User Tools

Site Tools


lecture_notes:04-24-2015

Team 3 Report: SGA (String Graph Assembler)

String Graph Assembler

  • Memory efficient, uses compressed data structures
  • Very modular, each component can be run independently
  • Emphasis on accuracy
  • Most of the CPU time is in steps that can be parallelized and merged
  • Tries to excel at substring coverage

User Experience

  • Current pipeline is different than the original paper, documentation is unclear or scattered
  • Minimum of 20x coverage, recommended 40x coverage.
  • Reads must be 100bp or greater
  • Key parameters: overlap size, k-mer size for error correction

Algorithm

  • Relies on BWT (Burrows Wheeler Transform), which allows for reversible compression and rearranges the string into runs of similar characters
  • FM-Index (Full-text Index in Minute space) - scales with the size of the input alphabet
  • Only some fraction of the last column of the BWT is stored
  • First column does not store nucleotides, merely pointers to where each nucleotide would start

Assembly

  1. Construct FM Index
  2. Merge paths to reduce graph size
  3. Re-index
  4. Build string graph

The assembly algorithm queries the FM Index and follows (unambiguous) paths where one k-mer maps to a given end nucleotide and condenses these paths to a single read.

String Graph

  • Remove duplicate reads, index with l-mers and then check reads with shared l-mers for longer overlaps
  • Builds edges and labels them with the non-matching sequence from two overlapping k-mers and removes transitive edges

Bubble Popping

  • For a pair of nodes with multiple walks, choose one to remain
  • Compare other walks and if they are similar enough (95%) to the chosen walk, remove them

Scaffolding

  • Create a potential order of contigs, removing uncertain or repetitive ones
  • Standalone module, uses k-mer search on gaps to find paths

Current Progress

  • Successfully run sga preprocess, sga index, sga correct, sga index (after correction)

Future Goals

  • Adjust error correction and assembly
  • Finish running full pipeline on one library
  • Re-run completed pipeline using all libraries
You could leave a comment if you were logged in.
lecture_notes/04-24-2015.txt · Last modified: 2015/04/24 17:52 by jdhouser