Team 3 Report: SGA (String Graph Assembler)
String Graph Assembler
Memory efficient, uses compressed data structures
Very modular, each component can be run independently
Emphasis on accuracy
Most of the CPU time is in steps that can be parallelized and merged
Tries to excel at substring coverage
User Experience
Current pipeline is different than the original paper, documentation is unclear or scattered
Minimum of 20x coverage, recommended 40x coverage.
Reads must be 100bp or greater
Key parameters: overlap size, k-mer size for error correction
Algorithm
Relies on BWT (Burrows Wheeler Transform), which allows for reversible compression and rearranges the string into runs of similar characters
FM-Index (Full-text Index in Minute space) - scales with the size of the input alphabet
Only some fraction of the last column of the BWT is stored
First column does not store nucleotides, merely pointers to where each nucleotide would start
Assembly
Construct FM Index
Merge paths to reduce graph size
Re-index
Build string graph
The assembly algorithm queries the FM Index and follows (unambiguous) paths where one k-mer maps to a given end nucleotide and condenses these paths to a single read.
String Graph
Remove duplicate reads, index with l-mers and then check reads with shared l-mers for longer overlaps
Builds edges and labels them with the non-matching sequence from two overlapping k-mers and removes transitive edges
Bubble Popping
For a pair of nodes with multiple walks, choose one to remain
Compare other walks and if they are similar enough (95%) to the chosen walk, remove them
Scaffolding
Create a potential order of contigs, removing uncertain or repetitive ones
Standalone module, uses k-mer search on gaps to find paths
Current Progress
Future Goals
Adjust error correction and assembly
Finish running full pipeline on one library
Re-run completed pipeline using all libraries