======Team 3 Report: SGA (String Graph Assembler)====== =====String Graph Assembler===== * Memory efficient, uses compressed data structures * Very modular, each component can be run independently * Emphasis on accuracy * Most of the CPU time is in steps that can be parallelized and merged * Tries to excel at substring coverage =====User Experience===== * Current pipeline is different than the original paper, documentation is unclear or scattered * Minimum of 20x coverage, recommended 40x coverage. * Reads must be 100bp or greater * Key parameters: overlap size, k-mer size for error correction =====Algorithm===== * Relies on BWT (Burrows Wheeler Transform), which allows for reversible compression and rearranges the string into runs of similar characters * FM-Index (Full-text Index in Minute space) - scales with the size of the input alphabet * Only some fraction of the last column of the BWT is stored * First column does not store nucleotides, merely pointers to where each nucleotide would start =====Assembly===== - Construct FM Index - Merge paths to reduce graph size - Re-index - Build string graph The assembly algorithm queries the FM Index and follows (unambiguous) paths where one k-mer maps to a given end nucleotide and condenses these paths to a single read. =====String Graph===== * Remove duplicate reads, index with l-mers and then check reads with shared l-mers for longer overlaps * Builds edges and labels them with the non-matching sequence from two overlapping k-mers and removes transitive edges =====Bubble Popping===== * For a pair of nodes with multiple walks, choose one to remain * Compare other walks and if they are similar enough (95%) to the chosen walk, remove them =====Scaffolding===== * Create a potential order of contigs, removing uncertain or repetitive ones * Standalone module, uses k-mer search on gaps to find paths =====Current Progress===== * Successfully run sga preprocess, sga index, sga correct, sga index (after correction) =====Future Goals===== * Adjust error correction and assembly * Finish running full pipeline on one library * Re-run completed pipeline using all libraries