Team 3 Report: SGA (String Graph Assembler)

String Graph Assembler

Memory efficient, uses compressed data structures
Very modular, each component can be run independently
Emphasis on accuracy
Most of the CPU time is in steps that can be parallelized and merged
Tries to excel at substring coverage

User Experience

Current pipeline is different than the original paper, documentation is unclear or scattered
Minimum of 20x coverage, recommended 40x coverage.
Reads must be 100bp or greater
Key parameters: overlap size, k-mer size for error correction

Algorithm

Relies on BWT (Burrows Wheeler Transform), which allows for reversible compression and rearranges the string into runs of similar characters
FM-Index (Full-text Index in Minute space) - scales with the size of the input alphabet
Only some fraction of the last column of the BWT is stored
First column does not store nucleotides, merely pointers to where each nucleotide would start

Assembly

Construct FM Index
Merge paths to reduce graph size
Re-index
Build string graph

The assembly algorithm queries the FM Index and follows (unambiguous) paths where one k-mer maps to a given end nucleotide and condenses these paths to a single read.

String Graph

Remove duplicate reads, index with l-mers and then check reads with shared l-mers for longer overlaps
Builds edges and labels them with the non-matching sequence from two overlapping k-mers and removes transitive edges

Bubble Popping

For a pair of nodes with multiple walks, choose one to remain
Compare other walks and if they are similar enough (95%) to the chosen walk, remove them

Scaffolding

Create a potential order of contigs, removing uncertain or repetitive ones
Standalone module, uses k-mer search on gaps to find paths

Current Progress

Successfully run sga preprocess, sga index, sga correct, sga index (after correction)

Future Goals

Adjust error correction and assembly
Finish running full pipeline on one library
Re-run completed pipeline using all libraries

You could leave a comment if you were logged in.

Banana Slug Genomics

Table of Contents

Team 3 Report: SGA (String Graph Assembler)

String Graph Assembler

User Experience

Algorithm

Assembly

String Graph

Bubble Popping

Scaffolding

Current Progress

Future Goals

Banana Slug Genomics

User Tools

Site Tools

Table of Contents

Team 3 Report: SGA (String Graph Assembler)

String Graph Assembler

User Experience

Algorithm

Assembly

String Graph

Bubble Popping

Scaffolding

Current Progress

Future Goals

Page Tools