======Team 3 Report: SGA (String Graph Assembler)======

=====String Graph Assembler=====

  * Memory efficient, uses compressed data structures
  * Very modular, each component can be run independently
  * Emphasis on accuracy
  * Most of the CPU time is in steps that can be parallelized and merged
  * Tries to excel at substring coverage

=====User Experience=====

  * Current pipeline is different than the original paper, documentation is unclear or scattered
  * Minimum of 20x coverage, recommended 40x coverage.
  * Reads must be 100bp or greater
  * Key parameters: overlap size, k-mer size for error correction

=====Algorithm=====

  * Relies on BWT (Burrows Wheeler Transform), which allows for reversible compression and rearranges the string into runs of similar characters
  * FM-Index (Full-text Index in Minute space) - scales with the size of the input alphabet
  * Only some fraction of the last column of the BWT is stored
  * First column does not store nucleotides, merely pointers to where each nucleotide would start

=====Assembly=====

  - Construct FM Index
  - Merge paths to reduce graph size
  - Re-index
  - Build string graph

The assembly algorithm queries the FM Index and follows (unambiguous) paths where one k-mer maps to a given end nucleotide and condenses these paths to a single read.

=====String Graph=====

  * Remove duplicate reads, index with l-mers and then check reads with shared l-mers for longer overlaps
  * Builds edges and labels them with the non-matching sequence from two overlapping k-mers and removes transitive edges

=====Bubble Popping=====

  * For a pair of nodes with multiple walks, choose one to remain
  * Compare other walks and if they are similar enough (95%) to the chosen walk, remove them

=====Scaffolding=====

  * Create a potential order of contigs, removing uncertain or repetitive ones
  * Standalone module, uses k-mer search on gaps to find paths

=====Current Progress=====

  *  Successfully run sga preprocess, sga index, sga correct, sga index (after correction)

=====Future Goals=====

  * Adjust error correction and assembly
  * Finish running full pipeline on one library
  * Re-run completed pipeline using all libraries