Sanger String Graph Assembler
Methods
String Graphs
Nodes are reads (reads that are substrings are condensed into superstrings). Edges are overlaps between reads, and the non-overlapping prefix is stored in the forward edge and suffix is stored in the backwards edge.
Condenses repeats like a de Bruijn graph.
More expensive to construct than a de Bruijn graph.
BWT/FM-index
Like BWA, it uses the FM-index, which is a compressed method of inferring the suffix array.
The Burrows Wheeler transform B_X is an array of the last characters in the alphabetically sorted suffix array.
The FM-index (two data structures: 1. C_X(a) be the number of symbols in X that are lexographically lower than the symbol a, 2. Occ_X(a, i) be the number of occurrences of the symbol a in B_X[1, i], the ) allows substring searching and can be extended to construct the string graph.
String Graph Construction with the FM-index
Installation
Installation of SGA from the GitHub is a major pain, because it has so many dependencies. It needs
google-sparsehash (also needed for Abyss)
hoard
bamtools, which is a particular pain to install because it uses
cmake (newer than the version installed on campusrocks), and cmake does not support installing stuff anywhere other than in the initial directory or root-access-only directories, so we can't do the usual “configure –prefix=/campusdata/BME235”
Once cmake, bamtools, hoard, and google/sparsehash are all installed as best they can be, then the UCSC-install.bash script can be run to install SGA. Actually running SGA is still to be tested!