Workaround for broken symlinks on head and children nodes
Check assemblies/Pog/map-colorspace-5/Makefile for the workaround
explicitly assigns working directory rather than using cwd
Kevin has been working on H. Pylori, which is a much more challenging genome than Pog. MAy have developed some new methods for dealing with genome assembly
One trick is a method to estimate how frequently a contig appears in a genome.
Look at reads/base. For long regions, reads/base is roughly constant.
Expect to have more reads/base in repeat regions.
This measure is more noisy with smaller regions
Estimated gap size is helpful for interpolating smaller contigs between other contigs.
Short fragments can fit in regions with large estimated gap size.
Biological tricks:
c→b→a→b'→d
an inversion of b→a→b' turns the scaffold into c→b→a'→b'→d… Note that flipping b→a→b' reverses and complements the sequences, so that it becomes (b')→a'→b' or b→a'→b'.
Burrows-Wheeler Transform (BWT)
The algorithm behind bowtie, bwa, soap2 and possibly a few other mapping algorithms.
The strength of this is it allows you to map short reads to a large genome rapidly and without alot of memory.
Dependent on a reference genome.
Indexes the genome and produces a fairly compact genome.
The current implementations are all optimized for short reads.
The algorithm inherently gets worse as the reads get longer.
Longer reads will have to be broken into shorter pieces, indexed the reads separately and piece them back together.
What is the Burrows-Wheeler Transform and where did it come from?
Came from data compression.
Original idea was to make a string and take an invertible string.
The converted string would be much easier to compress than the original string.
Also used for bzip.
Example of how BWT works:
Note must, have an end marker at the end of the string
Take the initial string (GOOGOL$) and all rotations:
GOOGOL$
OOGOL$G
OGOL$GO
GOL$GOO
OL$GOOG
L$GOOGO
Sort them Lexographically
$GOOGOL
GOL$GOO
GOOGOL$
L$GOOGO
OGOL$GO
OL$GOOG
OOGOL$G
The BWT is the last column Array B
1 $GOOGOL
2 GOL$GOO
3 GOOGOL$
4 L$GOOGO
5 OGOL$GO
6 OL$GOOG
7 OOGOL$G
How do you invert this thing and get the original string back?
You know the all the letters in your original string. You know how to sort them alphabetically.
This gives you the first column
You know the first and the last are paired together.
Take all the pairs you get from putting the first and last column together and reproduce the 2nd column.
What is this really representing?
You can also keep track of the permutation you did when you did the sorting. The array which maps position in BWT to original sorted array is know as the S
0⇒6 $GOOGOL
1⇒3 GOL$GOO
2⇒0 GOOGOL$
3⇒5 L$GOOGO
4⇒2 OGOL$GO
5⇒4 OL$GOOG
6⇒1 OOGOL$G
These numbers are also a method of keeping track of the transform
Let's say you have a repeated subsequence
Take “GO” for example.
All the repeated subsequences are clumped together (0 & 3) GOOGOL
There's another way of looking at this from a computer science perspective
Trie
The branching structure of the tree
The branching is based on the alphabet. The branching factor is the number of letters in your alphabet.
BWT is aka a Prefix Trie
[Insert Diagram] See the BWA paper for the picture [Fast and Accurate Short read alignment with Burrows-Wheeler Transform, Heng Li and Richard Durbin, bioinformatics 2009]
It turns out there is a nice simple correspondence between the trie and the array of numbers in the BWT.
Let's say you want to find a subsequence [OG] in the Trie.
Invert the string and trace the path back to the first character. Each possibility is a child of that node.
The table representation can also find all string starting with [OG] (4,4)
Each node in the tree corresponds to a range from the array of the BWT.
[GO] has range (1,2)
Trie is not stored, but enough data to recalculate the trie is stored.
Algorithm
C is the number of letters in the first letter of the BWT matrix lexicographically before a. [G⇒1,L⇒3,O⇒4)
O(a,i) = # occurrences of letter a within B[0..1]
R_(aw) = C(a) + O(a,R_(w)-1)+1, where a is a prefix letter and w is a word. R_ is the lower bound
R-(aw) = C(a) + O(a, r-(w), where a is a prefix letter and w is a word. R- is the upper bound
Let's see if we can map “GO”
R_(nil) = 0
R-(nil) = 6
R_(“O”) = c(O) + O(“O”, -1) +1 = 4
R-(“O”) = c(O) + O(“O”) + 1 = 6
R_(“GO”) =
Kevin recommends reading the paper. Fast and accurate short read alignment with Burrows-Wheeler transform.
References
1.
a
Heng Li and Richard Durbin.
Bioinformatics. 2009 Jul 15;25(14):1754-60. Epub 2009 May 18.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK.
doi:10.1093/bioinformatics/btp324