This is an old revision of the document!
Workaround for broken symlinks on head and children nodes - Check assemblies/Pog/map-colorspace-5/Makefile for the workaround - explicitly assigns working directory rather than using cwd - Kevin has been working on H. Pylori, which is a much more challenging genome than Pog. MAy have developed some new methods for dealing with genome assembly - One trick is a method to estimate how frequently a contig appears in a genome. - Look at reads/base. For long regions, reads/base is roughly constant. - Expect to have more reads/base in repeat regions. - This measure is more noisy with smaller regions - Estimated gap size is helpful for interpolating smaller contigs between other contigs. - Short fragments can fit in regions with large estimated gap size. - Biological tricks: - c->b->a->b'->d - an inversion of b->a->b' turns the scaffold into c->b->a'->b'->d...(I think this should be c->b'->a'->b->d) Burrows-Wheeler Transform (BWT) - The algorithm behind bowtie, bwa, soap2 and possibly a few other mapping algorithms. - The strength of this is it allows you to map short reads to a large genome rapidly and without alot of memory. - Dependent on a reference genome. - Indexes the genome and produces a fairly compact genome. - The current implementations are all optimized for short reads. - The algorithm inherently gets worse as the reads get longer. - Longer reads will have to be broken into shorter pieces, indexed the reads separately and piece them back together. What is the Burrows-Wheeler Transform and where did it come from? - Came from data compression. - Original idea was to make a string and take an invertible string. - The converted string would be much easier to compress than the original string. - Also used for bzip. - Example of how BWT works: - **Note must, have an end marker at the end of the string** - Take the initial string (GOOGOL$) and all rotations: - GOOGOL$ - OOGOL$G - OGOL$GO - GOL$GOO - OL$GOOG - L$GOOGO - Sort them Lexographically - $GOOGOL - GOL$GOO - GOOGOL$ - L$GOOGO - OGOL$GO - OL$GOOG - OOGOL$G - The BWT is the last column Array **B** - **1** $GOOGO**L** - **2** GOL$GO**O** - **3** GOOGOL**$** - **4** L$GOOG**O** - **5** OGOL$G**O** - **6** OL$GOO**G** - **7** OOGOL$**G** - How do you invert this thing and get the original string back? - You know the all the letters in your original string. You know how to sort them alphabetically. - This gives you the first column - You know the first and the last are paired together. - Take all the pairs you get from putting the first and last column together and reproduce the 2nd column. - What is this really representing? - You can also keep track of the permutation you did when you did the sorting. The array which maps position in BWT to original sorted array is know as the **S** - 0=>**6** $GOOGOL - 1=>**3** GOL$GOO - 2=>**0** GOOGOL$ - 3=>**5** L$GOOGO - 4=>**2** OGOL$GO - 5=>**4** OL$GOOG - 6=>**1** OOGOL$G - These numbers are also a method of keeping track of the transform - Let's say you have a repeated subsequence - Take "GO" for example. - All the repeated subsequences are clumped together (0 & 3) **GO**O**GO**L - There's another way of looking at this from a computer science perspective - Trie - The branching structure of the tree - The branching is based on the alphabet. The branching factor is the number of letters in your alphabet. - BWT is aka a Prefix Trie - [Insert Diagram] See the BWA paper for the picture [Fast and Accurate Short read alignment with Burrows-Wheeler Transform, Heng Li and Richard Durbin, bioinformatics 2009] - It turns out there is a nice simple correspondence between the trie and the array of numbers in the BWT. - Let's say you want to find a subsequence [OG] in the Trie. - Invert the string and trace the path back to the first character. Each possibility is a child of that node. - The table representation can also find all string starting with [OG] (4,4) - Each node in the tree corresponds to a range from the array of the BWT. - [GO] has range (1,2) - Trie is not stored, but enough data to recalculate the trie is stored. - Algorithm - C is the number of letters in the first letter of the BWT matrix lexicographically before a. [G=>1,L=>3,O=>4) - O(a,i) = # occurrences of letter a within B[0..1] - R_(aw) = C(a) + O(a,R_(w)-1)+1, where a is a prefix letter and w is a word. R_ is the lower bound - R-(aw) = C(a) + O(a, r-(w), where a is a prefix letter and w is a word. R- is the upper bound - Let's see if we can map "GO" - R_(nil) = 0 - R-(nil) = 6 - R_("O") = c(O) + O("O", -1) +1 = 4 - R-("O") = c(O) + O("O") + 1 = 6 - R_("GO") = - Kevin recommends reading the paper. [Fast and Accurate Short read alignment with Burrows-Wheeler Transform, Heng Li and Richard Durbin, bioinformatics 2009]