lecture_notes:05-03-2010

Workaround for broken symlinks on head and children nodes

- Check assemblies/Pog/map-colorspace-5/Makefile for the workaround
- explicitly assigns working directory rather than using cwd

- Kevin has been working on H. Pylori, which is a much more challenging genome than Pog. MAy have developed some new methods for dealing with genome assembly
- One trick is a method to estimate how frequently a contig appears in a genome.
- Look at reads/base. For long regions, reads/base is roughly constant.
- Expect to have more reads/base in repeat regions.
- This measure is more noisy with smaller regions

- Estimated gap size is helpful for interpolating smaller contigs between other contigs.
- Short fragments can fit in regions with large estimated gap size.

- Biological tricks:
- c→b→a→b'→d
- an inversion of b→a→b' turns the scaffold into c→b→a'→b'→d… Note that flipping b→a→b' reverses and complements the sequences, so that it becomes (b')→a'→b' or b→a'→b'.

Burrows-Wheeler Transform (BWT)

- The algorithm behind bowtie, bwa, soap2 and possibly a few other mapping algorithms.
- The strength of this is it allows you to map short reads to a large genome rapidly and without alot of memory.
- Dependent on a reference genome.
- Indexes the genome and produces a fairly compact genome.
- The current implementations are all optimized for short reads.
- The algorithm inherently gets worse as the reads get longer.
- Longer reads will have to be broken into shorter pieces, indexed the reads separately and piece them back together.

What is the Burrows-Wheeler Transform and where did it come from?

- Came from data compression.
- Original idea was to make a string and take an invertible string.
- The converted string would be much easier to compress than the original string.
- Also used for bzip.
- Example of how BWT works:
**Note must, have an end marker at the end of the string**- Take the initial string (GOOGOL$) and all rotations:
- GOOGOL$
- OOGOL$G
- OGOL$GO
- GOL$GOO
- OL$GOOG
- L$GOOGO

- Sort them Lexographically
- $GOOGOL
- GOL$GOO
- GOOGOL$
- L$GOOGO
- OGOL$GO
- OL$GOOG
- OOGOL$G

- The BWT is the last column Array
**B****1**$GOOGO**L****2**GOL$GO**O****3**GOOGOL**$****4**L$GOOG**O****5**OGOL$G**O****6**OL$GOO**G****7**OOGOL$**G**

- How do you invert this thing and get the original string back?
- You know the all the letters in your original string. You know how to sort them alphabetically.
- This gives you the first column

- You know the first and the last are paired together.
- Take all the pairs you get from putting the first and last column together and reproduce the 2nd column.

- What is this really representing?
- You can also keep track of the permutation you did when you did the sorting. The array which maps position in BWT to original sorted array is know as the
**S**- 0⇒
**6**$GOOGOL - 1⇒
**3**GOL$GOO - 2⇒
**0**GOOGOL$ - 3⇒
**5**L$GOOGO - 4⇒
**2**OGOL$GO - 5⇒
**4**OL$GOOG - 6⇒
**1**OOGOL$G

- These numbers are also a method of keeping track of the transform

- Let's say you have a repeated subsequence
- Take “GO” for example.
- All the repeated subsequences are clumped together (0 & 3)
**GO**O**GO**L

- There's another way of looking at this from a computer science perspective
- Trie
- The branching structure of the tree
- The branching is based on the alphabet. The branching factor is the number of letters in your alphabet.
- BWT is aka a Prefix Trie
- [Insert Diagram] See the BWA paper for the picture [Fast and Accurate Short read alignment with Burrows-Wheeler Transform, Heng Li and Richard Durbin, bioinformatics 2009]

- It turns out there is a nice simple correspondence between the trie and the array of numbers in the BWT.
- Let's say you want to find a subsequence [OG] in the Trie.
- Invert the string and trace the path back to the first character. Each possibility is a child of that node.
- The table representation can also find all string starting with [OG] (4,4)
- Each node in the tree corresponds to a range from the array of the BWT.
- [GO] has range (1,2)

- Trie is not stored, but enough data to recalculate the trie is stored.

- Algorithm
- C is the number of letters in the first letter of the BWT matrix lexicographically before a. [G⇒1,L⇒3,O⇒4)
- O(a,i) = # occurrences of letter a within B[0..1]
- R_(aw) = C(a) + O(a,R_(w)-1)+1, where a is a prefix letter and w is a word. R_ is the lower bound
- R-(aw) = C(a) + O(a, r-(w), where a is a prefix letter and w is a word. R- is the upper bound
- Let's see if we can map “GO”
- R_(nil) = 0
- R-(nil) = 6
- R_(“O”) = c(O) + O(“O”, -1) +1 = 4
- R-(“O”) = c(O) + O(“O”) + 1 = 6
- R_(“GO”) =

- Kevin recommends reading the paper.
**Fast and accurate short read alignment with Burrows-Wheeler transform.**[1]

Bioinformatics. 2009 Jul 15;25(14):1754-60. Epub 2009 May 18.

Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK.

doi:10.1093/bioinformatics/btp324

You could leave a comment if you were logged in.

lecture_notes/05-03-2010.txt · Last modified: 2010/05/09 01:11 by galt