This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
lecture_notes:04-22-2011 [2011/04/25 22:01] eyliaw created |
lecture_notes:04-22-2011 [2011/06/09 23:41] eyliaw [Searching] |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Burrows Wheeler Transform ====== | + | ====== Burrows Wheeler Aligner ====== |
+ | We discussed the [[bioinformatic_tools:bwa]]. It uses the Burrows Wheeler Transform to represent a prefix trie, allowing for short read alignment with mismatches and gaps. | ||
===== The prefix trie ====== | ===== The prefix trie ====== | ||
- | The prefix trie is a tree built from possible prefixes, starting at the end of a branch and going backwards to the root. Each node is the range in the suffix array where the | + | The prefix trie is a tree built from possible prefixes, starting at the end of a branch and going backwards to the root. Each node is the range in the suffix array where that prefix may be found. Each edge is a character in the prefix. A string can be found--allowing for mismatches and gaps--in the prefix tree by going up each branch and comparing characters to the prefix until the all characters have been matched or the mismatch limit is met. |
===== Suffix array ====== | ===== Suffix array ====== | ||
- | The suffix array is found by sorting the cyclic transformation lexicographically and taking their original indices. | + | The suffix array is found by appending an end character and sorting the cyclic transformation of the resulting string lexicographically and storing the original indices. |
- | Range in the suffix array can be found by the R_ and R- | + | |
+ | Example: | ||
+ | GOOGOL | ||
+ | Append end character: | ||
+ | X = GOOGOL$ | ||
+ | Cyclic transformation: | ||
+ | 0 GOOGOL$ | ||
+ | 1 OOGOL$G | ||
+ | 2 OGOL$GO | ||
+ | 3 GOL$GOO | ||
+ | 4 OL$GOOG | ||
+ | 5 L$GOOGO | ||
+ | 6 $GOOGOL | ||
+ | Sort: | ||
+ | 6 $GOOGOL | ||
+ | 3 GOL$GOO | ||
+ | 0 GOOGOL$ | ||
+ | 5 L$GOOGO | ||
+ | 2 OGOL$GO | ||
+ | 4 OL$GOOG | ||
+ | 1 OOGOL$G | ||
+ | |||
+ | S(i) = [6,3,0,5,2,4,1] | ||
+ | The Burrows Wheeler transform takes the last character of the sorted cyclic strings: | ||
+ | B(i) = LO$OOGG | ||
+ | ===== Searching ===== | ||
+ | We can use the FM-index to find the upper and lower bounds of a substring. | ||
+ | |||
+ | C_X(a) is the number of characters lexicographically before a in X. | ||
+ | $ 0 | ||
+ | G 1 | ||
+ | L 3 | ||
+ | O 4 | ||
+ | O_X(a,i) is the number of occurrences of a in B_X[0,i]. | ||
+ | ^ i ^ B_X[0,i] ^ O_X(G,i) ^ O_X(L,i) ^ O_X(O,i) ^ | ||
+ | | 1 | L | 0 | 1 | 0 | | ||
+ | | 2 | LO | 0 | 1 | 1 | | ||
+ | | 3 | LO$ | 0 | 1 | 1 | | ||
+ | | 4 | LO$O | 0 | 1 | 2 | | ||
+ | | 5 | LO$OO | 0 | 1 | 3 | | ||
+ | | 6 | LO$OOG | 1 | 1 | 3 | | ||
+ | | 7 | LO$OOGG | 2 | 1 | 3 | | ||
+ | |||
+ | |