User Tools

Site Tools


lecture_notes:04-22-2011

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

====== Burrows Wheeler Aligner ====== We discussed the [[bioinformatic_tools:bwa]]. It uses the Burrows Wheeler Transform to represent a prefix trie, allowing for short read alignment with mismatches and gaps. ===== The prefix trie ====== The prefix trie is a tree built from possible prefixes, starting at the end of a branch and going backwards to the root. Each node is the range in the suffix array where that prefix may be found. Each edge is a character in the prefix. A string can be found--allowing for mismatches and gaps--in the prefix tree by going up each branch and comparing characters to the prefix until the all characters have been matched or the mismatch limit is met. ===== Suffix array ====== The suffix array is found by appending an end character and sorting the cyclic transformation of the resulting string lexicographically and storing the original indices. Example: GOOGOL Append end character: X = GOOGOL$ Cyclic transformation: 0 GOOGOL$ 1 OOGOL$G 2 OGOL$GO 3 GOL$GOO 4 OL$GOOG 5 L$GOOGO 6 $GOOGOL Sort: 6 $GOOGOL 3 GOL$GOO 0 GOOGOL$ 5 L$GOOGO 2 OGOL$GO 4 OL$GOOG 1 OOGOL$G S(i) = [6,3,0,5,2,4,1] The Burrows Wheeler transform takes the last character of the sorted cyclic strings: B(i) = LO$OOGG ===== Searching ===== We can use the FM-index to find the upper and lower bounds of a substring. C_X(a) is the number of characters lexicographically before a in X. $ 0 G 1 L 3 O 4 O_X(a,i) is the number of occurrences of a in B_X[0,i]. ^ i ^ B_X[0,i] ^ O_X(G,i) ^ O_X(L,i) ^ O_X(O,i) ^ | 1 | L | 0 | 1 | 0 | | 2 | LO | 0 | 1 | 1 | | 3 | LO$ | 0 | 1 | 1 | | 4 | LO$O | 0 | 1 | 2 | | 5 | LO$OO | 0 | 1 | 3 | | 6 | LO$OOG | 1 | 1 | 3 | | 7 | LO$OOGG | 2 | 1 | 3 |

You could leave a comment if you were logged in.
lecture_notes/04-22-2011.1307662863.txt.gz · Last modified: 2011/06/09 23:41 by eyliaw