User Tools

Site Tools


lecture_notes:04-22-2011

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

====== Burrows Wheeler Aligner ====== We discussed the [[bioinformatic_tools:bwa]]. It uses the Burrows Wheeler Transform to represent a prefix trie, allowing for short read alignment with mismatches and gaps. ===== The prefix trie ====== The prefix trie is a tree built from possible prefixes, starting at the end of a branch and going backwards to the root. Each node is the range in the suffix array where that prefix may be found. Each edge is a character in the prefix. A string can be found--allowing for mismatches and gaps--in the prefix tree by going up each branch and comparing characters to the prefix until the all characters have been matched or the mismatch limit is met. ===== Suffix array ====== The suffix array is found by appending an end character and sorting the cyclic transformation of the resulting string lexicographically and storing the original indices. Example: GOOGOL Append end character: X = GOOGOL$ Cyclic transformation: 0 GOOGOL$ 1 OOGOL$G 2 OGOL$GO 3 GOL$GOO 4 OL$GOOG 5 L$GOOGO 6 $GOOGOL Sort: 6 $GOOGOL 3 GOL$GOO 0 GOOGOL$ 5 L$GOOGO 2 OGOL$GO 4 OL$GOOG 1 OOGOL$G S(i) = [6,3,0,5,2,4,1] The Burrows Wheeler transform takes the last character of the sorted cyclic strings: B(i) = LO$OOGG ===== Searching ===== We can use the FM-index to find the upper and lower bounds of a substring. C_X(a) is the number of characters lexicographically before a in X. $ 0 G 1 L 3 O 4 O_X(a,i) is the number of occurrences of a in B_X[0,i]. ^ B_X[0,i]: ^ O_X(G,i): ^ O_X(L,i) ^ O_X(O,i) | 1 L | 0 | 1 | 0 | 2 LO | 0 | 1 | 1 | 3 LO$ | 0 | 1 | 1 | 4 LO$O | 0 | 1 | 2 | 5 LO$OO | 0 | 1 | 3 | 6 LO$OOG | 1 | 1 | 3 | 7 LO$OOGG | 2 | 1 | 3

You could leave a comment if you were logged in.
lecture_notes/04-22-2011.1307662703.txt.gz · Last modified: 2011/06/09 23:38 by eyliaw