This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
lecture_notes:04-22-2011 [2011/06/08 17:14] eyliaw [Suffix array] |
lecture_notes:04-22-2011 [2015/08/09 23:06] (current) 212.129.31.47 ↷ Links adapted because of a move operation |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Burrows Wheeler Aligner ====== | ====== Burrows Wheeler Aligner ====== | ||
| - | We discussed the [[bioinformatic_tools:bwa]]. It uses the Burrows Wheeler Transform to represent a prefix trie, allowing for short read alignment with mismatches and gaps. | + | We discussed the [[archive:bioinformatic_tools:bwa]]. It uses the Burrows Wheeler Transform to represent a prefix trie, allowing for short read alignment with mismatches and gaps. |
| ===== The prefix trie ====== | ===== The prefix trie ====== | ||
| Line 11: | Line 11: | ||
| GOOGOL | GOOGOL | ||
| Append end character: | Append end character: | ||
| - | GOOGOL$ | + | X = GOOGOL$ |
| Cyclic transformation: | Cyclic transformation: | ||
| 0 GOOGOL$ | 0 GOOGOL$ | ||
| Line 21: | Line 21: | ||
| 6 $GOOGOL | 6 $GOOGOL | ||
| Sort: | Sort: | ||
| - | 6 $GOOGO**L** | + | 6 $GOOGOL |
| - | 3 GOL$GO**O** | + | 3 GOL$GOO |
| - | 0 GOOGOL**$** | + | 0 GOOGOL$ |
| - | 5 L$GOOG**O** | + | 5 L$GOOGO |
| - | 2 OGOL$G**O** | + | 2 OGOL$GO |
| - | 4 OL$GOO**G** | + | 4 OL$GOOG |
| - | 1 OOGOL$**G** | + | 1 OOGOL$G |
| S(i) = [6,3,0,5,2,4,1] | S(i) = [6,3,0,5,2,4,1] | ||
| The Burrows Wheeler transform takes the last character of the sorted cyclic strings: | The Burrows Wheeler transform takes the last character of the sorted cyclic strings: | ||
| B(i) = LO$OOGG | B(i) = LO$OOGG | ||
| + | ===== Searching ===== | ||
| + | We can use the FM-index to find the upper and lower bounds of a substring. | ||
| + | |||
| + | C_X(a) is the number of characters lexicographically before a in X. | ||
| + | G 0 | ||
| + | L 2 | ||
| + | O 3 | ||
| + | O_X(a,i) is the number of occurrences of a in B_X[0,i]. | ||
| + | ^ i ^ B_X[0,i] ^ O_X(G,i) ^ O_X(L,i) ^ O_X(O,i) ^ | ||
| + | | 1 | L | 0 | 1 | 0 | | ||
| + | | 2 | LO | 0 | 1 | 1 | | ||
| + | | 3 | LO$ | 0 | 1 | 1 | | ||
| + | | 4 | LO$O | 0 | 1 | 2 | | ||
| + | | 5 | LO$OO | 0 | 1 | 3 | | ||
| + | | 6 | LO$OOG | 1 | 1 | 3 | | ||
| + | | 7 | LO$OOGG | 2 | 1 | 3 | | ||
| + | There are then two recursive formulas to find the start and end positions of a substring. | ||
| + | End R_(aW) = C(a) + O(a,R_(W) - 1) + 1 or 1 if W is the empty string | ||
| + | Start R-(aW) = C(a) + O(a,R-(W)) or (n - 1) if W is the empty string | ||
| + | Where a is the first character and W is the word. | ||