We discussed the bwa. It uses the Burrows Wheeler Transform to represent a prefix trie, allowing for short read alignment with mismatches and gaps.
The prefix trie is a tree built from possible prefixes, starting at the end of a branch and going backwards to the root. Each node is the range in the suffix array where that prefix may be found. Each edge is a character in the prefix. A string can be found–allowing for mismatches and gaps–in the prefix tree by going up each branch and comparing characters to the prefix until the all characters have been matched or the mismatch limit is met.
The suffix array is found by appending an end character and sorting the cyclic transformation of the resulting string lexicographically and storing the original indices.
Example:
GOOGOL
Append end character:
X = GOOGOL$
Cyclic transformation:
0 GOOGOL$ 1 OOGOL$G 2 OGOL$GO 3 GOL$GOO 4 OL$GOOG 5 L$GOOGO 6 $GOOGOL
Sort:
6 $GOOGOL 3 GOL$GOO 0 GOOGOL$ 5 L$GOOGO 2 OGOL$GO 4 OL$GOOG 1 OOGOL$G
S(i) = [6,3,0,5,2,4,1]
The Burrows Wheeler transform takes the last character of the sorted cyclic strings:
B(i) = LO$OOGG
We can use the FM-index to find the upper and lower bounds of a substring.
C_X(a) is the number of characters lexicographically before a in X.
G 0 L 2 O 3
O_X(a,i) is the number of occurrences of a in B_X[0,i].
i | B_X[0,i] | O_X(G,i) | O_X(L,i) | O_X(O,i) |
---|---|---|---|---|
1 | L | 0 | 1 | 0 |
2 | LO | 0 | 1 | 1 |
3 | LO$ | 0 | 1 | 1 |
4 | LO$O | 0 | 1 | 2 |
5 | LO$OO | 0 | 1 | 3 |
6 | LO$OOG | 1 | 1 | 3 |
7 | LO$OOGG | 2 | 1 | 3 |
There are then two recursive formulas to find the start and end positions of a substring.
End R_(aW) = C(a) + O(a,R_(W) - 1) + 1 or 1 if W is the empty string Start R-(aW) = C(a) + O(a,R-(W)) or (n - 1) if W is the empty string
Where a is the first character and W is the word.