Table of Contents

Burrows Wheeler Aligner

We discussed the bwa. It uses the Burrows Wheeler Transform to represent a prefix trie, allowing for short read alignment with mismatches and gaps.

The prefix trie

The prefix trie is a tree built from possible prefixes, starting at the end of a branch and going backwards to the root. Each node is the range in the suffix array where that prefix may be found. Each edge is a character in the prefix. A string can be found–allowing for mismatches and gaps–in the prefix tree by going up each branch and comparing characters to the prefix until the all characters have been matched or the mismatch limit is met.

Suffix array

The suffix array is found by appending an end character and sorting the cyclic transformation of the resulting string lexicographically and storing the original indices.

Example:

 GOOGOL

Append end character:

 X = GOOGOL$

Cyclic transformation:

 0 GOOGOL$
 1 OOGOL$G
 2 OGOL$GO
 3 GOL$GOO
 4 OL$GOOG
 5 L$GOOGO
 6 $GOOGOL

Sort:

 6 $GOOGOL
 3 GOL$GOO
 0 GOOGOL$
 5 L$GOOGO
 2 OGOL$GO
 4 OL$GOOG
 1 OOGOL$G
 S(i) = [6,3,0,5,2,4,1]

The Burrows Wheeler transform takes the last character of the sorted cyclic strings:

 B(i) = LO$OOGG

Searching

We can use the FM-index to find the upper and lower bounds of a substring.

C_X(a) is the number of characters lexicographically before a in X.

 G 0
 L 2
 O 3

O_X(a,i) is the number of occurrences of a in B_X[0,i].

i B_X[0,i] O_X(G,i) O_X(L,i) O_X(O,i)
1 L 0 1 0
2 LO 0 1 1
3 LO$ 0 1 1
4 LO$O 0 1 2
5 LO$OO 0 1 3
6 LO$OOG 1 1 3
7 LO$OOGG 2 1 3

There are then two recursive formulas to find the start and end positions of a substring.

 End R_(aW) = C(a) + O(a,R_(W) - 1) + 1 or 1 if W is the empty string
 Start R-(aW) = C(a) + O(a,R-(W)) or (n - 1) if W is the empty string

Where a is the first character and W is the word.