Banana Slug Genomics

Site Tools

lecture_notes:04-22-2011

This is an old revision of the document!

Burrows Wheeler Aligner

We discussed the bwa. It uses the Burrows Wheeler Transform to represent a prefix trie, allowing for short read alignment with mismatches and gaps.

The prefix trie

The prefix trie is a tree built from possible prefixes, starting at the end of a branch and going backwards to the root. Each node is the range in the suffix array where that prefix may be found. Each edge is a character in the prefix. A string can be found–allowing for mismatches and gaps–in the prefix tree by going up each branch and comparing characters to the prefix until the all characters have been matched or the mismatch limit is met.

Suffix array

The suffix array is found by appending an end character and sorting the cyclic transformation of the resulting string lexicographically and storing the original indices.

Example:

` GOOGOL`

Append end character:

` X = GOOGOL\$`

Cyclic transformation:

``` 0 GOOGOL\$
1 OOGOL\$G
2 OGOL\$GO
3 GOL\$GOO
4 OL\$GOOG
5 L\$GOOGO
6 \$GOOGOL```

Sort:

``` 6 \$GOOGOL
3 GOL\$GOO
0 GOOGOL\$
5 L\$GOOGO
2 OGOL\$GO
4 OL\$GOOG
1 OOGOL\$G```
` S(i) = [6,3,0,5,2,4,1]`

The Burrows Wheeler transform takes the last character of the sorted cyclic strings:

` B(i) = LO\$OOGG`

Searching

We can use the FM-index to find the upper and lower bounds of a substring.

C_X(a) is the number of characters lexicographically before a in X.

``` \$ 0
G 1
L 3
O 4```

O_X(a,i) is the number of occurrences of a in B_X[0,i].

B_X[0,i] O_X(G,i) O_X(L,i) O_X(O,i)
1 L 0 1 0
2 LO 0 1 1
3 LO\$ 0 1 1
4 LO\$O 0 1 2
5 LO\$OO 0 1 3
6 LO\$OOG 1 1 3
7 LO\$OOGG 2 1 3