Banana Slug Genomics

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

===== Suffix Arrays ===== Example Suffix Array, A[ ], of "$banana" {{:lecture_notes:screen_shot_2015-06-01_at_3.42.20_pm.png|}} ===== Burrows Wheeler Transform ===== An example BWT of "^BANANA|" {{:lecture_notes:screen_shot_2015-06-01_at_3.41.29_pm.png|}} ===== FM Index ===== For example, the BWT of the string T = "abracadabra" is "ard$rcaaaabb", and here it is represented by the matrix M where each row is a rotation of the text that has been sorted. The transform corresponds to the last column labeled L. {{:lecture_notes:screen_shot_2015-06-01_at_3.44.14_pm.png|}} C[c] is a table that, for each character c in the alphabet, contains the number of occurrences of lexically smaller characters in the text. {{:lecture_notes:screen_shot_2015-06-01_at_3.43.25_pm.png|}} The function Occ(c, k) is the number of occurrences of character c in the prefix L[1..k] {{:lecture_notes:screen_shot_2015-06-01_at_3.43.33_pm.png|}} <code> Charles M. - BWT Notes BWA datastructure lecture # Suffix array # Burrows Wheeler Transform - variation on suffix array, useful for data compression and indexing # FM index - uses both suffix arrays and BWT for fast lookup suffix array definition: X[0:n] with a '$' character spliced in the range of [0,n]. e.g. string S: z a b c d e index: 0,1,2,3,4,5 # BWT definition: B[i] = X[S[i]-1] 1) look at suffixes of a string a de cde bcde abcde zabcde 2) sort order of suffixes: Pos Suffix 6 $ 1 abcde$ 2 bcde$ 3 cde$ 4 de$ 5 e$ 0 zabcde$ 3) rotate the string: zabcde[$] abcde$[z] bcde$z[a] . . . $zabcd[e] ^ (last column is the burrows wheeler transform) left column (since it is in alphabet order) just record where a character starts since there is repetition of a character. -> $ $ $ -> A A A A ... -> C C ... 4) Record Occupancy: Definitions: C(a) # Where char's start. Occ(a,i) # Array of number of char's before i in B. (e.g. # of a's in B[0,i]) - Can store part of the Occupancy array to obtain greater compression at the cost of slightly greater cpu time. B # BWT array Use last_to_first(i) function which allows you to take the BWT string and find the 1st column row index of a BWT character. Use that row index to find the previous character in the BWT string (It's the last column character of the same row index). Occupancy function: C(B[i]) + Occ(B[i],i) # Search algorithm Given Pattern P R(P): where pattern starts Rbar(P): where pattern ends Rbar(aP) = C(a) + Occ(a, Rbar(P)) R(aP) = C(a) + Occ(a,R(P)-1)+1 </code> Additional resources and better examples can be found below: [[http://www.cs.jhu.edu/~langmea/resources/bwt_fm.pdf|BWT and FM Index tutorial]] [[http://www.ncbi.nlm.nih.gov/pubmed/19451168|BWT paper]]

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

Page Tools