We examined the 19-mers with the highest counts from Illumina run 1 without the control lane. Many came from low complexity sequences and Illumina adapters also appeared often. SeqPrep can remove these adapter sequences but we need to see about removing that differ from the adapter sequence by 1, which are still quite common.
Get the reads cleaned up this week so we can run an assembler over the weekend. Run SeqPrep first, and then Quake.
We continued our discussion of the paper by Li and Durbin on short sequence alignment with Burrows-Wheeler transform. The main idea is to use an implicit prefix trie to search for substrings. Instead of storing the tree, two arrays used to find the range in the suffix array where the substring is present. A fraction of each arrays is kept in memory and the rest is calculated on the fly.
The suffix array is not required if we only need to count the number of occurrences of each substring. However, it needs to be kept if we want to find where the substring occurs. The paper presents a method of reducing the memory required to store the suffix array.