lecture_notes:05-09-2011

Presented by Edward. We talked about the reasoning behind kmer counting for corrections. Since we know the coverage to be some value, we expect to see that many kmers across all the reads. Sequencing miscalls would only occur at a 1% error rate, so we would see much fewer kmers in those reads.

We also discussed the qmer counting approach, where Quake would increment by the probability of the base being correct, as interpreted from the Phred score = 10*log_10(1-P), where P is the probability of the base being correct.

Lastly, we talked about how Quake processes corrections. It uses a probability model based on the GC% and the Illumina base miscall substitution rate to calculate the likelihood of a miscall. It then makes corrections to repair the error with the highest chance of occurrence, until it finds one that matches a trusted kmer. It uses a bit array to store the trusted kmers, so the actual counts of each kmer is not stored; one weakness of this approach is that it does not take the likelihood of that kmer occurring into account when making the corrections. Once it matches a kmer, it continues searching for another, up to a threshold. If more than one is found, the correction is ambiguous and it discards the result instead of correcting it.

You could leave a comment if you were logged in.

lecture_notes/05-09-2011.txt · Last modified: 2011/05/14 01:26 by eyliaw