K-mer Counting Theory

Choosing the K-mer size has trade-offs. A short K-mer will appear often and a long K-mer will have a higher chance of errors occur in it.

k = length of K-mer
b = base error rate (0.01 1 base in 100 wrong)
j = # of changes or errors
G = genome size
c = coverage

Big assumption: P(A) = P(C) = P(G) = P(T) = 1/4

P( K-mer is OK ) = (1 - b^k)

P( K-mer has exactly j errors ) = ( k choose j ) b^j (1-b)^(k-j) [Binomial distribution]

# of neighboring K-mers
1 change:  3 * k
2 changes: ( k choose 2 ) 3^2
j changes: ( k choose j ) 3^j

P( a random K-mer is in genome ) = 1 - (1-(1/(4^k))^G

P( k-mer has a "trusted" j-neighbor ) = 1 - ( (1-(1/4^k))^G ) ^ ( (k choose j) 3^j )

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

K-mer Counting Theory

Page Tools