Choosing the K-mer size has trade-offs. A short K-mer will appear often and a long K-mer will have a higher chance of errors occur in it.
k = length of K-mer b = base error rate (0.01 1 base in 100 wrong) j = # of changes or errors G = genome size c = coverage Big assumption: P(A) = P(C) = P(G) = P(T) = 1/4 P( K-mer is OK ) = (1 - b^k) P( K-mer has exactly j errors ) = ( k choose j ) b^j (1-b)^(k-j) [Binomial distribution] # of neighboring K-mers 1 change: 3 * k 2 changes: ( k choose 2 ) 3^2 j changes: ( k choose j ) 3^j P( a random K-mer is in genome ) = 1 - (1-(1/(4^k))^G P( k-mer has a "trusted" j-neighbor ) = 1 - ( (1-(1/4^k))^G ) ^ ( (k choose j) 3^j )