2nd homework will be due.
make sketches of what these things will look like. think about what he showed on wed.
H. Pylori There were a couple of pieces to be copied and inserted then ran through newbler again, and they disappeared. Ran another check with check-hypotheses-with-solid, works well for small regions.
2 different models of how the genome assembles. Look for cases where it maps into one place but not the other.
The ones that disappeared in the newbler remapping. check-hypotheses-with-solid showed that there was support for both variants. About 30% of population had the insertion that they wanted to do. There were 3 places, and 2 disappeared. The majority read wasn't there. Newbler was cleaning up a bubble.
Look for it in the reads. There are multiple copies in the genome. Megablast found hits. Annotated as transposase. Seems to be really active in the strain. Where does it insert, what are the recognition sites? Where are all the places it inserts.
Library prep can sometimes distort the ratios.
Not known a-priori how many plasmids there are.
Pog expect small number of cells to be infected. Plasmids tend to be helpful and more ubiquitous.
Homework. plots of R vs F. No correct answer anywhere.
You can take a look. You can produce plots out of Pog data. The deletions and insertions will not appear in Pog however. You can take a look at H. Pylori. projects/lowelab/users/~karplus/ plot /hp Find matepair mapping colormapping 8, 6, 12, 16, 19 is the newest. Can look in trim9.chr19-delete.rdb and -invert.rdb.
Nader: Mate-pair how are you filtering? could be source of bias. have not done, partly because it is expensive and we are not quantitating off them. Also, removing identical reads does not help because the error rate is so high. About 2.5% per base. Starts low but gets high by the end.
Trying to have 20X that maps more uniquely. 40X is still better than 20X, even if you did 100X and 60X of it is duplicates.
3 things:
1. how using solid data, not using high-power algorithm. works ok for smaller stuff. Kind of like what's used in blast.
Blast uses a seed or short kmer, then try to extend an alignment. Only need to look where that kmer occurs. Basic notion is the kmer match.
I have a short read. I am interested in it if a big-enough k-mer from it matches. If it matches a k-mer at exactly one place in the genome, I consider that a good match. Even if the contigs are slightly imperfectly ordered, it won't matter much so the ref. assembly contigs is ok.
The problem is that anything with an error in the first base, it won't map. You find stuff quickly this way, but you don't find everything you should. It was good enough for some work.
The next version did initial or final kmer match. Check if first k-bases match uniquely, then stop, done. Otherwise try last k bases. Only 24-colorspace-readlength. That picked up more stuff that had failed the first time, especially if the bad color read from clogged fluidics.
The 3rd version looked at the shorter kmer. Also, not sure which kmer he wants, so now he checks all the shorter kmers in the read. Want to find two kmers that match the same location. More aggressive. Managed to find almost all of the solid reads. Tends to find it in multiple places, but that's usually error. So now only accept ones mapping to a single location.
39 milliion readpairs in Python. Didn't want to have to do full alignments.
Exact match on the k-mer. Wants unique mapping ones.
With the other pair. Finds all locations for each of R and F. Look for all matches for them together.
To handle indels better with this thing, could possibly tolerate +/- 3 positions. Don't know if he actually put it in the code.
mapcolorspace goes through a series of these. first kmer, last kmer, then the small kmers for more aggressive matching to rescue. Takes the cheapest one first. Should have few false positives, some false negatives. Start with one with lots of false negatives, last in step should have fewer. Managed to pick up 96% or more of the data. Less than 1% that didn't match either end. 2-3% that matched just one end.
60% on initial kmer fast few % on final kmer, fast then the last step is expensive but have fewer troubled ones to worry about.
How would you do an exact kmer search with fairly large, e.g. 13mer to 24mer?
Hash the reference genome. Also the reverse complement. Skip along. Got worse behavior, not linear time. Almost N-squared on genome. Tried hashing a bunch of numbers. That wasn't the problem. Slicing. Each python string is a separate object. Memory allocation. Came up with a work-around. Translating to binary is close. Just used strings of digits as numbers. Have to put a digit in front to not lose the leading zeros. Python has support for huge numbers. The lookup is a little bit slower now. But the building time is a huge savings.
With a real language like C could have used binary encoding and pointers to the huge genome string.
Can't do this for Slug, it's too huge. Memory efficiency tricks are still important.
Sequencing is not doubling each year, it's going like 5x.
Got to use the memory smarter, old-school.
What can we do besides a k-mer. Allow mutations? Multiple seeds. Used in BLAT. Uses a fixed-length seed. multiple hits at the right region. only works if your things are similar enough. The kmers have to be pretty long.
Spaced seeds. Match a few bits, don't care a few. Makes it span. But in combination with the multiple seeds. Instead of a block of nothing matching, you have a substitution in a hole so it can still match. Tolerates error a little. Can design optimally spaced seeds. Would be expensive to do it in python. In a binary just do shift and add and skip, very fast.
People are not using these anymore. As the genomes get bigger, the hashes get bigger. You have to have a subpart that matches or else you cannot match it.
Now people want up to 3 errors. spaced seeds and kmer matches have troubles.
How do you do it with the burrows-wheeler transform.
There is a variant of the algorithm that allows various kinds of inexact matches such as SNPs and indels.
An inexact match.
Going from the prefix trie representation. Taking it from the same paper as before. This is the Heng Li and Richard Durbin paper discussed monday.
(get the figure from the paper) Each node represents a location where the range occurs in the array. If we are trying to match “”. Exact match is fast. BWT could get small memory 3Gb in 1GB RAM. What about allowing 1 letter to be different? Do it as a depth-first search. Might find several nodes that match within the tolerance. “GOF” search with 2 mismatches.
The indel mismatch. Deletion or Insertion. If the letter is not there. To do insertion, “say this is an extra letter”. I am really scoring an alignment as I do this search. You can even do affine gap costs. You get a total cutoff price to pay before you stop searching.
Conceptually that's what we are going to do. Instead of depth-first search, we are going to do best-first search from AI techniques. Make a triple. Position in trie, position in search, score at this point. Can do a breadth-first search, can go down whatever branch is most promising. Can make heap with scores to keep pulling out the best-so-far. Can say, I only want to go for more mismatches when I have not been successful with the exact matches, etc.
The part that makes this efficient is that you can prune off certain branches without ever exploring. Can do exact searches for pieces of it. Can estimate how many errors are there in the prefix here? How many errs up to that point? Can't get it exactly but lower bound. (could be zero). If I already have 3 errors, and there are 3 other errors, this is going to just stop.
D(i) is the lower bound on errors in W[0..i]
As long as you are finding it in the genome, ok. But when exact search fails, then there is one error. Resume looking for exact matches from there. Can keep going with lower bound on the errors, which can grow as you go. Backwards search basically adding lower bound from one side with exact score from other side. There are some tricks to compute it very quickly.
This is a very clever combination of tricks. Can use small memory. Uses a hell of a lot less memory, because you can do a lot of computation. These algorithms are fast and flexible.
The difference between suffix and prefix tries is just convention.
This is the BWA program, which we need to install. Handles stuff that Bowtie didn't handle, like colorspace.