Genome assembly—theory and practice
Administrative
Lucigen mate pair data is up
Presentations starting next week
How does the assembler work? (theory)
What are the theoretical advantages of the approach implemented in your assembler?
What was your user experience? (holistic)
Some results, e.g. N50
IMPORTANT: Just because your first run gets awful results, doesn't mean your assembler sucks. Make sure that you are making good choices for parameters, etc, before claiming your assembler is bad.
Idea: try a positive control with your assembler. Use public read data for a known genome, see if you can get a decent assembly.
Don't prevent your assemblers from mapping things to very short contigs, that will give you a lot of false positives. (not sure of the context here?)
We want more data! We probably have 27X coverage now, which is below
spec for most of our assemblers.
We do want another lane of shotgun data
Do we want to make another library?
No MinIon data right now, but we will ask about it
Transcriptome data: can be used for scaffolding, but using it for assessing quality of the scaffolding is much better
Grading
Pass/fail people: do your share and you will pass
No specific standards for letter grades at the moment
No need for grades: “There's nothing more motivating than fear of looking like an idiot in front of your peers.” -Ed
N50
How to find N50? Find the total length and divide it by 2. Order your contigs from longest to shortest. Go down the list and keep track of the total bases you've seen so far. Whichever contig puts you over the length/2 is your N50 contig, and you can report it's length.
Limitations
de Bruijn graphs
Reminder: de Bruijn graph algorithm takes kmers from reads and builds and adjacency graph (usually nodes are the kmers, but the kmers could be the edges if you want).
In theory we don't need extremely long kmers to get mostly unique places in the genome.
Each read represents either the top strand or the bottom strand at that region of the genome. Another read that covered that area could have the reverse complement.
What about kmers that are palindromes?
Introduces ambiguity because you lose the directionality of the arrow because you don't know if you are looking at the forward strand or the reverse strand.
Just make k odd! Then you can never have a perfect palindrome.
Tip from Ed: Write reverse complement strand as mirror image, the actual way that it is in the DNA. That way 5' to 3' is explicit, you can physically rotate the paper and it looks correct still.
Picking k
k should be long enough so that most single-copy genome regions are unique
L is read length
number of kmers = L-k+1
number of arcs = L-k (that is what gives us connectivity information, so we can't have k be too close to L)
in case of sequencing error: the number of kmers affected is k (so especially in error prone reads you would want a smaller k)
so we need to balance k being long enough for uniquesness and short enough for connectivity, plus take into account that with bigger k, more kmers are affected by a single sequencing error.
preqc gives us a recommendation of k to use. But your assembler will have its own specifications, so you don't want to just use the recommended k blindly
What could possibly go wrong?
(See slides for diagrams referred to in this section.)
Picture A: Sequencing error at the end of a read (very likely).
Picture B: Sequencing error in the middle of a read
Picture C: Repetitive elements of genome. There are multiple ways in and out of the middle section, but how do you know which corresponds with which?
If you have reads that are long enough to span the whole area, you can keep track of which reads go through which paths
Mate pair data: if the pairs map onto either side (span the repeat), that will disambiguate it