User Tools

Site Tools


lecture_notes:04-17-2015

Genome assembly—theory and practice

Administrative

  • Lucigen mate pair data is up
    • Josh from Ed's lab worked on it, we can ask him questions
  • Presentations starting next week
    • How does the assembler work? (theory)
    • What are the theoretical advantages of the approach implemented in your assembler?
    • What was your user experience? (holistic)
    • Some results, e.g. N50
    • IMPORTANT: Just because your first run gets awful results, doesn't mean your assembler sucks. Make sure that you are making good choices for parameters, etc, before claiming your assembler is bad.
      • All our assemblers were highly ranked in Assemblathon II, BUT they were being used by their authors. So they can work really well, the question is can we figure out how to make them do that.
    • Idea: try a positive control with your assembler. Use public read data for a known genome, see if you can get a decent assembly.
  • Don't prevent your assemblers from mapping things to very short contigs, that will give you a lot of false positives. (not sure of the context here?)
  • We want more data! We probably have 27X coverage now, which is below spec for most of our assemblers.
    • We do want another lane of shotgun data
    • Do we want to make another library?
    • No MinIon data right now, but we will ask about it
      • Mostly good for scaffolding
    • Transcriptome data: can be used for scaffolding, but using it for assessing quality of the scaffolding is much better
  • Grading
    • Pass/fail people: do your share and you will pass
    • No specific standards for letter grades at the moment
    • No need for grades: “There's nothing more motivating than fear of looking like an idiot in front of your peers.” -Ed

N50

How to find N50? Find the total length and divide it by 2. Order your contigs from longest to shortest. Go down the list and keep track of the total bases you've seen so far. Whichever contig puts you over the length/2 is your N50 contig, and you can report it's length.

Limitations

  • False scaffolding (e.g. just cat all the contigs together and then N50 will be great!)
  • How do you check for this?
    • Gene contiguity correctness from RNAseq or other data
    • Paired end mapping
    • Synteny analysis (not an option for us, since there isn't a close enough genome and we have no idea about how synteny would be conserved in slugs anyway)
      • Saying something about chromosome evolution in our paper might be cool

de Bruijn graphs

Reminder: de Bruijn graph algorithm takes kmers from reads and builds and adjacency graph (usually nodes are the kmers, but the kmers could be the edges if you want). In theory we don't need extremely long kmers to get mostly unique places in the genome.

  • Each read represents either the top strand or the bottom strand at that region of the genome. Another read that covered that area could have the reverse complement.
    • So label each node with the kmer and the reverse complement, because they are equivalent pieces of data (and equally likely).
  • What about kmers that are palindromes?
    • Introduces ambiguity because you lose the directionality of the arrow because you don't know if you are looking at the forward strand or the reverse strand.
    • Just make k odd! Then you can never have a perfect palindrome.
  • Tip from Ed: Write reverse complement strand as mirror image, the actual way that it is in the DNA. That way 5' to 3' is explicit, you can physically rotate the paper and it looks correct still.

Picking k

  • k should be long enough so that most single-copy genome regions are unique
  • L is read length
  • number of kmers = L-k+1
  • number of arcs = L-k (that is what gives us connectivity information, so we can't have k be too close to L)
  • in case of sequencing error: the number of kmers affected is k (so especially in error prone reads you would want a smaller k)
  • so we need to balance k being long enough for uniquesness and short enough for connectivity, plus take into account that with bigger k, more kmers are affected by a single sequencing error.
  • preqc gives us a recommendation of k to use. But your assembler will have its own specifications, so you don't want to just use the recommended k blindly

What could possibly go wrong?

(See slides for diagrams referred to in this section.)

  • Picture A: Sequencing error at the end of a read (very likely).
  • Picture B: Sequencing error in the middle of a read
  • Picture C: Repetitive elements of genome. There are multiple ways in and out of the middle section, but how do you know which corresponds with which?
    • If you have reads that are long enough to span the whole area, you can keep track of which reads go through which paths
    • Mate pair data: if the pairs map onto either side (span the repeat), that will disambiguate it
You could leave a comment if you were logged in.
lecture_notes/04-17-2015.txt · Last modified: 2015/04/20 00:29 by sihussai