This is an old revision of the document!
======Genome assembly—theory and practice====== =====Administrative===== * Lucigen mate pari data is up * Josh from Ed's lab worked on it, we can ask him questions * Presentations starting next week * How does the assembler work? (theory) * What are the theoretical advantages of the approach implemented in your assembler? * What was your user experience? (holistic) * Some results, e.g. N50 * IMPORTANT: Just because your first run gets awful results, doesn't mean your assembler sucks. Make sure that you are making good choices for parameters, etc, before claiming your assembler is bad. * All our assemblers were highly ranked in Assemblathon II, BUT they were being used by their authors. So they can work really well, the question is can we figure out how to make them do that. * Idea: try a positive control with your assembler. Use public read data for a known genome, see if you can get a decent assembly. * Don't prevent your assemblers from mapping things to very short contigs, that will give you a lot of false positives. (not sure of the context here?) * We want more data! We probably have 27X coverage now, which is below spec for most of our assemblers. * We do want another lane of shotgun data * Do we want to make another library? * No MinIon data right now, but we will ask about it * Mostly good for scaffolding * Transcriptome data: can be used for scaffolding, but using it for assessing quality of the scaffolding is much better * Grading * Pass/fail people: do your share and you will pass * No specific standards for letter grades at the moment * No need for grades: "There's nothing more motivating than fear of looking like an idiot in front of your peers." -Ed =====N50===== How to find N50? Find the total length and divide it by 2. Order your contigs from longest to shortest. Go down the list and keep track of the total bases you've seen so far. Whichever contig puts you over the length/2 is your N50 contig, and you can report it's length. ====Limitations==== * False scaffolding (e.g. just cat all the contigs together and then N50 will be great!) * How do you check for this? * Gene contiguity correctness from RNAseq or other data * Paired end mapping * Synteny analysis (not an option for us, since there isn't a close enough genome and we have no idea about how synteny would be conserved in slugs anyway) * Saying something about chromosome evolution in our paper might be cool =====de Bruijn graphs===== Reminder: de Bruijn graph algorithm takes kmers from reads and builds and adjacency graph (usually nodes are the kmers, but the kmers could be the edges if you want). In theory we don't need extremely long kmers to get mostly unique places in the genome. * Each read represents either the top strand or the bottom strand at that region of the genome. Another read that covered that area could have the reverse complement. * So label each node with the kmer and the reverse complement, because they are equivalent pieces of data (and equally likely). * What about kmers that are palindromes? * Introduces ambiguity because you lose the directionality of the arrow because you don't know if you are looking at the forward strand or the reverse strand. * **Just make k odd!** Then you can never have a perfect palindrome. * Tip from Ed: Write reverse complement strand as mirror image, the actual way that it is in the DNA. That way 5' to 3' is explicit, you can physically rotate the paper and it looks correct still.