Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-17-2015 [2015/04/17 23:25]
sihussai created (incomplete)
+++ lecture_notes:04-17-2015 [2015/04/20 00:29]
sihussai
@@ Line 2: / Line 2: @@
 =====Administrative=====
-  * Lucigen mate pari data is up
+  * Lucigen mate pair data is up
     * Josh from Ed's lab worked on it, we can ask him questions
   * Presentations starting next week
@@ Line 44: / Line 44: @@
     * Introduces ambiguity because you lose the directionality of the arrow because you don't know if you are looking at the forward strand or the reverse strand.
     * **Just make k odd!** Then you can never have a perfect palindrome.
   * Tip from Ed: Write reverse complement strand as mirror image, the actual way that it is in the DNA. That way 5' to 3' is explicit, you can physically rotate the paper and it looks correct still.
+====Picking k====
+  * k should be long enough so that most single-copy genome regions are unique
+  * L is read length
+  * number of kmers = L-k+1
+  * number of arcs = L-k (that is what gives us connectivity information, so we can't have k be too close to L)
+  * in case of sequencing error: the number of kmers affected is k (so especially in error prone reads you would want a smaller k)
+  * so we need to balance k being long enough for uniquesness and short enough for connectivity, plus take into account that with bigger k, more kmers are affected by a single sequencing error.
+  * preqc gives us a recommendation of k to use. But your assembler will have its own specifications, so you don't want to just use the recommended k blindly
+====What could possibly go wrong?====
+(See slides for diagrams referred to in this section.)
+  * Picture A: Sequencing error at the end of a read (very likely).
+  * Picture B: Sequencing error in the middle of a read
+  * Picture C: Repetitive elements of genome. There are multiple ways in and out of the middle section, but how do you know which corresponds with which?
+    * If you have reads that are long enough to span the whole area, you can keep track of which reads go through which paths
+    * Mate pair data: if the pairs map onto either side (span the repeat), that will disambiguate it

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools