Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-09-2010 [2010/04/10 05:01]
cbrumbau Fixed less than or equal to symbol
+++ lecture_notes:04-09-2010 [2010/04/12 21:37] (current)
cbrumbau Changed for reference format, italics, punctuation, spelling, etc.
@@ Line 7: / Line 7: @@
 Take fix mode script from /projects/compbio/bin/scripts and replace protein user group with BME 235 user group.
-Next week will have a reference genome (POG) to use for testing the tools on.
+Next week will have a reference genome (//Pyrobaculum oguniense//, aka "//Pog//") to use for testing the tools on.
-For the most part POG is done; however, there are still some uncertainty with 8 SNPs left. It is definitely past the MIAMI standard at this point.
+For the most part //Pog// is done; however, there are still some uncertainty with 8 SNPs left. It is definitely past the MIAMI standard at this point. (//Pog// assembly is down to only 8 SNPs & one potentially variable insert.)
+Note about sequencing platform quality scores: most platforms are trying to use the Phred quality score[(cite:phred>[[wp>Phred quality score]])], so the quality score is theoretically comparable between the platforms and runs (although calibration causes scores to vary between runs and instruments nonetheless).
+It can be informative, once reads are mapped, to look at the quality scores for reads with observed errors.
+Lior Pachter (from UC Berkeley) is vising on Monday, to speak about the Bowtie/TopHat/CuffLinks algorithms. (Bowtie: mapping; TopHat/Cufflinks: find splice junctions, predicted spliced transcripts. Bowtie is used in a lot of the assembly algorithms.)
 ===== Main lecture: Assembler graphs =====
@@ Line 14: / Line 20: @@
 Types of assembler graphs:
   * Overlap graph
-  * de Bruijn graph
+  * de Bruijn graph  (pronounced like "De Broin")
 Differences are "What are the nodes?"
@@ Line 30: / Line 36: @@
             B
 </code>
-The problem is the direction of the reads when aligning:
+The problem with edges between contig nodes is in defining direction of the reads when aligning:
   * 4 different edge scenarios:
     * -> -> (A -> B)
@@ Line 37: / Line 43: @@
     * <- <- (B -> A)
   * 3 different edge types:
-    * A to B
+    * Same dir: A to B / B to A
-    * B to A
+    * Tail-to-tail (convergent): A' to B
-    * A' to B / A to B'
+    * Head-to-head (divergent): A to B'
+Need to have some tolerance for error because the reads are noisy. When creating read overlaps, if you require 100% pairing, you’ll miss a lot of data. Also these include the read ends, where quality falls off, so you need a “overlap quality score”.
+Can’t do all-vs-all searches (n<sup>2</sup> algorithms not a good idea with billions of reads). So how do you search what to overlap? Most algorithms do a BLAST-like filter before trying to align edges (~nlogn).
+Side Note: For transcriptome libraries, if done properly, reads should have known strandedness, so it can’t be run through algorithms which make strandedness arbitrary (story about problems with a prominent yeast microarray transcriptome analysis incorrectly finding a lot of “antisense” mRNAs due to library prep error).
-Need to have some tolerance for error because the reads are noisy.
 === de Bruijn graphs ===
@@ Line 90: / Line 101: @@
 Realistically, there are issues:
-Spurs:
+== End of contig boundaries ==
+What if A->B and A->C and A->D //BUT// A->B and A->C are inconsistent with each other?
+… A becomes “end of contig”, because you aren’t sure where to go next.
+Also end of contig if there are no more edges from the node.
+== Spurs ==
 <code>
 kmer -> kmer -> kmer -> kmer -> kmer
     \-> kmer -> kmer -> kmer (off to nowhere)
 </code>
+Path diverges but does not reconverge, resulting in source/sink dead-ends (these are likely due to read errors).
+== Bubbles ==
-Collapse bubbles:
 <code>
     /-> kmer -> kmer -> kmer -\
@@ Line 102: / Line 122: @@
 </code>
-Other issues:
+The path splits due to a SNP but then converges. This can happen with real SNPs, read error SNPs, and real repeats which differ by a SNP or two.
+== Loop ==
-Loop:
 <code>
 kmer -> kmer -> kmer -> kmer -> kmer -> kmer -> kmer
                             \- kmers <-/
 </code>
+Tandem repeats will generate a circle, but have edges in and out; hard to disambiguate copy number though. If the data is really clean (i.e. in/out edges are ~10 read-depth with low SD, and inside circle has ~20 read-depth with low SD), we can guess that there might be 2 copies of the repeat, but this is not highly reliable.
-Take the loop?
+== Multiple paths ==
-Multiple paths:
 <code>
 A                      B
@@ Line 126: / Line 147: @@
 Largest bias usually comes from PCR for amplification.
-Need to collapse the graph (both overlap and de Bruijn) to assemble the reads.
+=== Assembly ===
+Algorithms (both overlap and de Bruijn) need to collapse bubbles and trim spurs.\\
+Spurs: Discard if their read count is low.\\
+Bubbles: Tricky, because they can represent real, divergent paths.
+===== References =====
+<refnotes>notes-separator: none</refnotes>
+~~REFNOTES cite~~

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools