This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
lecture_notes:04-09-2010 [2010/04/10 05:01] cbrumbau Fixed less than or equal to symbol |
lecture_notes:04-09-2010 [2010/04/12 21:37] (current) cbrumbau Changed for reference format, italics, punctuation, spelling, etc. |
||
---|---|---|---|
Line 7: | Line 7: | ||
Take fix mode script from /projects/compbio/bin/scripts and replace protein user group with BME 235 user group. | Take fix mode script from /projects/compbio/bin/scripts and replace protein user group with BME 235 user group. | ||
- | Next week will have a reference genome (POG) to use for testing the tools on. | + | Next week will have a reference genome (//Pyrobaculum oguniense//, aka "//Pog//") to use for testing the tools on. |
- | For the most part POG is done; however, there are still some uncertainty with 8 SNPs left. It is definitely past the MIAMI standard at this point. | + | For the most part //Pog// is done; however, there are still some uncertainty with 8 SNPs left. It is definitely past the MIAMI standard at this point. (//Pog// assembly is down to only 8 SNPs & one potentially variable insert.) |
+ | |||
+ | Note about sequencing platform quality scores: most platforms are trying to use the Phred quality score[(cite:phred>[[wp>Phred quality score]])], so the quality score is theoretically comparable between the platforms and runs (although calibration causes scores to vary between runs and instruments nonetheless). | ||
+ | |||
+ | It can be informative, once reads are mapped, to look at the quality scores for reads with observed errors. | ||
+ | |||
+ | Lior Pachter (from UC Berkeley) is vising on Monday, to speak about the Bowtie/TopHat/CuffLinks algorithms. (Bowtie: mapping; TopHat/Cufflinks: find splice junctions, predicted spliced transcripts. Bowtie is used in a lot of the assembly algorithms.) | ||
===== Main lecture: Assembler graphs ===== | ===== Main lecture: Assembler graphs ===== | ||
Line 14: | Line 20: | ||
Types of assembler graphs: | Types of assembler graphs: | ||
* Overlap graph | * Overlap graph | ||
- | * de Bruijn graph | + | * de Bruijn graph (pronounced like "De Broin") |
Differences are "What are the nodes?" | Differences are "What are the nodes?" | ||
Line 30: | Line 36: | ||
B | B | ||
</code> | </code> | ||
- | The problem is the direction of the reads when aligning: | + | The problem with edges between contig nodes is in defining direction of the reads when aligning: |
* 4 different edge scenarios: | * 4 different edge scenarios: | ||
* -> -> (A -> B) | * -> -> (A -> B) | ||
Line 37: | Line 43: | ||
* <- <- (B -> A) | * <- <- (B -> A) | ||
* 3 different edge types: | * 3 different edge types: | ||
- | * A to B | + | * Same dir: A to B / B to A |
- | * B to A | + | * Tail-to-tail (convergent): A' to B |
- | * A' to B / A to B' | + | * Head-to-head (divergent): A to B' |
+ | |||
+ | Need to have some tolerance for error because the reads are noisy. When creating read overlaps, if you require 100% pairing, you’ll miss a lot of data. Also these include the read ends, where quality falls off, so you need a “overlap quality score”. | ||
+ | |||
+ | Can’t do all-vs-all searches (n<sup>2</sup> algorithms not a good idea with billions of reads). So how do you search what to overlap? Most algorithms do a BLAST-like filter before trying to align edges (~nlogn). | ||
+ | |||
+ | Side Note: For transcriptome libraries, if done properly, reads should have known strandedness, so it can’t be run through algorithms which make strandedness arbitrary (story about problems with a prominent yeast microarray transcriptome analysis incorrectly finding a lot of “antisense” mRNAs due to library prep error). | ||
- | Need to have some tolerance for error because the reads are noisy. | ||
=== de Bruijn graphs === | === de Bruijn graphs === | ||
Line 90: | Line 101: | ||
Realistically, there are issues: | Realistically, there are issues: | ||
- | Spurs: | + | == End of contig boundaries == |
+ | |||
+ | What if A->B and A->C and A->D //BUT// A->B and A->C are inconsistent with each other? | ||
+ | … A becomes “end of contig”, because you aren’t sure where to go next. | ||
+ | Also end of contig if there are no more edges from the node. | ||
+ | |||
+ | == Spurs == | ||
<code> | <code> | ||
kmer -> kmer -> kmer -> kmer -> kmer | kmer -> kmer -> kmer -> kmer -> kmer | ||
\-> kmer -> kmer -> kmer (off to nowhere) | \-> kmer -> kmer -> kmer (off to nowhere) | ||
</code> | </code> | ||
+ | Path diverges but does not reconverge, resulting in source/sink dead-ends (these are likely due to read errors). | ||
+ | |||
+ | == Bubbles == | ||
- | Collapse bubbles: | ||
<code> | <code> | ||
/-> kmer -> kmer -> kmer -\ | /-> kmer -> kmer -> kmer -\ | ||
Line 102: | Line 122: | ||
</code> | </code> | ||
- | Other issues: | + | The path splits due to a SNP but then converges. This can happen with real SNPs, read error SNPs, and real repeats which differ by a SNP or two. |
+ | |||
+ | == Loop == | ||
- | Loop: | ||
<code> | <code> | ||
kmer -> kmer -> kmer -> kmer -> kmer -> kmer -> kmer | kmer -> kmer -> kmer -> kmer -> kmer -> kmer -> kmer | ||
\- kmers <-/ | \- kmers <-/ | ||
</code> | </code> | ||
+ | Tandem repeats will generate a circle, but have edges in and out; hard to disambiguate copy number though. If the data is really clean (i.e. in/out edges are ~10 read-depth with low SD, and inside circle has ~20 read-depth with low SD), we can guess that there might be 2 copies of the repeat, but this is not highly reliable. | ||
- | Take the loop? | + | == Multiple paths == |
- | Multiple paths: | ||
<code> | <code> | ||
A B | A B | ||
Line 126: | Line 147: | ||
Largest bias usually comes from PCR for amplification. | Largest bias usually comes from PCR for amplification. | ||
- | Need to collapse the graph (both overlap and de Bruijn) to assemble the reads. | + | === Assembly === |
+ | |||
+ | Algorithms (both overlap and de Bruijn) need to collapse bubbles and trim spurs.\\ | ||
+ | Spurs: Discard if their read count is low.\\ | ||
+ | Bubbles: Tricky, because they can represent real, divergent paths. | ||
+ | |||
+ | ===== References ===== | ||
+ | <refnotes>notes-separator: none</refnotes> | ||
+ | ~~REFNOTES cite~~ |