This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
lecture_notes:04-19-2010 [2010/04/19 22:21] cbrumbau created |
lecture_notes:04-19-2010 [2010/04/22 06:10] galt Added Galt's notes. |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Lecture Notes for April 19, 2010 ====== | ====== Lecture Notes for April 19, 2010 ====== | ||
Need to fill in any ** INCOMPLETE ** portions | Need to fill in any ** INCOMPLETE ** portions | ||
+ | |||
+ | {{:lecture_notes:20100419_banana_slug_seminar.ppt|Daniel's Powerpoint}} | ||
===== Presentation ===== | ===== Presentation ===== | ||
Line 118: | Line 120: | ||
* Integrates alignment information (SAM, BAM file) | * Integrates alignment information (SAM, BAM file) | ||
* Edits, breaks up, reconnects ** INCOMPLETE ** | * Edits, breaks up, reconnects ** INCOMPLETE ** | ||
+ | |||
+ | |||
+ | |||
+ | ===== Galt's Notes ===== | ||
+ | |||
+ | Assembling and comparing genomes or transcriptomes using de Bruijn graphs. | ||
+ | |||
+ | Daniel Zerbino, Marcel Schulz, Wendy Lee, | ||
+ | Ewan Birney and David Haussler. | ||
+ | |||
+ | Slides will be made available. | ||
+ | |||
+ | Talking about | ||
+ | *ways to build graphs | ||
+ | *handle repeats | ||
+ | *practical considerations | ||
+ | *extensions to velvet | ||
+ | |||
+ | ==== OLC ==== | ||
+ | Traditional approach is overlap-layout-consensus | ||
+ | Phrap, etc.\\ | ||
+ | Short reads - Edena\\ | ||
+ | More expensive computationally. | ||
+ | Use whole-read alignments. | ||
+ | |||
+ | === Quick OLC example === | ||
+ | Filter out all the reads that are identical or self-contained in another read. | ||
+ | Only left-to-right etc allowed. | ||
+ | Compute all their overlaps. | ||
+ | Get many transitive redundant connections. | ||
+ | Many algorithms are available for transitive reduction. | ||
+ | You obtain simplified paths through the reads. | ||
+ | Get 4 big chunks of reads from the example. | ||
+ | Hard to distinguish which reads belong to which copy of the repeat. | ||
+ | This is how contigs were first built. | ||
+ | Have to compare all reads against each other, | ||
+ | Quadratic. This has become difficult. | ||
+ | NGS reads are short, and overlaps between reads are short. | ||
+ | They are getting a little longer with newer technology. | ||
+ | OLC method uses all information in the read at least. | ||
+ | |||
+ | ==== DBG ==== | ||
+ | |||
+ | De Bruijn graphs, Pavel Pevzner popularized in paper 2001. | ||
+ | Why do we use them? | ||
+ | High through-put, cheap, short reads with errors. | ||
+ | |||
+ | |||
+ | The next-gen means 2007 there were many papers, SSKE, SHARCGS, VCKE, | ||
+ | have a kmer hash, much faster and leaner, but get noise. | ||
+ | |||
+ | Build a dictionary of words in the data set. | ||
+ | His example is 4mer, but typically 20mer and longer. | ||
+ | For each, create a node in the hash table. | ||
+ | Big dataset. Typically a binary marker to say if present. | ||
+ | Sometimes at branch have two possible extensions. | ||
+ | Greedy approach essentially De Bruijn graph with | ||
+ | local info only. In this structure you could have | ||
+ | many paths through. And reads of different lengths can be used. | ||
+ | |||
+ | Later Euler, Allphaths, Velvet, ABySS, SOAPdenovo came out. | ||
+ | They souped up the kmer hash, but processed more globally. | ||
+ | Graph construction and correction is similar in all assemblers. | ||
+ | ABySS handles parallel implementation. | ||
+ | |||
+ | First, concatenate the simple graphs without branches | ||
+ | into nodes which compact the non-branching runs into a single item. | ||
+ | Biochemistry of shortreads means errors tend to be near the ends. | ||
+ | If you have errors in the reads in the second half, | ||
+ | it creates millions of tips, but they are at least easy to | ||
+ | remove later. | ||
+ | |||
+ | Each read is a path through the graph. The head of the read | ||
+ | is still kept, so we're not losing the read placement information. | ||
+ | |||
+ | We are still left with bubbles. | ||
+ | Possibly accumulation of overlapping spur-tips. | ||
+ | More often variation e.g. SNP or variation. | ||
+ | Would like to keep that information where possible. | ||
+ | Could also be a long read. But don't want to cut | ||
+ | the long read in the middle. When you find a bubble | ||
+ | in velvet, remap that back onto the one with the higher | ||
+ | branch. | ||
+ | Now we have a loop that goes back | ||
+ | on itself (not a bubble). e.g. tandem repeats. | ||
+ | |||
+ | OLC has atomic arcs. | ||
+ | DBG has a greater variety of overlaps. | ||
+ | |||
+ | OLC gets error correction from long reads. | ||
+ | DBG requires more work to correct. | ||
+ | |||
+ | ==== Tour Bus ==== | ||
+ | Errors. | ||
+ | Compare 5mb chunk from 4 species, ecolo yeast, elegans, human. | ||
+ | Increased coverage from 0 to 50x. | ||
+ | N50 on y axis. | ||
+ | First, exponential growth as expected. | ||
+ | Then the complexity causes trouble, | ||
+ | unless you have other kinds of information. | ||
+ | Not using paired-reads information. | ||
+ | You see that the human is a little more complex than E. coli. | ||
+ | |||
+ | Scaling up the hard way: chrX. | ||
+ | 1/2 billion reads from flow-sorted human chrX. | ||
+ | Fit in 70gb ram. | ||
+ | Overall length 130mb. | ||
+ | Short contigs, and numerous. | ||
+ | How to handle the repeats? | ||
+ | |||
+ | **Margulies** used reduced representation | ||
+ | using restriction enzymes and gel | ||
+ | to cut the problem into tractable bits. | ||
+ | Cut chunks from the gel, and then | ||
+ | sequenced the gel slice separately, | ||
+ | or at least in different lanes. | ||
+ | Then the assembly problem is smaller | ||
+ | and more localized instead of whole-genome. | ||
+ | |||
+ | Combined re-mapping and de novo sequencing. | ||
+ | **Cheetem and Pleasance**. | ||
+ | Map to reference or similar genome, | ||
+ | then re-map locally some stuff that's unique. | ||
+ | |||
+ | MAQ and BWA only allow 3 or 4 mismatches. | ||
+ | Can see reads not mapping in a small area | ||
+ | with too many SNPs or breakpoint rearrangment. | ||
+ | |||
+ | Code parallelization (**Birol and Cook**). | ||
+ | Only ABySS has published yet. | ||
+ | |||
+ | Improved indexing | ||
+ | Fit WGS onto 200GB. | ||
+ | |||
+ | Use of intermediate re-mapping. (**Haimel**) | ||
+ | Wanted to parallelize the second level | ||
+ | work of building scaffolds. | ||
+ | |||
+ | But those are mostly engineering issues. | ||
+ | |||
+ | The DBG for a small organism | ||
+ | looks like long-ish contigs, | ||
+ | but still makes a hair-ball. | ||
+ | |||
+ | ==== Rock Band ==== | ||
+ | Use long and short reads together. | ||
+ | Long reads which are mapped seamlessly into the graph. | ||
+ | |||
+ | 35bp reads 50x, some long reads (100-1kb) 0-20x. | ||
+ | Played with the length. | ||
+ | The longer your longreads are, the more bridgeable | ||
+ | it is and the longer the contigs. | ||
+ | Why not just assemble the long reads? someone might ask. | ||
+ | |||
+ | Different approach. Spectral graph analysis. | ||
+ | Find what is the vector that threads its way. | ||
+ | Has a fancy theory with massive matrix diagonalization. | ||
+ | Takes all the data at once in one huge matrix, | ||
+ | not too affected by noise. No one has actually | ||
+ | tried this approach. Too much computation. | ||
+ | The trad approach is contigs with paired-end reads. | ||
+ | Find chains connected by contig reads. | ||
+ | |||
+ | Your coverage is huge. | ||
+ | Each basepair has to handle 1000 coverage | ||
+ | read-pairs. Long reads separated by a mess | ||
+ | in the middle. Check possible paths. | ||
+ | Find which are acceptable. | ||
+ | Find all the reads that start and end on the | ||
+ | two long reads. Then consider them all | ||
+ | together, much more efficient. | ||
+ | |||
+ | 5% or more of mate-pairs can be wrong. | ||
+ | Hopefully will be able to filter them out. | ||
+ | In velvet use a likelihood approach. | ||
+ | Have a distance estimate. | ||
+ | Coverage of A and B, how many would | ||
+ | we expect to see? If we see too few, | ||
+ | we discard the connection. | ||
+ | |||
+ | Allpaths did better using small localized problems | ||
+ | of a single small scaffold for parallelization. | ||
+ | |||
+ | The Shorty algorithm uses the variance between | ||
+ | read pair anchored on a common contig on a kmer. | ||
+ | Projected image of the other ends creates possibility | ||
+ | of assembling a smaller set, the projection from a node. | ||
+ | |||
+ | ==== PEBBLE ==== | ||
+ | Velvet used the reverse idea, the projections to a node. | ||
+ | Hit the nodes in the vicinity of the starting contig, | ||
+ | got distance information on the immediate vicinity. | ||
+ | Have a greedy algorithm. Find way to next unique node. | ||
+ | |||
+ | Having a more-precise insert size could help. | ||
+ | Velvet did a size assuming 10% distribution. | ||
+ | Solexa might have been different. | ||
+ | Solid up to 2k, solexa just 300. | ||
+ | It is difficult to have enough dna of a narrow range. | ||
+ | Don't love barcoding because it uses a lot of reagents. | ||
+ | |||
+ | |||
+ | ==== COLOR-SPACE ==== | ||
+ | |||
+ | 4-letter alphabet. | ||
+ | Double-encoded files use ACGT. | ||
+ | Be careful not to confuse it. | ||
+ | Different rules for complementarity. | ||
+ | Just reverse, don't complement them. | ||
+ | The assemblers and mappers should | ||
+ | in theory adapt to this. | ||
+ | Could use base-space, but one error | ||
+ | will ruin the rest of the read. This is | ||
+ | a huge loss, so do all in colorspace | ||
+ | and make a complete assembly, | ||
+ | then finally convert to base calls later. | ||
+ | Typically it detects the error after | ||
+ | 1 or 2 bp, then corrects. | ||
+ | SOLID has developed tools. | ||
+ | |||
+ | Different error models. | ||
+ | 454 had homopolymer issues, flowspace. | ||
+ | Illumina or ABI are different. | ||
+ | Makes it hard to handle errors. | ||
+ | Especially in DBG. Applying the correction | ||
+ | rules becomes much more complex. | ||
+ | Newbler may be better for 454. | ||
+ | |||
+ | Pre-filtering reads. | ||
+ | Some assemblers have built-in filtering of reads. | ||
+ | |||
+ | Filtering low quality bases can cut down | ||
+ | on the computational cost. | ||
+ | |||
+ | User-specific, data-space specific solutions. | ||
+ | No tried and true consistent approach. | ||
+ | |||
+ | Some assemblers require reads of identical lengths. | ||
+ | |||
+ | ----------- | ||
+ | |||
+ | ==== OASES ==== | ||
+ | |||
+ | De-novo Transcriptome Assembly. | ||
+ | People are using NGS generally because it's cheaper. | ||
+ | Would have just one path except for alternate splicing. | ||
+ | Pevzner in 2002 mentioned DBG is equivalent to a | ||
+ | splice-graph. So can apply those algorithms developed for | ||
+ | Splice-graphs to DBG. | ||
+ | |||
+ | === Oases Process === | ||
+ | The genetic locus should create a connected locus, cluster of contigs. | ||
+ | Sometimes one cluster gets broken into two, but works pretty well. | ||
+ | |||
+ | Dynamic assembly, find the path of highest weight. | ||
+ | |||
+ | === Oases Output === | ||
+ | With skipped exon, and one artifact. | ||
+ | Handles contigs with different coverage levels. | ||
+ | Find donor and acceptor sites on or around junctions. | ||
+ | |||
+ | Drosophila reads. | ||
+ | rl 36, k21 ins size 200. | ||
+ | Map to genome 90% | ||
+ | Found an unannotated gene with 7 exons. | ||
+ | |||
+ | Assembler has no notion of the reference genome, | ||
+ | but still found one example that mapped very well. | ||
+ | |||
+ | |||
+ | ==== COLUMBUS ==== | ||
+ | Mapping. | ||
+ | Either map, or do de-novo. | ||
+ | Many good mature mapping tools. | ||
+ | Easier to analyze, comes with genomic coords. | ||
+ | De-novo harder. May require special hardware. | ||
+ | Works without reference. Handles novel structures. | ||
+ | |||
+ | How to get the best of both worlds? | ||
+ | Just throw in the ref genome as long reads. | ||
+ | Chucking in long reads and getting short contigs. | ||
+ | What they wanted really was to start from a | ||
+ | reference genome. Then handle it as a little insertion. | ||
+ | Takes DBG and SAM/BAM (BWA, bowtie) | ||
+ | Keeps ref sequences. Not allowed to overlap with self. | ||
+ | Starting with comparative transcriptome assembly. | ||
+ | Would like to ? cancer. | ||
+ | |||
+ | DBG convention for NGS. | ||
+ | |||
+ | For transcriptome, using k=21. | ||
+ | Can use reads as short as 25. | ||
+ | How many X coverage needed? | ||
+ | How many genes do you want to catch. | ||
+ | |||
+ | With genome, want long inserts generally. | ||
+ | With transcriptome, short inserts are more | ||
+ | useful. They have been using 200bp tiny stuff. | ||
+ | |||
+ | ==== End-of-class Business ==== | ||
+ | |||
+ | We can put Daniel's slides on the wiki. | ||
+ | |||
+ | Galt will send Daniel a link to the wiki. | ||
+ | |||
+ | Kevin on Wed. will be talking about: | ||
+ | Use solid data for ordering and orientation of Pog contigs. | ||
+ | Special purpose and scripts. | ||
+ | The data was so noisy. | ||
+ | |||
+ | |||
+ |