User Tools

Site Tools


lecture_notes:04-19-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
lecture_notes:04-19-2010 [2010/04/19 22:21]
cbrumbau created
lecture_notes:04-19-2010 [2010/04/22 06:12]
galt
Line 1: Line 1:
 ====== Lecture Notes for April 19, 2010 ====== ====== Lecture Notes for April 19, 2010 ======
 Need to fill in any ** INCOMPLETE ** portions Need to fill in any ** INCOMPLETE ** portions
 +
 +{{:​lecture_notes:​20100419_banana_slug_seminar.ppt|Daniel'​s Powerpoint}}
  
 ===== Presentation ===== ===== Presentation =====
Line 118: Line 120:
     * Integrates alignment information (SAM, BAM file)     * Integrates alignment information (SAM, BAM file)
     * Edits, breaks up, reconnects ** INCOMPLETE **     * Edits, breaks up, reconnects ** INCOMPLETE **
 +
 +
 +
 +===== Galt's Notes =====
 +
 +Assembling and comparing genomes or transcriptomes using de Bruijn graphs.
 +
 +Daniel Zerbino, Marcel Schulz, Wendy Lee,
 +Ewan Birney and David Haussler.
 +
 +Slides will be made available.
 +
 +Talking about
 +  *ways to build graphs ​
 +  *handle repeats
 +  *practical considerations
 +  *extensions to velvet
 +
 +==== OLC ====
 +Traditional approach is overlap-layout-consensus
 +Phrap, etc.\\
 +Short reads - Edena\\
 +More expensive computationally.
 +Use whole-read alignments.
 +
 +=== Quick OLC example ===
 +  Filter out all the reads that are identical or self-contained in another read.
 +  Only left-to-right etc allowed.
 +  Compute all their overlaps.
 +  Get many transitive redundant connections.
 +  Many algorithms are available for transitive reduction.
 +  You obtain simplified paths through the reads.
 +  Get 4 big chunks of reads from the example.
 +  Hard to distinguish which reads belong to which copy of the repeat.
 +  This is how contigs were first built.
 +  Have to compare all reads against each other,
 +  Quadratic. ​ This has become difficult.
 +  NGS reads are short, and overlaps between reads are short.
 +  They are getting a little longer with newer technology.
 +  OLC method uses all information in the read at least.
 +
 +==== DBG ====
 +
 +De Bruijn graphs, Pavel Pevzner popularized in paper 2001.
 +Why do we use them?
 +High through-put,​ cheap, short reads with errors.
 +
 +
 +The next-gen means 2007 there were many papers, SSKE, SHARCGS, VCKE,
 +have a kmer hash, much faster and leaner, but get noise.
 +
 +Build a dictionary of words in the data set.
 +His example is 4mer, but typically 20mer and longer.
 +For each, create a node in the hash table.
 +Big dataset. ​ Typically a binary marker to say if present.
 +Sometimes at branch have two possible extensions.
 +Greedy approach essentially De Bruijn graph with
 +local info only.  In this structure you could have 
 +many paths through. ​ And reads of different lengths can be used.
 +
 +Later Euler, Allphaths, Velvet, ABySS, SOAPdenovo came out.
 +They souped up the kmer hash, but processed more globally.
 +Graph construction and correction is similar in all assemblers.
 +ABySS handles parallel implementation.
 +
 +First, concatenate the simple graphs without branches
 +into nodes which compact the non-branching runs into a single item.
 +Biochemistry of shortreads means errors tend to be near the ends.
 +If you have errors in the reads in the second half,
 +it creates millions of tips, but they are at least easy to 
 +remove later.  ​
 +
 +Each read is a path through the graph. The head of the read
 +is still kept, so we're not losing the read placement information.
 +
 +We are still left with bubbles.
 +Possibly accumulation of overlapping spur-tips.
 +More often variation e.g. SNP or variation.
 +Would like to keep that information where possible.
 +Could also be a long read.  But don't want to cut
 +the long read in the middle. ​ When you find a bubble
 +in velvet, remap that back onto the one with the higher
 +branch.
 +Now we have a loop that goes back
 +on itself (not a bubble). e.g. tandem repeats.
 +
 +OLC has atomic arcs.
 +DBG has a greater variety of overlaps.
 +
 +OLC gets error correction from long reads.
 +DBG requires more work to correct.
 +
 +==== Tour Bus ====
 +Errors.
 +Compare 5mb chunk from 4 species, ecolo yeast, elegans, human.
 +Increased coverage from 0 to 50x.
 +N50 on y axis.
 +First, exponential growth as expected.
 +Then the complexity causes trouble,
 +unless you have other kinds of information.
 +Not using paired-reads information.
 +You see that the human is a little more complex than E. coli.
 +
 +Scaling up the hard way: chrX.
 +1/2 billion reads from flow-sorted human chrX.
 +Fit in 70gb ram.
 +Overall length 130mb.
 +Short contigs, and numerous.
 +How to handle the repeats?
 +
 +**Margulies** used reduced representation
 +using restriction enzymes and gel
 +to cut the problem into tractable bits.
 +Cut chunks from the gel, and then
 +sequenced the gel slice separately,
 +or at least in different lanes.
 +Then the assembly problem is smaller
 +and more localized instead of whole-genome.
 +
 +Combined re-mapping and de novo sequencing.
 +**Cheetem and Pleasance**.
 +Map to reference or similar genome,
 +then re-map locally some stuff that's unique.
 +
 +MAQ and BWA only allow 3 or 4 mismatches.
 +Can see reads not mapping in a small area
 +with too many SNPs or breakpoint rearrangment.
 +
 +Code parallelization (**Birol and Cook**).
 +Only ABySS has published yet.
 +
 +Improved indexing
 +Fit WGS onto 200GB.
 +
 +Use of intermediate re-mapping. (**Haimel**)
 +Wanted to parallelize the second level 
 +work of building scaffolds.
 +
 +But those are mostly engineering issues.
 +
 +The DBG for a small organism
 +looks like long-ish contigs,
 +but still makes a hair-ball.
 +
 +==== Rock Band ====
 +Use long and short reads together.
 +Long reads which are mapped seamlessly into the graph.  ​
 +
 +35bp reads 50x, some long reads (100-1kb) 0-20x.
 +Played with the length.
 +The longer your longreads are, the more bridgeable
 +it is and the longer the contigs.  ​
 +Why not just assemble the long reads? someone might ask.
 +
 +Different approach. ​ Spectral graph analysis.
 +Find what is the vector that threads its way.
 +Has a fancy theory with massive matrix diagonalization.
 +Takes all the data at once in one huge matrix,
 +not too affected by noise. No one has actually
 +tried this approach. Too much computation.
 +The trad approach is contigs with paired-end reads. ​
 +Find chains connected by contig reads.
 +
 +Your coverage is huge.
 +Each basepair has to handle 1000 coverage
 +read-pairs. ​ Long reads separated by a mess
 +in the middle. ​ Check possible paths.
 +Find which are acceptable. ​
 +Find all the reads that start and end on the 
 +two long reads. ​ Then consider them all
 +together, much more efficient.
 +
 +5% or more of mate-pairs can be wrong.
 +Hopefully will be able to filter them out.
 +In velvet use a likelihood approach.
 +Have a distance estimate.
 +Coverage of A and B, how many would
 +we expect to see? If we see too few,
 +we discard the connection.
 +
 +Allpaths did better using small localized problems
 +of a single small scaffold for parallelization.
 +
 +The Shorty algorithm uses the variance between
 +read pair anchored on a common contig on a kmer.
 +Projected image of the other ends creates possibility
 +of assembling a smaller set, the projection from a node.
 +
 +==== PEBBLE ====
 +Velvet used the reverse idea, the projections to a node.
 +Hit the nodes in the vicinity of the starting contig,
 +got distance information on the immediate vicinity.
 +Have a greedy algorithm. ​ Find way to next unique node.
 +
 +Having a more-precise insert size could help.
 +Velvet did a size assuming 10% distribution.
 +Solexa might have been different.
 +Solid up to 2k, solexa just 300.
 +It is difficult to have enough dna of a narrow range.
 +Don't love barcoding because it uses a lot of reagents.
 +
 +
 +==== COLOR-SPACE ====
 +
 +4-letter alphabet.
 +Double-encoded files use ACGT.
 +Be careful not to confuse it.
 +Different rules for complementarity.
 +Just reverse, don't complement them.
 +The assemblers and mappers should ​
 +in theory adapt to this.
 +Could use base-space, but one error
 +will ruin the rest of the read.  This is
 +a huge loss, so do all in colorspace
 +and make a complete assembly,
 +then finally convert to base calls later.
 +Typically it detects the error after 
 +1 or 2 bp, then corrects.
 +SOLID has developed tools.
 +
 +Different error models.
 +454 had homopolymer issues, flowspace.
 +Illumina or ABI are different.
 +Makes it hard to handle errors.
 +Especially in DBG. Applying the correction
 +rules becomes much more complex.
 +Newbler may be better for 454.
 +
 +Pre-filtering reads.
 +Some assemblers have built-in filtering of reads.
 +
 +Filtering low quality bases can cut down
 +on the computational cost.
 +
 +User-specific,​ data-space specific solutions.
 +No tried and true consistent approach.
 +
 +Some assemblers require reads of identical lengths.
 +
 +-----------
 +
 +==== OASES ====
 +
 +De-novo Transcriptome Assembly.
 +People are using NGS generally because it's cheaper.
 +Would have just one path except for alternate splicing.
 +Pevzner in 2002 mentioned DBG is equivalent to a
 +splice-graph. ​ So can apply those algorithms developed for 
 +Splice-graphs to DBG.
 +
 +=== Oases Process ===
 +The genetic locus should create a connected locus, cluster of contigs. ​
 +Sometimes one cluster gets broken into two, but works pretty well.
 +
 +Dynamic assembly, find the path of highest weight.
 +
 +=== Oases Output ===
 +With skipped exon, and one artifact.
 +Handles contigs with different coverage levels.
 +Find donor and acceptor sites on or around junctions.
 +
 +Drosophila reads.
 +rl 36, k21 ins size 200.
 +Map to genome 90%
 +Found an unannotated gene with 7 exons.
 +
 +Assembler has no notion of the reference genome,
 +but still found one example that mapped very well.
 +
 +
 +==== COLUMBUS ====
 +Mapping. ​
 +Either map, or do de-novo.
 +Many good mature mapping tools.
 +Easier to analyze, comes with genomic coords.
 +De-novo harder. ​ May require special hardware.
 +Works without reference. ​ Handles novel structures.
 +
 +How to get the best of both worlds?
 +Just throw in the ref genome as long reads.
 +Chucking in long reads and getting short contigs.
 +What they wanted really was to start from a 
 +reference genome. ​ Then handle it as a little insertion.
 +Takes DBG and SAM/BAM (BWA, bowtie)
 +Keeps ref sequences. ​ Not allowed to overlap with self.
 +Starting with comparative transcriptome assembly.
 +Would like to ? cancer.
 +
 +DBG convention for NGS.
 +
 +For transcriptome,​ using k=21.
 +Can use reads as short as 25.
 +How many X coverage needed?
 +How many genes do you want to catch.
 +
 +With genome, want long inserts generally.
 +With transcriptome,​ short inserts are more 
 +useful. ​ They have been using 200bp tiny stuff.
 +
 +==== End-of-class Business ====
 +
 +We can put Daniel'​s slides on the wiki.
 +
 +Galt will send Daniel a link to the wiki.
 +
 +Kevin on Wed. will be talking about:
 +Use solid data for ordering and orientation of Pog contigs.
 +Special purpose scripts.
 +The data was so noisy.
 +
 +
 +
lecture_notes/04-19-2010.txt ยท Last modified: 2010/04/24 14:54 by karplus