Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-19-2010 [2010/04/19 22:21]
cbrumbau created
+++ lecture_notes:04-19-2010 [2010/04/22 06:12]
galt
@@ Line 1: / Line 1: @@
 ====== Lecture Notes for April 19, 2010 ======
 Need to fill in any ** INCOMPLETE ** portions
+{{:lecture_notes:20100419_banana_slug_seminar.ppt|Daniel's Powerpoint}}
 ===== Presentation =====
@@ Line 118: / Line 120: @@
     * Integrates alignment information (SAM, BAM file)
     * Edits, breaks up, reconnects ** INCOMPLETE **
+===== Galt's Notes =====
+Assembling and comparing genomes or transcriptomes using de Bruijn graphs.
+Daniel Zerbino, Marcel Schulz, Wendy Lee,
+Ewan Birney and David Haussler.
+Slides will be made available.
+Talking about
+  *ways to build graphs
+  *handle repeats
+  *practical considerations
+  *extensions to velvet
+==== OLC ====
+Traditional approach is overlap-layout-consensus
+Phrap, etc.\\
+Short reads - Edena\\
+More expensive computationally.
+Use whole-read alignments.
+=== Quick OLC example ===
+  Filter out all the reads that are identical or self-contained in another read.
+  Only left-to-right etc allowed.
+  Compute all their overlaps.
+  Get many transitive redundant connections.
+  Many algorithms are available for transitive reduction.
+  You obtain simplified paths through the reads.
+  Get 4 big chunks of reads from the example.
+  Hard to distinguish which reads belong to which copy of the repeat.
+  This is how contigs were first built.
+  Have to compare all reads against each other,
+  Quadratic.  This has become difficult.
+  NGS reads are short, and overlaps between reads are short.
+  They are getting a little longer with newer technology.
+  OLC method uses all information in the read at least.
+==== DBG ====
+De Bruijn graphs, Pavel Pevzner popularized in paper 2001.
+Why do we use them?
+High through-put, cheap, short reads with errors.
+The next-gen means 2007 there were many papers, SSKE, SHARCGS, VCKE,
+have a kmer hash, much faster and leaner, but get noise.
+Build a dictionary of words in the data set.
+His example is 4mer, but typically 20mer and longer.
+For each, create a node in the hash table.
+Big dataset.  Typically a binary marker to say if present.
+Sometimes at branch have two possible extensions.
+Greedy approach essentially De Bruijn graph with
+local info only.  In this structure you could have
+many paths through.  And reads of different lengths can be used.
+Later Euler, Allphaths, Velvet, ABySS, SOAPdenovo came out.
+They souped up the kmer hash, but processed more globally.
+Graph construction and correction is similar in all assemblers.
+ABySS handles parallel implementation.
+First, concatenate the simple graphs without branches
+into nodes which compact the non-branching runs into a single item.
+Biochemistry of shortreads means errors tend to be near the ends.
+If you have errors in the reads in the second half,
+it creates millions of tips, but they are at least easy to
+remove later.
+Each read is a path through the graph. The head of the read
+is still kept, so we're not losing the read placement information.
+We are still left with bubbles.
+Possibly accumulation of overlapping spur-tips.
+More often variation e.g. SNP or variation.
+Would like to keep that information where possible.
+Could also be a long read.  But don't want to cut
+the long read in the middle.  When you find a bubble
+in velvet, remap that back onto the one with the higher
+branch.
+Now we have a loop that goes back
+on itself (not a bubble). e.g. tandem repeats.
+OLC has atomic arcs.
+DBG has a greater variety of overlaps.
+OLC gets error correction from long reads.
+DBG requires more work to correct.
+==== Tour Bus ====
+Errors.
+Compare 5mb chunk from 4 species, ecolo yeast, elegans, human.
+Increased coverage from 0 to 50x.
+N50 on y axis.
+First, exponential growth as expected.
+Then the complexity causes trouble,
+unless you have other kinds of information.
+Not using paired-reads information.
+You see that the human is a little more complex than E. coli.
+Scaling up the hard way: chrX.
+/2 billion reads from flow-sorted human chrX.
+Fit in 70gb ram.
+Overall length 130mb.
+Short contigs, and numerous.
+How to handle the repeats?
+**Margulies** used reduced representation
+using restriction enzymes and gel
+to cut the problem into tractable bits.
+Cut chunks from the gel, and then
+sequenced the gel slice separately,
+or at least in different lanes.
+Then the assembly problem is smaller
+and more localized instead of whole-genome.
+Combined re-mapping and de novo sequencing.
+**Cheetem and Pleasance**.
+Map to reference or similar genome,
+then re-map locally some stuff that's unique.
+MAQ and BWA only allow 3 or 4 mismatches.
+Can see reads not mapping in a small area
+with too many SNPs or breakpoint rearrangment.
+Code parallelization (**Birol and Cook**).
+Only ABySS has published yet.
+Improved indexing
+Fit WGS onto 200GB.
+Use of intermediate re-mapping. (**Haimel**)
+Wanted to parallelize the second level
+work of building scaffolds.
+But those are mostly engineering issues.
+The DBG for a small organism
+looks like long-ish contigs,
+but still makes a hair-ball.
+==== Rock Band ====
+Use long and short reads together.
+Long reads which are mapped seamlessly into the graph.
+bp reads 50x, some long reads (100-1kb) 0-20x.
+Played with the length.
+The longer your longreads are, the more bridgeable
+it is and the longer the contigs.
+Why not just assemble the long reads? someone might ask.
+Different approach.  Spectral graph analysis.
+Find what is the vector that threads its way.
+Has a fancy theory with massive matrix diagonalization.
+Takes all the data at once in one huge matrix,
+not too affected by noise. No one has actually
+tried this approach. Too much computation.
+The trad approach is contigs with paired-end reads.
+Find chains connected by contig reads.
+Your coverage is huge.
+Each basepair has to handle 1000 coverage
+read-pairs.  Long reads separated by a mess
+in the middle.  Check possible paths.
+Find which are acceptable.
+Find all the reads that start and end on the
+two long reads.  Then consider them all
+together, much more efficient.
+% or more of mate-pairs can be wrong.
+Hopefully will be able to filter them out.
+In velvet use a likelihood approach.
+Have a distance estimate.
+Coverage of A and B, how many would
+we expect to see? If we see too few,
+we discard the connection.
+Allpaths did better using small localized problems
+of a single small scaffold for parallelization.
+The Shorty algorithm uses the variance between
+read pair anchored on a common contig on a kmer.
+Projected image of the other ends creates possibility
+of assembling a smaller set, the projection from a node.
+==== PEBBLE ====
+Velvet used the reverse idea, the projections to a node.
+Hit the nodes in the vicinity of the starting contig,
+got distance information on the immediate vicinity.
+Have a greedy algorithm.  Find way to next unique node.
+Having a more-precise insert size could help.
+Velvet did a size assuming 10% distribution.
+Solexa might have been different.
+Solid up to 2k, solexa just 300.
+It is difficult to have enough dna of a narrow range.
+Don't love barcoding because it uses a lot of reagents.
+==== COLOR-SPACE ====
+-letter alphabet.
+Double-encoded files use ACGT.
+Be careful not to confuse it.
+Different rules for complementarity.
+Just reverse, don't complement them.
+The assemblers and mappers should
+in theory adapt to this.
+Could use base-space, but one error
+will ruin the rest of the read.  This is
+a huge loss, so do all in colorspace
+and make a complete assembly,
+then finally convert to base calls later.
+Typically it detects the error after
+or 2 bp, then corrects.
+SOLID has developed tools.
+Different error models.
+had homopolymer issues, flowspace.
+Illumina or ABI are different.
+Makes it hard to handle errors.
+Especially in DBG. Applying the correction
+rules becomes much more complex.
+Newbler may be better for 454.
+Pre-filtering reads.
+Some assemblers have built-in filtering of reads.
+Filtering low quality bases can cut down
+on the computational cost.
+User-specific, data-space specific solutions.
+No tried and true consistent approach.
+Some assemblers require reads of identical lengths.
+-----------
+==== OASES ====
+De-novo Transcriptome Assembly.
+People are using NGS generally because it's cheaper.
+Would have just one path except for alternate splicing.
+Pevzner in 2002 mentioned DBG is equivalent to a
+splice-graph.  So can apply those algorithms developed for
+Splice-graphs to DBG.
+=== Oases Process ===
+The genetic locus should create a connected locus, cluster of contigs.
+Sometimes one cluster gets broken into two, but works pretty well.
+Dynamic assembly, find the path of highest weight.
+=== Oases Output ===
+With skipped exon, and one artifact.
+Handles contigs with different coverage levels.
+Find donor and acceptor sites on or around junctions.
+Drosophila reads.
+rl 36, k21 ins size 200.
+Map to genome 90%
+Found an unannotated gene with 7 exons.
+Assembler has no notion of the reference genome,
+but still found one example that mapped very well.
+==== COLUMBUS ====
+Mapping.
+Either map, or do de-novo.
+Many good mature mapping tools.
+Easier to analyze, comes with genomic coords.
+De-novo harder.  May require special hardware.
+Works without reference.  Handles novel structures.
+How to get the best of both worlds?
+Just throw in the ref genome as long reads.
+Chucking in long reads and getting short contigs.
+What they wanted really was to start from a
+reference genome.  Then handle it as a little insertion.
+Takes DBG and SAM/BAM (BWA, bowtie)
+Keeps ref sequences.  Not allowed to overlap with self.
+Starting with comparative transcriptome assembly.
+Would like to ? cancer.
+DBG convention for NGS.
+For transcriptome, using k=21.
+Can use reads as short as 25.
+How many X coverage needed?
+How many genes do you want to catch.
+With genome, want long inserts generally.
+With transcriptome, short inserts are more
+useful.  They have been using 200bp tiny stuff.
+==== End-of-class Business ====
+We can put Daniel's slides on the wiki.
+Galt will send Daniel a link to the wiki.
+Kevin on Wed. will be talking about:
+Use solid data for ordering and orientation of Pog contigs.
+Special purpose scripts.
+The data was so noisy.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools