User Tools

Site Tools


lecture_notes:04-19-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
lecture_notes:04-19-2010 [2010/04/20 00:01]
galt
lecture_notes:04-19-2010 [2010/04/22 06:10]
galt Added Galt's notes.
Line 120: Line 120:
     * Integrates alignment information (SAM, BAM file)     * Integrates alignment information (SAM, BAM file)
     * Edits, breaks up, reconnects ** INCOMPLETE **     * Edits, breaks up, reconnects ** INCOMPLETE **
 +
 +
 +
 +===== Galt's Notes =====
 +
 +Assembling and comparing genomes or transcriptomes using de Bruijn graphs.
 +
 +Daniel Zerbino, Marcel Schulz, Wendy Lee,
 +Ewan Birney and David Haussler.
 +
 +Slides will be made available.
 +
 +Talking about
 +  *ways to build graphs ​
 +  *handle repeats
 +  *practical considerations
 +  *extensions to velvet
 +
 +==== OLC ====
 +Traditional approach is overlap-layout-consensus
 +Phrap, etc.\\
 +Short reads - Edena\\
 +More expensive computationally.
 +Use whole-read alignments.
 +
 +=== Quick OLC example ===
 +  Filter out all the reads that are identical or self-contained in another read.
 +  Only left-to-right etc allowed.
 +  Compute all their overlaps.
 +  Get many transitive redundant connections.
 +  Many algorithms are available for transitive reduction.
 +  You obtain simplified paths through the reads.
 +  Get 4 big chunks of reads from the example.
 +  Hard to distinguish which reads belong to which copy of the repeat.
 +  This is how contigs were first built.
 +  Have to compare all reads against each other,
 +  Quadratic. ​ This has become difficult.
 +  NGS reads are short, and overlaps between reads are short.
 +  They are getting a little longer with newer technology.
 +  OLC method uses all information in the read at least.
 +
 +==== DBG ====
 +
 +De Bruijn graphs, Pavel Pevzner popularized in paper 2001.
 +Why do we use them?
 +High through-put,​ cheap, short reads with errors.
 +
 +
 +The next-gen means 2007 there were many papers, SSKE, SHARCGS, VCKE,
 +have a kmer hash, much faster and leaner, but get noise.
 +
 +Build a dictionary of words in the data set.
 +His example is 4mer, but typically 20mer and longer.
 +For each, create a node in the hash table.
 +Big dataset. ​ Typically a binary marker to say if present.
 +Sometimes at branch have two possible extensions.
 +Greedy approach essentially De Bruijn graph with
 +local info only.  In this structure you could have 
 +many paths through. ​ And reads of different lengths can be used.
 +
 +Later Euler, Allphaths, Velvet, ABySS, SOAPdenovo came out.
 +They souped up the kmer hash, but processed more globally.
 +Graph construction and correction is similar in all assemblers.
 +ABySS handles parallel implementation.
 +
 +First, concatenate the simple graphs without branches
 +into nodes which compact the non-branching runs into a single item.
 +Biochemistry of shortreads means errors tend to be near the ends.
 +If you have errors in the reads in the second half,
 +it creates millions of tips, but they are at least easy to 
 +remove later.  ​
 +
 +Each read is a path through the graph. The head of the read
 +is still kept, so we're not losing the read placement information.
 +
 +We are still left with bubbles.
 +Possibly accumulation of overlapping spur-tips.
 +More often variation e.g. SNP or variation.
 +Would like to keep that information where possible.
 +Could also be a long read.  But don't want to cut
 +the long read in the middle. ​ When you find a bubble
 +in velvet, remap that back onto the one with the higher
 +branch.
 +Now we have a loop that goes back
 +on itself (not a bubble). e.g. tandem repeats.
 +
 +OLC has atomic arcs.
 +DBG has a greater variety of overlaps.
 +
 +OLC gets error correction from long reads.
 +DBG requires more work to correct.
 +
 +==== Tour Bus ====
 +Errors.
 +Compare 5mb chunk from 4 species, ecolo yeast, elegans, human.
 +Increased coverage from 0 to 50x.
 +N50 on y axis.
 +First, exponential growth as expected.
 +Then the complexity causes trouble,
 +unless you have other kinds of information.
 +Not using paired-reads information.
 +You see that the human is a little more complex than E. coli.
 +
 +Scaling up the hard way: chrX.
 +1/2 billion reads from flow-sorted human chrX.
 +Fit in 70gb ram.
 +Overall length 130mb.
 +Short contigs, and numerous.
 +How to handle the repeats?
 +
 +**Margulies** used reduced representation
 +using restriction enzymes and gel
 +to cut the problem into tractable bits.
 +Cut chunks from the gel, and then
 +sequenced the gel slice separately,
 +or at least in different lanes.
 +Then the assembly problem is smaller
 +and more localized instead of whole-genome.
 +
 +Combined re-mapping and de novo sequencing.
 +**Cheetem and Pleasance**.
 +Map to reference or similar genome,
 +then re-map locally some stuff that's unique.
 +
 +MAQ and BWA only allow 3 or 4 mismatches.
 +Can see reads not mapping in a small area
 +with too many SNPs or breakpoint rearrangment.
 +
 +Code parallelization (**Birol and Cook**).
 +Only ABySS has published yet.
 +
 +Improved indexing
 +Fit WGS onto 200GB.
 +
 +Use of intermediate re-mapping. (**Haimel**)
 +Wanted to parallelize the second level 
 +work of building scaffolds.
 +
 +But those are mostly engineering issues.
 +
 +The DBG for a small organism
 +looks like long-ish contigs,
 +but still makes a hair-ball.
 +
 +==== Rock Band ====
 +Use long and short reads together.
 +Long reads which are mapped seamlessly into the graph.  ​
 +
 +35bp reads 50x, some long reads (100-1kb) 0-20x.
 +Played with the length.
 +The longer your longreads are, the more bridgeable
 +it is and the longer the contigs.  ​
 +Why not just assemble the long reads? someone might ask.
 +
 +Different approach. ​ Spectral graph analysis.
 +Find what is the vector that threads its way.
 +Has a fancy theory with massive matrix diagonalization.
 +Takes all the data at once in one huge matrix,
 +not too affected by noise. No one has actually
 +tried this approach. Too much computation.
 +The trad approach is contigs with paired-end reads. ​
 +Find chains connected by contig reads.
 +
 +Your coverage is huge.
 +Each basepair has to handle 1000 coverage
 +read-pairs. ​ Long reads separated by a mess
 +in the middle. ​ Check possible paths.
 +Find which are acceptable. ​
 +Find all the reads that start and end on the 
 +two long reads. ​ Then consider them all
 +together, much more efficient.
 +
 +5% or more of mate-pairs can be wrong.
 +Hopefully will be able to filter them out.
 +In velvet use a likelihood approach.
 +Have a distance estimate.
 +Coverage of A and B, how many would
 +we expect to see? If we see too few,
 +we discard the connection.
 +
 +Allpaths did better using small localized problems
 +of a single small scaffold for parallelization.
 +
 +The Shorty algorithm uses the variance between
 +read pair anchored on a common contig on a kmer.
 +Projected image of the other ends creates possibility
 +of assembling a smaller set, the projection from a node.
 +
 +==== PEBBLE ====
 +Velvet used the reverse idea, the projections to a node.
 +Hit the nodes in the vicinity of the starting contig,
 +got distance information on the immediate vicinity.
 +Have a greedy algorithm. ​ Find way to next unique node.
 +
 +Having a more-precise insert size could help.
 +Velvet did a size assuming 10% distribution.
 +Solexa might have been different.
 +Solid up to 2k, solexa just 300.
 +It is difficult to have enough dna of a narrow range.
 +Don't love barcoding because it uses a lot of reagents.
 +
 +
 +==== COLOR-SPACE ====
 +
 +4-letter alphabet.
 +Double-encoded files use ACGT.
 +Be careful not to confuse it.
 +Different rules for complementarity.
 +Just reverse, don't complement them.
 +The assemblers and mappers should ​
 +in theory adapt to this.
 +Could use base-space, but one error
 +will ruin the rest of the read.  This is
 +a huge loss, so do all in colorspace
 +and make a complete assembly,
 +then finally convert to base calls later.
 +Typically it detects the error after 
 +1 or 2 bp, then corrects.
 +SOLID has developed tools.
 +
 +Different error models.
 +454 had homopolymer issues, flowspace.
 +Illumina or ABI are different.
 +Makes it hard to handle errors.
 +Especially in DBG. Applying the correction
 +rules becomes much more complex.
 +Newbler may be better for 454.
 +
 +Pre-filtering reads.
 +Some assemblers have built-in filtering of reads.
 +
 +Filtering low quality bases can cut down
 +on the computational cost.
 +
 +User-specific,​ data-space specific solutions.
 +No tried and true consistent approach.
 +
 +Some assemblers require reads of identical lengths.
 +
 +-----------
 +
 +==== OASES ====
 +
 +De-novo Transcriptome Assembly.
 +People are using NGS generally because it's cheaper.
 +Would have just one path except for alternate splicing.
 +Pevzner in 2002 mentioned DBG is equivalent to a
 +splice-graph. ​ So can apply those algorithms developed for 
 +Splice-graphs to DBG.
 +
 +=== Oases Process ===
 +The genetic locus should create a connected locus, cluster of contigs. ​
 +Sometimes one cluster gets broken into two, but works pretty well.
 +
 +Dynamic assembly, find the path of highest weight.
 +
 +=== Oases Output ===
 +With skipped exon, and one artifact.
 +Handles contigs with different coverage levels.
 +Find donor and acceptor sites on or around junctions.
 +
 +Drosophila reads.
 +rl 36, k21 ins size 200.
 +Map to genome 90%
 +Found an unannotated gene with 7 exons.
 +
 +Assembler has no notion of the reference genome,
 +but still found one example that mapped very well.
 +
 +
 +==== COLUMBUS ====
 +Mapping. ​
 +Either map, or do de-novo.
 +Many good mature mapping tools.
 +Easier to analyze, comes with genomic coords.
 +De-novo harder. ​ May require special hardware.
 +Works without reference. ​ Handles novel structures.
 +
 +How to get the best of both worlds?
 +Just throw in the ref genome as long reads.
 +Chucking in long reads and getting short contigs.
 +What they wanted really was to start from a 
 +reference genome. ​ Then handle it as a little insertion.
 +Takes DBG and SAM/BAM (BWA, bowtie)
 +Keeps ref sequences. ​ Not allowed to overlap with self.
 +Starting with comparative transcriptome assembly.
 +Would like to ? cancer.
 +
 +DBG convention for NGS.
 +
 +For transcriptome,​ using k=21.
 +Can use reads as short as 25.
 +How many X coverage needed?
 +How many genes do you want to catch.
 +
 +With genome, want long inserts generally.
 +With transcriptome,​ short inserts are more 
 +useful. ​ They have been using 200bp tiny stuff.
 +
 +==== End-of-class Business ====
 +
 +We can put Daniel'​s slides on the wiki.
 +
 +Galt will send Daniel a link to the wiki.
 +
 +Kevin on Wed. will be talking about:
 +Use solid data for ordering and orientation of Pog contigs.
 +Special purpose and scripts.
 +The data was so noisy.
 +
 +
 +
lecture_notes/04-19-2010.txt ยท Last modified: 2010/04/24 14:54 by karplus