Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-19-2010 [2010/04/24 02:26]
cbrumbau Started to merge notes (incomplete), cleaned up formatting, filled in missing notes from presentation
+++ lecture_notes:04-19-2010 [2010/04/24 14:54]
karplus added correction about quadratic behavior of overlap-consensus
@@ Line 23: / Line 23: @@
   * Use whole-read alignments
-=== Quick OLC example ===
+=== Quick Overlap-Layout-Consensus (OLC) example ===
   * Filter out all the reads that are identical or self-contained in another read. Only left-to-right, etc. are allowed.
   * Compute all their overlaps.
@@ Line 34: / Line 34: @@
   * Have to compare all reads against each other.
     * Quadratic. This has become difficult.
-  * NGS reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.
+    * Correction: quadratic behavior can be reduced to about O(n log n) by doing a clustering before aligning: just like in several fast search techniques, you only attempt to align sequences that share seed hits to relatively short k-mers (much shorter than those needed for de Bruijn graphs, since they only need to reduce the number of hits, not get unique locations in the genome). --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/24 07:51//
+  * Next generation sequencing reads are short, and overlaps between reads are short. They are getting a little longer with newer technology.
   * OLC method uses all information in the read at least.
@@ Line 49: / Line 50: @@
     * Big dataset. Typically a binary marker to say if present.
   * Sometimes at branch have two possible extensions.
-  * Greedy approach essentially De Bruijn graph with local info only. In this structure you could have many paths through.
+  * Greedy approach essentially de Bruijn graph with local info only. In this structure you could have many paths through.
   * Can throw in reads of different lengths with no problems.
@@ Line 117: / Line 118: @@
   * Combined re-mapping and //de novo// sequencing (Cheetham et al., Pleaseance et al.).
     * Map to reference or similar genome, then re-map locally some stuff that is unique.
-    * MAQ and BWA only allow 3 or 4 mismatches.
+    * MAQ (Mapping and Assembly with Quality) and BWA (Burrows-Wheeler Alignment Tool) only allow 3 or 4 mismatches.
     * Can see reads not mapping in a small area with too many SNPs or breakpoint rearrangments.
   * Code parallelization (Birol et al., Jeffrey Cook).
     * Only ABySS has published yet
   * Improved indexing (Zamin Iqbal and Mario Caccamo).
-    * Fit WGS onto 200GB
+    * Fit whole genome shotgun onto 200GB
   * Use of intermediate remapping (Matthias Haimei).
     * Wanted to parallelize the second level work of building scaffolds, but those are mostly engineering issues.
@@ Line 135: / Line 136: @@
   * Played with the length
   * The longer your long reads are, the more bridgeable it is and the longer the contigs.
-    * "Why not just assemble the long reads?" someone might ask
+    * Someone might ask: "Why not just assemble the long reads?"
 === Different approaches to repeat resolution ===
@@ Line 201: / Line 202: @@
   * When using different technologies, you have to take into account different technologies
     * Trivial when doing OLC assembly
-    * Much more tricky when doing De Bruijn graph assemble, since k-mers are not assigned to reads
+    * Much more tricky when doing de Bruijn graph assemble, since k-mers are not assigned to reads
     * Different assemblers had different settings (e.g. Newbler)
   * 454 had homopolymer issues in flow-space.
@@ Line 221: / Line 222: @@
 === De Brujin graphs ~ splice graphs ===
-  * Pavel Pevzner mentioned that De Bruijn graphs are more or less like splice graphs
+  * People are using next generation sequencing generally because it's cheaper.
+  * Would have just one path except for alternate splicing.
+  * Pavel Pevzner mentioned in 2002 that de Bruijn graphs are equivalent to splice graphs, so we can apply those algorithms developed for splice graphs to de Bruijn graphs.
 === Oases: Process ===
@@ Line 227: / Line 230: @@
     * Connecting reads
     * Connecting read pairs
+    * The genetic locus should create a connected locus, cluster of contigs.
+    * Sometimes one cluster gets broken into two, but works pretty well.
   * Re-implement traditional algorithms to run on graphs, instead of a reference genome
     * Greedy transitive reduction (Myers, 2005)
     * Motif searches
     * Dynamic assembly of transcripts (Lee, 2003)
+      * find the path of highest weight
 === Oases: Output ===
   * Full length transcripts - handles alternative splicing.
+    * With skipped exon and one artifact.
   * Transcripts from different genes - handles varying coverage levels.
   * Alternative splicing events: when possible, distinguished between common AS events.
+    * Find donor and acceptor sites on or around junctions.
 === Oases: Results ===
@@ Line 245: / Line 253: @@
 === Oases: results of Drosophila PE-reads ===
   * The number of transcripts and loci are in the ballpark of what they were expecting.
+  * rl 36, k21 ins size 200
+  * Map to genome ~90%
+=== Oases: examples ===
+  * Found an unannotated gene with 7 exons
+  * Assembler has no notion of the reference genome, but still found one example that mapped very well.
 === Mapping vs. de novo ===
   * Mapping
-    * Many tools available
+    * Many good mature tools available
     * Easy to compute
-    * Simpler to analyze
+    * Simpler to analyze, comes with genetic coords
   * //De novo//
+    * Harder than mapping
+    * May require special hardware
     * No reference needed
     * Handles novel structures without //a priori//
+  * How to get the best of both worlds?
 === Columbus: Velvet module ===
   * Solution: the reference-based de Bruijn graph
-    * Preserves the structure of the reference sequences
+    * Just throw in the reference genome as long reads.
-    * Integrates alignment information (SAM, BAM file)
+    * Chucking in long reads and getting short contigs.
+    * Preserves the structure of the reference sequences. Not allowed to overlap with self.
+      * What they wanted really was to start from a reference genome, then handle it as a little insertion.
+    * Integrates alignment information (SAM, BAM file with BWA, Bowtie)
     * Edits, breaks up, reconnects, and extends the "alignment contigs" as necessary
+  * Applications
+    * Comparative transcriptome assembly
+    * Cancer RNAseq
+    * Comparative assembly
-===== Galt's Notes =====
+=== De Bruijn graph convention for next generation sequencing ===
+  * For transcriptome, using k=21.
-==== OASES ====
+  * Can use reads as short as 25.
+  * How many X coverage needed?
-De-novo Transcriptome Assembly.
+  * How many genes do you want to catch?
-People are using NGS generally because it's cheaper.
+  * With genome, want long inserts generally.
-Would have just one path except for alternate splicing.
+  * With transcriptome, short inserts are more useful. They have been using 200bp tiny stuff.
-Pevzner in 2002 mentioned DBG is equivalent to a
-splice-graph.  So can apply those algorithms developed for
-Splice-graphs to DBG.
-=== Oases Process ===
-The genetic locus should create a connected locus, cluster of contigs.
-Sometimes one cluster gets broken into two, but works pretty well.
-Dynamic assembly, find the path of highest weight.
-=== Oases Output ===
-With skipped exon, and one artifact.
-Handles contigs with different coverage levels.
-Find donor and acceptor sites on or around junctions.
-Drosophila reads.
-rl 36, k21 ins size 200.
-Map to genome 90%
-Found an unannotated gene with 7 exons.
-Assembler has no notion of the reference genome,
-but still found one example that mapped very well.
-==== COLUMBUS ====
-Mapping.
-Either map, or do de-novo.
-Many good mature mapping tools.
-Easier to analyze, comes with genomic coords.
-De-novo harder.  May require special hardware.
-Works without reference.  Handles novel structures.
-How to get the best of both worlds?
-Just throw in the ref genome as long reads.
-Chucking in long reads and getting short contigs.
-What they wanted really was to start from a
-reference genome.  Then handle it as a little insertion.
-Takes DBG and SAM/BAM (BWA, bowtie)
-Keeps ref sequences.  Not allowed to overlap with self.
-Starting with comparative transcriptome assembly.
-Would like to ? cancer.
-DBG convention for NGS.
-For transcriptome, using k=21.
-Can use reads as short as 25.
-How many X coverage needed?
-How many genes do you want to catch.
-With genome, want long inserts generally.
-With transcriptome, short inserts are more
-useful.  They have been using 200bp tiny stuff.
-===== End-of-class business =====
+===== After presentation =====
-Kevin on Wed. will be talking about:
+Kevin on Wednesday will be talking about:
   * Use SOLiD data for ordering and orientation of //Pog// contigs
   * Special purpose scripts
   * The data was so noisy

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools