Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-13-2015 [2015/04/13 18:31]
ceisenhart
+++ lecture_notes:04-13-2015 [2015/04/17 23:00] (current)
sihussai
@@ Line 1: / Line 1: @@
-Banana Slug Genomics Notes
+======Illumina sequencing technology======
-Chris Eisenhart
-Kevin's notes on fixing the wiki
+=====Administrative=====
-	Be more careful with meta-data, specifically the processing (FASTQC/preqc) results
+We need to be better about documenting everything.
-		should be clearly linked to the data set they were made from.
+  * Need to have clear documentation of how each file was produced (and why)
-	More discussion regarding results
+  * Be more careful with meta-data, specifically the processing (FASTQC/preqc) results should be clearly linked to the data set they were made from.
-	Make a page for each data set that was collected, linking relevant information
+  * Need more discussion regarding results
-	The wiki needs more information analysis, less information dump. (Look at the old data for examples)
+  * Make a page for each data set that was collected, linking relevant information
+      * Should have detailed info about the data set and what was done to it, with discussion and interpretation of results, if applicable
+  * The wiki needs more information analysis, less information dump. (Look at the old data for examples)
+  * Make sure everything you do (process, results, discussion, notes) ends up on the wiki! If it isn't up, it never happened.
-Ed’s Lecture
+=====High-throughput sequencing techniques=====
-	High-throughput sequencing techniques
+====Background====
-	Get DNA (Complex DNA sample) -> Adapter Ligation -> PCR amplification & sequencing
+  * Sanger sequencing
-		Often techniques differ in PCR amplification & sequencing
+    * NOT high-throughput sequencing
-	Illumine Library Construction
+    * Was the standard for decades
-		blunt end repair
+    * Huge problem: higher throughput basically meant buying lots more (expensive) machines
-		ligate adaptors
+====High-throughput sequencing====
-		fill in adaptors
+Everything in this class is based on the idea of a complex DNA sample. Basically you make the library without knowing anything about the DNA. Basic idea:
-		PCR amplification
-		Sequencing
+    Get DNA (Complex DNA sample) -> Adapter Ligation -> PCR amplification & sequencing
-			attach primers to either side of the target segment
-	Note that DNA can only be extended on the 3’ end (There is an -OH on the ribose at the 3’ end)
+The difference between different technologies is generally in the PCR amplification & sequencing steps.
-	PCR Biases
-		Shorter strands are more amplified
+We can't amplify everything because there is too much noise, so library molecules are physically separated first. Each tech does this in it's own way. One approach is dilute down to a single molecule, then amplify, but this has low throughput. Newer technologies have clever ways to physically partition stuff from each other, and then amplify those all at the same time.
-		GC content
-	Often Fastq sequences come based on the tile where they were sequenced
+====Illumina library construction====
-pg is roughly 3 gig abases, therefore 1pg = 1gb
+  - Sonicate to break DNA into small templates
-pg = 1 banana slug genome
+  - Repair DNA to get blunt ends
-mg/2pg = 500,000 genomes
+    - You can extend the shorter end if it goes 5' to 3'. Otherwise you have to remove the longer end, because you can only extend DNA one way.
-		there are 4 million 500 mers in a banana slug genome
+  - Ligate adaptors (P5, P7)
-,000,000,000,000 total fragments for the MiSeq run
+  - Fill in adaptors (they are designed to be extendable)
-	Dilution parameter explanation
+  - PCR amplification
-		Too dilute and not enough colonies form
+    - Uses an indexing oligo on one end that has a barcode specific to the library, so that later you will know where the sequence came from
-		Too concentrated and colonies merge into each other, making sequencing impossible
+    - This step introduces lots of bias (more on this later)
-	Sequencing by synthesis
+  - Sequence
-		Add a new base (flourescent tagged) and watch for the light
+    - Attach primers to either side of the target segment
-	Limitations of Illumina Sequencing
+Note that DNA can only be extended on the 3’ end (There is an -OH on the ribose at the 3’ end).
-		slow run time
-		short reads
+===PCR biases===
-		Fluorophore overlap
+  * Shorter strands are more amplified
-		Out-of-phase accumulation
+  * GC content: anything that is especially GC-rich or -poor will be a problem to amplify
+===Other notes===
+  * If you want a random subset of your reads, it generally works to take a continuous block of them, as long as you skip the first million or so. This is easy using head and tail on linux
+  * The main factor in seeing duplicates isn't how many rounds of PCR you do, it's how many unique molecules you started with (due to the orders of magnitude of each)
+    * 3 pg is roughly 3 gig abases, therefore 1pg = 1gb
+    * 2pg = 1 banana slug genome
+    * 1mg/2pg = 500,000 genomes
+    * there are 4 million 500 mers in a banana slug genome
+    * 2,000,000,000,000 total fragments for the MiSeq run
+  * Often Fastq sequences come based on the tile where they were sequenced
+  * Dilution parameter (critical!)
+    * Too dilute and not enough colonies form
+    * Too concentrated and colonies merge into each other, making sequencing impossible
+  * Sequencing by synthesis
+    * Add a new base (flourescent tagged) and watch for the light
+    * Quality is determined by how close the signal is to that of another base
+    * The "secret sauce" of Illumina sequencing: getting a polymerase that will accept the modified bases as legitimate, and that won't back up for error correction
+===Limitations of Illumina sequencing===
+  * Slow run time
+    * Each step involves real live chemistry
+    * A 2x100 run typically takes a week to 10 days
+  * Short reads (~150 nt)
+  * Fluorophore overlap
+  * Out-of-phase accumulation
+    * If one template messes up slightly (by not incorporating a base at one step), then it will be out of phase forever. The number of templates getting out of phase increases with time, so the signal goes down and the noise goes up. This is why quality scores drop off at the ends of reads, and what limits the length of reads, since at some point it's just not worth continuing anymore.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools