Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-08-2015 [2015/04/09 19:12]
jolespin
+++ lecture_notes:04-08-2015 [2015/04/09 23:05]
sihussai
@@ Line 1: / Line 1: @@
-	 **De nova Assembly II** | Wed 8 April 2015 | Stefan Prost | stefan.prost@berkley.edu | jolespin notes
+=====De novo Assembly II=====
+**Guest lecturer: Stefan Prost, stefan.prost@berkley.edu**
-#Illumina Paired-end Sequencing Libraries
+====Illumina Paired-end Sequencing Libraries====
-	. MiSeq has 300 bp reads
+  * MiSeq has 300 bp reads
-	. Paired ends read from both directions
+  * Paired ends read from both directions, so you get one read for each end (may or may not overlap depending on molecule and read size)
-		=====> read_1
+		=====> end_1
 		________________________
-		           read_2 <=====
+		           end_2 <=====
-	. Can’t sequence repeat regions with paired-end reads
-#Illumina Mate-Pair Sequencing Libraries
+  * Problem: not really sufficient for repetitive regions
-			<===== read_1
+    * ends can't be very far apart because Illumina can't handle big molecules
+    * not enough info for scaffolding
+====Illumina Mate-Pair Sequencing Libraries===
+  * Idea: get paired reads that are much farther away (for more scaffolding info)
+  * Basically, the same idea as paired-ends, except the middle section is missing and the ends are oriented the opposite way.
+			<===== end_1
 		________________________
-		           read_2 =====>
+		           end_2 =====>
-	. Dependent on inferring insert size
-	. Genomic DNA > Fragment (2-5 kb) > biotinylate ends >
+  * Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > …
-	  Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments >
+    * Cut DNA, attach a biotin tag to both ends of the target molecule
-          Ligate adaptors > …
+    * Circularize target molecule
+      * This step is hard
+    * Circularized molecule is fragmented, fragment with the biotin tag (a.k.a the ends of the original target stuck to each other backwards, with the tags in between) is enriched
+    * Then end repair, A-tailing, adapters added, amplification, sequencing
+  * Dependent on inferring insert size (can be tricky)
+  * Most companies can get you 8 kb inserts, with skill you can get up to 20 kb
+  * Complicated process, weird stuff can happen in between
+  * Important difference between paired ends and mate-pairs: ends are oriented the opposite way.
+====BAC (Bacterial Artificial Chromosome) and Fosmid Libraries====
+  * Uncommon and expensive, but the gold standard
+  * Bacterial F-plasmid takes< 40 kb insert size
+  * Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads
+  * http://www.scq.ubc.ca/wp-content/plasmidtext.gif
-#BAC (Bacterial Artificial Chromosome) and Fosmid Libraries
+====Read Quality Assessment====
-	. Bacterial F-plasmid takes< 40 kb insert size
+  * Base quality: Phred scores reported by sequencer.
-	. Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads
+  * Fastq files: fasta files, plus encoded phred scores
-	. http://www.scq.ubc.ca/wp-content/plasmidtext.gif
+    * Need to know if your file has phred33 or phred64 encoding
+  * Quality for each individual base is not the whole story, the context matters to the signal processing too
+  * Reads decrease in quality further down the read
+    *  **Depending on the assembler, you might need to trim lower quality reads off the end. Others want untrimmed data**
+  * Pacific Bio doesn’t have GC|AT bias
-#Read Quality Assessment Tools
+===Tools===
-	. FastQC (Most popular tool to tell you about the read library)
+  * FastQC (Most popular tool to tell you about the read library)
-	. Preqc
+    * FastQC reported an issue with our data with kmer count (related to adapter content)
-	. Reads decrease in quality further down the read
+      * **This needs to be checked out and diagnosed!**
-	. Pacific Bio doesn’t have GC|AT bias
+  * Preqc
+    * Estimates how difficult the assembly will be
-#Estimating Genome Size from Read Data
+====Estimating Genome Size from Read Data====
-	. G = (pn(1-k+1))/(λ_k)
+	G = (pn(1-k+1))/(λ_k)
 	G = Genome size
 	pn = proportion of correct reads
@@ Line 38: / Line 64: @@
 	Simpson 2013, arXiv
-	. To estimate genome size need to know:i) total number of reads; ii) length of reads;  and iii) kmer distribution
+  * To estimate genome size need to know:i) total number of reads; ii) length of reads;  and iii) kmer distribution
-#Error Correction
+====Error Correction====
-	. High amount of small kmers are usually errors
+  * High amount of small kmers are usually errors
-***Simulated contif length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contif length N50
+**Simulated contig length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contig length N50**

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools