User Tools

Site Tools


lecture_notes:04-08-2015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lecture_notes:04-08-2015 [2015/04/09 19:12]
jolespin
lecture_notes:04-08-2015 [2015/04/17 22:34] (current)
sihussai fixing capitalization
Line 1: Line 1:
- **De nova Assembly ​II** | Wed 8 April 2015 | Stefan Prost stefan.prost@berkley.edu ​| jolespin notes+======De novo assembly ​II====== 
 +**Guest lecturer: ​Stefan Proststefan.prost@berkley.edu** 
  
-#Illumina ​Paired-end Sequencing Libraries +=====Illumina ​paired-end sequencing libraries===== 
- MiSeq has 300 bp reads +  ​* ​MiSeq has 300 bp reads 
- Paired ends read from both directions+  ​* ​Paired ends read from both directions, so you get one read for each end (may or may not overlap depending on molecule and read size)
  
- =====> ​read_1+ =====> ​end_1
  ________________________  ________________________
-            read_2 ​<===== +            end_2 <=====
- . Can’t sequence repeat regions with paired-end reads+
  
-#Illumina ​Mate-Pair Sequencing Libraries +  * Problem: not really sufficient for repetitive regions 
- <​===== ​read_1+    * ends can't be very far apart because ​Illumina ​can't handle big molecules 
 +    * not enough info for scaffolding 
 + 
 +=====Illumina mate-pair sequencing libraries==== 
 +  * Idea: get paired reads that are much farther away (for more scaffolding info) 
 +  * Basically, the same idea as paired-ends,​ except the middle section is missing and the ends are oriented the opposite way.  
 + 
 + <​===== ​end_1
  ________________________  ________________________
-            read_2 ​=====> +            end_2 =====> 
- . Dependent on inferring insert size + 
- Genomic DNA > Fragment (2-5 kb) > biotinylate ends > +  ​* ​Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > … 
-   ​Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > +    * Cut DNA, attach a biotin tag to both ends of the target molecule 
-          ​Ligate adaptors > …+    * Circularize target molecule 
 +      * This step is hard 
 +    * Circularized molecule is fragmented, fragment with the biotin tag (a.k.a the ends of the original target stuck to each other backwards, with the tags in between) is enriched 
 +    * Then end repair, A-tailing, adapters added, amplification,​ sequencing  
 +  * Dependent on inferring insert size (can be tricky)  
 +  * Most companies can get you 8 kb inserts, with skill you can get up to 20 kb 
 +  * Complicated process, weird stuff can happen in between  
 +  * Important difference between paired ends and mate-pairs: ends are oriented the opposite way.  
 + 
 + 
 +=====BAC (Bacterial Artificial Chromosome) and fosmid libraries===== 
 +  * Uncommon and expensive, but the gold standard  
 +  * Bacterial F-plasmid takes< 40 kb insert size 
 +  * Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads 
 +  * http://​www.scq.ubc.ca/​wp-content/​plasmidtext.gif
  
-#BAC (Bacterial Artificial Chromosome) and Fosmid Libraries +=====Read quality assessment===== 
-Bacterial F-plasmid takes< 40 kb insert size +  * Base quality: Phred scores reported by sequencer.  
-Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads +  * Fastq files: fasta files, plus encoded phred scores 
- . http://​www.scq.ubc.ca/​wp-content/​plasmidtext.gif+    * Need to know if your file has phred33 or phred64 encoding ​   
 +  * Quality for each individual base is not the whole story, the context matters to the signal processing too 
 +  * Reads decrease in quality further down the read 
 +    *  **Depending on the assembler, you might need to trim lower quality reads off the endOthers want untrimmed data** 
 +  * Pacific Bio doesn’t have GC|AT bias
  
-#Read Quality Assessment ​Tools +====Tools==== 
- FastQC (Most popular tool to tell you about the read library)  +  ​* ​FastQC (Most popular tool to tell you about the read library)  
- Preqc +    * FastQC reported an issue with our data with kmer count (related to adapter content) 
- . Reads decrease in quality further down the read +      * **This needs to be checked out and diagnosed!**  
- . Pacific Bio doesn’t have GC|AT bias+  * Preqc 
 +    * Estimates how difficult ​the assembly will be
  
-#Estimating ​Genome Size from Read Data +=====Estimating ​genome size from read data===== 
- G = (pn(1-k+1))/​(λ_k)+ G = (pn(1-k+1))/​(λ_k)
  G = Genome size  G = Genome size
  pn = proportion of correct reads  pn = proportion of correct reads
Line 38: Line 64:
  Simpson 2013, arXiv  Simpson 2013, arXiv
  
-To estimate genome size need to know:i) total number of reads; ii) length of reads; ​ and iii) kmer distribution+  * To estimate genome size need to know:i) total number of reads; ii) length of reads; ​ and iii) kmer distribution
  
-#Error Correction +=====Error correction===== 
- High amount of small kmers are usually errors+  ​* ​High amount of small kmers are usually errors
  
-***Simulated ​contif ​length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contif ​length N50 +**Simulated ​contig ​length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contig ​length N50** 
lecture_notes/04-08-2015.1428606722.txt.gz · Last modified: 2015/04/09 19:12 by jolespin