User Tools

Site Tools


lecture_notes:04-08-2015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
lecture_notes:04-08-2015 [2015/04/09 12:12]
jolespin
lecture_notes:04-08-2015 [2015/04/17 14:53]
sihussai fixing formatting
Line 1: Line 1:
- **De nova Assembly II** | Wed 8 April 2015 | Stefan Prost stefan.prost@berkley.edu ​| jolespin notes+======De novo Assembly II====== 
 +**Guest lecturer: ​Stefan Proststefan.prost@berkley.edu** 
  
-#Illumina Paired-end Sequencing Libraries +=====Illumina Paired-end Sequencing Libraries===== 
- MiSeq has 300 bp reads +  ​* ​MiSeq has 300 bp reads 
- Paired ends read from both directions+  ​* ​Paired ends read from both directions, so you get one read for each end (may or may not overlap depending on molecule and read size)
  
- =====> ​read_1+ =====> ​end_1
  ________________________  ________________________
-            read_2 ​<===== +            end_2 <=====
- . Can’t sequence repeat regions with paired-end reads+
  
-#Illumina Mate-Pair Sequencing Libraries +  * Problem: not really sufficient for repetitive regions 
- <​===== ​read_1+    * ends can't be very far apart because Illumina can't handle big molecules 
 +    * not enough info for scaffolding 
 + 
 +=====Illumina Mate-Pair Sequencing Libraries==== 
 +  * Idea: get paired reads that are much farther away (for more scaffolding info) 
 +  * Basically, the same idea as paired-ends,​ except the middle section is missing and the ends are oriented the opposite way.  
 + 
 + <​===== ​end_1
  ________________________  ________________________
-            read_2 ​=====> +            end_2 =====> 
- . Dependent on inferring insert size + 
- Genomic DNA > Fragment (2-5 kb) > biotinylate ends > +  ​* ​Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > … 
-   ​Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > +    * Cut DNA, attach a biotin tag to both ends of the target molecule 
-          ​Ligate adaptors > …+    * Circularize target molecule 
 +      * This step is hard 
 +    * Circularized molecule is fragmented, fragment with the biotin tag (a.k.a the ends of the original target stuck to each other backwards, with the tags in between) is enriched 
 +    * Then end repair, A-tailing, adapters added, amplification,​ sequencing  
 +  * Dependent on inferring insert size (can be tricky)  
 +  * Most companies can get you 8 kb inserts, with skill you can get up to 20 kb 
 +  * Complicated process, weird stuff can happen in between  
 +  * Important difference between paired ends and mate-pairs: ends are oriented the opposite way.  
 + 
 + 
 +=====BAC (Bacterial Artificial Chromosome) and Fosmid Libraries===== 
 +  * Uncommon and expensive, but the gold standard  
 +  * Bacterial F-plasmid takes< 40 kb insert size 
 +  * Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads 
 +  * http://​www.scq.ubc.ca/​wp-content/​plasmidtext.gif
  
-#BAC (Bacterial Artificial Chromosome) and Fosmid Libraries +=====Read Quality Assessment===== 
-Bacterial F-plasmid takes< 40 kb insert size +  * Base quality: Phred scores reported by sequencer.  
-Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads +  * Fastq files: fasta files, plus encoded phred scores 
- . http://​www.scq.ubc.ca/​wp-content/​plasmidtext.gif+    * Need to know if your file has phred33 or phred64 encoding ​   
 +  * Quality for each individual base is not the whole story, the context matters to the signal processing too 
 +  * Reads decrease in quality further down the read 
 +    *  **Depending on the assembler, you might need to trim lower quality reads off the endOthers want untrimmed data** 
 +  * Pacific Bio doesn’t have GC|AT bias
  
-#Read Quality Assessment ​Tools +====Tools==== 
- FastQC (Most popular tool to tell you about the read library)  +  ​* ​FastQC (Most popular tool to tell you about the read library)  
- Preqc +    * FastQC reported an issue with our data with kmer count (related to adapter content) 
- . Reads decrease in quality further down the read +      * **This needs to be checked out and diagnosed!**  
- . Pacific Bio doesn’t have GC|AT bias+  * Preqc 
 +    * Estimates how difficult ​the assembly will be
  
-#Estimating Genome Size from Read Data +=====Estimating Genome Size from Read Data===== 
- G = (pn(1-k+1))/​(λ_k)+ G = (pn(1-k+1))/​(λ_k)
  G = Genome size  G = Genome size
  pn = proportion of correct reads  pn = proportion of correct reads
Line 38: Line 64:
  Simpson 2013, arXiv  Simpson 2013, arXiv
  
-To estimate genome size need to know:i) total number of reads; ii) length of reads; ​ and iii) kmer distribution+  * To estimate genome size need to know:i) total number of reads; ii) length of reads; ​ and iii) kmer distribution
  
-#Error Correction +=====Error Correction===== 
- High amount of small kmers are usually errors+  ​* ​High amount of small kmers are usually errors
  
-***Simulated ​contif ​length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contif ​length N50 +**Simulated ​contig ​length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contig ​length N50** 
lecture_notes/04-08-2015.txt · Last modified: 2015/04/17 15:34 by sihussai