User Tools

Site Tools


lecture_notes:04-08-2015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lecture_notes:04-08-2015 [2015/04/09 15:42]
sihussai
lecture_notes:04-08-2015 [2015/04/17 15:34] (current)
sihussai fixing capitalization
Line 1: Line 1:
-=====De novo Assembly ​II=====+======De novo assembly ​II======
 **Guest lecturer: Stefan Prost, stefan.prost@berkley.edu** ​ **Guest lecturer: Stefan Prost, stefan.prost@berkley.edu** ​
  
-====Illumina ​Paired-end Sequencing Libraries====+=====Illumina ​paired-end sequencing libraries=====
   * MiSeq has 300 bp reads   * MiSeq has 300 bp reads
   * Paired ends read from both directions, so you get one read for each end (may or may not overlap depending on molecule and read size)   * Paired ends read from both directions, so you get one read for each end (may or may not overlap depending on molecule and read size)
Line 10: Line 10:
             end_2 <=====             end_2 <=====
  
-  * Can’t sequence repeat ​regions ​with paired-end reads+  * Problem: not really sufficient for repetitive ​regions
     * ends can't be very far apart because Illumina can't handle big molecules     * ends can't be very far apart because Illumina can't handle big molecules
     * not enough info for scaffolding     * not enough info for scaffolding
  
-====Illumina ​Mate-Pair Sequencing Libraries===+=====Illumina ​mate-pair sequencing libraries====
   * Idea: get paired reads that are much farther away (for more scaffolding info)   * Idea: get paired reads that are much farther away (for more scaffolding info)
   * Basically, the same idea as paired-ends,​ except the middle section is missing and the ends are oriented the opposite way.    * Basically, the same idea as paired-ends,​ except the middle section is missing and the ends are oriented the opposite way. 
Line 23: Line 23:
  
   * Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > …   * Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > …
-  ​* Dependent on inferring insert size+    * Cut DNA, attach a biotin tag to both ends of the target molecule 
 +    * Circularize target molecule 
 +      * This step is hard 
 +    * Circularized molecule is fragmented, fragment with the biotin tag (a.k.a the ends of the original target stuck to each other backwards, with the tags in between) is enriched 
 +    * Then end repair, A-tailing, adapters added, amplification,​ sequencing  
 +  ​* Dependent on inferring insert size (can be tricky)  
 +  * Most companies can get you 8 kb inserts, with skill you can get up to 20 kb 
 +  * Complicated process, weird stuff can happen in between  
 +  * Important difference between paired ends and mate-pairs: ends are oriented the opposite way. 
  
  
-====BAC (Bacterial Artificial Chromosome) and Fosmid Libraries====+=====BAC (Bacterial Artificial Chromosome) and fosmid libraries====
 +  * Uncommon and expensive, but the gold standard ​
   * Bacterial F-plasmid takes< 40 kb insert size   * Bacterial F-plasmid takes< 40 kb insert size
   * Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads   * Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads
   * http://​www.scq.ubc.ca/​wp-content/​plasmidtext.gif   * http://​www.scq.ubc.ca/​wp-content/​plasmidtext.gif
  
-====Read ​Quality Assessment Tools==== +=====Read ​quality assessment===== 
-  * FastQC (Most popular tool to tell you about the read library) ​ +  * Base quality: Phred scores reported by sequencer.  
-  * Preqc+  * Fastq files: fasta files, plus encoded phred scores 
 +    * Need to know if your file has phred33 or phred64 encoding ​   
 +  * Quality for each individual base is not the whole story, the context matters to the signal processing too
   * Reads decrease in quality further down the read   * Reads decrease in quality further down the read
 +    *  **Depending on the assembler, you might need to trim lower quality reads off the end. Others want untrimmed data**
   * Pacific Bio doesn’t have GC|AT bias   * Pacific Bio doesn’t have GC|AT bias
  
-====Estimating ​Genome Size from Read Data==== +====Tools==== 
- G = (pn(1-k+1))/​(λ_k)+  * FastQC (Most popular tool to tell you about the read library)  
 +    * FastQC reported an issue with our data with kmer count (related to adapter content) 
 +      * **This needs to be checked out and diagnosed!**  
 +  * Preqc 
 +    * Estimates how difficult the assembly will be 
 + 
 +=====Estimating ​genome size from read data===== 
 + G = (pn(1-k+1))/​(λ_k)
  G = Genome size  G = Genome size
  pn = proportion of correct reads  pn = proportion of correct reads
Line 47: Line 66:
   * To estimate genome size need to know:i) total number of reads; ii) length of reads; ​ and iii) kmer distribution   * To estimate genome size need to know:i) total number of reads; ii) length of reads; ​ and iii) kmer distribution
  
-====Error ​Correction====+=====Error ​correction=====
   * High amount of small kmers are usually errors   * High amount of small kmers are usually errors
  
-***Simulated ​contif ​length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contif ​length N50 +**Simulated ​contig ​length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contig ​length N50** 
lecture_notes/04-08-2015.1428619367.txt.gz · Last modified: 2015/04/09 15:42 by sihussai