User Tools

Site Tools


lecture_notes:04-08-2015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
lecture_notes:04-08-2015 [2015/04/09 22:42]
sihussai
lecture_notes:04-08-2015 [2015/04/17 21:53]
sihussai fixing formatting
Line 1: Line 1:
-=====De novo Assembly II=====+======De novo Assembly II======
 **Guest lecturer: Stefan Prost, stefan.prost@berkley.edu** ​ **Guest lecturer: Stefan Prost, stefan.prost@berkley.edu** ​
  
-====Illumina Paired-end Sequencing Libraries====+=====Illumina Paired-end Sequencing Libraries=====
   * MiSeq has 300 bp reads   * MiSeq has 300 bp reads
   * Paired ends read from both directions, so you get one read for each end (may or may not overlap depending on molecule and read size)   * Paired ends read from both directions, so you get one read for each end (may or may not overlap depending on molecule and read size)
Line 10: Line 10:
             end_2 <=====             end_2 <=====
  
-  * Can’t sequence repeat ​regions ​with paired-end reads+  * Problem: not really sufficient for repetitive ​regions
     * ends can't be very far apart because Illumina can't handle big molecules     * ends can't be very far apart because Illumina can't handle big molecules
     * not enough info for scaffolding     * not enough info for scaffolding
  
-====Illumina Mate-Pair Sequencing Libraries===+=====Illumina Mate-Pair Sequencing Libraries====
   * Idea: get paired reads that are much farther away (for more scaffolding info)   * Idea: get paired reads that are much farther away (for more scaffolding info)
   * Basically, the same idea as paired-ends,​ except the middle section is missing and the ends are oriented the opposite way.    * Basically, the same idea as paired-ends,​ except the middle section is missing and the ends are oriented the opposite way. 
Line 23: Line 23:
  
   * Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > …   * Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > …
-  ​* Dependent on inferring insert size+    * Cut DNA, attach a biotin tag to both ends of the target molecule 
 +    * Circularize target molecule 
 +      * This step is hard 
 +    * Circularized molecule is fragmented, fragment with the biotin tag (a.k.a the ends of the original target stuck to each other backwards, with the tags in between) is enriched 
 +    * Then end repair, A-tailing, adapters added, amplification,​ sequencing  
 +  ​* Dependent on inferring insert size (can be tricky)  
 +  * Most companies can get you 8 kb inserts, with skill you can get up to 20 kb 
 +  * Complicated process, weird stuff can happen in between  
 +  * Important difference between paired ends and mate-pairs: ends are oriented the opposite way. 
  
  
-====BAC (Bacterial Artificial Chromosome) and Fosmid Libraries====+=====BAC (Bacterial Artificial Chromosome) and Fosmid Libraries====
 +  * Uncommon and expensive, but the gold standard ​
   * Bacterial F-plasmid takes< 40 kb insert size   * Bacterial F-plasmid takes< 40 kb insert size
   * Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads   * Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads
   * http://​www.scq.ubc.ca/​wp-content/​plasmidtext.gif   * http://​www.scq.ubc.ca/​wp-content/​plasmidtext.gif
  
-====Read Quality Assessment ​Tools==== +=====Read Quality Assessment===== 
-  * FastQC (Most popular tool to tell you about the read library) ​ +  * Base quality: Phred scores reported by sequencer.  
-  * Preqc+  * Fastq files: fasta files, plus encoded phred scores 
 +    * Need to know if your file has phred33 or phred64 encoding ​   
 +  * Quality for each individual base is not the whole story, the context matters to the signal processing too
   * Reads decrease in quality further down the read   * Reads decrease in quality further down the read
 +    *  **Depending on the assembler, you might need to trim lower quality reads off the end. Others want untrimmed data**
   * Pacific Bio doesn’t have GC|AT bias   * Pacific Bio doesn’t have GC|AT bias
  
-====Estimating Genome Size from Read Data==== +====Tools==== 
- G = (pn(1-k+1))/​(λ_k)+  * FastQC (Most popular tool to tell you about the read library)  
 +    * FastQC reported an issue with our data with kmer count (related to adapter content) 
 +      * **This needs to be checked out and diagnosed!**  
 +  * Preqc 
 +    * Estimates how difficult the assembly will be 
 + 
 +=====Estimating Genome Size from Read Data===== 
 + G = (pn(1-k+1))/​(λ_k)
  G = Genome size  G = Genome size
  pn = proportion of correct reads  pn = proportion of correct reads
Line 47: Line 66:
   * To estimate genome size need to know:i) total number of reads; ii) length of reads; ​ and iii) kmer distribution   * To estimate genome size need to know:i) total number of reads; ii) length of reads; ​ and iii) kmer distribution
  
-====Error Correction====+=====Error Correction=====
   * High amount of small kmers are usually errors   * High amount of small kmers are usually errors
  
-***Simulated ​contif ​length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contif ​length N50 +**Simulated ​contig ​length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contig ​length N50** 
lecture_notes/04-08-2015.txt · Last modified: 2015/04/17 22:34 by sihussai