This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
|
lecture_notes:04-08-2015 [2015/04/09 19:11] jolespin created |
lecture_notes:04-08-2015 [2015/04/17 22:34] (current) sihussai fixing capitalization |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | De nova Assembly | Wed 8 April 2015 | Stefan Prost | stefan.prost@berkley.edu | jolespin notes | + | ======De novo assembly II====== |
| + | **Guest lecturer: Stefan Prost, stefan.prost@berkley.edu** | ||
| - | #Illumina Paired-end Sequencing Libraries | + | =====Illumina paired-end sequencing libraries===== |
| - | . MiSeq has 300 bp reads | + | * MiSeq has 300 bp reads |
| - | . Paired ends read from both directions | + | * Paired ends read from both directions, so you get one read for each end (may or may not overlap depending on molecule and read size) |
| - | =====> read_1 | + | =====> end_1 |
| ________________________ | ________________________ | ||
| - | read_2 <===== | + | end_2 <===== |
| - | . Can’t sequence repeat regions with paired-end reads | + | |
| - | #Illumina Mate-Pair Sequencing Libraries | + | * Problem: not really sufficient for repetitive regions |
| - | <===== read_1 | + | * ends can't be very far apart because Illumina can't handle big molecules |
| + | * not enough info for scaffolding | ||
| + | |||
| + | =====Illumina mate-pair sequencing libraries==== | ||
| + | * Idea: get paired reads that are much farther away (for more scaffolding info) | ||
| + | * Basically, the same idea as paired-ends, except the middle section is missing and the ends are oriented the opposite way. | ||
| + | |||
| + | <===== end_1 | ||
| ________________________ | ________________________ | ||
| - | read_2 =====> | + | end_2 =====> |
| - | . Dependent on inferring insert size | + | |
| - | . Genomic DNA > Fragment (2-5 kb) > biotinylate ends > | + | * Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > … |
| - | Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > | + | * Cut DNA, attach a biotin tag to both ends of the target molecule |
| - | Ligate adaptors > … | + | * Circularize target molecule |
| + | * This step is hard | ||
| + | * Circularized molecule is fragmented, fragment with the biotin tag (a.k.a the ends of the original target stuck to each other backwards, with the tags in between) is enriched | ||
| + | * Then end repair, A-tailing, adapters added, amplification, sequencing | ||
| + | * Dependent on inferring insert size (can be tricky) | ||
| + | * Most companies can get you 8 kb inserts, with skill you can get up to 20 kb | ||
| + | * Complicated process, weird stuff can happen in between | ||
| + | * Important difference between paired ends and mate-pairs: ends are oriented the opposite way. | ||
| + | |||
| + | |||
| + | =====BAC (Bacterial Artificial Chromosome) and fosmid libraries===== | ||
| + | * Uncommon and expensive, but the gold standard | ||
| + | * Bacterial F-plasmid takes< 40 kb insert size | ||
| + | * Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads | ||
| + | * http://www.scq.ubc.ca/wp-content/plasmidtext.gif | ||
| - | #BAC (Bacterial Artificial Chromosome) and Fosmid Libraries | + | =====Read quality assessment===== |
| - | . Bacterial F-plasmid takes< 40 kb insert size | + | * Base quality: Phred scores reported by sequencer. |
| - | . Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads | + | * Fastq files: fasta files, plus encoded phred scores |
| - | . http://www.scq.ubc.ca/wp-content/plasmidtext.gif | + | * Need to know if your file has phred33 or phred64 encoding |
| + | * Quality for each individual base is not the whole story, the context matters to the signal processing too | ||
| + | * Reads decrease in quality further down the read | ||
| + | * **Depending on the assembler, you might need to trim lower quality reads off the end. Others want untrimmed data** | ||
| + | * Pacific Bio doesn’t have GC|AT bias | ||
| - | #Read Quality Assessment Tools | + | ====Tools==== |
| - | . FastQC (Most popular tool to tell you about the read library) | + | * FastQC (Most popular tool to tell you about the read library) |
| - | . Preqc | + | * FastQC reported an issue with our data with kmer count (related to adapter content) |
| - | . Reads decrease in quality further down the read | + | * **This needs to be checked out and diagnosed!** |
| - | . Pacific Bio doesn’t have GC|AT bias | + | * Preqc |
| + | * Estimates how difficult the assembly will be | ||
| - | #Estimating Genome Size from Read Data | + | =====Estimating genome size from read data===== |
| - | . G = (pn(1-k+1))/(λ_k) | + | G = (pn(1-k+1))/(λ_k) |
| G = Genome size | G = Genome size | ||
| pn = proportion of correct reads | pn = proportion of correct reads | ||
| Line 38: | Line 64: | ||
| Simpson 2013, arXiv | Simpson 2013, arXiv | ||
| - | . To estimate genome size need to know:i) total number of reads; ii) length of reads; and iii) kmer distribution | + | * To estimate genome size need to know:i) total number of reads; ii) length of reads; and iii) kmer distribution |
| - | #Error Correction | + | =====Error correction===== |
| - | . High amount of small kmers are usually errors | + | * High amount of small kmers are usually errors |
| - | ***Simulated contif length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contif length N50 | + | **Simulated contig length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contig length N50** |