**This is an old revision of the document!** ----
De nova Assembly | Wed 8 April 2015 | Stefan Prost | stefan.prost@berkley.edu | jolespin notes #Illumina Paired-end Sequencing Libraries . MiSeq has 300 bp reads . Paired ends read from both directions =====> read_1 ________________________ read_2 <===== . Can’t sequence repeat regions with paired-end reads #Illumina Mate-Pair Sequencing Libraries <===== read_1 ________________________ read_2 =====> . Dependent on inferring insert size . Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > … #BAC (Bacterial Artificial Chromosome) and Fosmid Libraries . Bacterial F-plasmid takes< 40 kb insert size . Fragment query DNA > DNA into BAC > Transform E. coli with BAC > ~300 kb long reads . http://www.scq.ubc.ca/wp-content/plasmidtext.gif #Read Quality Assessment Tools . FastQC (Most popular tool to tell you about the read library) . Preqc . Reads decrease in quality further down the read . Pacific Bio doesn’t have GC|AT bias #Estimating Genome Size from Read Data . G = (pn(1-k+1))/(λ_k) G = Genome size pn = proportion of correct reads k = kmer length λ_k= mode of the k-kmer count histogram Simpson 2013, arXiv . To estimate genome size need to know:i) total number of reads; ii) length of reads; and iii) kmer distribution #Error Correction . High amount of small kmers are usually errors ***Simulated contif length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contif length N50