User Tools

Site Tools


This is an old revision of the document!

De novo Assembly II

Guest lecturer: Stefan Prost,

Illumina Paired-end Sequencing Libraries

  • MiSeq has 300 bp reads
  • Paired ends read from both directions, so you get one read for each end (may or may not overlap depending on molecule and read size)
	=====> end_1
	           end_2 <=====
  • Can’t sequence repeat regions with paired-end reads
    • ends can't be very far apart because Illumina can't handle big molecules
    • not enough info for scaffolding

Illumina Mate-Pair Sequencing Libraries

  • Idea: get paired reads that are much farther away (for more scaffolding info)
  • Basically, the same idea as paired-ends, except the middle section is missing and the ends are oriented the opposite way.
		<===== end_1
	           end_2 =====>
  • Genomic DNA > Fragment (2-5 kb) > biotinylate ends > Circularize > Fragment (400-600 bp) > Enrich biotinylated fragments > Ligate adaptors > …
  • Dependent on inferring insert size

BAC (Bacterial Artificial Chromosome) and Fosmid Libraries

Read Quality Assessment Tools

  • FastQC (Most popular tool to tell you about the read library)
  • Preqc
  • Reads decrease in quality further down the read
  • Pacific Bio doesn’t have GC|AT bias

Estimating Genome Size from Read Data

. G = (pn(1-k+1))/(λ_k)
G = Genome size
pn = proportion of correct reads
k = kmer length
λ_k= mode of the k-kmer count histogram
Simpson 2013, arXiv
  • To estimate genome size need to know:i) total number of reads; ii) length of reads; and iii) kmer distribution

Error Correction

  • High amount of small kmers are usually errors

***Simulated contif length in the k-de Brujin graph can estimate the best kmer to use for assembly. Based on contif length N50

You could leave a comment if you were logged in.
lecture_notes/04-08-2015.1428619367.txt.gz · Last modified: 2015/04/09 15:42 by sihussai