User Tools

Site Tools


This is an old revision of the document!

Lecture Notes 4/6/2015

Note Taker: Christopher Kan

A road map to the Denovo-Assembly of the Banana Slug Genome

  1. Stefan Prost

Denovo VS. Reference Genome

  1. Reference can be biased by the assembly itself. Eg some areas may not be annotated or reads are not available.
  2. Denovo costs more

Scaffolds and Contigs

  1. Contigs have little to no gaps
  2. Scaffolds can have missing regions but the linear order of the contigs within each scaffold is known
  3. N50s for Scaffold and Contigs are used as quality measures.

○ You sum the size of the scaffolds or contigs until you reach 1/2 the linear length of a genome. The size of the last constituent part of the N50. It’s a way to obtain a median-esque measure of assembly quality

  1. Ideally # scaffolds = # chromosomes

Definition: Kmer - Short unique element of DNA of a certain length n

  1. The elements can overlap
  2. Used to summarize data by assemblers

A priori knowledge of a genome

  1. Expected Genome Size
    • C-values from
    • C-value is the genome size in picrograms
    • 1pg=1C=980MB
    • Depending on clade information from related genomes can be used to provide a-priori knowledge

§ Some have low variation and high synteny - Birds

  • 6-7 GB becomes difficult
  1. Data bases
  1. Expected repeat content
  • Correlated with genome size
  • Small repeats and pseudogenes, genome duplications
  1. Expected Heterozygosity
  2. Haploid? Diploid or polyploid?
  • No assembler that can assemble polyploid currently

Sequencing Technology

  1. 1st Gen

○ Sanger

  1. 2nd Gen (PCR Needed)
    • Illumia

§ It took me a long time to understand how this works these two video helped me: Link

  • Roche:454
  • IONtorrent
  • ABI: Solid
  1. 3rd Gen (Single Molecule Sequencing)
  • Heliscope
  • PacBio RS II

§ Problems

  • Polyerase needs to be fast with low error
  • Poor yield from cell

~ Need to wash with low concentration to ensure most cells only have one molocule

  • Insertions and deletions. Missing or having one that hangs around

~ Random. This property used to error correct

  • Light emission at time of amplification
    • Real time, allows decernment of 3D structure of molocule based on time between incorporations
  • Can circularize small DNA fragments and get multiple reads ~3kb, possible to 8kb
  • MiniION and GridION
  • Sequences by taking molocule apart
  • Nanopore allows the molocule through based on salt gradient
  • Sequence as molocule goes through

~ Molocule held by molocule that clips off one nucleotide at a time - exonuclease

				~ Measure the charge at the nanopore. 
			* OR sequence the molocule as it goes through as its held with an helicase
		* Some systematic errors - Harder to correct 
		* Can use a hair pin to run both side of DNA so its effectiely paired
		* No restriction on size hypothetically

Issues with 3rd Gen

  1. High error
  2. High cost
  3. Error correction very computationally expensive

Note Taker: Gepoliano Chaves


Guest lecturer: Stefan Prost

Lecturer contact:

OVERVIEW TOPICS • A priori information about the genome • Sequencing strategies and platforms • Sequencing libraries • Raw data processing and Quality assessment • Assembly Strategies and Tools • Assembly quality assessment • Further Improvement of the Assembly • What is a finished Assembly? • (There’s no finished assembly) • Downstream processing

There are two approaches to assembly genomic reads, de novo genome assembly and reference-based assembly.

De novo Assembly

No previous genome to map the sequencing reads with Sequence reads are clustered in Sequence contigs (one read after the other), no gaps Scaffolds groups different contigs Repeat reads are difficult to resolve: reads One contiguous read, but there must be gaps. N50 – thousands of scaffolds: rank the contigs by similarity N50 – is a king of median of the contigs length

Reference-based Assembly

Kmer = short, unique element of DNA sequence of length n A commonly used platform to get sequencing data is the Illumina’s HiSeq; This platform allows kmers as big as 100 bp Reads are then mapped back to genome


As of a start point, 4 topics should be in our minds for the assembly:

• Expected Genome Size (there is previous data for the slugs) • Expected repeat content • Expected heterozygosity • Haploid, Diploid or polyploidy (this represents a serious problem) – as I understood there’s no information about that for the banana slug.

Cariotype information – can we derive that from the assembly?
      C-Value = weight of genome (picogram) 1pg =1GB long
      c-value from

Information from other genomes

The longfish has the largest vertebrate genome
big genomes – repetitions: genome size and repeat content, correlate positively
      Drosophila has a repetition content of 2%.
      Mamallian genomes are trickier than bird’s genomes
      Genome Synteny (Poelstra 2014)
      Synteny – conservation of blocks of order within two sets of chromosomes that are being compared with each other (
Mammals’ genomes present rearrangements of sequences
RNA-Seq in this regard does not show where the gene was.


First generation

Sanger Sequencing, based on the dideoxynucleotide chain termination. Dideoxynucleotides are chain-elongation inhibitors of DNA polymerase (Good accuracy and read length)

Second Generation (PCR needed)

Illumina’s Miseq and HiSeq are cheap platfroms to sequence DNA. Slugs’ data was collected using Illumina’s platform
Roche 454, expensive (slugs have some data originally sequenced in this platform)
LIFE sciences IONtorrent and IONproton (cheaper than 454)
ABI: SOLiD (hybridization array approach)( slugs have some data originally sequenced in this platform too)

Third Generation (Single Molecule Sequencing) Most commonly used platforms:

Helicos Biosciences: Heliscope
Pacific Biosciences : PacBio RS II (PacBio is useless, no AT, GC bias, problems with scafolding)
	Microbial genome - 
Oxford Nanopore: MinION & GridION (error rate ~ 15%)
      Illumina's technology may be used to error-correct PacBio reads, but Illumina has GC bias, being computationally expensive.


Different 3’ and 5’ end adapters – fragments are flanked by the adapters
Hybridization in the flowcell (array)
Bridge amplification – proximity and PCR amplification allows the fragment to be amplified.
	Metzker 2010
A washing step takes out one of the two types
Same cluster: same sequence, sequencing primer
Four nucleotides are labeled, all 4 in the same reaction, different 4 calors for each nucleotide
Incorporation by polymerase – light release with colors
Same clusters – signal
CDC camera catches the color.
The process continues until ~100 bp
$1000 for a flowcell MinION – 400bp 1 lane 


Imagine a plate with small wells
in this technology, the objective is to make a polymerase stick to the well
Eid et al 2009
Single molecule PCR polymerase that is fast enough
This can be imagined as something like ELISA but with only one exactly DNA and polymerase per well
Real time sequencing technology, allowing the time it takes to incorporate the base, a measure to infer information about heterochromatin and quadruplex structures
This technology allows long reads
INDEL in the main mismatch that happens in PacBio


A technology that has been ~ 15 years around
Involves a membrane with a nanopore, formed with a protein called alfa-hemolysin
	DNA molecule goes throught the pore
	Salt gradient concentration allows the guidance of a single stranded DNA molecule through the pore
	Guidance of DNA to the pore leads the DNA molecule to the exonuclease activity coupled to the hemolysin
	Nucleotide is cut and the carge changes in a side of the membrane surface
	Based on nucleotide charge change, the unique nucleotide that matches that change is inferred to be in the sequence.


, 2015/04/06 11:02

Posted notes and link to youtube videos - CK

You could leave a comment if you were logged in.
lecture_notes/04-06-2015.1428641948.txt.gz · Last modified: 2015/04/09 21:59 by gepoliano