This is an old revision of the document!
Lecture Notes 4/6/2015
Note Taker: Christopher Kan
A road map to the Denovo-Assembly of the Banana Slug Genome
Denovo VS. Reference Genome
Scaffolds and Contigs
○ You sum the size of the scaffolds or contigs until you reach 1/2 the linear length of a genome. The size of the last constituent part of the N50. It’s a way to obtain a median-esque measure of assembly quality
Definition: Kmer - Short unique element of DNA of a certain length n
A priori knowledge of a genome
§ Some have low variation and high synteny - Birds
§ It took me a long time to understand how this works these two video helped me: Link
~ Need to wash with low concentration to ensure most cells only have one molocule
~ Random. This property used to error correct
~ Molocule held by molocule that clips off one nucleotide at a time - exonuclease
~ Measure the charge at the nanopore. * OR sequence the molocule as it goes through as its held with an helicase * Some systematic errors - Harder to correct * Can use a hair pin to run both side of DNA so its effectiely paired * No restriction on size hypothetically
Issues with 3rd Gen
Note Taker: Gepoliano Chaves
LECTURE: ROADMAP TO THE DE NOVO ASSEMBLY OF THE BANANA SLUG GENOME
Guest lecturer: Stefan Prost
Lecturer contact: email@example.com
OVERVIEW TOPICS • A priori information about the genome • Sequencing strategies and platforms • Sequencing libraries • Raw data processing and Quality assessment • Assembly Strategies and Tools • Assembly quality assessment • Further Improvement of the Assembly • What is a finished Assembly? • (There’s no finished assembly) • Downstream processing
There are two approaches to assembly genomic reads, de novo genome assembly and reference-based assembly.
De novo Assembly
No previous genome to map the sequencing reads with Sequence reads are clustered in Sequence contigs (one read after the other), no gaps Scaffolds groups different contigs Repeat reads are difficult to resolve: reads One contiguous read, but there must be gaps. N50 – thousands of scaffolds: rank the contigs by similarity N50 – is a king of median of the contigs length
Kmer = short, unique element of DNA sequence of length n A commonly used platform to get sequencing data is the Illumina’s HiSeq; This platform allows kmers as big as 100 bp Reads are then mapped back to genome
GENERAL INFORMATION ABOUT GENOME ASSEMBLY
As of a start point, 4 topics should be in our minds for the assembly:
Information from other genomes
Sanger Sequencing, based on the dideoxynucleotide chain termination is considered to be the first generation sequencing technology. Dideoxynucleotides are chain-elongation inhibitors of DNA polymerase. Positive features of this technology is its good accuracy and read length.
Second Generation (PCR needed)
Illumina’s MiSeq and HiSeq are cheap platfroms to sequence DNA. Slugs’ data was collected using Illumina’s platform for this course. Roche 454 is an expensive platform, however slugs have some data originally sequenced on it. LIFE sciences manufactured IONtorrent and IONproton (cheaper than 454). ABI’s SOLiD (hybridization array approach) system was also used sequencing slugs’ genome.
Third Generation (Single Molecule Sequencing)
Most commonly used platforms: Helicos Biosciences: Heliscope Pacific Biosciences : PacBio RS II (PacBio is useless, no AT, GC bias, problems with scafolding) Microbial genome - Oxford Nanopore: MinION & GridION (error rate ~ 15%) Illumina may be used to error-correct PacBio reads, but Illumna has GC bias, computationally expensive
ILLUMINA SEQUENCING Different 3’ and 5’ end adapters – fragments are flanked by the adapters Hybridization in the flowcell (array) Bridge amplification – proximity and PCR amplification allows the fragment to be amplified. Metzker 2010 A washing step takes out one of the two types Same cluster: same sequence, sequencing primer Four nucleotides are labeled, all 4 in the same reaction, different 4 calors for each nucleotide Incorporation by polymerase – light release with colors Same clusters – signal CDC camera catches the color. The process continues until ~100 bp $1000 for a flowcell MinION – 400bp 1 lane
PACBIO Imagine a plate with small wells. In this technology, the objective is to make a polymerase stick to the well (Eid et al., 2009). This can be imagined as something like ELISA but with only one exactly DNA and polymerase per well. PACBIO is a Real Time sequencing technology, allowing the time it takes to incorporate the base, a measure to infer information about heterochromatin and quadruplex structures. This technology allows long reads. Howver indels are the main mismatch that happens in PacBio.
This is a technology that has been around for 15 years now. It involves a membrane with a nanopore, formed with a protein called alfa-hemolysin. DNA molecule goes throught the pore using a salt gradient concentration that allows the guidance of a single stranded DNA molecule. the guidance of DNA to the pore leads the DNA molecule to the exonuclease activity coupled to the hemolysin in the nanopore. Nucleotide is cut out by the exonuclease and the charge changes in a side of the membrane surface. Based on nucleotide charge change, the unique nucleotide that matches that change is inferred to be in the sequence.