This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revision Both sides next revision | ||
lecture_notes:04-06-2015 [2015/04/06 18:13] chkan Fixed formating |
lecture_notes:04-06-2015 [2015/04/10 04:59] gepoliano |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | Lecture Notes 4/6/2015 | + | Lecture Notes 4/6/2015 |
Note Taker: Christopher Kan | Note Taker: Christopher Kan | ||
Line 80: | Line 80: | ||
- | Note Taker: XXX | + | Note Taker: Gepoliano Chaves |
+ | |||
+ | LECTURE: ROADMAP TO THE DE NOVO ASSEMBLY OF THE BANANA SLUG GENOME | ||
+ | |||
+ | |||
+ | Guest lecturer: Stefan Prost | ||
+ | |||
+ | Lecturer contact: stefan.prost@berkeley.edu | ||
+ | |||
+ | OVERVIEW TOPICS | ||
+ | • A priori information about the genome | ||
+ | • Sequencing strategies and platforms | ||
+ | • Sequencing libraries | ||
+ | • Raw data processing and Quality assessment | ||
+ | • Assembly Strategies and Tools | ||
+ | • Assembly quality assessment | ||
+ | • Further Improvement of the Assembly | ||
+ | • What is a finished Assembly? | ||
+ | • (There’s no finished assembly) | ||
+ | • Downstream processing | ||
+ | |||
+ | There are two approaches to assembly genomic reads, de novo genome assembly and reference-based assembly. | ||
+ | |||
+ | De novo Assembly | ||
+ | |||
+ | No previous genome to map the sequencing reads with | ||
+ | Sequence reads are clustered in Sequence contigs (one read after the other), no gaps | ||
+ | Scaffolds groups different contigs | ||
+ | Repeat reads are difficult to resolve: reads | ||
+ | One contiguous read, but there must be gaps. | ||
+ | N50 – thousands of scaffolds: rank the contigs by similarity | ||
+ | N50 – is a king of median of the contigs length | ||
+ | |||
+ | Reference-based Assembly | ||
+ | |||
+ | Kmer = short, unique element of DNA sequence of length n | ||
+ | A commonly used platform to get sequencing data is the Illumina’s HiSeq; | ||
+ | This platform allows kmers as big as 100 bp | ||
+ | Reads are then mapped back to genome | ||
+ | |||
+ | ===== | ||
+ | GENERAL INFORMATION ABOUT GENOME ASSEMBLY ===== | ||
+ | |||
+ | |||
+ | As of a start point, 4 topics should be in our minds for the assembly: | ||
+ | |||
+ | • Expected Genome Size (there is previous data for the slugs) | ||
+ | • Expected repeat content | ||
+ | • Expected heterozygosity | ||
+ | • Haploid, Diploid or polyploidy (this represents a serious problem) – as I understood there’s no information about that for the banana slug. | ||
+ | Cariotype information – can we derive that from the assembly? | ||
+ | C-Value = weight of genome (picogram) 1pg =1GB long | ||
+ | c-value from www.genomesize.com/ | ||
+ | |||
+ | Information from other genomes | ||
+ | |||
+ | The longfish has the largest vertebrate genome | ||
+ | big genomes – repetitions: genome size and repeat content, correlate positively | ||
+ | Drosophila has a repetition content of 2%. | ||
+ | Mamallian genomes are trickier than bird’s genomes | ||
+ | Genome Synteny (Poelstra 2014) | ||
+ | Synteny – conservation of blocks of order within two sets of chromosomes that are being compared with each other (http://en.wikipedia.org/wiki/Synteny) | ||
+ | Mammals’ genomes present rearrangements of sequences | ||
+ | RNA-Seq in this regard does not show where the gene was. | ||
+ | |||
+ | SEQUENCING TECHNOLOGIES | ||
+ | |||
+ | First generation | ||
+ | Sanger Sequencing, based on the dideoxynucleotide chain termination. Dideoxynucleotides are chain-elongation inhibitors of DNA polymerase (Good accuracy and read length) | ||
+ | Second Generation (PCR needed) | ||
+ | Illumina’s Miseq and HiSeq are cheap platfroms to sequence DNA. Slugs’ data was collected using Illumina’s platform | ||
+ | Roche 454, expensive (slugs have some data originally sequenced in this platform) | ||
+ | LIFE sciences IONtorrent and IONproton (cheaper than 454) | ||
+ | ABI: SOLiD (hybridization array approach)( slugs have some data originally sequenced in this platform too) | ||
+ | |||
+ | Third Generation (Single Molecule Sequencing) | ||
+ | Most commonly used platforms: | ||
+ | Helicos Biosciences: Heliscope | ||
+ | Pacific Biosciences : PacBio RS II (PacBio is useless, no AT, GC bias, problems with scafolding) | ||
+ | Microbial genome - | ||
+ | Oxford Nanopore: MinION & GridION (error rate ~ 15%) | ||
+ | Illumina's technology may be used to error-correct PacBio reads, but Illumina has GC bias, being computationally expensive. | ||
+ | |||
+ | ILLUMINA SEQUENCING | ||
+ | Different 3’ and 5’ end adapters – fragments are flanked by the adapters | ||
+ | Hybridization in the flowcell (array) | ||
+ | Bridge amplification – proximity and PCR amplification allows the fragment to be amplified. | ||
+ | Metzker 2010 | ||
+ | A washing step takes out one of the two types | ||
+ | Same cluster: same sequence, sequencing primer | ||
+ | Four nucleotides are labeled, all 4 in the same reaction, different 4 calors for each nucleotide | ||
+ | Incorporation by polymerase – light release with colors | ||
+ | Same clusters – signal | ||
+ | CDC camera catches the color. | ||
+ | The process continues until ~100 bp | ||
+ | $1000 for a flowcell MinION – 400bp 1 lane | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | PACBIO | ||
+ | Imagine a plate with small wells | ||
+ | in this technology, the objective is to make a polymerase stick to the well | ||
+ | Eid et al 2009 | ||
+ | Single molecule PCR polymerase that is fast enough | ||
+ | This can be imagined as something like ELISA but with only one exactly DNA and polymerase per well | ||
+ | Real time sequencing technology, allowing the time it takes to incorporate the base, a measure to infer information about heterochromatin and quadruplex structures | ||
+ | This technology allows long reads | ||
+ | INDEL in the main mismatch that happens in PacBio | ||
+ | |||
+ | |||
+ | OXFORD NANOPORE | ||
+ | A technology that has been ~ 15 years around | ||
+ | Involves a membrane with a nanopore, formed with a protein called alfa-hemolysin | ||
+ | DNA molecule goes throught the pore | ||
+ | Salt gradient concentration allows the guidance of a single stranded DNA molecule through the pore | ||
+ | Guidance of DNA to the pore leads the DNA molecule to the exonuclease activity coupled to the hemolysin | ||
+ | Nucleotide is cut and the carge changes in a side of the membrane surface | ||
+ | Based on nucleotide charge change, the unique nucleotide that matches that change is inferred to be in the sequence. |