This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
lecture_notes:04-06-2015 [2015/04/06 17:32] chkan created |
lecture_notes:04-06-2015 [2015/04/10 04:59] gepoliano |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | Lecture Notes 4/6/2015 | + | Lecture Notes 4/6/2015 |
Note Taker: Christopher Kan | Note Taker: Christopher Kan | ||
- | Note Taker: XXX | + | A road map to the Denovo-Assembly of the Banana Slug Genome |
+ | - Stefan Prost | ||
+ | |||
+ | Denovo VS. Reference Genome | ||
+ | - Reference can be biased by the assembly itself. Eg some areas may not be annotated or reads are not available. | ||
+ | - Denovo costs more | ||
+ | |||
+ | Scaffolds and Contigs | ||
+ | - Contigs have little to no gaps | ||
+ | - Scaffolds can have missing regions but the linear order of the contigs within each scaffold is known | ||
+ | - N50s for Scaffold and Contigs are used as quality measures. | ||
+ | ○ You sum the size of the scaffolds or contigs until you reach 1/2 the linear length of a genome. The size of the last constituent part of the N50. It’s a way to obtain a median-esque measure of assembly quality | ||
+ | - Ideally # scaffolds = # chromosomes | ||
+ | |||
+ | Definition: Kmer - Short unique element of DNA of a certain length n | ||
+ | - The elements can overlap | ||
+ | - Used to summarize data by assemblers | ||
+ | |||
+ | A priori knowledge of a genome | ||
+ | - Expected Genome Size | ||
+ | * C-values from www.genomesize.com | ||
+ | * C-value is the genome size in picrograms | ||
+ | * 1pg=1C=980MB | ||
+ | * Depending on clade information from related genomes can be used to provide a-priori knowledge | ||
+ | § Some have low variation and high synteny - Birds | ||
+ | * 6-7 GB becomes difficult | ||
+ | - Data bases | ||
+ | * www.Gigaadb.org | ||
+ | * NCBI Genome | ||
+ | - Expected repeat content | ||
+ | * Correlated with genome size | ||
+ | * Small repeats and pseudogenes, genome duplications | ||
+ | - Expected Heterozygosity | ||
+ | - Haploid? Diploid or polyploid? | ||
+ | * No assembler that can assemble polyploid currently | ||
+ | |||
+ | Sequencing Technology | ||
+ | - 1st Gen | ||
+ | ○ Sanger | ||
+ | - 2nd Gen (PCR Needed) | ||
+ | * Illumia | ||
+ | § It took me a long time to understand how this works these two video helped me: [[https://www.youtube.com/playlist?list=PLfvYDg0hWvoqfF9z7bw7Zizeenj620r5c| Link]] | ||
+ | * Roche:454 | ||
+ | * IONtorrent | ||
+ | * ABI: Solid | ||
+ | - 3rd Gen (Single Molecule Sequencing) | ||
+ | * Heliscope | ||
+ | * PacBio RS II | ||
+ | § Problems | ||
+ | * Polyerase needs to be fast with low error | ||
+ | * Poor yield from cell | ||
+ | ~ Need to wash with low concentration to ensure most cells only have one molocule | ||
+ | * Insertions and deletions. Missing or having one that hangs around | ||
+ | ~ Random. This property used to error correct | ||
+ | * Light emission at time of amplification | ||
+ | * Real time, allows decernment of 3D structure of molocule based on time between incorporations | ||
+ | * Can circularize small DNA fragments and get multiple reads ~3kb, possible to 8kb | ||
+ | |||
+ | * MiniION and GridION | ||
+ | * Sequences by taking molocule apart | ||
+ | * Nanopore allows the molocule through based on salt gradient | ||
+ | * Sequence as molocule goes through | ||
+ | ~ Molocule held by molocule that clips off one nucleotide at a time - exonuclease | ||
+ | ~ Measure the charge at the nanopore. | ||
+ | * OR sequence the molocule as it goes through as its held with an helicase | ||
+ | * Some systematic errors - Harder to correct | ||
+ | * Can use a hair pin to run both side of DNA so its effectiely paired | ||
+ | * No restriction on size hypothetically | ||
+ | Issues with 3rd Gen | ||
+ | - High error | ||
+ | - High cost | ||
+ | - Error correction very computationally expensive | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | Note Taker: Gepoliano Chaves | ||
+ | |||
+ | LECTURE: ROADMAP TO THE DE NOVO ASSEMBLY OF THE BANANA SLUG GENOME | ||
+ | |||
+ | |||
+ | Guest lecturer: Stefan Prost | ||
+ | |||
+ | Lecturer contact: stefan.prost@berkeley.edu | ||
+ | |||
+ | OVERVIEW TOPICS | ||
+ | • A priori information about the genome | ||
+ | • Sequencing strategies and platforms | ||
+ | • Sequencing libraries | ||
+ | • Raw data processing and Quality assessment | ||
+ | • Assembly Strategies and Tools | ||
+ | • Assembly quality assessment | ||
+ | • Further Improvement of the Assembly | ||
+ | • What is a finished Assembly? | ||
+ | • (There’s no finished assembly) | ||
+ | • Downstream processing | ||
+ | |||
+ | There are two approaches to assembly genomic reads, de novo genome assembly and reference-based assembly. | ||
+ | |||
+ | De novo Assembly | ||
+ | |||
+ | No previous genome to map the sequencing reads with | ||
+ | Sequence reads are clustered in Sequence contigs (one read after the other), no gaps | ||
+ | Scaffolds groups different contigs | ||
+ | Repeat reads are difficult to resolve: reads | ||
+ | One contiguous read, but there must be gaps. | ||
+ | N50 – thousands of scaffolds: rank the contigs by similarity | ||
+ | N50 – is a king of median of the contigs length | ||
+ | |||
+ | Reference-based Assembly | ||
+ | |||
+ | Kmer = short, unique element of DNA sequence of length n | ||
+ | A commonly used platform to get sequencing data is the Illumina’s HiSeq; | ||
+ | This platform allows kmers as big as 100 bp | ||
+ | Reads are then mapped back to genome | ||
+ | |||
+ | ===== | ||
+ | GENERAL INFORMATION ABOUT GENOME ASSEMBLY ===== | ||
+ | |||
+ | |||
+ | As of a start point, 4 topics should be in our minds for the assembly: | ||
+ | |||
+ | • Expected Genome Size (there is previous data for the slugs) | ||
+ | • Expected repeat content | ||
+ | • Expected heterozygosity | ||
+ | • Haploid, Diploid or polyploidy (this represents a serious problem) – as I understood there’s no information about that for the banana slug. | ||
+ | Cariotype information – can we derive that from the assembly? | ||
+ | C-Value = weight of genome (picogram) 1pg =1GB long | ||
+ | c-value from www.genomesize.com/ | ||
+ | |||
+ | Information from other genomes | ||
+ | |||
+ | The longfish has the largest vertebrate genome | ||
+ | big genomes – repetitions: genome size and repeat content, correlate positively | ||
+ | Drosophila has a repetition content of 2%. | ||
+ | Mamallian genomes are trickier than bird’s genomes | ||
+ | Genome Synteny (Poelstra 2014) | ||
+ | Synteny – conservation of blocks of order within two sets of chromosomes that are being compared with each other (http://en.wikipedia.org/wiki/Synteny) | ||
+ | Mammals’ genomes present rearrangements of sequences | ||
+ | RNA-Seq in this regard does not show where the gene was. | ||
+ | |||
+ | SEQUENCING TECHNOLOGIES | ||
+ | |||
+ | First generation | ||
+ | Sanger Sequencing, based on the dideoxynucleotide chain termination. Dideoxynucleotides are chain-elongation inhibitors of DNA polymerase (Good accuracy and read length) | ||
+ | Second Generation (PCR needed) | ||
+ | Illumina’s Miseq and HiSeq are cheap platfroms to sequence DNA. Slugs’ data was collected using Illumina’s platform | ||
+ | Roche 454, expensive (slugs have some data originally sequenced in this platform) | ||
+ | LIFE sciences IONtorrent and IONproton (cheaper than 454) | ||
+ | ABI: SOLiD (hybridization array approach)( slugs have some data originally sequenced in this platform too) | ||
+ | |||
+ | Third Generation (Single Molecule Sequencing) | ||
+ | Most commonly used platforms: | ||
+ | Helicos Biosciences: Heliscope | ||
+ | Pacific Biosciences : PacBio RS II (PacBio is useless, no AT, GC bias, problems with scafolding) | ||
+ | Microbial genome - | ||
+ | Oxford Nanopore: MinION & GridION (error rate ~ 15%) | ||
+ | Illumina's technology may be used to error-correct PacBio reads, but Illumina has GC bias, being computationally expensive. | ||
+ | |||
+ | ILLUMINA SEQUENCING | ||
+ | Different 3’ and 5’ end adapters – fragments are flanked by the adapters | ||
+ | Hybridization in the flowcell (array) | ||
+ | Bridge amplification – proximity and PCR amplification allows the fragment to be amplified. | ||
+ | Metzker 2010 | ||
+ | A washing step takes out one of the two types | ||
+ | Same cluster: same sequence, sequencing primer | ||
+ | Four nucleotides are labeled, all 4 in the same reaction, different 4 calors for each nucleotide | ||
+ | Incorporation by polymerase – light release with colors | ||
+ | Same clusters – signal | ||
+ | CDC camera catches the color. | ||
+ | The process continues until ~100 bp | ||
+ | $1000 for a flowcell MinION – 400bp 1 lane | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | PACBIO | ||
+ | Imagine a plate with small wells | ||
+ | in this technology, the objective is to make a polymerase stick to the well | ||
+ | Eid et al 2009 | ||
+ | Single molecule PCR polymerase that is fast enough | ||
+ | This can be imagined as something like ELISA but with only one exactly DNA and polymerase per well | ||
+ | Real time sequencing technology, allowing the time it takes to incorporate the base, a measure to infer information about heterochromatin and quadruplex structures | ||
+ | This technology allows long reads | ||
+ | INDEL in the main mismatch that happens in PacBio | ||
+ | |||
+ | |||
+ | OXFORD NANOPORE | ||
+ | A technology that has been ~ 15 years around | ||
+ | Involves a membrane with a nanopore, formed with a protein called alfa-hemolysin | ||
+ | DNA molecule goes throught the pore | ||
+ | Salt gradient concentration allows the guidance of a single stranded DNA molecule through the pore | ||
+ | Guidance of DNA to the pore leads the DNA molecule to the exonuclease activity coupled to the hemolysin | ||
+ | Nucleotide is cut and the carge changes in a side of the membrane surface | ||
+ | Based on nucleotide charge change, the unique nucleotide that matches that change is inferred to be in the sequence. |