User Tools

Site Tools


lecture_notes:04-06-2015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
lecture_notes:04-06-2015 [2015/04/06 17:32]
chkan created
lecture_notes:04-06-2015 [2015/04/10 04:59]
gepoliano
Line 1: Line 1:
-Lecture Notes 4/6/2015+ Lecture Notes 4/6/2015
  
 Note Taker: Christopher Kan Note Taker: Christopher Kan
  
-Note Taker: ​XXX+A road map to the Denovo-Assembly of the Banana Slug Genome 
 + - Stefan Prost 
 + 
 +Denovo VS. Reference Genome 
 + - Reference can be biased by the assembly itself. Eg some areas may not be annotated or reads are not available.  
 + - Denovo costs more 
 + 
 +Scaffolds and Contigs 
 + - Contigs have little to no gaps 
 + - Scaffolds can have missing regions but the linear order of the contigs within each scaffold is known 
 + - N50s for Scaffold and Contigs are used as quality measures.  
 + ○ You sum the size of the scaffolds or contigs until you reach 1/2 the linear length of a genome. The size of the last constituent part of the N50. It’s a way to obtain a median-esque measure of assembly quality 
 + - Ideally # scaffolds = # chromosomes 
 + 
 +Definition: Kmer - Short unique element of DNA of a certain length n 
 + - The elements can overlap 
 + - Used to summarize data by assemblers 
 + 
 +A priori knowledge of a genome 
 + - Expected Genome Size 
 + * C-values from www.genomesize.com 
 + * C-value is the genome size in picrograms 
 + * 1pg=1C=980MB 
 + * Depending on clade information from related genomes can be used to provide a-priori knowledge 
 + § Some have low variation and high synteny - Birds 
 + * 6-7 GB becomes difficult 
 + - Data bases 
 + * www.Gigaadb.org 
 + * NCBI Genome 
 + - Expected repeat content 
 + * Correlated with genome size 
 + * Small repeats and pseudogenes,​ genome duplications 
 + - Expected Heterozygosity 
 + - Haploid? Diploid or polyploid?​ 
 + * No assembler that can assemble polyploid currently 
 +  
 +Sequencing Technology 
 + - 1st Gen 
 + ○ Sanger 
 + - 2nd Gen (PCR Needed) 
 + * Illumia 
 + § It took me a long time to understand how this works these two video helped me: [[https://​www.youtube.com/​playlist?​list=PLfvYDg0hWvoqfF9z7bw7Zizeenj620r5c| Link]] 
 + * Roche:454 
 + * IONtorrent 
 + * ABI: Solid 
 + - 3rd Gen (Single Molecule Sequencing) 
 + * Heliscope 
 + * PacBio RS II 
 + § Problems 
 + * Polyerase needs to be fast with low error 
 + * Poor yield from cell 
 + ~ Need to wash with low concentration to ensure most cells only have one molocule 
 + * Insertions and deletions. Missing or having one that hangs around 
 + ~ Random. This property used to error correct 
 + * Light emission at time of amplification 
 + * Real time, allows decernment of 3D structure of molocule based on time between incorporations 
 + * Can circularize small DNA fragments and get multiple reads ~3kb, possible to 8kb 
 +  
 + * MiniION and GridION 
 + * Sequences by taking molocule apart 
 + * Nanopore allows the molocule through based on salt gradient 
 + * Sequence as molocule goes through 
 + ~ Molocule held by molocule that clips off one nucleotide at a time - exonuclease 
 + ~ Measure the charge at the nanopore.  
 + * OR sequence the molocule as it goes through as its held with an helicase 
 + * Some systematic errors - Harder to correct  
 + * Can use a hair pin to run both side of DNA so its effectiely paired 
 + * No restriction on size hypothetically 
 +Issues with 3rd Gen 
 + - High error 
 + - High cost 
 + - Error correction very computationally expensive 
 +  
 + 
 + 
 + 
 + 
 +Note Taker: ​Gepoliano Chaves 
 + 
 +LECTURE: ROADMAP TO THE DE NOVO ASSEMBLY OF THE BANANA SLUG GENOME 
 + 
 + 
 +Guest lecturer: Stefan Prost 
 + 
 +Lecturer contact: stefan.prost@berkeley.edu 
 + 
 +OVERVIEW TOPICS 
 +• A priori information about the genome 
 +• Sequencing strategies and platforms 
 +• Sequencing libraries 
 +• Raw data processing and Quality assessment 
 +• Assembly Strategies and Tools 
 +• Assembly quality assessment 
 +• Further Improvement of the Assembly 
 +• What is a finished Assembly? 
 +• (There’s no finished assembly) 
 +• Downstream processing 
 + 
 +There are two approaches to assembly genomic reads, de novo genome assembly and reference-based assembly. 
 + 
 +De novo Assembly 
 + 
 +No previous genome to map the sequencing reads with 
 +Sequence reads are clustered in Sequence contigs (one read after the other), no gaps 
 +Scaffolds groups different contigs 
 +Repeat reads are difficult to resolve: reads  
 +One contiguous read, but there must be gaps.  
 +N50 – thousands of scaffolds: rank the contigs by similarity 
 +N50 – is a king of median of the contigs length 
 + 
 +Reference-based Assembly 
 + 
 +Kmer = short, unique element of DNA sequence of length n 
 +A commonly used platform to get sequencing data is the Illumina’s HiSeq; 
 +This platform allows kmers as big as 100 bp 
 +Reads are then mapped back to genome 
 + 
 +=====  
 +GENERAL INFORMATION ABOUT GENOME ASSEMBLY ===== 
 + 
 + 
 +As of a start point, 4 topics should be in our minds for the assembly: 
 + 
 +• Expected Genome Size (there is previous data for the slugs) 
 +• Expected repeat content 
 +• Expected heterozygosity  
 +• Haploid,​ Diploid or polyploidy (this represents a serious problem) – as I understood there’s no information about that for the banana slug. 
 + Cariotype information – can we derive that from the assembly? 
 +        C-Value = weight of genome (picogram) 1pg =1GB long 
 +        c-value from www.genomesize.com/​ 
 +         
 +Information from other genomes 
 + 
 + The longfish has the largest vertebrate genome 
 + big genomes – repetitions:​ genome size and repeat content, correlate positively 
 +        Drosophila has a repetition content of 2%. 
 +        Mamallian genomes are trickier than bird’s genomes 
 +        Genome Synteny (Poelstra 2014) 
 +        Synteny – conservation of blocks of order within two sets of chromosomes that are being compared with each other (http://​en.wikipedia.org/​wiki/​Synteny) 
 + Mammals’ genomes present rearrangements of sequences 
 + RNA-Seq in this regard does not show where the gene was. 
 + 
 +SEQUENCING TECHNOLOGIES 
 + 
 +First generation 
 + Sanger Sequencing, based on the dideoxynucleotide chain termination. Dideoxynucleotides are chain-elongation inhibitors of DNA polymerase (Good accuracy and read length) 
 +Second Generation (PCR needed) 
 + Illumina’s Miseq and HiSeq are cheap platfroms to sequence DNA. Slugs’ data was collected using Illumina’s platform 
 + Roche 454, expensive (slugs have some data originally sequenced in this platform) 
 + LIFE sciences IONtorrent and IONproton (cheaper than 454) 
 + ABI: SOLiD (hybridization array approach)( slugs have some data originally sequenced in this platform too) 
 + 
 +Third Generation (Single Molecule Sequencing) 
 +Most commonly used platforms:​  
 + Helicos Biosciences:​ Heliscope 
 + Pacific Biosciences : PacBio RS II (PacBio is useless, no AT, GC bias, problems with scafolding) 
 + Microbial genome -  
 + Oxford Nanopore: MinION & GridION (error rate ~ 15%) 
 +        Illumina'​s technology may be used to error-correct PacBio reads, but Illumina has GC bias, being computationally expensive. 
 + 
 +ILLUMINA SEQUENCING 
 + Different 3’ and 5’ end adapters – fragments are flanked by the adapters 
 + Hybridization in the flowcell (array) 
 + Bridge amplification – proximity and PCR amplification allows the fragment to be amplified. 
 + Metzker 2010 
 + A washing step takes out one of the two types 
 + Same cluster: same sequence, sequencing primer 
 + Four nucleotides are labeled, all 4 in the same reaction, different 4 calors for each nucleotide 
 + Incorporation by polymerase – light release with colors 
 + Same clusters – signal 
 + CDC camera catches the color. 
 + The process continues until ~100 bp 
 + $1000 for a flowcell MinION – 400bp 1 lane  
 +  
 + 
 + 
 + 
 + 
 +PACBIO 
 + Imagine a plate with small wells 
 + in this technology, the objective is to make a polymerase stick to the well 
 + Eid et al 2009 
 + Single molecule PCR polymerase that is fast enough 
 + This can be imagined as something like ELISA but with only one exactly DNA and polymerase per well 
 + Real time sequencing technology, allowing the time it takes to incorporate the base, a measure to infer information about heterochromatin and quadruplex structures 
 + This technology allows long reads 
 + INDEL in the main mismatch that happens in PacBio 
 + 
 + 
 +OXFORD NANOPORE 
 + A technology that has been ~ 15 years around 
 + Involves a membrane with a nanopore, formed with a protein called alfa-hemolysin 
 + DNA molecule goes throught the pore 
 + Salt gradient concentration allows the guidance of a single stranded DNA molecule through the pore 
 + Guidance of DNA to the pore leads the DNA molecule to the exonuclease activity coupled to the hemolysin 
 + Nucleotide is cut and the carge changes in a side of the membrane surface 
 + Based on nucleotide charge change, the unique nucleotide that matches that change is inferred to be in the sequence. 
lecture_notes/04-06-2015.txt · Last modified: 2015/04/10 05:55 by gepoliano