Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-06-2015 [2015/04/06 18:13]
chkan Fixed formating
+++ lecture_notes:04-06-2015 [2015/04/10 04:59]
gepoliano
@@ Line 1: / Line 1: @@
-Lecture Notes 4/6/2015
+ Lecture Notes 4/6/2015
 Note Taker: Christopher Kan
@@ Line 80: / Line 80: @@
-Note Taker: XXX
+Note Taker: Gepoliano Chaves
+LECTURE: ROADMAP TO THE DE NOVO ASSEMBLY OF THE BANANA SLUG GENOME
+Guest lecturer: Stefan Prost
+Lecturer contact: stefan.prost@berkeley.edu
+OVERVIEW TOPICS
+•	A priori information about the genome
+•	Sequencing strategies and platforms
+•	Sequencing libraries
+•	Raw data processing and Quality assessment
+•	Assembly Strategies and Tools
+•	Assembly quality assessment
+•	Further Improvement of the Assembly
+•	What is a finished Assembly?
+•	(There’s no finished assembly)
+•	Downstream processing
+There are two approaches to assembly genomic reads, de novo genome assembly and reference-based assembly.
+De novo Assembly
+No previous genome to map the sequencing reads with
+Sequence reads are clustered in Sequence contigs (one read after the other), no gaps
+Scaffolds groups different contigs
+Repeat reads are difficult to resolve: reads
+One contiguous read, but there must be gaps.
+N50 – thousands of scaffolds: rank the contigs by similarity
+N50 – is a king of median of the contigs length
+Reference-based Assembly
+Kmer = short, unique element of DNA sequence of length n
+A commonly used platform to get sequencing data is the Illumina’s HiSeq;
+This platform allows kmers as big as 100 bp
+Reads are then mapped back to genome
+=====
+GENERAL INFORMATION ABOUT GENOME ASSEMBLY =====
+As of a start point, 4 topics should be in our minds for the assembly:
+•	Expected Genome Size (there is previous data for the slugs)
+•	Expected repeat content
+•	Expected heterozygosity
+•	Haploid, Diploid or polyploidy (this represents a serious problem) – as I understood there’s no information about that for the banana slug.
+	Cariotype information – can we derive that from the assembly?
+        C-Value = weight of genome (picogram) 1pg =1GB long
+        c-value from www.genomesize.com/
+Information from other genomes
+	The longfish has the largest vertebrate genome
+	big genomes – repetitions: genome size and repeat content, correlate positively
+        Drosophila has a repetition content of 2%.
+        Mamallian genomes are trickier than bird’s genomes
+        Genome Synteny (Poelstra 2014)
+        Synteny – conservation of blocks of order within two sets of chromosomes that are being compared with each other (http://en.wikipedia.org/wiki/Synteny)
+	Mammals’ genomes present rearrangements of sequences
+	RNA-Seq in this regard does not show where the gene was.
+SEQUENCING TECHNOLOGIES
+First generation
+	Sanger Sequencing, based on the dideoxynucleotide chain termination. Dideoxynucleotides are chain-elongation inhibitors of DNA polymerase (Good accuracy and read length)
+Second Generation (PCR needed)
+	Illumina’s Miseq and HiSeq are cheap platfroms to sequence DNA. Slugs’ data was collected using Illumina’s platform
+	Roche 454, expensive (slugs have some data originally sequenced in this platform)
+	LIFE sciences IONtorrent and IONproton (cheaper than 454)
+	ABI: SOLiD (hybridization array approach)( slugs have some data originally sequenced in this platform too)
+Third Generation (Single Molecule Sequencing)
+Most commonly used platforms:
+	Helicos Biosciences: Heliscope
+	Pacific Biosciences : PacBio RS II (PacBio is useless, no AT, GC bias, problems with scafolding)
+		Microbial genome -
+	Oxford Nanopore: MinION & GridION (error rate ~ 15%)
+        Illumina's technology may be used to error-correct PacBio reads, but Illumina has GC bias, being computationally expensive.
+ILLUMINA SEQUENCING
+	Different 3’ and 5’ end adapters – fragments are flanked by the adapters
+	Hybridization in the flowcell (array)
+	Bridge amplification – proximity and PCR amplification allows the fragment to be amplified.
+		Metzker 2010
+	A washing step takes out one of the two types
+	Same cluster: same sequence, sequencing primer
+	Four nucleotides are labeled, all 4 in the same reaction, different 4 calors for each nucleotide
+	Incorporation by polymerase – light release with colors
+	Same clusters – signal
+	CDC camera catches the color.
+	The process continues until ~100 bp
+	$1000 for a flowcell MinION – 400bp 1 lane
+PACBIO
+	Imagine a plate with small wells
+	in this technology, the objective is to make a polymerase stick to the well
+	Eid et al 2009
+	Single molecule PCR polymerase that is fast enough
+	This can be imagined as something like ELISA but with only one exactly DNA and polymerase per well
+	Real time sequencing technology, allowing the time it takes to incorporate the base, a measure to infer information about heterochromatin and quadruplex structures
+	This technology allows long reads
+	INDEL in the main mismatch that happens in PacBio
+OXFORD NANOPORE
+	A technology that has been ~ 15 years around
+	Involves a membrane with a nanopore, formed with a protein called alfa-hemolysin
+		DNA molecule goes throught the pore
+		Salt gradient concentration allows the guidance of a single stranded DNA molecule through the pore
+		Guidance of DNA to the pore leads the DNA molecule to the exonuclease activity coupled to the hemolysin
+		Nucleotide is cut and the carge changes in a side of the membrane surface
+		Based on nucleotide charge change, the unique nucleotide that matches that change is inferred to be in the sequence.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools