Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-06-2015 [2015/04/06 17:32]
chkan created
+++ lecture_notes:04-06-2015 [2015/04/10 04:59]
gepoliano
@@ Line 1: / Line 1: @@
-Lecture Notes 4/6/2015
+ Lecture Notes 4/6/2015
 Note Taker: Christopher Kan
-Note Taker: XXX
+A road map to the Denovo-Assembly of the Banana Slug Genome
+	- Stefan Prost
+Denovo VS. Reference Genome
+	- Reference can be biased by the assembly itself. Eg some areas may not be annotated or reads are not available.
+	- Denovo costs more
+Scaffolds and Contigs
+	- Contigs have little to no gaps
+	- Scaffolds can have missing regions but the linear order of the contigs within each scaffold is known
+	- N50s for Scaffold and Contigs are used as quality measures.
+		○ You sum the size of the scaffolds or contigs until you reach 1/2 the linear length of a genome. The size of the last constituent part of the N50. It’s a way to obtain a median-esque measure of assembly quality
+	- Ideally # scaffolds = # chromosomes
+Definition: Kmer - Short unique element of DNA of a certain length n
+	- The elements can overlap
+	- Used to summarize data by assemblers
+A priori knowledge of a genome
+	- Expected Genome Size
+		* C-values from www.genomesize.com
+		* C-value is the genome size in picrograms
+		* 1pg=1C=980MB
+		* Depending on clade information from related genomes can be used to provide a-priori knowledge
+			§ Some have low variation and high synteny - Birds
+		* 6-7 GB becomes difficult
+	- Data bases
+		* www.Gigaadb.org
+		* NCBI Genome
+	- Expected repeat content
+		* Correlated with genome size
+		* Small repeats and pseudogenes, genome duplications
+	- Expected Heterozygosity
+	- Haploid? Diploid or polyploid?
+		* No assembler that can assemble polyploid currently
+Sequencing Technology
+	- 1st Gen
+		○ Sanger
+	- 2nd Gen (PCR Needed)
+		* Illumia
+			§ It took me a long time to understand how this works these two video helped me: [[https://www.youtube.com/playlist?list=PLfvYDg0hWvoqfF9z7bw7Zizeenj620r5c| Link]]
+		* Roche:454
+		* IONtorrent
+		* ABI: Solid
+	- 3rd Gen (Single Molecule Sequencing)
+		* Heliscope
+		* PacBio RS II
+			§ Problems
+				* Polyerase needs to be fast with low error
+				* Poor yield from cell
+					~ Need to wash with low concentration to ensure most cells only have one molocule
+				* Insertions and deletions. Missing or having one that hangs around
+					~ Random. This property used to error correct
+			* Light emission at time of amplification
+				* Real time, allows decernment of 3D structure of molocule based on time between incorporations
+			* Can circularize small DNA fragments and get multiple reads ~3kb, possible to 8kb
+		* MiniION and GridION
+			* Sequences by taking molocule apart
+			* Nanopore allows the molocule through based on salt gradient
+				* Sequence as molocule goes through
+					~ Molocule held by molocule that clips off one nucleotide at a time - exonuclease
+					~ Measure the charge at the nanopore.
+				* OR sequence the molocule as it goes through as its held with an helicase
+			* Some systematic errors - Harder to correct
+			* Can use a hair pin to run both side of DNA so its effectiely paired
+			* No restriction on size hypothetically
+Issues with 3rd Gen
+	- High error
+	- High cost
+	- Error correction very computationally expensive
+Note Taker: Gepoliano Chaves
+LECTURE: ROADMAP TO THE DE NOVO ASSEMBLY OF THE BANANA SLUG GENOME
+Guest lecturer: Stefan Prost
+Lecturer contact: stefan.prost@berkeley.edu
+OVERVIEW TOPICS
+•	A priori information about the genome
+•	Sequencing strategies and platforms
+•	Sequencing libraries
+•	Raw data processing and Quality assessment
+•	Assembly Strategies and Tools
+•	Assembly quality assessment
+•	Further Improvement of the Assembly
+•	What is a finished Assembly?
+•	(There’s no finished assembly)
+•	Downstream processing
+There are two approaches to assembly genomic reads, de novo genome assembly and reference-based assembly.
+De novo Assembly
+No previous genome to map the sequencing reads with
+Sequence reads are clustered in Sequence contigs (one read after the other), no gaps
+Scaffolds groups different contigs
+Repeat reads are difficult to resolve: reads
+One contiguous read, but there must be gaps.
+N50 – thousands of scaffolds: rank the contigs by similarity
+N50 – is a king of median of the contigs length
+Reference-based Assembly
+Kmer = short, unique element of DNA sequence of length n
+A commonly used platform to get sequencing data is the Illumina’s HiSeq;
+This platform allows kmers as big as 100 bp
+Reads are then mapped back to genome
+=====
+GENERAL INFORMATION ABOUT GENOME ASSEMBLY =====
+As of a start point, 4 topics should be in our minds for the assembly:
+•	Expected Genome Size (there is previous data for the slugs)
+•	Expected repeat content
+•	Expected heterozygosity
+•	Haploid, Diploid or polyploidy (this represents a serious problem) – as I understood there’s no information about that for the banana slug.
+	Cariotype information – can we derive that from the assembly?
+        C-Value = weight of genome (picogram) 1pg =1GB long
+        c-value from www.genomesize.com/
+Information from other genomes
+	The longfish has the largest vertebrate genome
+	big genomes – repetitions: genome size and repeat content, correlate positively
+        Drosophila has a repetition content of 2%.
+        Mamallian genomes are trickier than bird’s genomes
+        Genome Synteny (Poelstra 2014)
+        Synteny – conservation of blocks of order within two sets of chromosomes that are being compared with each other (http://en.wikipedia.org/wiki/Synteny)
+	Mammals’ genomes present rearrangements of sequences
+	RNA-Seq in this regard does not show where the gene was.
+SEQUENCING TECHNOLOGIES
+First generation
+	Sanger Sequencing, based on the dideoxynucleotide chain termination. Dideoxynucleotides are chain-elongation inhibitors of DNA polymerase (Good accuracy and read length)
+Second Generation (PCR needed)
+	Illumina’s Miseq and HiSeq are cheap platfroms to sequence DNA. Slugs’ data was collected using Illumina’s platform
+	Roche 454, expensive (slugs have some data originally sequenced in this platform)
+	LIFE sciences IONtorrent and IONproton (cheaper than 454)
+	ABI: SOLiD (hybridization array approach)( slugs have some data originally sequenced in this platform too)
+Third Generation (Single Molecule Sequencing)
+Most commonly used platforms:
+	Helicos Biosciences: Heliscope
+	Pacific Biosciences : PacBio RS II (PacBio is useless, no AT, GC bias, problems with scafolding)
+		Microbial genome -
+	Oxford Nanopore: MinION & GridION (error rate ~ 15%)
+        Illumina's technology may be used to error-correct PacBio reads, but Illumina has GC bias, being computationally expensive.
+ILLUMINA SEQUENCING
+	Different 3’ and 5’ end adapters – fragments are flanked by the adapters
+	Hybridization in the flowcell (array)
+	Bridge amplification – proximity and PCR amplification allows the fragment to be amplified.
+		Metzker 2010
+	A washing step takes out one of the two types
+	Same cluster: same sequence, sequencing primer
+	Four nucleotides are labeled, all 4 in the same reaction, different 4 calors for each nucleotide
+	Incorporation by polymerase – light release with colors
+	Same clusters – signal
+	CDC camera catches the color.
+	The process continues until ~100 bp
+	$1000 for a flowcell MinION – 400bp 1 lane
+PACBIO
+	Imagine a plate with small wells
+	in this technology, the objective is to make a polymerase stick to the well
+	Eid et al 2009
+	Single molecule PCR polymerase that is fast enough
+	This can be imagined as something like ELISA but with only one exactly DNA and polymerase per well
+	Real time sequencing technology, allowing the time it takes to incorporate the base, a measure to infer information about heterochromatin and quadruplex structures
+	This technology allows long reads
+	INDEL in the main mismatch that happens in PacBio
+OXFORD NANOPORE
+	A technology that has been ~ 15 years around
+	Involves a membrane with a nanopore, formed with a protein called alfa-hemolysin
+		DNA molecule goes throught the pore
+		Salt gradient concentration allows the guidance of a single stranded DNA molecule through the pore
+		Guidance of DNA to the pore leads the DNA molecule to the exonuclease activity coupled to the hemolysin
+		Nucleotide is cut and the carge changes in a side of the membrane surface
+		Based on nucleotide charge change, the unique nucleotide that matches that change is inferred to be in the sequence.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools