User Tools

Site Tools


lecture_notes:04-07-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
lecture_notes:04-07-2010 [2010/04/08 01:40]
galt created
lecture_notes:04-07-2010 [2015/09/14 11:40] (current)
68.180.230.228 ↷ Links adapted because of a move operation
Line 1: Line 1:
-Communicate to Jeff and Jenny +==== Class Business ====
-Offload something since they were'​nt +
-there on Monday.+
  
-Make a review articles page at a high level +Communicate about offloading assembler installation to Jeff and Jenny since they weren'​t there on Monday. 
-with citations. People can comment.+ 
 +Make a review articles page at a high level with citations. People can comment.
  
 Use the forum to discuss things. Use the forum to discuss things.
-Each person must sign up for the forum independently. +  * Each person must sign up for the forum independently. 
-Forum works better than email, +  ​* ​Forum works better than email, because you can go back later to that subject. 
-because you can go back later to that subject. +  * Email has immediate impact, but not so easily searchable.
-email has immediate impact, but not so easily searchable.+
  
-People should read the de-novo assemblers review paper so that +People should read the de-novo assemblers review paper so that they will be ready Friday'​s lecture. ​  
-they will be ready Friday'​s lecture. ​  +  ​* ​(This has been added to new review articles page) 
-(This has been added to new review articles page) +  ​* ​Discusses Overlap and de-Bruijn graphs.
-Discusses Overlap and de-Bruijn graphs.+
  
-454 Newbler assembler is entirely proprietary.+454 Newbler assembler is entirely proprietary ​and almost nothing is known on how it works internally (The only description is in the supplementary material of the original 454 method paper ((Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005.|http://​dx.doi.org/​10.1038/​nature03959)).
  
-Find out how much memory each tool needs. +Christy Hightower wants more feedback on the tools, to say good/bad ​Feedback should be added to the wiki [[lecture_notes:​04-02-2010|lecture notes for her lecture]].
-Does it need a cluster or just a single machine? etc.+
  
-Do not run anything on the headnode ​for campusRocks+[[https://​banana-slug.soe.ucsc.edu/​feed.php|RSS feed]] ​for wiki
-Learn how to use sungrid ​to tell it how to run it on the node.+  * Shows recent changes ​to the wiki.  
 +  * See what others have been doing lately. 
 +  * Good way to keep up with changes without having ​to scan every wiki page.
  
-We should all have access now to campusrocks.+===Guest lecturers coming up:=== 
 +  * Mon 19 Apr Dan Zerbino on Velvet. 
 +  * Fri 23 Apr, Janet Leonard and John Pearse on Slug biology.
  
-There'​s a link to some documentation on sunGrid.+We will talk Friday about graph representations.
  
-Can ssh to a machine ​in the campusrocks grid directly to run small things.+==== Running on Campus Rocks ==== 
 +  * Find out how much memory each tool needs. 
 +  * Does it need cluster or just a single ​machine? etc.
  
-Some of the data is up now.+Do not run anything on the headnode for campusRocks. 
 +Learn how to use sungrid to tell it how to run it on (one of) the nodes. Alternatively,​ use the [[http://​campusrocks.soe.ucsc.edu/​ganglia/​|status page]] to find an idle node and ssh to it directly. 
 +The [[archive:​computer_resources:​campusrocks|campusrocks page]] has a link to some documentation on sungrid.
  
-David Bernick and Kevin have been fussing with the data. +We should all have access now to campusrocksIf you don't contact tech staff (IT request).
-He had latest draft 4c. Have all the inversions ​Pyrobaculum.+
  
-Can test assemblers to see how well they work on the small genome.+For testing, there is some Pyrobaculum data on campusrocks now (or soon).  
 +  * David Bernick and Kevin have been fussing with the data. 
 +  * He had latest draft 4c. Have all the inversions. 
 +  * Can test assemblers to see how well they work on the small genome
 +  * 454 and Solid reads. 
 +  * Go ahead and try running it. (Remember: not on the head-node.) 
 +  * Start comparing the different assembly techniques.
  
-We have 454 and Solid reads. 
-Go ahead and try running it, but not on the head-node. 
-Start comparing the different assembly techniques. 
  
-Christy Hightower wants more feedback on the tools, +==== Lower-level Data ====
-to say good/​bad. ​ Can add feedback to the wiki lecture +
-notes on the lib lecture.+
  
-RSS feed for wiki. +=== Instruments === 
-Click the orange triangle upper-right of start page. +  * Sanger capillary 
-Shows recent changes to the wiki. +  * 454 
-See what others have been doing lately.+  * Solid 
 +  * Illumina 
 +  * Ion Torrent
  
-Guest lecturers coming up.+Sanger creates a trace. 454, Solid and Illumina take images with camera. Ion torrent uses direct chip pH measurement.
  
-Mon week after, Dan Zerbino. +=== Traces === 
-Slug biology. +  ​* ​4 1-D traces (wigglesoverlapping; one for each of ACGT
- +  ​* ​Each trace tells what there is at a position. 
-We will talk friday about graph representations. +  ​* ​Peaks are broadened ​and end of read is worse than beginning. 
- +  ​* ​Can get several in a row that are spread ​out making it difficult to tell how many you have
-So today let's talk about lower-level data +  * NCBI has large archives ​of trace data for abandoned projects
-Sanger capillary +  ​* ​Have a terminator on each seq. 
-454 +  
-Solid +=== Images === 
-Illumina +  * The image files are enormous (TB's of data) and require a great deal of image processing. 
-IonTorrent +  * After processing the raw images are almost never kept. 
- +  * Images are typically monochrome, but SOLiD use 4 flourophores ​at the same time. 
-454,​solid,​illumina,​ takes images with camera. +  * De-convolution problems there too. Spots may overlap.
-ion torrent uses direct chip ph measurement. +
- +
-The image files are enormous and require a great deal +
-of image processing which cooks them way down. +
- +
-For Sanger, you get a trace. +
-4 1-D wiggles overlapping. +
-ACGT +
-Each trace tells what there is at a position. +
-Peaks are broadenedend of read worse than beginning, +
-Can get several in a row that are spread. +
-Trace archives ​at NIH for public genome archives that never got finished+
-Have a terminator on each seq. +
-Their problem with homopolymers ​ +
-is at end of reads with broad peaks merging into eachother.+
  
-Images are typically monochrome. 
-(but SOLiD use 4 flourophores at the same time) 
-De-convolution problems there too. Spots may overlap. 
-Images are usually discarded, TB's of data. 
 Ion Torrent has direct electronic readout, no images. Ion Torrent has direct electronic readout, no images.
  
-Base-calling +=== Base-calling ​=== 
-AGCT, quality score. +  * For each positionturn image data into a base (AGCT) and a quality score. 
-but quality ​means something different on each  +  * Quality ​means something different on each platform and sometimes even each instrument (Sanger)
-platform and sometimes even each instrument (Sanger).+    (Correction to what I said in lecture: quality values are **supposed** to be -10 log<​sub>​10</​sub>​ P(error), but calibration is sometimes not very accurate. --- //​[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/09 07:18//) 
 +  * May have initial (known) sequences that are used to calibrate quality.
  
-May have initial images that are used to calibrate.+=== Spaces ===
  
-BASE-space (ACGT fasta file) +  * Base-space (A/C/G/T)
-Color-space (di-nucleotides,​ used only by SOLiD) +
- One major reason they did this was to avoid a patent.  +
- More independence in the sequencing errors.+
  
-Flow-space (get from sequnecing ​by synthesis +  * Color-space (One of four colors corresponding to the change ​from previous base) 
-with ordinary nucleotides,​ get multiple copies  +    * Used by SOLiD 
-added in a single step).+  * Flow-space (A/C/G/T and length of repeat)
  
-Ion Torrent like 454 is flow-space. ​ The hydrogen 
-ions are more linear, but still has issues. 
  
-Sort of like run-length encoding. +== Base-space ==  ​ 
-You say what base and then how many times it was found in a row+  * Often in fasta file
-Alignments in flow-space are possible.+  * Used by Illumina.
  
 +== Flow-space == 
 +    * Used by sequencing-by-synthesis methods (454, Ion torrent)
 +    * Multiple of the same homo-nucleotide are added in a single step and you get a (imperfect) signal of how many.
 +    * Signal gets worse (less specific) for higher values.
 +    * Analogous to run length encoding
 +    * Often not integer values.
 +    * Ion Torrent is more linear than 454, but still has issues.
 +    * Alignments in flow-space are possible.
  
 +== Color-space ==
 +  * 4 colors, numbered 0 to 3.
  
-Color-space +^ number ^ binary ^ color  ^ meaning ​                    ^ transitions ​          ^ 
-colors 0 to 3. +     ​| ​00     | blue   | same base                   | (A->A C->C G->G T->T| 
-0 00 blue means same as previous ​base (AA CC GG TT+     ​| ​01     | green  | non-complement transversion | (A->C C->A G->T T->G| 
-1 01 green means (AC CA GT TG+     ​| ​10     | yellow ​| transition ​                 | (A->G C->T G->A T->C| 
-2 10 yellow ​means (AG GA CT TC+     ​| ​11     | red    | complement ​                 | (A->T C->G G->C T->A|
-3 11 red means switch to compliment ​(AT TA CG GC)+
  
-XOR is associative and commutative. +  * See /​cse/​faculty/​karplus/​pluck/​scripts/​map-colorspace 
-This XOR operation is also works brilliantly with the Klein four group +  * One major reason they used this was to avoid a patent.  
-for the bases A C G T. +  * Allows more independence in the sequencing errors. 
-You get from one base to a color, or vice versa with XOR. +  * Binary representations are useful. 
- +    * XOR is associative and commutative. 
-A 0 00 +    ​* ​This XOR operation is also works brilliantly with the Klein four group for the bases A C G T. 
-C 1 01 +    ​* ​You get from one base to a color, or vice versa with XOR. 
-G 2 10 +      ​* ​A 0 00 
-T 3 11+      ​* ​C 1 01 
 +      ​* ​G 2 10 
 +      ​* ​T 3 11
  
 The di-nucleotide is simply saying, The di-nucleotide is simply saying,
Line 139: Line 133:
 Remember, it's really a series like this: Remember, it's really a series like this:
 ATCG is measured as chain of dinucleotide colors ​ ATCG is measured as chain of dinucleotide colors ​
-A -- T    color3 == color 11 +  ​A -- T    color3 == color 11 
-T -- C    color2 == color 10 +  T -- C    color2 == color 10 
-C -- G    color3 == color 11+  C -- G    color3 == color 11
 Each nucleotide in the final sequence is used Each nucleotide in the final sequence is used
-as the right have of one dinucleotide,​ and then+as the right half of one dinucleotide,​ and then
 the left half of the next dinucleotide. the left half of the next dinucleotide.
 The first letter A is given  The first letter A is given 
 (from the last base of the first primer if it is a read).  ​ (from the last base of the first primer if it is a read).  ​
 So the data actually appears something like this: So the data actually appears something like this:
-(A) 3 2 3 +  ​(A) 3 2 3 
-or +  or 
-(00) 11 10 11+  (00) 11 10 11
  
 If you are on base G (10) and your next color is red (11), If you are on base G (10) and your next color is red (11),
Line 159: Line 153:
  
 Indel in colorspace. Indel in colorspace.
-A C G A C A A +  ​A C G A C A A 
-   ​drop out GAC +     ​drop out GAC 
- 1 3 2 1 1 0 +   ​1 3 2 1 1 0 
-A C A A   (GAC deleted, 4 colors become 1 new color) +  A C A A   (GAC deleted, 4 colors become 1 new color) 
-  1 1 0 +    1 1 0 
-  so 3 2 1 1 will become 3 xor 2 xor 1 xor 1  +    so 3 2 1 1 will become 3 xor 2 xor 1 xor 1  
-  = 11 xor 10 xor 01 xor 01 = 01 == 1  +    = 11 xor 10 xor 01 xor 01 = 01 == 1  
- which is the same as C-->A which is correct.+   ​which is the same as C-->A which is correct.
 Take the region that's changing, Take the region that's changing,
 and the exclusive-or of it all together. and the exclusive-or of it all together.
Line 173: Line 167:
 SNP changes colors, two changes together. SNP changes colors, two changes together.
  
-A G C G   (C --> T) +  ​A G C G   (C --> T) 
-    +   ​2 3 3 
-       +     ​1 1 
- +   
-A G C A   (C --> T) +  A G C A   (C --> T) 
-   3 1 +   ​2 3 1 
-       3+     ​1 3
    
 XOR of that region has to take this base to that base. XOR of that region has to take this base to that base.
Line 190: Line 184:
  
  
-Note that when an error happens, all bases in the read down-stream +Note that when an error happens, all bases in the read down-stream will be wrong in base-space. ​ This is the reason that people bother to try to use color-space,​ because then the error stays localized.
-will be wrong in base-space. ​ This is the reason that people bother +
-to try to use color-space,​ because then the error stays localized.+
  
-When doing SNP calling, want to know if it is a SNP or a read-error. ​  +When doing SNP calling, want to know if it is a SNP or a read-error. The read-errors are independent typically.But the SNP will have coordinated changes. Either a larger change, mismapping, error, or something else. SOLID makes a big deal out of this. Not however useful for other non-SNP-calling things. Even with millions of reads, you can get false-positive SNPs at a low error rate.
-The read-errors are independent typically. ​  +
-But the SNP will have coordinated changes. +
-Either a larger change, mismapping, error, or something else. +
-SOLID makes a big deal out of this. +
-Not however useful for other non-snp-calling things. +
-Even with millions of reads, you can get false-positive SNPs at a low error rate.+
  
-Base-space and color-space comes with quality. +=== Quality ===
-Flowspace does not have such? have to check. +
-SFS format is the flowspace format for input  +
-into the newbler assembler. ​  Does it have any +
-independent quality measurement?​+
  
-A large number of the assemblers throw away the quality data. +Base-space, flow-space, and color-space all come with quality ​scores.
-Or only use it later. ​ Some use it to just throw away reads with +
-low quality.+
  
-Sanger fails because of electrophoresis,​ not the sanger chemistry itself, +SFF format is the flowspace format for input into the Newbler assembler. It has quality scores for each base using standard -10 log<​sub>​10</​sub>​ probability. 
-as far as getting long reads Out to about 1000 bases.+([[http://​www.ncbi.nlm.nih.gov/​Traces/​trace.cgi?​cmd=show&​f=formats&​m=doc&​s=format#​sff|SFF format]])
  
-454 quality ​drops off also.  ​Synthesis starts ​to get out of phase.+A large number of the assemblers throw away the quality ​data or only use it later.  ​Some use it to just throw away reads with low quality.
  
-Solid - lose yield on ligation. ​ Missing ligations. +== Reasons for quality dropoff == 
- +  * Sanger fails because of electrophoresis,​ not the sanger chemistry itself, as far as getting long reads. ​ Out to about 1000 bases. 
-Illumina problem with frequent washing.  ​removes template. Kevin thinks.+  * 454 synthesis starts to get out of phase. 
 +  * Solid loses yield on ligation. ​ Missing ligations. 
 +  ​* ​Illumina problem with frequent washing removes template. Kevin thinks.
  
  
 +=== Memory ===
 How do you represent this stuff in memory? How do you represent this stuff in memory?
-Two bits per base. +Two bits per base (four possible values)
 With color-space,​ can choose them to fit what they should be. With color-space,​ can choose them to fit what they should be.
-If read is not too-variable length, can fit in 64-bit integer.+If read is not too-variable length, can fit 32 bases into a 64-bit integer.
  
 SOLID produces cs-fasta file. (cs = colorspace) SOLID produces cs-fasta file. (cs = colorspace)
 It is a T (the last base of the first adapter?) It is a T (the last base of the first adapter?)
-T 00100 ...+  ​T 00100 ...
  
 Sometimes we want to do matching directly in colorspace. Sometimes we want to do matching directly in colorspace.
Line 237: Line 220:
 So this helps avoid problems that would otherwise happen. So this helps avoid problems that would otherwise happen.
  
-Sometimes don't know what strand you are working on. +Sometimes don't know what strand you are working on. To get reverse-complement equivalent in color-space all you have to do is the reversal. ​ No complementing is needed.
-To get reverse-complement equivalent in color-space +
-all you have to do is the reversal. ​ No complementing is needed.+
  
-One thing you can do when mapping is handle both strands. +One thing you can do when mapping is handle both strands. But you still have to hash the reversed colorspace too, so don't save memory
-But you still have to hash the reversed colorspace too, so don't save memory+
 in searches. ​ Hashing a genome takes a lot of space. in searches. ​ Hashing a genome takes a lot of space.
 +
 +==== Final Business ====
  
 Journal club papers should be fairly short. Journal club papers should be fairly short.
lecture_notes/04-07-2010.1270716036.txt.gz · Last modified: 2010/04/08 01:40 by galt