User Tools

Site Tools


lecture_notes:04-07-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lecture_notes:04-07-2010 [2010/04/08 09:25]
galt
lecture_notes:04-07-2010 [2015/09/14 18:40] (current)
68.180.230.228 ↷ Links adapted because of a move operation
Line 1: Line 1:
-== Class Business ==+==== Class Business ​====
  
-Communicate to Jeff and Jenny +Communicate ​about offloading assembler installation ​to Jeff and Jenny since they weren'there on Monday.
-Offload something ​since they were'nt +
-there on Monday.+
  
-Make a review articles page at a high level +Make a review articles page at a high level with citations. People can comment.
-with citations. People can comment.+
  
 Use the forum to discuss things. Use the forum to discuss things.
-Each person must sign up for the forum independently. +  * Each person must sign up for the forum independently. 
-Forum works better than email, +  ​* ​Forum works better than email, because you can go back later to that subject. 
-because you can go back later to that subject. +  * Email has immediate impact, but not so easily searchable.
-email has immediate impact, but not so easily searchable.+
  
-People should read the de-novo assemblers review paper so that +People should read the de-novo assemblers review paper so that they will be ready Friday'​s lecture. ​  
-they will be ready Friday'​s lecture. ​  +  ​* ​(This has been added to new review articles page) 
-(This has been added to new review articles page) +  ​* ​Discusses Overlap and de-Bruijn graphs.
-Discusses Overlap and de-Bruijn graphs.+
  
-454 Newbler assembler is entirely proprietary.+454 Newbler assembler is entirely proprietary ​and almost nothing is known on how it works internally (The only description is in the supplementary material of the original 454 method paper ((Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005.|http://​dx.doi.org/​10.1038/​nature03959)).
  
-Find out how much memory each tool needs. +Christy Hightower wants more feedback on the tools, to say good/bad ​Feedback should be added to the wiki [[lecture_notes:​04-02-2010|lecture notes for her lecture]].
-Does it need a cluster or just a single machine? etc.+
  
-Do not run anything on the headnode ​for campusRocks+[[https://​banana-slug.soe.ucsc.edu/​feed.php|RSS feed]] ​for wiki
-Learn how to use sungrid ​to tell it how to run it on the node.+  * Shows recent changes ​to the wiki.  
 +  * See what others have been doing lately. 
 +  * Good way to keep up with changes without having ​to scan every wiki page.
  
-We should all have access now to campusrocks.+===Guest lecturers coming up:=== 
 +  * Mon 19 Apr Dan Zerbino on Velvet. 
 +  * Fri 23 Apr, Janet Leonard and John Pearse on Slug biology.
  
-There'​s a link to some documentation on sunGrid.+We will talk Friday about graph representations.
  
-Can ssh to a machine ​in the campusrocks grid directly to run small things.+==== Running on Campus Rocks ==== 
 +  * Find out how much memory each tool needs. 
 +  * Does it need cluster or just a single ​machine? etc.
  
-Some of the data is up now.+Do not run anything on the headnode for campusRocks. 
 +Learn how to use sungrid to tell it how to run it on (one of) the nodes. Alternatively,​ use the [[http://​campusrocks.soe.ucsc.edu/​ganglia/​|status page]] to find an idle node and ssh to it directly. 
 +The [[archive:​computer_resources:​campusrocks|campusrocks page]] has a link to some documentation on sungrid.
  
-David Bernick and Kevin have been fussing with the data. +We should all have access now to campusrocksIf you don't contact tech staff (IT request).
-He had latest draft 4c. Have all the inversions ​Pyrobaculum.+
  
-Can test assemblers to see how well they work on the small genome.+For testing, there is some Pyrobaculum data on campusrocks now (or soon).  
 +  * David Bernick and Kevin have been fussing with the data. 
 +  * He had latest draft 4c. Have all the inversions. 
 +  * Can test assemblers to see how well they work on the small genome
 +  * 454 and Solid reads. 
 +  * Go ahead and try running it. (Remember: not on the head-node.) 
 +  * Start comparing the different assembly techniques.
  
-We have 454 and Solid reads. 
-Go ahead and try running it, but not on the head-node. 
-Start comparing the different assembly techniques. 
  
-Christy Hightower wants more feedback on the tools, +==== Lower-level Data ====
-to say good/​bad. ​ Can add feedback to the wiki lecture +
-notes on the lib lecture.+
  
-RSS feed for wiki. +=== Instruments === 
-Click the orange triangle upper-right of start page. +  * Sanger capillary 
-Shows recent changes to the wiki. +  * 454 
-See what others have been doing lately.+  * Solid 
 +  * Illumina 
 +  * Ion Torrent
  
-Guest lecturers coming up.+Sanger creates a trace. 454, Solid and Illumina take images with camera. Ion torrent uses direct chip pH measurement.
  
-Mon week afterDan Zerbino+=== Traces === 
-Slug biology.+  * 4 1-D traces (wiggles) overlapping;​ one for each of ACGT. 
 +  * Each trace tells what there is at a position. 
 +  * Peaks are broadened and end of a read is worse than beginning. 
 +  * Can get several in a row that are spread out making it difficult to tell how many you have. 
 +  * NCBI has large archives of trace data for abandoned projects. 
 +  * Have a terminator on each seq. 
 +  
 +=== Images === 
 +  * The image files are enormous (TB's of data) and require a great deal of image processing. 
 +  * After processing the raw images are almost never kept. 
 +  * Images are typically monochromebut SOLiD use 4 flourophores at the same time
 +  * De-convolution problems there too. Spots may overlap.
  
-We will talk friday about graph representations.+Ion Torrent has direct electronic readout, no images.
  
-So today let's talk about  +=== Base-calling ​==
-== Lower-level Data ==+  * For each position, turn image data into a base (AGCT) and a quality score. 
 +  * Quality means something different on each platform and sometimes even each instrument (Sanger). 
 +    (Correction to what I said in lecture: quality values are **supposed** to be -10 log<​sub>​10</​sub>​ P(error), but calibration is sometimes not very accurate. --- //​[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/09 07:18//) 
 +  * May have initial (known) sequences that are used to calibrate quality.
  
-  Sanger capillary +=== Spaces ===
-  454 +
-  Solid +
-  Illumina +
-  Ion Torrent+
  
-454,​solid,​illumina take images with camera. +  * Base-space (A/C/G/T)
-Ion torrent uses direct chip ph measurement.+
  
-The image files are enormous and require a great deal +  * Color-space (One of four colors corresponding to the change from previous base) 
-of image processing which cooks them way down.+    * Used by SOLiD 
 +  * Flow-space (A/C/G/T and length ​of repeat)
  
-For Sanger, you get a trace. 
-4 1-D wiggles overlapping. 
-ACGT 
-Each trace tells what there is at a position. 
-Peaks are broadened, end of read worse than beginning, 
-Can get several in a row that are spread. 
-Trace archives at NIH for public genome archives that never got finished. 
-Have a terminator on each seq. 
-Their problem with homopolymers ​ 
-is at end of reads with broad peaks merging into eachother. 
- 
-Images are typically monochrome. 
-(but SOLiD use 4 flourophores at the same time) 
-De-convolution problems there too. Spots may overlap. 
-Images are usually discarded, TB's of data. 
-Ion Torrent has direct electronic readout, no images. 
- 
-== Base-calling == 
-AGCT, quality score. 
-but quality means something different on each  
-platform and sometimes even each instrument (Sanger). 
- 
-May have initial images that are used to calibrate. 
- 
-== Spaces == 
- 
-  BASE-space (ACGT fasta file) 
-  Color-space (di-nucleotides,​ used only by SOLiD) 
-  Flow-space (454, Ion torrent) 
  
 +== Base-space ==  ​
 +  * Often in fasta file.
 +  * Used by Illumina.
  
 == Flow-space ==  == Flow-space == 
- +    * Used by sequencing-by-synthesis ​methods (454Ion torrent) 
-Get from sequencing by synthesis +    * Multiple ​of the same homo-nucleotide ​are added in a single step and you get a (imperfect) signal of how many
-with ordinary nucleotidesget multiple copies ​ +    * Signal gets worse (less specific) for higher values. 
-of same homo-nucleotide added in a single step. +    * Analogous to run length encoding 
- +    * Often not integer values. 
-Ion Torrent ​like 454 is flow-space. ​ The hydrogen +    ​* ​Ion Torrent is more linear ​than 454, but still has issues. 
-ions are more linear, but still has issues. +    ​* ​Alignments in flow-space are possible.
- +
-Sort of like run-length encoding. +
-You say what base and then how many times it was found in a row. +
-Alignments in flow-space are possible. +
- +
  
 == Color-space == == Color-space ==
 +  * 4 colors, numbered 0 to 3.
  
-One major reason they did this was to avoid a patent. ​ +^ number ^ binary ^ color  ^ meaning ​                    ^ transitions ​          ^ 
-More independence in the sequencing errors. +     ​| ​00     | blue   | same base                   ​| ​(A->A C->C G->G T->T| 
- +     ​| ​01     | green  ​| non-complement transversion | (A->C C->A G->T T->G| 
-colors 0 to 3. +     ​| ​10     | yellow ​| transition ​                 | (A->G C->T G->A T->C| 
-  ​0 00 blue   means (AA CC GG TT+     ​| ​11     | red    ​| complement ​                 | (A->​T ​C->G->​C ​T->A) |
-  1 01 green  ​means (AC CA GT TG+
-  2 10 yellow ​means (AG CT GA TC+
-  3 11 red    ​means (AT CG GC TA) +
- +
-XOR is associative and commutative. +
-This XOR operation is also works brilliantly with the Klein four group +
-for the bases A C G T+
-You get from one base to a color, or vice versa with XOR.+
  
-  A 0 00 +  ​* See /​cse/​faculty/​karplus/​pluck/​scripts/​map-colorspace 
-  C 1 01 +  * One major reason they used this was to avoid a patent.  
-  G 2 10 +  * Allows more independence in the sequencing errors. 
-  T 3 11+  * Binary representations are useful. 
 +    * XOR is associative and commutative. 
 +    * This XOR operation is also works brilliantly with the Klein four group for the bases A C G T. 
 +    * You get from one base to a color, or vice versa with XOR. 
 +      * A 0 00 
 +      ​* ​C 1 01 
 +      ​* ​G 2 10 
 +      ​* ​T 3 11
  
 The di-nucleotide is simply saying, The di-nucleotide is simply saying,
Line 202: Line 184:
  
  
-Note that when an error happens, all bases in the read down-stream +Note that when an error happens, all bases in the read down-stream will be wrong in base-space. ​ This is the reason that people bother to try to use color-space,​ because then the error stays localized.
-will be wrong in base-space. ​ This is the reason that people bother +
-to try to use color-space,​ because then the error stays localized. +
- +
-When doing SNP calling, want to know if it is a SNP or a read-error. ​  +
-The read-errors are independent typically. ​  +
-But the SNP will have coordinated changes. +
-Either a larger change, mismapping, error, or something else. +
-SOLID makes a big deal out of this. +
-Not however useful for other non-SNP-calling things. +
-Even with millions of reads, you can get false-positive SNPs at a low error rate. +
- +
-== Quality ==+
  
-Base-space and color-space comes with quality scores. +When doing SNP calling, want to know if it is a SNP or a read-error. The read-errors are independent typically.But the SNP will have coordinated changesEither a larger change, mismapping, error, or something else. SOLID makes a big deal out of this. Not however useful ​for other non-SNP-calling things. Even with millions of reads, you can get false-positive SNPs at a low error rate.
-Flowspace does not have such? have to check. +
-SFS format is the flowspace format ​for input  +
-into the Newbler assembler  Does it have any +
-independent quality measurement?​+
  
-A large number of the assemblers throw away the quality data. +=== Quality ===
-Or only use it later. ​ Some use it to just throw away reads with +
-low quality.+
  
-Sanger fails because of electrophoresisnot the sanger chemistry itself, +Base-spaceflow-spaceand color-space all come with quality scores.
-as far as getting long reads. ​ Out to about 1000 bases.+
  
-454 quality ​drops off also ​Synthesis starts to get out of phase.+SFF format is the flowspace format for input into the Newbler assembler. It has quality ​scores for each base using standard -10 log<​sub>​10</​sub>​ probability. 
 +([[http://​www.ncbi.nlm.nih.gov/​Traces/​trace.cgi?​cmd=show&​f=formats&​m=doc&​s=format#​sff|SFF format]])
  
-Solid - lose yield on ligation.  ​Missing ligations.+A large number of the assemblers throw away the quality data or only use it later.  ​Some use it to just throw away reads with low quality.
  
-Illumina problem with frequent washing removes template. Kevin thinks.+== Reasons for quality dropoff == 
 +  * Sanger fails because of electrophoresis,​ not the sanger chemistry itself, as far as getting long reads. ​ Out to about 1000 bases. 
 +  * 454 synthesis starts to get out of phase. 
 +  * Solid loses yield on ligation. ​ Missing ligations. 
 +  * Illumina problem with frequent washing removes template. Kevin thinks.
  
  
-== Memory ==+=== Memory ​===
 How do you represent this stuff in memory? How do you represent this stuff in memory?
-Two bits per base. +Two bits per base (four possible values)
 With color-space,​ can choose them to fit what they should be. With color-space,​ can choose them to fit what they should be.
-If read is not too-variable length, can fit in 64-bit integer.+If read is not too-variable length, can fit 32 bases into a 64-bit integer.
  
 SOLID produces cs-fasta file. (cs = colorspace) SOLID produces cs-fasta file. (cs = colorspace)
Line 252: Line 220:
 So this helps avoid problems that would otherwise happen. So this helps avoid problems that would otherwise happen.
  
-Sometimes don't know what strand you are working on. +Sometimes don't know what strand you are working on. To get reverse-complement equivalent in color-space all you have to do is the reversal. ​ No complementing is needed.
-To get reverse-complement equivalent in color-space +
-all you have to do is the reversal. ​ No complementing is needed.+
  
-One thing you can do when mapping is handle both strands. +One thing you can do when mapping is handle both strands. But you still have to hash the reversed colorspace too, so don't save memory
-But you still have to hash the reversed colorspace too, so don't save memory+
 in searches. ​ Hashing a genome takes a lot of space. in searches. ​ Hashing a genome takes a lot of space.
  
-== Final Business ==+==== Final Business ​====
  
 Journal club papers should be fairly short. Journal club papers should be fairly short.
lecture_notes/04-07-2010.1270718744.txt.gz · Last modified: 2010/04/08 09:25 by galt