User Tools

Site Tools


lecture_notes:04-07-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lecture_notes:04-07-2010 [2010/04/08 02:14]
galt
lecture_notes:04-07-2010 [2015/09/14 11:40] (current)
68.180.230.228 ↷ Links adapted because of a move operation
Line 1: Line 1:
-== Class Business ==+==== Class Business ​====
  
-Communicate to Jeff and Jenny +Communicate ​about offloading assembler installation ​to Jeff and Jenny since they weren'there on Monday.
-Offload something ​since they were'nt +
-there on Monday.+
  
-Make a review articles page at a high level +Make a review articles page at a high level with citations. People can comment.
-with citations. People can comment.+
  
 Use the forum to discuss things. Use the forum to discuss things.
-Each person must sign up for the forum independently. +  * Each person must sign up for the forum independently. 
-Forum works better than email, +  ​* ​Forum works better than email, because you can go back later to that subject. 
-because you can go back later to that subject. +  * Email has immediate impact, but not so easily searchable.
-email has immediate impact, but not so easily searchable.+
  
-People should read the de-novo assemblers review paper so that +People should read the de-novo assemblers review paper so that they will be ready Friday'​s lecture. ​  
-they will be ready Friday'​s lecture. ​  +  ​* ​(This has been added to new review articles page) 
-(This has been added to new review articles page) +  ​* ​Discusses Overlap and de-Bruijn graphs.
-Discusses Overlap and de-Bruijn graphs.+
  
-454 Newbler assembler is entirely proprietary.+454 Newbler assembler is entirely proprietary ​and almost nothing is known on how it works internally (The only description is in the supplementary material of the original 454 method paper ((Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005.|http://​dx.doi.org/​10.1038/​nature03959)).
  
-Find out how much memory each tool needs. +Christy Hightower wants more feedback on the tools, to say good/bad ​Feedback should be added to the wiki [[lecture_notes:​04-02-2010|lecture notes for her lecture]].
-Does it need a cluster or just a single machine? etc.+
  
-Do not run anything on the headnode ​for campusRocks+[[https://​banana-slug.soe.ucsc.edu/​feed.php|RSS feed]] ​for wiki
-Learn how to use sungrid ​to tell it how to run it on the node.+  * Shows recent changes ​to the wiki.  
 +  * See what others have been doing lately. 
 +  * Good way to keep up with changes without having ​to scan every wiki page.
  
-We should all have access now to campusrocks.+===Guest lecturers coming up:=== 
 +  * Mon 19 Apr Dan Zerbino on Velvet. 
 +  * Fri 23 Apr, Janet Leonard and John Pearse on Slug biology.
  
-There'​s a link to some documentation on sunGrid.+We will talk Friday about graph representations.
  
-Can ssh to a machine ​in the campusrocks grid directly to run small things.+==== Running on Campus Rocks ==== 
 +  * Find out how much memory each tool needs. 
 +  * Does it need cluster or just a single ​machine? etc.
  
-Some of the data is up now.+Do not run anything on the headnode for campusRocks. 
 +Learn how to use sungrid to tell it how to run it on (one of) the nodes. Alternatively,​ use the [[http://​campusrocks.soe.ucsc.edu/​ganglia/​|status page]] to find an idle node and ssh to it directly. 
 +The [[archive:​computer_resources:​campusrocks|campusrocks page]] has a link to some documentation on sungrid.
  
-David Bernick and Kevin have been fussing with the data. +We should all have access now to campusrocksIf you don't contact tech staff (IT request).
-He had latest draft 4c. Have all the inversions ​Pyrobaculum.+
  
-Can test assemblers to see how well they work on the small genome.+For testing, there is some Pyrobaculum data on campusrocks now (or soon).  
 +  * David Bernick and Kevin have been fussing with the data. 
 +  * He had latest draft 4c. Have all the inversions. 
 +  * Can test assemblers to see how well they work on the small genome
 +  * 454 and Solid reads. 
 +  * Go ahead and try running it. (Remember: not on the head-node.) 
 +  * Start comparing the different assembly techniques.
  
-We have 454 and Solid reads. 
-Go ahead and try running it, but not on the head-node. 
-Start comparing the different assembly techniques. 
  
-Christy Hightower wants more feedback on the tools, +==== Lower-level Data ====
-to say good/​bad. ​ Can add feedback to the wiki lecture +
-notes on the lib lecture.+
  
-RSS feed for wiki. +=== Instruments === 
-Click the orange triangle upper-right of start page. +  * Sanger capillary 
-Shows recent changes to the wiki. +  * 454 
-See what others have been doing lately.+  * Solid 
 +  * Illumina 
 +  * Ion Torrent
  
-Guest lecturers coming up.+Sanger creates a trace. 454, Solid and Illumina take images with camera. Ion torrent uses direct chip pH measurement.
  
-Mon week afterDan Zerbino+=== Traces === 
-Slug biology.+  * 4 1-D traces (wiggles) overlapping;​ one for each of ACGT. 
 +  * Each trace tells what there is at a position. 
 +  * Peaks are broadened and end of a read is worse than beginning. 
 +  * Can get several in a row that are spread out making it difficult to tell how many you have. 
 +  * NCBI has large archives of trace data for abandoned projects. 
 +  * Have a terminator on each seq. 
 +  
 +=== Images === 
 +  * The image files are enormous (TB's of data) and require a great deal of image processing. 
 +  * After processing the raw images are almost never kept. 
 +  * Images are typically monochromebut SOLiD use 4 flourophores at the same time
 +  * De-convolution problems there too. Spots may overlap.
  
-We will talk friday about graph representations.+Ion Torrent has direct electronic readout, no images.
  
-So today let's talk about  +=== Base-calling ​==
-== Lower-level Data ==+  * For each position, turn image data into a base (AGCT) and a quality score. 
 +  * Quality means something different on each platform and sometimes even each instrument (Sanger). 
 +    (Correction to what I said in lecture: quality values are **supposed** to be -10 log<​sub>​10</​sub>​ P(error), but calibration is sometimes not very accurate. --- //​[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/09 07:18//) 
 +  * May have initial (known) sequences that are used to calibrate quality.
  
-  Sanger capillary +=== Spaces ===
-  454 +
-  Solid +
-  Illumina +
-  Ion Torrent+
  
-454,​solid,​illumina take images with camera. +  * Base-space (A/C/G/T)
-Ion torrent uses direct chip ph measurement.+
  
-The image files are enormous and require a great deal +  * Color-space (One of four colors corresponding to the change from previous base) 
-of image processing which cooks them way down.+    * Used by SOLiD 
 +  * Flow-space (A/C/G/T and length ​of repeat)
  
-For Sanger, you get a trace. 
-4 1-D wiggles overlapping. 
-ACGT 
-Each trace tells what there is at a position. 
-Peaks are broadened, end of read worse than beginning, 
-Can get several in a row that are spread. 
-Trace archives at NIH for public genome archives that never got finished. 
-Have a terminator on each seq. 
-Their problem with homopolymers ​ 
-is at end of reads with broad peaks merging into eachother. 
- 
-Images are typically monochrome. 
-(but SOLiD use 4 flourophores at the same time) 
-De-convolution problems there too. Spots may overlap. 
-Images are usually discarded, TB's of data. 
-Ion Torrent has direct electronic readout, no images. 
- 
-== Base-calling == 
-AGCT, quality score. 
-but quality means something different on each  
-platform and sometimes even each instrument (Sanger). 
- 
-May have initial images that are used to calibrate. 
- 
-== Spaces == 
- 
-  BASE-space (ACGT fasta file) 
-  Color-space (di-nucleotides,​ used only by SOLiD) 
-  Flow-space (454, Ion torrent) 
  
 +== Base-space ==  ​
 +  * Often in fasta file.
 +  * Used by Illumina.
  
 == Flow-space ==  == Flow-space == 
- +    * Used by sequencing-by-synthesis ​methods (454Ion torrent) 
-Get from sequencing by synthesis +    * Multiple ​of the same homo-nucleotide ​are added in a single step and you get a (imperfect) signal of how many
-with ordinary nucleotidesget multiple copies ​ +    * Signal gets worse (less specific) for higher values. 
-of same homo-nucleotide added in a single step. +    * Analogous to run length encoding 
- +    * Often not integer values. 
-Ion Torrent ​like 454 is flow-space. ​ The hydrogen +    ​* ​Ion Torrent is more linear ​than 454, but still has issues. 
-ions are more linear, but still has issues. +    ​* ​Alignments in flow-space are possible.
- +
-Sort of like run-length encoding. +
-You say what base and then how many times it was found in a row. +
-Alignments in flow-space are possible. +
- +
  
 == Color-space == == Color-space ==
 +  * 4 colors, numbered 0 to 3.
  
-One major reason they did this was to avoid a patent. ​ +^ number ^ binary ^ color  ^ meaning ​                    ^ transitions ​          ^ 
-More independence in the sequencing errors.+| 0      | 00     | blue   | same base                   | (A->A C->C G->G T->T) | 
 +| 1      | 01     | green  | non-complement transversion | (A->C C->A G->T T->G) | 
 +| 2      | 10     | yellow | transition ​                 | (A->G C->T G->A T->C) | 
 +| 3      | 11     | red    | complement ​                 | (A->T C->G G->C T->A) |
  
-colors 0 to 3+  * See /​cse/​faculty/​karplus/​pluck/​scripts/​map-colorspace 
-  ​0 00 blue   means (AA CC GG TT) +  * One major reason they used this was to avoid a patent.  
-  ​1 01 green  means (AC CA GT TG) +  ​* Allows more independence in the sequencing errors. 
-  2 10 yellow means (AG CT GA TC) +  ​* Binary representations are useful. 
-  3 11 red    means (AT CG GC TA) +    ​XOR is associative and commutative. 
- +    ​* ​This XOR operation is also works brilliantly with the Klein four group for the bases A C G T. 
-XOR is associative and commutative. +    ​* ​You get from one base to a color, or vice versa with XOR. 
-This XOR operation is also works brilliantly with the Klein four group +      ​* ​A 0 00 
-for the bases A C G T. +      ​* ​C 1 01 
-You get from one base to a color, or vice versa with XOR. +      ​* ​G 2 10 
- +      ​* ​T 3 11
-  ​A 0 00 +
-  C 1 01 +
-  G 2 10 +
-  T 3 11+
  
 The di-nucleotide is simply saying, The di-nucleotide is simply saying,
Line 155: Line 137:
   C -- G    color3 == color 11   C -- G    color3 == color 11
 Each nucleotide in the final sequence is used Each nucleotide in the final sequence is used
-as the right have of one dinucleotide,​ and then+as the right half of one dinucleotide,​ and then
 the left half of the next dinucleotide. the left half of the next dinucleotide.
 The first letter A is given  The first letter A is given 
Line 202: Line 184:
  
  
-Note that when an error happens, all bases in the read down-stream +Note that when an error happens, all bases in the read down-stream will be wrong in base-space. ​ This is the reason that people bother to try to use color-space,​ because then the error stays localized.
-will be wrong in base-space. ​ This is the reason that people bother +
-to try to use color-space,​ because then the error stays localized.+
  
-When doing SNP calling, want to know if it is a SNP or a read-error. ​  +When doing SNP calling, want to know if it is a SNP or a read-error. The read-errors are independent typically.But the SNP will have coordinated changes. Either a larger change, mismapping, error, or something else. SOLID makes a big deal out of this. Not however useful for other non-SNP-calling things. Even with millions of reads, you can get false-positive SNPs at a low error rate.
-The read-errors are independent typically. ​  +
-But the SNP will have coordinated changes. +
-Either a larger change, mismapping, error, or something else. +
-SOLID makes a big deal out of this. +
-Not however useful for other non-snp-calling things. +
-Even with millions of reads, you can get false-positive SNPs at a low error rate.+
  
-== Quality ==+=== Quality ​===
  
-Base-space and color-space ​comes with quality. +Base-space, flow-space, ​and color-space ​all come with quality ​scores.
-Flowspace does not have such? have to check. +
-SFS format is the flowspace format for input  +
-into the newbler assembler. ​  Does it have any +
-independent quality measurement?​+
  
-A large number of the assemblers throw away the quality ​data+SFF format is the flowspace format for input into the Newbler assembler. It has quality ​scores for each base using standard -10 log<​sub>​10</​sub>​ probability
-Or only use it later Some use it to just throw away reads with +([[http://​www.ncbi.nlm.nih.gov/​Traces/​trace.cgi?​cmd=show&​f=formats&​m=doc&​s=format#​sff|SFF format]])
-low quality.+
  
-Sanger fails because ​of electrophoresis,​ not the sanger chemistry itself, +A large number ​of the assemblers throw away the quality data or only use it later.  ​Some use it to just throw away reads with low quality.
-as far as getting long reads.  ​Out to about 1000 bases.+
  
-454 quality ​drops off also.  ​Synthesis ​starts to get out of phase. +== Reasons for quality ​dropoff == 
- +  * Sanger fails because of electrophoresis,​ not the sanger chemistry itself, as far as getting long reads.  ​Out to about 1000 bases. 
-Solid - lose yield on ligation. ​ Missing ligations. +  * 454 synthesis ​starts to get out of phase. 
- +  ​* ​Solid loses yield on ligation. ​ Missing ligations. 
-Illumina problem with frequent washing removes template. Kevin thinks.+  ​* ​Illumina problem with frequent washing removes template. Kevin thinks.
  
  
 +=== Memory ===
 How do you represent this stuff in memory? How do you represent this stuff in memory?
-Two bits per base. +Two bits per base (four possible values)
 With color-space,​ can choose them to fit what they should be. With color-space,​ can choose them to fit what they should be.
-If read is not too-variable length, can fit in 64-bit integer.+If read is not too-variable length, can fit 32 bases into a 64-bit integer.
  
 SOLID produces cs-fasta file. (cs = colorspace) SOLID produces cs-fasta file. (cs = colorspace)
 It is a T (the last base of the first adapter?) It is a T (the last base of the first adapter?)
-T 00100 ...+  ​T 00100 ...
  
 Sometimes we want to do matching directly in colorspace. Sometimes we want to do matching directly in colorspace.
Line 251: Line 220:
 So this helps avoid problems that would otherwise happen. So this helps avoid problems that would otherwise happen.
  
-Sometimes don't know what strand you are working on. +Sometimes don't know what strand you are working on. To get reverse-complement equivalent in color-space all you have to do is the reversal. ​ No complementing is needed.
-To get reverse-complement equivalent in color-space +
-all you have to do is the reversal. ​ No complementing is needed.+
  
-One thing you can do when mapping is handle both strands. +One thing you can do when mapping is handle both strands. But you still have to hash the reversed colorspace too, so don't save memory
-But you still have to hash the reversed colorspace too, so don't save memory+
 in searches. ​ Hashing a genome takes a lot of space. in searches. ​ Hashing a genome takes a lot of space.
 +
 +==== Final Business ====
  
 Journal club papers should be fairly short. Journal club papers should be fairly short.
lecture_notes/04-07-2010.1270718044.txt.gz · Last modified: 2010/04/08 02:14 by galt