User Tools

Site Tools


lecture_notes:04-07-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
lecture_notes:04-07-2010 [2010/04/08 09:02]
galt
lecture_notes:04-07-2010 [2010/04/11 16:01]
karplus changed log10 to -10 log<sub>10</sub>
Line 1: Line 1:
-== Class Business ==+==== Class Business ​====
  
-Communicate to Jeff and Jenny +Communicate ​about offloading assembler installation ​to Jeff and Jenny since they weren'there on Monday.
-Offload something ​since they were'nt +
-there on Monday.+
  
-Make a review articles page at a high level +Make a review articles page at a high level with citations. People can comment.
-with citations. People can comment.+
  
 Use the forum to discuss things. Use the forum to discuss things.
-Each person must sign up for the forum independently. +  * Each person must sign up for the forum independently. 
-Forum works better than email, +  ​* ​Forum works better than email, because you can go back later to that subject. 
-because you can go back later to that subject. +  * Email has immediate impact, but not so easily searchable.
-email has immediate impact, but not so easily searchable.+
  
-People should read the de-novo assemblers review paper so that +People should read the de-novo assemblers review paper so that they will be ready Friday'​s lecture. ​  
-they will be ready Friday'​s lecture. ​  +  ​* ​(This has been added to new review articles page) 
-(This has been added to new review articles page) +  ​* ​Discusses Overlap and de-Bruijn graphs.
-Discusses Overlap and de-Bruijn graphs.+
  
-454 Newbler assembler is entirely proprietary.+454 Newbler assembler is entirely proprietary ​and almost nothing is known on how it works internally.
  
-Find out how much memory each tool needs. +Christy Hightower wants more feedback on the tools, to say good/bad ​Feedback should be added to the wiki [[lecture_notes:​04-02-2010|lecture notes for her lecture]].
-Does it need a cluster or just a single machine? etc.+
  
-Do not run anything on the headnode ​for campusRocks+[[https://​banana-slug.soe.ucsc.edu/​feed.php|RSS feed]] ​for wiki
-Learn how to use sungrid ​to tell it how to run it on the node.+  * Shows recent changes ​to the wiki.  
 +  * See what others have been doing lately. 
 +  * Good way to keep up with changes without having ​to scan every wiki page.
  
-We should all have access now to campusrocks.+===Guest lecturers coming up:=== 
 +  * Mon 19 Apr Dan Zerbino on Velvet. 
 +  * Fri 23 Apr, Janet Leonard and John Pearse on Slug biology.
  
-There'​s a link to some documentation on sunGrid.+We will talk Friday about graph representations.
  
-Can ssh to a machine ​in the campusrocks grid directly to run small things.+==== Running on Campus Rocks ==== 
 +  * Find out how much memory each tool needs. 
 +  * Does it need cluster or just a single ​machine? etc.
  
-Some of the data is up now.+Do not run anything on the headnode for campusRocks. 
 +Learn how to use sungrid to tell it how to run it on (one of) the nodes. Alternatively,​ use the [[http://​campusrocks.soe.ucsc.edu/​ganglia/​|status page]] to find an idle node and ssh to it directly. 
 +The [[computer_resources:​campusrocks|campusrocks page]] has a link to some documentation on sungrid.
  
-David Bernick and Kevin have been fussing with the data. +We should all have access now to campusrocksIf you don't contact tech staff (IT request).
-He had latest draft 4c. Have all the inversions ​Pyrobaculum.+
  
-Can test assemblers to see how well they work on the small genome.+For testing, there is some Pyrobaculum data on campusrocks now (or soon).  
 +  * David Bernick and Kevin have been fussing with the data. 
 +  * He had latest draft 4c. Have all the inversions. 
 +  * Can test assemblers to see how well they work on the small genome
 +  * 454 and Solid reads. 
 +  * Go ahead and try running it. (Remember: not on the head-node.) 
 +  * Start comparing the different assembly techniques.
  
-We have 454 and Solid reads. 
-Go ahead and try running it, but not on the head-node. 
-Start comparing the different assembly techniques. 
  
-Christy Hightower wants more feedback on the tools, +==== Lower-level Data ====
-to say good/​bad. ​ Can add feedback to the wiki lecture +
-notes on the lib lecture.+
  
-RSS feed for wiki. +=== Instruments === 
-Click the orange triangle upper-right of start page. +  * Sanger capillary 
-Shows recent changes to the wiki. +  * 454 
-See what others have been doing lately.+  * Solid 
 +  * Illumina 
 +  * Ion Torrent
  
-Guest lecturers coming up.+Sanger creates a trace. 454, Solid and Illumina take images with camera. Ion torrent uses direct chip pH measurement.
  
-Mon week afterDan Zerbino+=== Traces === 
-Slug biology.+  * 4 1-D traces (wiggles) overlapping;​ one for each of ACGT. 
 +  * Each trace tells what there is at a position. 
 +  * Peaks are broadened and end of a read is worse than beginning. 
 +  * Can get several in a row that are spread out making it difficult to tell how many you have. 
 +  * NCBI has large archives of trace data for abandoned projects. 
 +  * Have a terminator on each seq. 
 +  
 +=== Images === 
 +  * The image files are enormous (TB's of data) and require a great deal of image processing. 
 +  * After processing the raw images are almost never kept. 
 +  * Images are typically monochromebut SOLiD use 4 flourophores at the same time
 +  * De-convolution problems there too. Spots may overlap.
  
-We will talk friday about graph representations.+Ion Torrent has direct electronic readout, no images.
  
-So today let's talk about  +=== Base-calling ​==
-== Lower-level Data ==+  * For each position, turn image data into a base (AGCT) and a quality score. 
 +  * Quality means something different on each platform and sometimes even each instrument (Sanger). 
 +    (Correction to what I said in lecture: quality values are **supposed** to be -10 log<​sub>​10</​sub>​ P(error), but calibration is sometimes not very accurate. --- //​[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/09 07:18//) 
 +  * May have initial (known) sequences that are used to calibrate quality.
  
-  Sanger capillary +=== Spaces ===
-  454 +
-  Solid +
-  Illumina +
-  IonTorrent+
  
-454,​solid,​illumina take images with camera. +  * Base-space (A/C/G/T)
-Ion torrent uses direct chip ph measurement.+
  
-The image files are enormous and require a great deal +  * Color-space (One of four colors corresponding to the change from previous base) 
-of image processing which cooks them way down.+    * Used by SOLiD 
 +  * Flow-space (A/C/G/T and length ​of repeat)
  
-For Sanger, you get a trace. 
-4 1-D wiggles overlapping. 
-ACGT 
-Each trace tells what there is at a position. 
-Peaks are broadened, end of read worse than beginning, 
-Can get several in a row that are spread. 
-Trace archives at NIH for public genome archives that never got finished. 
-Have a terminator on each seq. 
-Their problem with homopolymers ​ 
-is at end of reads with broad peaks merging into eachother. 
  
-Images are typically monochrome. +== Base-space ==   
-(but SOLiD use 4 flourophores at the same time) +  ​* Often in fasta file. 
-De-convolution problems there too. Spots may overlap. +  ​* Used by Illumina.
-Images are usually discarded, TB's of data. +
-Ion Torrent has direct electronic readout, no images. +
- +
-== Base-calling ​== +
-AGCT, quality score. +
-but quality means something different on each  +
-platform and sometimes even each instrument (Sanger). +
- +
-May have initial images that are used to calibrate. +
- +
-== SPACES == +
- +
-  ​BASE-space (ACGT fasta file) +
-  ​Color-space (di-nucleotides,​ used only by SOLiD) +
-  Flow-space (454, Ion torrent)+
  
 == Flow-space ==  == Flow-space == 
-Get from sequencing by synthesis +    * Used by sequencing-by-synthesis ​methods (454Ion torrent) 
-with ordinary nucleotidesget multiple copies ​ +    * Multiple ​of the same homo-nucleotide ​are added in a single step and you get a (imperfect) signal of how many
-of same homo-nucleotide added in a single step. +    * Signal gets worse (less specific) for higher values. 
- +    * Analogous to run length encoding 
-Ion Torrent ​like 454 is flow-space. ​ The hydrogen +    * Often not integer values. 
-ions are more linear, but still has issues. +    ​* ​Ion Torrent is more linear ​than 454, but still has issues. 
- +    ​* ​Alignments in flow-space are possible.
-Sort of like run-length encoding. +
-You say what base and then how many times it was found in a row. +
-Alignments in flow-space are possible. +
- +
  
 == Color-space == == Color-space ==
 +  * 4 colors, numbered 0 to 3.
  
-One major reason they did this was to avoid a patent. ​ +^ number ^ binary ^ color  ^ meaning ​                    ^ transitions ​          ^ 
-More independence in the sequencing errors.+| 0      | 00     | blue   | same base                   | (A->A C->C G->G T->T) | 
 +| 1      | 01     | green  | non-complement transversion | (A->C C->A G->T T->G) | 
 +| 2      | 10     | yellow | transition ​                 | (A->G C->T G->A T->C) | 
 +| 3      | 11     | red    | complement ​                 | (A->T C->G G->C T->A) |
  
-colors 0 to 3+  * See /​cse/​faculty/​karplus/​pluck/​scripts/​map-colorspace 
-  ​0 00 blue means same as previous base (AA CC GG TT) +  * One major reason they used this was to avoid a patent.  
-  ​1 01 green means (AC CA GT TG) +  ​* Allows more independence in the sequencing errors. 
-  2 10 yellow means (AG GA CT TC) +  ​* Binary representations are useful. 
-  3 11 red means switch to compliment (AT TA CG GC) +    ​* ​XOR is associative and commutative. 
- +    ​* ​This XOR operation is also works brilliantly with the Klein four group for the bases A C G T. 
-XOR is associative and commutative. +    ​* ​You get from one base to a color, or vice versa with XOR. 
-This XOR operation is also works brilliantly with the Klein four group +      ​* ​A 0 00 
-for the bases A C G T. +      ​* ​C 1 01 
-You get from one base to a color, or vice versa with XOR. +      ​* ​G 2 10 
- +      ​* ​T 3 11
-  ​A 0 00 +
-  C 1 01 +
-  G 2 10 +
-  T 3 11+
  
 The di-nucleotide is simply saying, The di-nucleotide is simply saying,
Line 153: Line 137:
   C -- G    color3 == color 11   C -- G    color3 == color 11
 Each nucleotide in the final sequence is used Each nucleotide in the final sequence is used
-as the right have of one dinucleotide,​ and then+as the right half of one dinucleotide,​ and then
 the left half of the next dinucleotide. the left half of the next dinucleotide.
 The first letter A is given  The first letter A is given 
Line 200: Line 184:
  
  
-Note that when an error happens, all bases in the read down-stream +Note that when an error happens, all bases in the read down-stream will be wrong in base-space. ​ This is the reason that people bother to try to use color-space,​ because then the error stays localized.
-will be wrong in base-space. ​ This is the reason that people bother +
-to try to use color-space,​ because then the error stays localized.+
  
-When doing SNP calling, want to know if it is a SNP or a read-error. ​  +When doing SNP calling, want to know if it is a SNP or a read-error. The read-errors are independent typically.But the SNP will have coordinated changes. Either a larger change, mismapping, error, or something else. SOLID makes a big deal out of this. Not however useful for other non-SNP-calling things. Even with millions of reads, you can get false-positive SNPs at a low error rate.
-The read-errors are independent typically. ​  +
-But the SNP will have coordinated changes. +
-Either a larger change, mismapping, error, or something else. +
-SOLID makes a big deal out of this. +
-Not however useful for other non-snp-calling things. +
-Even with millions of reads, you can get false-positive SNPs at a low error rate.+
  
-Base-space and color-space comes with quality. +=== Quality ===
-Flowspace does not have such? have to check. +
-SFS format is the flowspace format for input  +
-into the newbler assembler. ​  Does it have any +
-independent quality measurement?​+
  
-A large number of the assemblers throw away the quality data. +Base-space, flow-space, and color-space all come with quality ​scores.
-Or only use it later. ​ Some use it to just throw away reads with +
-low quality.+
  
-Sanger fails because of electrophoresis,​ not the sanger chemistry itself, +SFF format is the flowspace format for input into the Newbler assembler. It has quality scores for each base using standard -10 log<​sub>​10</​sub>​ probability. 
-as far as getting long reads Out to about 1000 bases.+([[http://​www.ncbi.nlm.nih.gov/​Traces/​trace.cgi?​cmd=show&​f=formats&​m=doc&​s=format#​sff|SFF format]])
  
-454 quality ​drops off also.  ​Synthesis starts ​to get out of phase.+A large number of the assemblers throw away the quality ​data or only use it later.  ​Some use it to just throw away reads with low quality.
  
-Solid - lose yield on ligation. ​ Missing ligations. +== Reasons for quality dropoff == 
- +  * Sanger fails because of electrophoresis,​ not the sanger chemistry itself, as far as getting long reads. ​ Out to about 1000 bases. 
-Illumina problem with frequent washing removes template. Kevin thinks.+  * 454 synthesis starts to get out of phase. 
 +  * Solid loses yield on ligation. ​ Missing ligations. 
 +  ​* ​Illumina problem with frequent washing removes template. Kevin thinks.
  
  
 +=== Memory ===
 How do you represent this stuff in memory? How do you represent this stuff in memory?
-Two bits per base. +Two bits per base (four possible values)
 With color-space,​ can choose them to fit what they should be. With color-space,​ can choose them to fit what they should be.
-If read is not too-variable length, can fit in 64-bit integer.+If read is not too-variable length, can fit 32 bases into a 64-bit integer.
  
 SOLID produces cs-fasta file. (cs = colorspace) SOLID produces cs-fasta file. (cs = colorspace)
 It is a T (the last base of the first adapter?) It is a T (the last base of the first adapter?)
-T 00100 ...+  ​T 00100 ...
  
 Sometimes we want to do matching directly in colorspace. Sometimes we want to do matching directly in colorspace.
Line 247: Line 220:
 So this helps avoid problems that would otherwise happen. So this helps avoid problems that would otherwise happen.
  
-Sometimes don't know what strand you are working on. +Sometimes don't know what strand you are working on. To get reverse-complement equivalent in color-space all you have to do is the reversal. ​ No complementing is needed.
-To get reverse-complement equivalent in color-space +
-all you have to do is the reversal. ​ No complementing is needed.+
  
-One thing you can do when mapping is handle both strands. +One thing you can do when mapping is handle both strands. But you still have to hash the reversed colorspace too, so don't save memory
-But you still have to hash the reversed colorspace too, so don't save memory+
 in searches. ​ Hashing a genome takes a lot of space. in searches. ​ Hashing a genome takes a lot of space.
 +
 +==== Final Business ====
  
 Journal club papers should be fairly short. Journal club papers should be fairly short.
lecture_notes/04-07-2010.txt · Last modified: 2015/09/14 18:40 by 68.180.230.228