This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
lecture_notes:04-07-2010 [2010/04/08 08:40] galt created |
lecture_notes:04-07-2010 [2010/04/11 16:01] karplus changed log10 to -10 log<sub>10</sub> |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | Communicate to Jeff and Jenny | + | ==== Class Business ==== |
- | Offload something since they were'nt | + | |
- | there on Monday. | + | |
- | Make a review articles page at a high level | + | Communicate about offloading assembler installation to Jeff and Jenny since they weren't there on Monday. |
- | with citations. People can comment. | + | |
+ | Make a review articles page at a high level with citations. People can comment. | ||
Use the forum to discuss things. | Use the forum to discuss things. | ||
- | Each person must sign up for the forum independently. | + | * Each person must sign up for the forum independently. |
- | Forum works better than email, | + | * Forum works better than email, because you can go back later to that subject. |
- | because you can go back later to that subject. | + | * Email has immediate impact, but not so easily searchable. |
- | email has immediate impact, but not so easily searchable. | + | |
- | People should read the de-novo assemblers review paper so that | + | People should read the de-novo assemblers review paper so that they will be ready Friday's lecture. |
- | they will be ready Friday's lecture. | + | * (This has been added to new review articles page) |
- | (This has been added to new review articles page) | + | * Discusses Overlap and de-Bruijn graphs. |
- | Discusses Overlap and de-Bruijn graphs. | + | |
- | 454 Newbler assembler is entirely proprietary. | + | 454 Newbler assembler is entirely proprietary and almost nothing is known on how it works internally. |
- | Find out how much memory each tool needs. | + | Christy Hightower wants more feedback on the tools, to say good/bad. Feedback should be added to the wiki [[lecture_notes:04-02-2010|lecture notes for her lecture]]. |
- | Does it need a cluster or just a single machine? etc. | + | |
- | Do not run anything on the headnode for campusRocks. | + | [[https://banana-slug.soe.ucsc.edu/feed.php|RSS feed]] for wiki. |
- | Learn how to use sungrid to tell it how to run it on the node. | + | * Shows recent changes to the wiki. |
+ | * See what others have been doing lately. | ||
+ | * Good way to keep up with changes without having to scan every wiki page. | ||
- | We should all have access now to campusrocks. | + | ===Guest lecturers coming up:=== |
+ | * Mon 19 Apr Dan Zerbino on Velvet. | ||
+ | * Fri 23 Apr, Janet Leonard and John Pearse on Slug biology. | ||
- | There's a link to some documentation on sunGrid. | + | We will talk Friday about graph representations. |
- | Can ssh to a machine in the campusrocks grid directly to run small things. | + | ==== Running on Campus Rocks ==== |
+ | * Find out how much memory each tool needs. | ||
+ | * Does it need a cluster or just a single machine? etc. | ||
- | Some of the data is up now. | + | Do not run anything on the headnode for campusRocks. |
+ | Learn how to use sungrid to tell it how to run it on (one of) the nodes. Alternatively, use the [[http://campusrocks.soe.ucsc.edu/ganglia/|status page]] to find an idle node and ssh to it directly. | ||
+ | The [[computer_resources:campusrocks|campusrocks page]] has a link to some documentation on sungrid. | ||
- | David Bernick and Kevin have been fussing with the data. | + | We should all have access now to campusrocks. If you don't contact tech staff (IT request). |
- | He had latest draft 4c. Have all the inversions. Pyrobaculum. | + | |
- | Can test assemblers to see how well they work on the small genome. | + | For testing, there is some Pyrobaculum data on campusrocks now (or soon). |
+ | * David Bernick and Kevin have been fussing with the data. | ||
+ | * He had latest draft 4c. Have all the inversions. | ||
+ | * Can test assemblers to see how well they work on the small genome. | ||
+ | * 454 and Solid reads. | ||
+ | * Go ahead and try running it. (Remember: not on the head-node.) | ||
+ | * Start comparing the different assembly techniques. | ||
- | We have 454 and Solid reads. | ||
- | Go ahead and try running it, but not on the head-node. | ||
- | Start comparing the different assembly techniques. | ||
- | Christy Hightower wants more feedback on the tools, | + | ==== Lower-level Data ==== |
- | to say good/bad. Can add feedback to the wiki lecture | + | |
- | notes on the lib lecture. | + | |
- | RSS feed for wiki. | + | === Instruments === |
- | Click the orange triangle upper-right of start page. | + | * Sanger capillary |
- | Shows recent changes to the wiki. | + | * 454 |
- | See what others have been doing lately. | + | * Solid |
+ | * Illumina | ||
+ | * Ion Torrent | ||
- | Guest lecturers coming up. | + | Sanger creates a trace. 454, Solid and Illumina take images with camera. Ion torrent uses direct chip pH measurement. |
- | Mon week after, Dan Zerbino. | + | === Traces === |
- | Slug biology. | + | * 4 1-D traces (wiggles) overlapping; one for each of ACGT. |
- | + | * Each trace tells what there is at a position. | |
- | We will talk friday about graph representations. | + | * Peaks are broadened and end of a read is worse than beginning. |
- | + | * Can get several in a row that are spread out making it difficult to tell how many you have. | |
- | So today let's talk about lower-level data | + | * NCBI has large archives of trace data for abandoned projects. |
- | Sanger capillary | + | * Have a terminator on each seq. |
- | 454 | + | |
- | Solid | + | === Images === |
- | Illumina | + | * The image files are enormous (TB's of data) and require a great deal of image processing. |
- | IonTorrent | + | * After processing the raw images are almost never kept. |
- | + | * Images are typically monochrome, but SOLiD use 4 flourophores at the same time. | |
- | 454,solid,illumina, takes images with camera. | + | * De-convolution problems there too. Spots may overlap. |
- | ion torrent uses direct chip ph measurement. | + | |
- | + | ||
- | The image files are enormous and require a great deal | + | |
- | of image processing which cooks them way down. | + | |
- | + | ||
- | For Sanger, you get a trace. | + | |
- | 4 1-D wiggles overlapping. | + | |
- | ACGT | + | |
- | Each trace tells what there is at a position. | + | |
- | Peaks are broadened, end of read worse than beginning, | + | |
- | Can get several in a row that are spread. | + | |
- | Trace archives at NIH for public genome archives that never got finished. | + | |
- | Have a terminator on each seq. | + | |
- | Their problem with homopolymers | + | |
- | is at end of reads with broad peaks merging into eachother. | + | |
- | Images are typically monochrome. | ||
- | (but SOLiD use 4 flourophores at the same time) | ||
- | De-convolution problems there too. Spots may overlap. | ||
- | Images are usually discarded, TB's of data. | ||
Ion Torrent has direct electronic readout, no images. | Ion Torrent has direct electronic readout, no images. | ||
- | Base-calling | + | === Base-calling === |
- | AGCT, quality score. | + | * For each position, turn image data into a base (AGCT) and a quality score. |
- | but quality means something different on each | + | * Quality means something different on each platform and sometimes even each instrument (Sanger). |
- | platform and sometimes even each instrument (Sanger). | + | (Correction to what I said in lecture: quality values are **supposed** to be -10 log<sub>10</sub> P(error), but calibration is sometimes not very accurate. --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/09 07:18//) |
+ | * May have initial (known) sequences that are used to calibrate quality. | ||
- | May have initial images that are used to calibrate. | + | === Spaces === |
- | BASE-space (ACGT fasta file) | + | * Base-space (A/C/G/T) |
- | Color-space (di-nucleotides, used only by SOLiD) | + | |
- | One major reason they did this was to avoid a patent. | + | |
- | More independence in the sequencing errors. | + | |
- | Flow-space (get from sequnecing by synthesis | + | * Color-space (One of four colors corresponding to the change from previous base) |
- | with ordinary nucleotides, get multiple copies | + | * Used by SOLiD |
- | added in a single step). | + | * Flow-space (A/C/G/T and length of repeat) |
- | Ion Torrent like 454 is flow-space. The hydrogen | ||
- | ions are more linear, but still has issues. | ||
- | Sort of like run-length encoding. | + | == Base-space == |
- | You say what base and then how many times it was found in a row. | + | * Often in fasta file. |
- | Alignments in flow-space are possible. | + | * Used by Illumina. |
+ | == Flow-space == | ||
+ | * Used by sequencing-by-synthesis methods (454, Ion torrent) | ||
+ | * Multiple of the same homo-nucleotide are added in a single step and you get a (imperfect) signal of how many. | ||
+ | * Signal gets worse (less specific) for higher values. | ||
+ | * Analogous to run length encoding | ||
+ | * Often not integer values. | ||
+ | * Ion Torrent is more linear than 454, but still has issues. | ||
+ | * Alignments in flow-space are possible. | ||
+ | == Color-space == | ||
+ | * 4 colors, numbered 0 to 3. | ||
- | Color-space | + | ^ number ^ binary ^ color ^ meaning ^ transitions ^ |
- | colors 0 to 3. | + | | 0 | 00 | blue | same base | (A->A C->C G->G T->T) | |
- | 0 00 blue means same as previous base (AA CC GG TT) | + | | 1 | 01 | green | non-complement transversion | (A->C C->A G->T T->G) | |
- | 1 01 green means (AC CA GT TG) | + | | 2 | 10 | yellow | transition | (A->G C->T G->A T->C) | |
- | 2 10 yellow means (AG GA CT TC) | + | | 3 | 11 | red | complement | (A->T C->G G->C T->A) | |
- | 3 11 red means switch to compliment (AT TA CG GC) | + | |
- | XOR is associative and commutative. | + | * See /cse/faculty/karplus/pluck/scripts/map-colorspace |
- | This XOR operation is also works brilliantly with the Klein four group | + | * One major reason they used this was to avoid a patent. |
- | for the bases A C G T. | + | * Allows more independence in the sequencing errors. |
- | You get from one base to a color, or vice versa with XOR. | + | * Binary representations are useful. |
- | + | * XOR is associative and commutative. | |
- | A 0 00 | + | * This XOR operation is also works brilliantly with the Klein four group for the bases A C G T. |
- | C 1 01 | + | * You get from one base to a color, or vice versa with XOR. |
- | G 2 10 | + | * A 0 00 |
- | T 3 11 | + | * C 1 01 |
+ | * G 2 10 | ||
+ | * T 3 11 | ||
The di-nucleotide is simply saying, | The di-nucleotide is simply saying, | ||
Line 139: | Line 133: | ||
Remember, it's really a series like this: | Remember, it's really a series like this: | ||
ATCG is measured as chain of dinucleotide colors | ATCG is measured as chain of dinucleotide colors | ||
- | A -- T color3 == color 11 | + | A -- T color3 == color 11 |
- | T -- C color2 == color 10 | + | T -- C color2 == color 10 |
- | C -- G color3 == color 11 | + | C -- G color3 == color 11 |
Each nucleotide in the final sequence is used | Each nucleotide in the final sequence is used | ||
- | as the right have of one dinucleotide, and then | + | as the right half of one dinucleotide, and then |
the left half of the next dinucleotide. | the left half of the next dinucleotide. | ||
The first letter A is given | The first letter A is given | ||
(from the last base of the first primer if it is a read). | (from the last base of the first primer if it is a read). | ||
So the data actually appears something like this: | So the data actually appears something like this: | ||
- | (A) 3 2 3 | + | (A) 3 2 3 |
- | or | + | or |
- | (00) 11 10 11 | + | (00) 11 10 11 |
If you are on base G (10) and your next color is red (11), | If you are on base G (10) and your next color is red (11), | ||
Line 159: | Line 153: | ||
Indel in colorspace. | Indel in colorspace. | ||
- | A C G A C A A | + | A C G A C A A |
- | drop out GAC | + | drop out GAC |
- | 1 3 2 1 1 0 | + | 1 3 2 1 1 0 |
- | A C A A (GAC deleted, 4 colors become 1 new color) | + | A C A A (GAC deleted, 4 colors become 1 new color) |
- | 1 1 0 | + | 1 1 0 |
- | so 3 2 1 1 will become 3 xor 2 xor 1 xor 1 | + | so 3 2 1 1 will become 3 xor 2 xor 1 xor 1 |
- | = 11 xor 10 xor 01 xor 01 = 01 == 1 | + | = 11 xor 10 xor 01 xor 01 = 01 == 1 |
- | which is the same as C-->A which is correct. | + | which is the same as C-->A which is correct. |
Take the region that's changing, | Take the region that's changing, | ||
and the exclusive-or of it all together. | and the exclusive-or of it all together. | ||
Line 173: | Line 167: | ||
SNP changes colors, two changes together. | SNP changes colors, two changes together. | ||
- | A G C G (C --> T) | + | A G C G (C --> T) |
- | 2 3 3 | + | 2 3 3 |
- | 1 1 | + | 1 1 |
- | + | ||
- | A G C A (C --> T) | + | A G C A (C --> T) |
- | 2 3 1 | + | 2 3 1 |
- | 1 3 | + | 1 3 |
XOR of that region has to take this base to that base. | XOR of that region has to take this base to that base. | ||
Line 190: | Line 184: | ||
- | Note that when an error happens, all bases in the read down-stream | + | Note that when an error happens, all bases in the read down-stream will be wrong in base-space. This is the reason that people bother to try to use color-space, because then the error stays localized. |
- | will be wrong in base-space. This is the reason that people bother | + | |
- | to try to use color-space, because then the error stays localized. | + | |
- | When doing SNP calling, want to know if it is a SNP or a read-error. | + | When doing SNP calling, want to know if it is a SNP or a read-error. The read-errors are independent typically.But the SNP will have coordinated changes. Either a larger change, mismapping, error, or something else. SOLID makes a big deal out of this. Not however useful for other non-SNP-calling things. Even with millions of reads, you can get false-positive SNPs at a low error rate. |
- | The read-errors are independent typically. | + | |
- | But the SNP will have coordinated changes. | + | |
- | Either a larger change, mismapping, error, or something else. | + | |
- | SOLID makes a big deal out of this. | + | |
- | Not however useful for other non-snp-calling things. | + | |
- | Even with millions of reads, you can get false-positive SNPs at a low error rate. | + | |
- | Base-space and color-space comes with quality. | + | === Quality === |
- | Flowspace does not have such? have to check. | + | |
- | SFS format is the flowspace format for input | + | |
- | into the newbler assembler. Does it have any | + | |
- | independent quality measurement? | + | |
- | A large number of the assemblers throw away the quality data. | + | Base-space, flow-space, and color-space all come with quality scores. |
- | Or only use it later. Some use it to just throw away reads with | + | |
- | low quality. | + | |
- | Sanger fails because of electrophoresis, not the sanger chemistry itself, | + | SFF format is the flowspace format for input into the Newbler assembler. It has quality scores for each base using standard -10 log<sub>10</sub> probability. |
- | as far as getting long reads. Out to about 1000 bases. | + | ([[http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff|SFF format]]) |
- | 454 quality drops off also. Synthesis starts to get out of phase. | + | A large number of the assemblers throw away the quality data or only use it later. Some use it to just throw away reads with low quality. |
- | Solid - lose yield on ligation. Missing ligations. | + | == Reasons for quality dropoff == |
- | + | * Sanger fails because of electrophoresis, not the sanger chemistry itself, as far as getting long reads. Out to about 1000 bases. | |
- | Illumina problem with frequent washing. removes template. Kevin thinks. | + | * 454 synthesis starts to get out of phase. |
+ | * Solid loses yield on ligation. Missing ligations. | ||
+ | * Illumina problem with frequent washing removes template. Kevin thinks. | ||
+ | === Memory === | ||
How do you represent this stuff in memory? | How do you represent this stuff in memory? | ||
- | Two bits per base. | + | Two bits per base (four possible values). |
With color-space, can choose them to fit what they should be. | With color-space, can choose them to fit what they should be. | ||
- | If read is not too-variable length, can fit in 64-bit integer. | + | If read is not too-variable length, can fit 32 bases into a 64-bit integer. |
SOLID produces cs-fasta file. (cs = colorspace) | SOLID produces cs-fasta file. (cs = colorspace) | ||
It is a T (the last base of the first adapter?) | It is a T (the last base of the first adapter?) | ||
- | T 00100 ... | + | T 00100 ... |
Sometimes we want to do matching directly in colorspace. | Sometimes we want to do matching directly in colorspace. | ||
Line 237: | Line 220: | ||
So this helps avoid problems that would otherwise happen. | So this helps avoid problems that would otherwise happen. | ||
- | Sometimes don't know what strand you are working on. | + | Sometimes don't know what strand you are working on. To get reverse-complement equivalent in color-space all you have to do is the reversal. No complementing is needed. |
- | To get reverse-complement equivalent in color-space | + | |
- | all you have to do is the reversal. No complementing is needed. | + | |
- | One thing you can do when mapping is handle both strands. | + | One thing you can do when mapping is handle both strands. But you still have to hash the reversed colorspace too, so don't save memory |
- | But you still have to hash the reversed colorspace too, so don't save memory | + | |
in searches. Hashing a genome takes a lot of space. | in searches. Hashing a genome takes a lot of space. | ||
+ | |||
+ | ==== Final Business ==== | ||
Journal club papers should be fairly short. | Journal club papers should be fairly short. |