This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
lecture_notes:04-07-2010 [2010/04/08 09:14] galt |
lecture_notes:04-07-2010 [2010/04/09 14:21] karplus added correction for mistake I made in class about meaning of quality |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | == Class Business == | + | ==== Class Business ==== |
- | Communicate to Jeff and Jenny | + | Communicate about offloading assembler installation to Jeff and Jenny since they weren't there on Monday. |
- | Offload something since they were'nt | + | |
- | there on Monday. | + | |
- | Make a review articles page at a high level | + | Make a review articles page at a high level with citations. People can comment. |
- | with citations. People can comment. | + | |
Use the forum to discuss things. | Use the forum to discuss things. | ||
- | Each person must sign up for the forum independently. | + | * Each person must sign up for the forum independently. |
- | Forum works better than email, | + | * Forum works better than email, because you can go back later to that subject. |
- | because you can go back later to that subject. | + | * Email has immediate impact, but not so easily searchable. |
- | email has immediate impact, but not so easily searchable. | + | |
- | People should read the de-novo assemblers review paper so that | + | People should read the de-novo assemblers review paper so that they will be ready Friday's lecture. |
- | they will be ready Friday's lecture. | + | * (This has been added to new review articles page) |
- | (This has been added to new review articles page) | + | * Discusses Overlap and de-Bruijn graphs. |
- | Discusses Overlap and de-Bruijn graphs. | + | |
- | 454 Newbler assembler is entirely proprietary. | + | 454 Newbler assembler is entirely proprietary and almost nothing is known on how it works internally. |
- | Find out how much memory each tool needs. | + | Christy Hightower wants more feedback on the tools, to say good/bad. Feedback should be added to the wiki [[lecture_notes:04-02-2010|lecture notes for her lecture]]. |
- | Does it need a cluster or just a single machine? etc. | + | |
- | Do not run anything on the headnode for campusRocks. | + | [[https://banana-slug.soe.ucsc.edu/feed.php|RSS feed]] for wiki. |
- | Learn how to use sungrid to tell it how to run it on the node. | + | * Shows recent changes to the wiki. |
+ | * See what others have been doing lately. | ||
+ | * Good way to keep up with changes without having to scan every wiki page. | ||
- | We should all have access now to campusrocks. | + | ===Guest lecturers coming up:=== |
+ | * Mon 19 Apr Dan Zerbino on Velvet. | ||
+ | * Fri 23 Apr, Janet Leonard and John Pearse on Slug biology. | ||
- | There's a link to some documentation on sunGrid. | + | We will talk Friday about graph representations. |
- | Can ssh to a machine in the campusrocks grid directly to run small things. | + | ==== Running on Campus Rocks ==== |
+ | * Find out how much memory each tool needs. | ||
+ | * Does it need a cluster or just a single machine? etc. | ||
- | Some of the data is up now. | + | Do not run anything on the headnode for campusRocks. |
+ | Learn how to use sungrid to tell it how to run it on (one of) the nodes. Alternatively, use the [[http://campusrocks.soe.ucsc.edu/ganglia/|status page]] to find an idle node and ssh to it directly. | ||
+ | The [[computer_resources:campusrocks|campusrocks page]] has a link to some documentation on sungrid. | ||
- | David Bernick and Kevin have been fussing with the data. | + | We should all have access now to campusrocks. If you don't contact tech staff (IT request). |
- | He had latest draft 4c. Have all the inversions. Pyrobaculum. | + | |
- | Can test assemblers to see how well they work on the small genome. | + | For testing, there is some Pyrobaculum data on campusrocks now (or soon). |
+ | * David Bernick and Kevin have been fussing with the data. | ||
+ | * He had latest draft 4c. Have all the inversions. | ||
+ | * Can test assemblers to see how well they work on the small genome. | ||
+ | * 454 and Solid reads. | ||
+ | * Go ahead and try running it. (Remember: not on the head-node.) | ||
+ | * Start comparing the different assembly techniques. | ||
- | We have 454 and Solid reads. | ||
- | Go ahead and try running it, but not on the head-node. | ||
- | Start comparing the different assembly techniques. | ||
- | Christy Hightower wants more feedback on the tools, | + | ==== Lower-level Data ==== |
- | to say good/bad. Can add feedback to the wiki lecture | + | |
- | notes on the lib lecture. | + | |
- | RSS feed for wiki. | + | === Instruments === |
- | Click the orange triangle upper-right of start page. | + | * Sanger capillary |
- | Shows recent changes to the wiki. | + | * 454 |
- | See what others have been doing lately. | + | * Solid |
+ | * Illumina | ||
+ | * Ion Torrent | ||
- | Guest lecturers coming up. | + | Sanger creates a trace. 454, Solid and Illumina take images with camera. Ion torrent uses direct chip pH measurement. |
- | Mon week after, Dan Zerbino. | + | === Traces === |
- | Slug biology. | + | * 4 1-D traces (wiggles) overlapping; one for each of ACGT. |
+ | * Each trace tells what there is at a position. | ||
+ | * Peaks are broadened and end of a read is worse than beginning. | ||
+ | * Can get several in a row that are spread out making it difficult to tell how many you have. | ||
+ | * NCBI has large archives of trace data for abandoned projects. | ||
+ | * Have a terminator on each seq. | ||
+ | |||
+ | === Images === | ||
+ | * The image files are enormous (TB's of data) and require a great deal of image processing. | ||
+ | * After processing the raw images are almost never kept. | ||
+ | * Images are typically monochrome, but SOLiD use 4 flourophores at the same time. | ||
+ | * De-convolution problems there too. Spots may overlap. | ||
- | We will talk friday about graph representations. | + | Ion Torrent has direct electronic readout, no images. |
- | So today let's talk about | + | === Base-calling === |
- | == Lower-level Data == | + | * For each position, turn image data into a base (AGCT) and a quality score. |
+ | * Quality means something different on each platform and sometimes even each instrument (Sanger). | ||
+ | (Correction to what I said in lecture: quality values are **supposed** to be -10 log<sub>10</sub> P(error), but calibration is sometimes not very accurate. --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/04/09 07:18//) | ||
+ | * May have initial (known) sequences that are used to calibrate quality. | ||
- | Sanger capillary | + | === Spaces === |
- | 454 | + | |
- | Solid | + | |
- | Illumina | + | |
- | Ion Torrent | + | |
- | 454,solid,illumina take images with camera. | + | * Base-space (A/C/G/T) |
- | Ion torrent uses direct chip ph measurement. | + | |
- | The image files are enormous and require a great deal | + | * Color-space (One of four colors corresponding to the change from previous base) |
- | of image processing which cooks them way down. | + | * Used by SOLiD |
+ | * Flow-space (A/C/G/T and length of repeat) | ||
- | For Sanger, you get a trace. | ||
- | 4 1-D wiggles overlapping. | ||
- | ACGT | ||
- | Each trace tells what there is at a position. | ||
- | Peaks are broadened, end of read worse than beginning, | ||
- | Can get several in a row that are spread. | ||
- | Trace archives at NIH for public genome archives that never got finished. | ||
- | Have a terminator on each seq. | ||
- | Their problem with homopolymers | ||
- | is at end of reads with broad peaks merging into eachother. | ||
- | |||
- | Images are typically monochrome. | ||
- | (but SOLiD use 4 flourophores at the same time) | ||
- | De-convolution problems there too. Spots may overlap. | ||
- | Images are usually discarded, TB's of data. | ||
- | Ion Torrent has direct electronic readout, no images. | ||
- | |||
- | == Base-calling == | ||
- | AGCT, quality score. | ||
- | but quality means something different on each | ||
- | platform and sometimes even each instrument (Sanger). | ||
- | |||
- | May have initial images that are used to calibrate. | ||
- | |||
- | == Spaces == | ||
- | |||
- | BASE-space (ACGT fasta file) | ||
- | Color-space (di-nucleotides, used only by SOLiD) | ||
- | Flow-space (454, Ion torrent) | ||
+ | == Base-space == | ||
+ | * Often in fasta file. | ||
+ | * Used by Illumina. | ||
== Flow-space == | == Flow-space == | ||
- | + | * Used by sequencing-by-synthesis methods (454, Ion torrent) | |
- | Get from sequencing by synthesis | + | * Multiple of the same homo-nucleotide are added in a single step and you get a (imperfect) signal of how many. |
- | with ordinary nucleotides, get multiple copies | + | * Signal gets worse (less specific) for higher values. |
- | of same homo-nucleotide added in a single step. | + | * Analogous to run length encoding |
- | + | * Often not integer values. | |
- | Ion Torrent like 454 is flow-space. The hydrogen | + | * Ion Torrent is more linear than 454, but still has issues. |
- | ions are more linear, but still has issues. | + | * Alignments in flow-space are possible. |
- | + | ||
- | Sort of like run-length encoding. | + | |
- | You say what base and then how many times it was found in a row. | + | |
- | Alignments in flow-space are possible. | + | |
- | + | ||
== Color-space == | == Color-space == | ||
+ | * 4 colors, numbered 0 to 3. | ||
- | One major reason they did this was to avoid a patent. | + | ^ number ^ binary ^ color ^ meaning ^ transitions ^ |
- | More independence in the sequencing errors. | + | | 0 | 00 | blue | same base | (A->A C->C G->G T->T) | |
+ | | 1 | 01 | green | non-complement transversion | (A->C C->A G->T T->G) | | ||
+ | | 2 | 10 | yellow | transition | (A->G C->T G->A T->C) | | ||
+ | | 3 | 11 | red | complement | (A->T C->G G->C T->A) | | ||
- | colors 0 to 3. | + | * See /cse/faculty/karplus/pluck/scripts/map-colorspace |
- | 0 00 blue means (AA CC GG TT) | + | * One major reason they used this was to avoid a patent. |
- | 1 01 green means (AC CA GT TG) | + | * Allows more independence in the sequencing errors. |
- | 2 10 yellow means (AG CT GA TC) | + | * Binary representations are useful. |
- | 3 11 red means (AT CG GC TA) | + | * XOR is associative and commutative. |
- | + | * This XOR operation is also works brilliantly with the Klein four group for the bases A C G T. | |
- | XOR is associative and commutative. | + | * You get from one base to a color, or vice versa with XOR. |
- | This XOR operation is also works brilliantly with the Klein four group | + | * A 0 00 |
- | for the bases A C G T. | + | * C 1 01 |
- | You get from one base to a color, or vice versa with XOR. | + | * G 2 10 |
- | + | * T 3 11 | |
- | A 0 00 | + | |
- | C 1 01 | + | |
- | G 2 10 | + | |
- | T 3 11 | + | |
The di-nucleotide is simply saying, | The di-nucleotide is simply saying, | ||
Line 155: | Line 137: | ||
C -- G color3 == color 11 | C -- G color3 == color 11 | ||
Each nucleotide in the final sequence is used | Each nucleotide in the final sequence is used | ||
- | as the right have of one dinucleotide, and then | + | as the right half of one dinucleotide, and then |
the left half of the next dinucleotide. | the left half of the next dinucleotide. | ||
The first letter A is given | The first letter A is given | ||
Line 202: | Line 184: | ||
- | Note that when an error happens, all bases in the read down-stream | + | Note that when an error happens, all bases in the read down-stream will be wrong in base-space. This is the reason that people bother to try to use color-space, because then the error stays localized. |
- | will be wrong in base-space. This is the reason that people bother | + | |
- | to try to use color-space, because then the error stays localized. | + | |
- | When doing SNP calling, want to know if it is a SNP or a read-error. | + | When doing SNP calling, want to know if it is a SNP or a read-error. The read-errors are independent typically.But the SNP will have coordinated changes. Either a larger change, mismapping, error, or something else. SOLID makes a big deal out of this. Not however useful for other non-SNP-calling things. Even with millions of reads, you can get false-positive SNPs at a low error rate. |
- | The read-errors are independent typically. | + | |
- | But the SNP will have coordinated changes. | + | |
- | Either a larger change, mismapping, error, or something else. | + | |
- | SOLID makes a big deal out of this. | + | |
- | Not however useful for other non-snp-calling things. | + | |
- | Even with millions of reads, you can get false-positive SNPs at a low error rate. | + | |
- | == Quality == | + | === Quality === |
- | Base-space and color-space comes with quality. | + | * Base-space and color-space comes with quality scores. |
- | Flowspace does not have such? have to check. | + | * Flowspace does not have such? have to check. |
- | SFS format is the flowspace format for input | + | * SFS format is the flowspace format for input into the Newbler assembler. Does it have any independent quality measurement? |
- | into the newbler assembler. Does it have any | + | |
- | independent quality measurement? | + | |
- | A large number of the assemblers throw away the quality data. | + | A large number of the assemblers throw away the quality data or only use it later. Some use it to just throw away reads with low quality. |
- | Or only use it later. Some use it to just throw away reads with | + | |
- | low quality. | + | |
- | Sanger fails because of electrophoresis, not the sanger chemistry itself, | + | == Reasons for quality dropoff == |
- | as far as getting long reads. Out to about 1000 bases. | + | * Sanger fails because of electrophoresis, not the sanger chemistry itself, as far as getting long reads. Out to about 1000 bases. |
- | + | * 454 synthesis starts to get out of phase. | |
- | 454 quality drops off also. Synthesis starts to get out of phase. | + | * Solid loses yield on ligation. Missing ligations. |
- | + | * Illumina problem with frequent washing removes template. Kevin thinks. | |
- | Solid - lose yield on ligation. Missing ligations. | + | |
- | + | ||
- | Illumina problem with frequent washing removes template. Kevin thinks. | + | |
+ | === Memory === | ||
How do you represent this stuff in memory? | How do you represent this stuff in memory? | ||
- | Two bits per base. | + | Two bits per base (four possible values). |
With color-space, can choose them to fit what they should be. | With color-space, can choose them to fit what they should be. | ||
- | If read is not too-variable length, can fit in 64-bit integer. | + | If read is not too-variable length, can fit 32 bases into a 64-bit integer. |
SOLID produces cs-fasta file. (cs = colorspace) | SOLID produces cs-fasta file. (cs = colorspace) | ||
It is a T (the last base of the first adapter?) | It is a T (the last base of the first adapter?) | ||
- | T 00100 ... | + | T 00100 ... |
Sometimes we want to do matching directly in colorspace. | Sometimes we want to do matching directly in colorspace. | ||
Line 251: | Line 219: | ||
So this helps avoid problems that would otherwise happen. | So this helps avoid problems that would otherwise happen. | ||
- | Sometimes don't know what strand you are working on. | + | Sometimes don't know what strand you are working on. To get reverse-complement equivalent in color-space all you have to do is the reversal. No complementing is needed. |
- | To get reverse-complement equivalent in color-space | + | |
- | all you have to do is the reversal. No complementing is needed. | + | |
- | One thing you can do when mapping is handle both strands. | + | One thing you can do when mapping is handle both strands. But you still have to hash the reversed colorspace too, so don't save memory |
- | But you still have to hash the reversed colorspace too, so don't save memory | + | |
in searches. Hashing a genome takes a lot of space. | in searches. Hashing a genome takes a lot of space. | ||
+ | |||
+ | ==== Final Business ==== | ||
Journal club papers should be fairly short. | Journal club papers should be fairly short. |