Banana Slug Genomics

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

== Class Business == Communicate to Jeff and Jenny Offload something since they were'nt there on Monday. Make a review articles page at a high level with citations. People can comment. Use the forum to discuss things. Each person must sign up for the forum independently. Forum works better than email, because you can go back later to that subject. email has immediate impact, but not so easily searchable. People should read the de-novo assemblers review paper so that they will be ready Friday's lecture. (This has been added to new review articles page) Discusses Overlap and de-Bruijn graphs. 454 Newbler assembler is entirely proprietary. Find out how much memory each tool needs. Does it need a cluster or just a single machine? etc. Do not run anything on the headnode for campusRocks. Learn how to use sungrid to tell it how to run it on the node. We should all have access now to campusrocks. There's a link to some documentation on sunGrid. Can ssh to a machine in the campusrocks grid directly to run small things. Some of the data is up now. David Bernick and Kevin have been fussing with the data. He had latest draft 4c. Have all the inversions. Pyrobaculum. Can test assemblers to see how well they work on the small genome. We have 454 and Solid reads. Go ahead and try running it, but not on the head-node. Start comparing the different assembly techniques. Christy Hightower wants more feedback on the tools, to say good/bad. Can add feedback to the wiki lecture notes on the lib lecture. RSS feed for wiki. Click the orange triangle upper-right of start page. Shows recent changes to the wiki. See what others have been doing lately. Guest lecturers coming up. Mon week after, Dan Zerbino. Slug biology. We will talk friday about graph representations. So today let's talk about == Lower-level Data == Sanger capillary 454 Solid Illumina Ion Torrent 454,solid,illumina take images with camera. Ion torrent uses direct chip ph measurement. The image files are enormous and require a great deal of image processing which cooks them way down. For Sanger, you get a trace. 4 1-D wiggles overlapping. ACGT Each trace tells what there is at a position. Peaks are broadened, end of read worse than beginning, Can get several in a row that are spread. Trace archives at NIH for public genome archives that never got finished. Have a terminator on each seq. Their problem with homopolymers is at end of reads with broad peaks merging into eachother. Images are typically monochrome. (but SOLiD use 4 flourophores at the same time) De-convolution problems there too. Spots may overlap. Images are usually discarded, TB's of data. Ion Torrent has direct electronic readout, no images. == Base-calling == AGCT, quality score. but quality means something different on each platform and sometimes even each instrument (Sanger). May have initial images that are used to calibrate. == Spaces == BASE-space (ACGT fasta file) Color-space (di-nucleotides, used only by SOLiD) Flow-space (454, Ion torrent) == Flow-space == Get from sequencing by synthesis with ordinary nucleotides, get multiple copies of same homo-nucleotide added in a single step. Ion Torrent like 454 is flow-space. The hydrogen ions are more linear, but still has issues. Sort of like run-length encoding. You say what base and then how many times it was found in a row. Alignments in flow-space are possible. == Color-space == One major reason they did this was to avoid a patent. More independence in the sequencing errors. colors 0 to 3. 0 00 blue means (AA CC GG TT) 1 01 green means (AC CA GT TG) 2 10 yellow means (AG CT GA TC) 3 11 red means (AT CG GC TA) XOR is associative and commutative. This XOR operation is also works brilliantly with the Klein four group for the bases A C G T. You get from one base to a color, or vice versa with XOR. A 0 00 C 1 01 G 2 10 T 3 11 The di-nucleotide is simply saying, if I am at base B1 and XOR with the color C, I will get base B2, the other end of my di-nucleotide. One can define the entire SOLID color-space dinucleotide array by simply asking what color lets me XOR with my first nucleotide to get my second? Remember, it's really a series like this: ATCG is measured as chain of dinucleotide colors A -- T color3 == color 11 T -- C color2 == color 10 C -- G color3 == color 11 Each nucleotide in the final sequence is used as the right half of one dinucleotide, and then the left half of the next dinucleotide. The first letter A is given (from the last base of the first primer if it is a read). So the data actually appears something like this: (A) 3 2 3 or (00) 11 10 11 If you are on base G (10) and your next color is red (11), then your next base is simply the XOR operation, so 10 XOR 11 = 01 = C. As you move along a read or a string in color-space, you can simply keep xor-ing with the next color to get the next base! Indel in colorspace. A C G A C A A drop out GAC 1 3 2 1 1 0 A C A A (GAC deleted, 4 colors become 1 new color) 1 1 0 so 3 2 1 1 will become 3 xor 2 xor 1 xor 1 = 11 xor 10 xor 01 xor 01 = 01 == 1 which is the same as C-->A which is correct. Take the region that's changing, and the exclusive-or of it all together. Can do a lot of work directly in color-space. SNP changes colors, two changes together. A G C G (C --> T) 2 3 3 1 1 A G C A (C --> T) 2 3 1 1 3 XOR of that region has to take this base to that base. An indel maintains the two things across there, the regions are different lengths. Can do more complicated stuff, 5 long thing replaced by 3 long thing. Note that when an error happens, all bases in the read down-stream will be wrong in base-space. This is the reason that people bother to try to use color-space, because then the error stays localized. When doing SNP calling, want to know if it is a SNP or a read-error. The read-errors are independent typically. But the SNP will have coordinated changes. Either a larger change, mismapping, error, or something else. SOLID makes a big deal out of this. Not however useful for other non-SNP-calling things. Even with millions of reads, you can get false-positive SNPs at a low error rate. == Quality == Base-space and color-space comes with quality scores. Flowspace does not have such? have to check. SFS format is the flowspace format for input into the Newbler assembler. Does it have any independent quality measurement? A large number of the assemblers throw away the quality data. Or only use it later. Some use it to just throw away reads with low quality. Sanger fails because of electrophoresis, not the sanger chemistry itself, as far as getting long reads. Out to about 1000 bases. 454 quality drops off also. Synthesis starts to get out of phase. Solid - lose yield on ligation. Missing ligations. Illumina problem with frequent washing removes template. Kevin thinks. == Memory == How do you represent this stuff in memory? Two bits per base. With color-space, can choose them to fit what they should be. If read is not too-variable length, can fit in 64-bit integer. SOLID produces cs-fasta file. (cs = colorspace) It is a T (the last base of the first adapter?) T 00100 ... Sometimes we want to do matching directly in colorspace. Therefore kevin throws away the first base, and the first digit "0" above. Then he has 24-color read instead of 25-base read. But now he can match entirely in color-space. So this helps avoid problems that would otherwise happen. Sometimes don't know what strand you are working on. To get reverse-complement equivalent in color-space all you have to do is the reversal. No complementing is needed. One thing you can do when mapping is handle both strands. But you still have to hash the reversed colorspace too, so don't save memory in searches. Hashing a genome takes a lot of space. == Final Business == Journal club papers should be fairly short. Give a 10-minute summary. Will Nader want all 3 lectures next week? Or include some time for Journal Club. Start being ready to do papers middle of next week.

Banana Slug Genomics

User Tools

Site Tools

Discussion

Page Tools