Table of Contents

Class Business

Communicate about offloading assembler installation to Jeff and Jenny since they weren't there on Monday.

Make a review articles page at a high level with citations. People can comment.

Use the forum to discuss things.

People should read the de-novo assemblers review paper so that they will be ready Friday's lecture.

454 Newbler assembler is entirely proprietary and almost nothing is known on how it works internally (The only description is in the supplementary material of the original 454 method paper 1).

Christy Hightower wants more feedback on the tools, to say good/bad. Feedback should be added to the wiki lecture notes for her lecture.

RSS feed for wiki.

Guest lecturers coming up:

We will talk Friday about graph representations.

Running on Campus Rocks

Do not run anything on the headnode for campusRocks. Learn how to use sungrid to tell it how to run it on (one of) the nodes. Alternatively, use the status page to find an idle node and ssh to it directly. The campusrocks page has a link to some documentation on sungrid.

We should all have access now to campusrocks. If you don't contact tech staff (IT request).

For testing, there is some Pyrobaculum data on campusrocks now (or soon).

Lower-level Data

Instruments

Sanger creates a trace. 454, Solid and Illumina take images with camera. Ion torrent uses direct chip pH measurement.

Traces

Images

Ion Torrent has direct electronic readout, no images.

Base-calling

(Correction to what I said in lecture: quality values are supposed to be -10 log10 P(error), but calibration is sometimes not very accurate. — Kevin Karplus 2010/04/09 07:18)

Spaces

Base-space
Flow-space
Color-space
number binary color meaning transitions
0 00 blue same base (A→A C→C G→G T→T)
1 01 green non-complement transversion (A→C C→A G→T T→G)
2 10 yellow transition (A→G C→T G→A T→C)
3 11 red complement (A→T C→G G→C T→A)

The di-nucleotide is simply saying, if I am at base B1 and XOR with the color C, I will get base B2, the other end of my di-nucleotide. One can define the entire SOLID color-space dinucleotide array by simply asking what color lets me XOR with my first nucleotide to get my second?

Remember, it's really a series like this: ATCG is measured as chain of dinucleotide colors

A -- T    color3 == color 11
T -- C    color2 == color 10
C -- G    color3 == color 11

Each nucleotide in the final sequence is used as the right half of one dinucleotide, and then the left half of the next dinucleotide. The first letter A is given (from the last base of the first primer if it is a read). So the data actually appears something like this:

(A) 3 2 3
or
(00) 11 10 11

If you are on base G (10) and your next color is red (11), then your next base is simply the XOR operation, so 10 XOR 11 = 01 = C. As you move along a read or a string in color-space, you can simply keep xor-ing with the next color to get the next base!

Indel in colorspace.

A C G A C A A
   drop out GAC
 1 3 2 1 1 0
A C A A   (GAC deleted, 4 colors become 1 new color)
  1 1 0
  so 3 2 1 1 will become 3 xor 2 xor 1 xor 1 
  = 11 xor 10 xor 01 xor 01 = 01 == 1 
 which is the same as C-->A which is correct.

Take the region that's changing, and the exclusive-or of it all together. Can do a lot of work directly in color-space.

SNP changes colors, two changes together.

A G C G   (C --> T)
 2 3 3
   1 1

A G C A   (C --> T)
 2 3 1
   1 3

XOR of that region has to take this base to that base.

An indel maintains the two things across there, the regions are different lengths.

Can do more complicated stuff, 5 long thing replaced by 3 long thing.

Note that when an error happens, all bases in the read down-stream will be wrong in base-space. This is the reason that people bother to try to use color-space, because then the error stays localized.

When doing SNP calling, want to know if it is a SNP or a read-error. The read-errors are independent typically.But the SNP will have coordinated changes. Either a larger change, mismapping, error, or something else. SOLID makes a big deal out of this. Not however useful for other non-SNP-calling things. Even with millions of reads, you can get false-positive SNPs at a low error rate.

Quality

Base-space, flow-space, and color-space all come with quality scores.

SFF format is the flowspace format for input into the Newbler assembler. It has quality scores for each base using standard -10 log10 probability. (SFF format)

A large number of the assemblers throw away the quality data or only use it later. Some use it to just throw away reads with low quality.

Reasons for quality dropoff

Memory

How do you represent this stuff in memory? Two bits per base (four possible values). With color-space, can choose them to fit what they should be. If read is not too-variable length, can fit 32 bases into a 64-bit integer.

SOLID produces cs-fasta file. (cs = colorspace) It is a T (the last base of the first adapter?)

T 00100 ...

Sometimes we want to do matching directly in colorspace. Therefore kevin throws away the first base, and the first digit “0” above. Then he has 24-color read instead of 25-base read. But now he can match entirely in color-space. So this helps avoid problems that would otherwise happen.

Sometimes don't know what strand you are working on. To get reverse-complement equivalent in color-space all you have to do is the reversal. No complementing is needed.

One thing you can do when mapping is handle both strands. But you still have to hash the reversed colorspace too, so don't save memory in searches. Hashing a genome takes a lot of space.

Final Business

Journal club papers should be fairly short. Give a 10-minute summary.

Will Nader want all 3 lectures next week? Or include some time for Journal Club. Start being ready to do papers middle of next week.

1) Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005.|http://dx.doi.org/10.1038/nature03959