User Tools

Site Tools


lecture_notes:04-30-2010

Logistics

Presentations

Some people haven't done presentations or put much up on the wiki pages.

I want everyone to signup for a time when they will be presenting one of the assemblers. For those that haven't done it, please sign up. Can always do a mapper like Bowtie or BWA. even if we run out of tools that work. Should add a mappers section on the tools page.

Next week will be student presentations. We need volunteers to do presentations. Especially the people who haven't yet presented on an assembler or mapper.

Presentations should include the algorithms, and wiki tools page should include the paper. Makes a difference as to how to apply it to our data.

Not necessarily going to take an hour per assembly. 2 students per day would be fine. If it takes longer, that's fine.

Volunteers for Monday: Installing, Running, How does it work in general? Find out which have gotten done and which have not. Mark off which have been done on the list. (apr 5)

Shorty maybe should be on the list. The idea of using the matepaired or paired end data to make a group in a cluster and then to assemble that. Let's add Shorty to the list of tools.

Maybe we don't have anyone ready for Monday, you need to be ready to do it in the next two weeks. You can tell me or send me email.

Literature

Also, people should do the literature search stuff for slug biology. Can put scope on biology page. Want it to be relevant as it's not a model organism for very many people. Other Ariolimax stuff that might be genus specific should be ok.

Slug locomotion might not even matter what the genus is. For taxonomy, the ones that distinguish Ariolimax are probably better. We may need to put in new subheaders to group papers by topic, e.g. locomotion.

People need to start putting in more stuff on the wiki, read the paper, put up a paragraph about what's interesting in it about dolichophallus. Jan's bibliography has already been put up on the wiki page.

Future years of this class will use the wiki pages.

Doing some reading so we can categorize. Everyone should read a couple of the papers. Let's get most of them read, know what kind of information exists out there.

Just like we should all read at least a couple of different papers on assemblers.

Genome Browser

Patricia. She has automated it more than the main genome browser group. There are different types of browsers. Might be interesting to have someone in browser group do a presentation on it.

Homework for Wednesday

Homework assignments, as people have not been contributing equally to the wiki. Those of you who took David Bernick's class will have already done it. Different version of newbler produced different assembly. Order and orient the contigs. Try to do it.

Do the contig alignments from the .join file.
(Kevin had already re-done the map-colorspace5.)

map-colorspace5

bme235/assemblies/Pog/map-colorspace5 just the mappings of the solid pairs to the 31 contigs made by newbler. Several files were done by the mapcolorspace program that Kevin wrote. The trim*files are the output. There are files called “delete” and “invert”. “delete” weren't an appropriate distance to make a read. <350 or >5000 were not a good read, “invert” if mapped to opposite strands. Could examine these to look for deletions or inversions.

trim9-cross.rdb mapped to different contigs. not filtered for where, e.g. ends that it mapped. Ambiguous stuff was filtered out.

trim9-out is summary of mapping. There were indexed with two different kmers in color space, length 15 and 11. These are the trim parameters. trim9 off 24 leaves 15. trim3 off 24 leaves 21? Longer kmer will reduce number of ambiguous things, but reduces coverage.

48% of the reads mapped. Maybe I should run it with the improved version of mapcolorspace. I tried to map it more agressively.

Looks for the first 15mer, if it is unique, it knows where it is. If it exists more than once, it maps the other end. The aggressive approach allows for a small indel. Before it required two that mapped to the same place. More aggressive mapping can have more mistakes. But did not seem to cause a lot more trouble in the output. Multiply mapped regions will happen in Pog,1000bases that are in two places. Won't find if error every 10 bases. Made bugfixes yesterday, needs to install it and run it. It made no sense that 23% of the reads were mapping nowhere. Now it is lower, about 0.5%. Matepair uses stuff on either side of the adapter. The two things on either end of the adapters are just stuff. Some fraction of the reads are bogus. Essentially all the reported deletions are this case of double-adaptor junk. The correct answer is on lowelab, and not on campusrocks. On an earlier version, we were seeing a strand-bias. SOCs was one that had that problem. It was an algorithmic defect. Comparing the F3 to R3 counts, can see if there is bias. The error rates are 1-1.5%. The assembled pieces have nearly .01% (1/10000).

He finds putatives snps and small indels. Most are wrong, finds a handful that are real.

The data we will use is in the .joins file. Has all the reads that crossed the contig boundary. Looks for which are in the appropriate direction and on the right ends.

The counts for each contig run a python script over all the trim counts. Then it looks to see if it is mapping forward or backwards on the last part of the contig. See if I stuck these two together would it work sticking them together as a good read.

Contiglengths.rdb can give an estimate of coverage.

Back to .join.
3 2 is the same as -2 -3
-7 20 === -20 7
If you flip the pair, the direction an order switch.
If he has - - , he just flips it to have + +.
Chose to put the lower-number contig first.

The last column is useful but misleading. It is an estimate of the gap between the contigs. If this pair were the optimal length (e.g. 2200) where would

      "-----------"
   "-----------"
"-------|  |------"
   "----------"
     "---------"

If you pretend they were all at the average length, how big do you expect this gap to be? Sometimes if there are no reads in the gap, it really skews the number. It is dubious, but sometimes tells you something. Can be an extra ordering hint. The trick is to take these edges and try to make a larger genome or scaffold. Pog goes pretty well, but where it falls apart is where you have repeats. The same thing going in two edges. 30*31/2 Don't care about things with a small number of reads spanning. Can ignore stuf below 100 and call it noise. Just ignore the noise. So only the first 50 lines tell you anything. I will re-run with the latest version, it should get even sharper.

Significance values are tough because of various problems. In one case, ended up with 40 contigs and only 3 could be properly mapped.

For inversions, flip and look for reads across boundaries. Can make a correction if you know how mappable it is, although we don't know that here.

Some of the short contigs are highly mappable.

Take that data and try to come up with the mapping, knowing that some things are duplicated, that there are two circular genomes, the big Pog and the virus particle.

When you are mapping the scaffold do you do it by hand or not. Kevin has done it for Pog once and H. Pyl a couple of times. It's very simple, but has a couple of places that it goes. The problem is it has two ways to go, and they're both right.

Sometimes you have to make judgment call, and then check with mappability. We don't have a full agorithm for the assembly. If you ignore data or make it up, it might work, but hard to automate. Can not do this with 1000 contigs. Probably wouldn't have enough solid data for that anyway.

Anyway, give it a try orienting the .join file.

Estimating Genome Size

slug/newbler-assembly1/

Kevin tried to estimate the genome size. It is somewhat bogus, but the best estimate we have so far. Maybe later with the Illumina data.

We only have 138M 454 slug data. Other molluscs are 0.7 to 7 G. The smallest estimate is 5-6x the data that we have. We won't get 20x coverage. We should at least have 2x coverage. Illumina data is much more voluminous. But won't be able to do it with newbler alone. It assembled 9k contigs. 2.9M stuff in contigs. N50 size 615. Inferred read error - If I treat the assembly as correct, how error prone are the reads. Any differences between assembly and reads are sequencing errors (or mapping errors) the error rate is fairly high. 20x higher than the underlying platform. This indicates more of an assembly error than a sequencing error. 4.7% is not believable. He told it the coverage was low. If it aligns at all, it's stuck together. This is mis-assembled because the 4.7% read err is bogus.

But what can we do with the junk assembly? We did get somestuff piling up. We could look at the number of singletons. We could look at where it was able to overlap. What's the probability? We know how much sequence we have. newbler estimates 60M, which is just wrong.

Let's try doing some computations.

g = genome size
c = coverage
cg = # bases sequenced 138M (1.384e6)
n = number of reads 500k (0.5e6)
avg read length = cg/n
p = prob. that some specific base is in a particular read.
  = (cg/n)/g = c / n
p = prob. that base not seen = 1-p
base never seen = (1-p)^n
  = (1-c/n)^n = e^-c.

This does not depend on the genome size. I don't know the coverage. This alone doesn't help. Expected number of them == g*e^-c. The ones that have been seen once, and more than once? P seen exactly once = P (1-P)^(n-1). (n times)

n * P/(1-p) * e^-c.
1-p will be approx 1.
So this is c*e^-c.
The expected number is c*g*e^-c.

We do know cg at least since that's the number of bases. What about bases seen more than once? Either do bases 2x, or grather than 1. Greater than one is just all the rest, so that's

1-e^-c - c*e^-c

So expected number is

g*(1-(1+c)*e^-c)

FYI base seen twice is

N2/2 * p2 * e^-c == cce^-c/2.

Number of bases in final assembly is 2.9M. we know cg, and we know. Hard to do with closed-form solution. We can do it numerically with gnuplot. Got coverage of about 4.3% Leads to an estimated genome size of 3.2G.

The most bogus part was the assumption that any time it saw two things it made an overlap. Maybe all we're seeing is the repeat regions.

The number of aliged bases/2 estimates that this is 0.7 G. This seems unlikely. Not knowing detailed info about newbler behavior makes it difficult.

One other estimate based on singletons. What fraction of your reads are singleton reads? What's the probability? Gets a length of 2.75G which is smaller than the smallest known mollusc. Kevin doesn't believe this either.

We can do the same sort of estimate from any sort of data. If you know more about the assemblers algorithm, and Illumina data, should get a better estimate. Right now it's too sparse. Almost never hits. Even just getting the right answer within a factor of 10 was encouraging.

Second set of lecture notes

From Jonathan Magasin.

Today's lecture covered files generated by Kevin's map-colorspace5
script which will be necessary for the homework assignment: ordering
the Pog contigs. Also covered: How to roughly estimate banana slug
genome size.

Homework assignment: assemble Pog

From the newer newbler we have a different set of contigs, thirty-one
of them. We also have mate pairs from SOLiD mapped to newbler
assembly 5. The assignment is to order and orient them the contigs.
Kevin has posted his solution in the assembly directory
(assemblies/Pog/map-colorspace5).

'delete' and 'invert' files

The trim9 files are output from Kevin's map-colorspace5 script. Most
of them are not for us. The 'delete' files are for mate pairs where
both mapped but not at the appropriate distance: closer than 350
bases, or more than 5000 bases apart. These files can be studied to
find massive deletions.

'invert' files are for mate pairs that were on opposite strands (at
any distance) and are useful for studying inversions. However
inversions are not so useful because they will not be within a single
contig.

The 'between contigs' files are for when the reads mapped to different
contigs. Only cases with unique mappings are in these files. fixme:
What file(s) is this?

trim9.out

The trim9.out file is a summary of what happened during mapping.
Kmers are colorspace kmers. These kmers are the lengths after
trimming off nine. 'uniquly mapped' means one for the R3 and one for
the F3. Note that some of the summary stats are bugs (the 'wrong
range' line) that have since been fixed. [Kevin reran the script of
Friday.]

Compared to earlier mapping, Kevin has increased the agressiveness.

The deletions ('wrong range' line) appear to be from reads taken from
clones with multiple adapters.

Strand biases (algorithmic) were checked for in the forward and
backward counts of mapped reads.

Tonight Kevin [reran] the mapping script. The new output will
include error rates by position. Sol requested these be called
mismatches rather than errors. Kevin said they are estimates of
sequencing error. They are in colorspace, and exclude indels. In
reality they are differences between the reads and assemblies, not
errors. But for all practical purposes when there is a mismatch
between a read and the assembly the error is in the read.)

trim9.joins

The data we'll use is not in trim9.out, rather it is in trim9.joins.
trim9.joins is computed from trim9-cross.rdb (which has reads that
crossed a contig boundary).

Looking at the first line: How many reads support contig 22 following
contig 3? 34K. The contigs lengths are there to see if a contig is
short enough that mate pairs might span them. E.g. what orderings are
consistent with the listed contig orders: 2→3→4, or maybe 3 is very
short so we have a mate pair that jumps over 3.

A minus sign appearing before a contig name means to take the
reverse-complement of that contig. (And contig3 followed by contig22
is the same as -contig22 followed by -contig3.)

The last number on each line indicates if the mate pair were the
optimal length (optimal is the peak of length distribution). That
number is the expected length of the gap between the contigs. It is
very rough.

The file is sorted most-counts to fewest. The stuff at bottom of the
file is noise. Only about first fifty lines are helpful.

Long mate pairs are very helpful for disambiguating, and wish we had
them for H.pylori.

Your task

Take the trim9.joins file and try to order and orient the contigs,
knowing that some of them are duplicated, and that there is a virus in
the mix and there should be one large chromosome.

You could leave a comment if you were logged in.
lecture_notes/04-30-2010.txt · Last modified: 2010/05/05 14:05 by jmagasin