Today Shyamini discussed the PCAP assembler and her results
applying it to Pog 454 data.
For the SOLiD paired end data for H.pylori, some bases have 1000
or more reads (not kmers), some bases having 186,000, identical over all
25 bases! Such reads comprise about 3% of the data. What's going on? These
reads are not from species Nader's lab is sequencing and
so he thinks they are contaminants in the reagents. The reads have no
major blast hits, so Jenny proposed they might be sizing ladders
and so would not be in GenBank. Kevin may ask Invitrogen for their ladder
sequences. It would be nice if they published them.
Reagents are often quite dirty. One reason sequencing reagents are so expensive is because have to ultrapurify them.
The Pog data (454) only had four sequences with over 1000 reads.
Lesson: The first thing to do with any biological data is to look
for the unknowns (stuff that should not be there). They had removed
plasmids and transposons. After mapping, see what you mapped.
Here, the problem was not that they had reads mapping to multiple
genome locations but rather that reads not from the genome mapped to
it.
Kevin wrote a script for identifying such reads, something like 'findshortreads'.
Shyamini's detailed slides are posted. This section supplements with discussion during the presentation.
PCAP is an overlap consensus assembler.
Phase 1: Banded dynamic programming is like Smith-Waterman but
instead of an n2 matrix it uses seed hits that you know
line
up and stays on the diagonal between them. It loses some
sensitivity – can't find really remote homologs – but for sequences
with just a few base differences (3 or 4) it is appropriate. BDP is
used here not just for efficiency but because we want sequences
to be
nearly identical. (BDP would be too limited for GenBank searches,
e.g. it would not handle inserts of 60 bases.) Don't use BDP if
your short reads are eukaryotic mRNA because you will lose the
introns. BDP is great for de novo assembly of contigs.
Phase 3: PCAP reads the FASTA and qual files.
PCAP - major programs: The bclean default depth parameter of 75 is
a problem for us since for Pog we have 60x coverage. We should
increase it to 200 or 300. (We want to get rid of things that occur
10x in the genome, not twice.)
It might be good to look at the distribution of read depths by
position and see if it is uniform (hopefully, but not in practice).
Actually, could get this info from the ACE file (if not using the
'nobig' option).
PCAP - minor programs: bpair has some expectation for file names so Shyamini could not use it.
Output files:
Kevin suggested the class create a table for how all these tools have done on Pog: number of contigs, maximum contig length, N50., …
Last slide, a summary of output files: 10% thrown out perhaps because of the bclean coverage parameter being 75? It took 2hrs to run PCAP.
autopcap is handy but it had some expectations about the path…. Aha!
First line in autopcap sets a path variable to '..'. Moral: For
(non-algorithmic) problems like this it can be easier to check out the
source than the documentation:)
Kevin remarked that it would be interesting to run PCAP raising the
coverage from 75 and expects it should run about as well as Velvet.
It looks promising so far… Can it handle SOLiD data though?
Shyamini said that would require quality info.
MPI on the cluster is not working well so maybe cannot use it to take
advantage of PCAP's parallelism. With much pain John got it to work
(had to make a machine list, set various environment variables,
… without benefit of documentation). Perhaps a talk for next Friday
(please)…
It's very helpful when running these assemblers to be able to
recognize symptoms that the machine is thrashing, i.e. that the machine
is spending so much time swapping pages in/out of memory that
it is doing no useful work. Kevin's tips: Run 'top' to check CPU and
memory usage. For CPU usage (user, sys, idle), 98% or more of your
job should be 'user.' If the job's idle goes way up it is waiting for
disk access so swapping could be a problem. If both the system
and idle times go way up and user time goes way down, thrashing. If
at the same time memory usage goes up, then you're definitely
thrashing. So kill the process. It will never complete since page
faults kill speed by a factor of 100s or 1000s.
Nice-ing your job is a good thing and does not slow down your job if
you are the only one on the machine. Use qdel on Sun to remove jobs
causing problems.
Kevin also discussed SSH set-up: It's magic;) Copy someone else's ritual There is info in Kevin's .ssh directories (SOE and campusrocks).
ssh-copy-id may be helpful for setting up keys and permissions, the latter part often error-prone if done by hand.
qlogin is an alternative to ssh.