Differences

This shows you the differences between two versions of the page.

--- lecture_notes:05-12-2010 [2010/05/18 01:26]
svasili
+++ lecture_notes:05-12-2010 [2010/05/20 04:15] (current)
jmagasin
@@ Line 1: / Line 1: @@
-  * A brief outline of PCAP program, and the results obtained from PCAP assembly on Pog 454 data were shown by Shyamini.
+Today Shyamini discussed the PCAP assembler and her results\\
+applying it to Pog 454 data.
+==== Resources from Shyamini====
   * Presentation slides in pdf format are posted {{:lecture_notes:pcap_ppt.pdf|here}}.
   * Article - "Generating Genome Assembly with PCAP" describes the major and minor programs used to run PCAP, their default parameters, the output files, and a brief description of the program. The link to the article is {{:lecture_notes:pcap.pdf|here}}.
@@ Line 5: / Line 9: @@
+==== Kevin discusses contaminants found when assembling H. pylori ====
+For the SOLiD paired end data for //H.pylori//, some bases have 1000
+or more **reads** (not kmers), some bases having 186,000, identical over all\\
+bases!  Such reads comprise about 3% of the data.  What's going on?  These
+reads are not from species Nader's lab is sequencing and\\
+so he thinks they are contaminants in the reagents.  The reads have no
+major blast hits, so Jenny proposed they might be sizing ladders\\
+and so would not be in GenBank.  Kevin may ask Invitrogen for their ladder
+sequences.  It would be nice if they published them.
+Reagents are often quite dirty.  One reason sequencing reagents are
+so expensive is because have to ultrapurify them.
+The Pog data (454) only had four sequences with over 1000 reads.
+Lesson: The first thing to do with any biological data is to //look
+for the unknowns// (stuff that should not be there).  They had removed\\
+plasmids and transposons.  After mapping, //see what you mapped.//
+Here, the problem was not that they had reads mapping to multiple\\
+genome locations but rather that reads not from the genome mapped to
+it.
+Kevin wrote a script for identifying such reads, something like 'findshortreads'.
+==== PCAP presentation by Shyamini ====
+//Shyamini's detailed slides are posted.  This section supplements
+with discussion during the presentation.//
+PCAP is an overlap consensus assembler.
+**Phase 1**: Banded dynamic programming is like Smith-Waterman but
+instead of an n<sup>2</sup> matrix it uses seed hits that you know
+line\\
+up and stays on the diagonal between them.  It loses some
+sensitivity -- can't find really remote homologs -- but for sequences\\
+with just a few base differences (3 or 4) it is appropriate.  BDP is
+used here not just for efficiency but because we want sequences\\
+to be
+nearly identical.  (BDP would be too limited for GenBank searches,
+//e.g.// it would not handle inserts of 60 bases.)  Don't use BDP if\\
+your short reads are eukaryotic mRNA because you will lose the
+introns.  BDP is great for //de novo// assembly of contigs.
+**Phase 3**:  PCAP reads the FASTA and qual files.
+**PCAP - major programs**: The bclean default depth parameter of 75 is
+a problem for us since for Pog we have 60x coverage.  We should\\
+increase it to 200 or 300.  (We want to get rid of things that occur
+x in the genome, not twice.)
+It might be good to look at the distribution of read depths by
+position and see if it is uniform (hopefully, but not in practice).\\
+Actually, could get this info from the ACE file (if not using the
+'nobig' option).
+**PCAP - minor programs**: bpair has some expectation for file names
+so Shyamini could not use it.
+**Output files**:
+Kevin suggested the class create a table for how all these tools have
+done on Pog: number of contigs, maximum contig length, N50., ...
+**Last slide, a summary of output files**: 10% thrown out perhaps
+because of the bclean coverage parameter being 75?  It took 2hrs to
+run PCAP.
+autopcap is handy but it had some expectations about the path.... Aha!
+First line in autopcap sets a path variable to '..'.  Moral: For\\
+(non-algorithmic) problems like this it can be easier to check out the
+source than the documentation:)
+Kevin remarked that it would be interesting to run PCAP raising the
+coverage from 75 and expects it should run about as well as Velvet.\\
+It looks promising so far...  Can it handle SOLiD data though?
+Shyamini said that would require quality info.
+==== Practical discussion ====
+MPI on the cluster is not working well so maybe cannot use it to take
+advantage of PCAP's parallelism.  With much pain John got it to work\\
+(had to make a machine list, set various environment variables,
+... without benefit of documentation).  Perhaps a talk for next Friday
+(please)...
+It's very helpful when running these assemblers to be able to
+recognize symptoms that the machine is thrashing, //i.e.// that the machine\\
+is spending so much time swapping pages in/out of memory that
+it is doing no useful work.  Kevin's tips: Run 'top' to check CPU and\\
+memory usage.  For CPU usage (user, sys, idle), 98% or more of your
+job should be 'user.'  If the job's idle goes way up it is waiting for\\
+disk access so swapping //could// be a problem.  If both the system
+and idle times go way up and user time goes way down, thrashing.  If\\
+at the same time memory usage goes up, then you're definitely
+thrashing.  So kill the process.  It will never complete since page\\
+faults kill speed by a factor of 100s or 1000s.
+Nice-ing your job is a good thing and does not slow down your job if
+you are the only one on the machine.  Use qdel on Sun to remove jobs\\
+causing problems.
+Kevin also discussed SSH set-up:  It's magic;)  Copy someone else's ritual
+There is info in Kevin's .ssh directories (SOE and campusrocks).
+ssh-copy-id may be helpful for setting up keys and permissions, the
+latter part often error-prone if done by hand.
+qlogin is an alternative to ssh.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools