User Tools

Site Tools



This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
lecture_notes:05-12-2010 [2010/05/18 01:12]
svasili created
lecture_notes:05-12-2010 [2010/05/20 04:15] (current)
Line 1: Line 1:
-A brief outline of PCAP algorithm ​and the results ​obtained from PCAP assembly of Pog 454 data were shown.+Today Shyamini discussed the PCAP assembler ​and her results\\ 
 +applying it to Pog 454 data.
-Presentation slides in pdf format are posted {{:​lecture_notes:​pcap_ppt.pdf|here}}+==== Resources from Shyamini====
 +  * Presentation slides in pdf format are posted {{:​lecture_notes:​pcap_ppt.pdf|here}}.
 +  * Article - "​Generating Genome Assembly with PCAP" describes the major and minor programs used to run PCAP, their default parameters, the output files, and a brief description of the program. The link to the article is {{:​lecture_notes:​pcap.pdf|here}}. ​
 +  * Another article - "PCAP: A Whole-Genome Assembly Program"​ published in Genome Research 2003, describes the algorithm behind the program PCAP. Link to the article is [[http://​​content/​13/​9/​2164.full#​T1|here]].
 +==== Kevin discusses contaminants found when assembling H. pylori ====
 +For the SOLiD paired end data for //​H.pylori//,​ some bases have 1000
 +or more **reads** (not kmers), some bases having 186,000, identical over all\\
 +25 bases! ​ Such reads comprise about 3% of the data.  What's going on?  These
 +reads are not from species Nader'​s lab is sequencing and\\
 +so he thinks they are contaminants in the reagents. ​ The reads have no
 +major blast hits, so Jenny proposed they might be sizing ladders\\
 +and so would not be in GenBank. ​ Kevin may ask Invitrogen for their ladder
 +sequences. ​ It would be nice if they published them.
 +Reagents are often quite dirty. ​ One reason sequencing reagents are
 +so expensive is because have to ultrapurify them.
 +The Pog data (454) only had four sequences with over 1000 reads.
 +Lesson: The first thing to do with any biological data is to //look
 +for the unknowns// (stuff that should not be there). ​ They had removed\\
 +plasmids and transposons. ​ After mapping, //see what you mapped.//
 +Here, the problem was not that they had reads mapping to multiple\\
 +genome locations but rather that reads not from the genome mapped to
 +Kevin wrote a script for identifying such reads, something like '​findshortreads'​.
 +==== PCAP presentation by Shyamini ====
 +//​Shyamini'​s detailed slides are posted. ​ This section supplements
 +with discussion during the presentation.//​
 +PCAP is an overlap consensus assembler.
 +**Phase 1**: Banded dynamic programming is like Smith-Waterman but
 +instead of an n<​sup>​2</​sup>​ matrix it uses seed hits that you know
 +up and stays on the diagonal between them.  It loses some
 +sensitivity -- can't find really remote homologs -- but for sequences\\
 +with just a few base differences (3 or 4) it is appropriate. ​ BDP is
 +used here not just for efficiency but because we want sequences\\
 +to be
 +nearly identical. ​ (BDP would be too limited for GenBank searches,
 +//e.g.// it would not handle inserts of 60 bases.) ​ Don't use BDP if\\
 +your short reads are eukaryotic mRNA because you will lose the
 +introns. ​ BDP is great for //de novo// assembly of contigs.
 +**Phase 3**:  PCAP reads the FASTA and qual files.
 +**PCAP - major programs**: The bclean default depth parameter of 75 is
 +a problem for us since for Pog we have 60x coverage. ​ We should\\
 +increase it to 200 or 300.  (We want to get rid of things that occur
 +10x in the genome, not twice.)
 +It might be good to look at the distribution of read depths by
 +position and see if it is uniform (hopefully, but not in practice).\\
 +Actually, could get this info from the ACE file (if not using the
 +'​nobig'​ option).
 +**PCAP - minor programs**: bpair has some expectation for file names
 +so Shyamini could not use it.
 +**Output files**: ​
 +Kevin suggested the class create a table for how all these tools have
 +done on Pog: number of contigs, maximum contig length, N50., ...
 +**Last slide, a summary of output files**: 10% thrown out perhaps
 +because of the bclean coverage parameter being 75?  It took 2hrs to
 +run PCAP.
 +autopcap is handy but it had some expectations about the path.... Aha!
 +First line in autopcap sets a path variable to '​..'​. ​ Moral: For\\
 +(non-algorithmic) problems like this it can be easier to check out the
 +source than the documentation:​)
 +Kevin remarked that it would be interesting to run PCAP raising the
 +coverage from 75 and expects it should run about as well as Velvet.\\
 +It looks promising so far... ​ Can it handle SOLiD data though?
 +Shyamini said that would require quality info.
 +==== Practical discussion ====
 +MPI on the cluster is not working well so maybe cannot use it to take
 +advantage of PCAP's parallelism. ​ With much pain John got it to work\\
 +(had to make a machine list, set various environment variables,
 +... without benefit of documentation). ​ Perhaps a talk for next Friday
 +It's very helpful when running these assemblers to be able to
 +recognize symptoms that the machine is thrashing, //i.e.// that the machine\\
 +is spending so much time swapping pages in/out of memory that
 +it is doing no useful work.  Kevin'​s tips: Run '​top'​ to check CPU and\\
 +memory usage. ​ For CPU usage (user, sys, idle), 98% or more of your
 +job should be '​user.' ​ If the job's idle goes way up it is waiting for\\
 +disk access so swapping //could// be a problem. ​ If both the system
 +and idle times go way up and user time goes way down, thrashing. ​ If\\
 +at the same time memory usage goes up, then you're definitely
 +thrashing. ​ So kill the process. ​ It will never complete since page\\
 +faults kill speed by a factor of 100s or 1000s.
 +Nice-ing your job is a good thing and does not slow down your job if
 +you are the only one on the machine. ​ Use qdel on Sun to remove jobs\\
 +causing problems.
 +Kevin also discussed SSH set-up: ​ It's magic;​) ​ Copy someone else's ritual
 +There is info in Kevin'​s .ssh directories (SOE and campusrocks).
 +ssh-copy-id may be helpful for setting up keys and permissions,​ the
 +latter part often error-prone if done by hand.
 +qlogin is an alternative to ssh.
lecture_notes/05-12-2010.1274145123.txt.gz · Last modified: 2010/05/18 01:12 by svasili