User Tools

Site Tools


lecture_notes:05-12-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Last revision Both sides next revision
lecture_notes:05-12-2010 [2010/05/18 01:26]
svasili
lecture_notes:05-12-2010 [2010/05/20 04:14]
jmagasin
Line 1: Line 1:
-  * A brief outline of PCAP program, ​and the results ​obtained from PCAP assembly on Pog 454 data were shown by Shyamini.+Today Shyamini discussed the PCAP assembler ​and her results\\ 
 +applying it to Pog 454 data. 
 + 
 +==== Resources from Shyamini==== 
   * Presentation slides in pdf format are posted {{:​lecture_notes:​pcap_ppt.pdf|here}}.   * Presentation slides in pdf format are posted {{:​lecture_notes:​pcap_ppt.pdf|here}}.
   * Article - "​Generating Genome Assembly with PCAP" describes the major and minor programs used to run PCAP, their default parameters, the output files, and a brief description of the program. The link to the article is {{:​lecture_notes:​pcap.pdf|here}}. ​   * Article - "​Generating Genome Assembly with PCAP" describes the major and minor programs used to run PCAP, their default parameters, the output files, and a brief description of the program. The link to the article is {{:​lecture_notes:​pcap.pdf|here}}. ​
Line 5: Line 9:
  
  
 +==== Kevin discusses contaminants found when assembling H. pylori ====
 +
 +For the SOLiD paired end data for //​H.pylori//,​ some bases have 1000
 +or more **reads** (not kmers), some bases having 186,000, identical over all\\
 +25 bases! ​ Such reads comprise about 3% of the data.  What's going on?  These
 +reads are not from species Nader'​s lab is sequencing and\\
 +so he thinks they are contaminants in the reagents. ​ The reads have no
 +major blast hits, so Jenny proposed they might be sizing ladders\\
 +and so would not be in GenBank. ​ Kevin may ask Invitrogen for their ladder
 +sequences. ​ It would be nice if they published them.
 +
 +Reagents are often quite dirty. ​ One reason sequencing reagents are
 +so expensive is because have to ultrapurify them.
 +
 +The Pog data (454) only had four sequences with over 1000 reads.
 +
 +Lesson: The first thing to do with any biological data is to //look
 +for the unknowns// (stuff that should not be there). ​ They had removed\\
 +plasmids and transposons. ​ After mapping, //see what you mapped.//
 +Here, the problem was not that they had reads mapping to multiple\\
 +genome locations but rather that reads not from the genome mapped to
 +it.
 +
 +Kevin wrote a script for identifying such reads, something like '​findshortreads'​.
 +
 +==== PCAP presentation by Shyamini ====
 +
 +//​Shyamini'​s detailed slides are posted. ​ This section supplements
 +with discussion during the presentation.//​
 +
 +PCAP is an overlap consensus assembler.
 +
 +**Phase 1**: Banded dynamic programming is like Smith-Waterman but
 +instead of an n<​sup>​2</​sup>​ matrix it uses seed hits that you know
 +line\\
 +up and stays on the diagonal between them.  It loses some
 +sensitivity -- can't find really remote homologs -- but for sequences\\
 +with just a few base differences (3 or 4) it is appropriate. ​ BDP is
 +used here not just for efficiency but because we want sequences\\
 +to be
 +nearly identical. ​ (BDP would be too limited for GenBank searches,
 +//e.g.// it would not handle inserts of 60 bases.) ​ Don't use BDP if\\
 +your short reads are eukaryotic mRNA because you will lose the
 +introns. ​ BDP is great for //de novo// assembly of contigs.
 +
 +
 +**Phase 3**:  PCAP reads the FASTA and qual files.
 +
 +**PCAP - major programs**: The bclean default depth parameter of 75 is
 +a problem for us since for Pog we have 60x coverage. ​ We should\\
 +increase it to 200 or 300.  (We want to get rid of things that occur
 +10x in the genome, not twice.)
 +
 +It might be good to look at the distribution of read depths by
 +position and see if it is uniform (hopefully, but not in practice).\\
 +Actually, could get this info from the ACE file (if not using the
 +'​nobig'​ option).
 +
 +**PCAP - minor programs**: bpair has some expectation for file names
 +so Shyamini could not use it.
 +
 +**Output files**: ​
 +
 +Kevin suggested the class create a table for how all these tools have
 +done on Pog: number of contigs, maximum contig length, N50., ...
 +
 +**Last slide, a summary of output files**: 10% thrown out perhaps
 +because of the bclean coverage parameter being 75?  It took 2hrs to
 +run PCAP.
 +
 +
 +autopcap is handy but it had some expectations about the path.... Aha!
 +First line in autopcap sets a path variable to '​..'​. ​ Moral: For\\
 +(non-algorithmic) problems like this it can be easier to check out the
 +source than the documentation:​)
 +
 +Kevin remarked that it would be interesting to run PCAP raising the
 +coverage from 75 and expects it should run about as well as Velvet.\\
 +It looks promising so far... ​ Can it handle SOLiD data though?
 +Shyamini said that would require quality info.
 +
 +==== Practical discussion ====
 +
 +MPI on the cluster is not working well so maybe cannot use it to take
 +advantage of PCAP's parallelism. ​ With much pain John got it to work\\
 +(had to make a machine list, set various environment variables,
 +... without benefit of documentation). ​ Perhaps a talk for next Friday
 +(please)...
 +
 +It's very helpful when running these assemblers to be able to
 +recognize symptoms that the machine is thrashing, //i.e.// that the\\
 +machine\\
 +is spending so much time swapping pages in/out of memory that
 +it is doing no useful work.  Kevin'​s tips: Run '​top'​ to check CPU and\\
 +memory usage. ​ For CPU usage (user, sys, idle), 98% or more of your
 +job should be '​user.' ​ If the job's idle goes way up it is waiting for\\
 +disk access so swapping //could// be a problem. ​ If both the system
 +and idle times go way up and user time goes way down, thrashing. ​ If\\
 +at the same time memory usage goes up, then you're definitely
 +thrashing. ​ So kill the process. ​ It will never complete since page\\
 +faults kill speed by a factor of 100s or 1000s.
 +
 +Nice-ing your job is a good thing and does not slow down your job if
 +you are the only one on the machine. ​ Use qdel on Sun to remove jobs\\
 +causing problems.
 +
 +Kevin also discussed SSH set-up: ​ It's magic;​) ​ Copy someone else's ritual
 +There is info in Kevin'​s .ssh directories (SOE and campusrocks).
 +
 +ssh-copy-id may be helpful for setting up keys and permissions,​ the
 +latter part often error-prone if done by hand.
  
 +qlogin is an alternative to ssh.
lecture_notes/05-12-2010.txt · Last modified: 2010/05/20 04:15 by jmagasin