This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
lecture_notes:05-12-2010 [2010/05/18 01:26] svasili |
lecture_notes:05-12-2010 [2010/05/20 04:15] (current) jmagasin |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | * A brief outline of PCAP program, and the results obtained from PCAP assembly on Pog 454 data were shown by Shyamini. | + | Today Shyamini discussed the PCAP assembler and her results\\ |
+ | applying it to Pog 454 data. | ||
+ | |||
+ | ==== Resources from Shyamini==== | ||
* Presentation slides in pdf format are posted {{:lecture_notes:pcap_ppt.pdf|here}}. | * Presentation slides in pdf format are posted {{:lecture_notes:pcap_ppt.pdf|here}}. | ||
* Article - "Generating Genome Assembly with PCAP" describes the major and minor programs used to run PCAP, their default parameters, the output files, and a brief description of the program. The link to the article is {{:lecture_notes:pcap.pdf|here}}. | * Article - "Generating Genome Assembly with PCAP" describes the major and minor programs used to run PCAP, their default parameters, the output files, and a brief description of the program. The link to the article is {{:lecture_notes:pcap.pdf|here}}. | ||
Line 5: | Line 9: | ||
+ | ==== Kevin discusses contaminants found when assembling H. pylori ==== | ||
+ | |||
+ | For the SOLiD paired end data for //H.pylori//, some bases have 1000 | ||
+ | or more **reads** (not kmers), some bases having 186,000, identical over all\\ | ||
+ | 25 bases! Such reads comprise about 3% of the data. What's going on? These | ||
+ | reads are not from species Nader's lab is sequencing and\\ | ||
+ | so he thinks they are contaminants in the reagents. The reads have no | ||
+ | major blast hits, so Jenny proposed they might be sizing ladders\\ | ||
+ | and so would not be in GenBank. Kevin may ask Invitrogen for their ladder | ||
+ | sequences. It would be nice if they published them. | ||
+ | |||
+ | Reagents are often quite dirty. One reason sequencing reagents are | ||
+ | so expensive is because have to ultrapurify them. | ||
+ | |||
+ | The Pog data (454) only had four sequences with over 1000 reads. | ||
+ | |||
+ | Lesson: The first thing to do with any biological data is to //look | ||
+ | for the unknowns// (stuff that should not be there). They had removed\\ | ||
+ | plasmids and transposons. After mapping, //see what you mapped.// | ||
+ | Here, the problem was not that they had reads mapping to multiple\\ | ||
+ | genome locations but rather that reads not from the genome mapped to | ||
+ | it. | ||
+ | |||
+ | Kevin wrote a script for identifying such reads, something like 'findshortreads'. | ||
+ | |||
+ | ==== PCAP presentation by Shyamini ==== | ||
+ | |||
+ | //Shyamini's detailed slides are posted. This section supplements | ||
+ | with discussion during the presentation.// | ||
+ | |||
+ | PCAP is an overlap consensus assembler. | ||
+ | |||
+ | **Phase 1**: Banded dynamic programming is like Smith-Waterman but | ||
+ | instead of an n<sup>2</sup> matrix it uses seed hits that you know | ||
+ | line\\ | ||
+ | up and stays on the diagonal between them. It loses some | ||
+ | sensitivity -- can't find really remote homologs -- but for sequences\\ | ||
+ | with just a few base differences (3 or 4) it is appropriate. BDP is | ||
+ | used here not just for efficiency but because we want sequences\\ | ||
+ | to be | ||
+ | nearly identical. (BDP would be too limited for GenBank searches, | ||
+ | //e.g.// it would not handle inserts of 60 bases.) Don't use BDP if\\ | ||
+ | your short reads are eukaryotic mRNA because you will lose the | ||
+ | introns. BDP is great for //de novo// assembly of contigs. | ||
+ | |||
+ | |||
+ | **Phase 3**: PCAP reads the FASTA and qual files. | ||
+ | |||
+ | **PCAP - major programs**: The bclean default depth parameter of 75 is | ||
+ | a problem for us since for Pog we have 60x coverage. We should\\ | ||
+ | increase it to 200 or 300. (We want to get rid of things that occur | ||
+ | 10x in the genome, not twice.) | ||
+ | |||
+ | It might be good to look at the distribution of read depths by | ||
+ | position and see if it is uniform (hopefully, but not in practice).\\ | ||
+ | Actually, could get this info from the ACE file (if not using the | ||
+ | 'nobig' option). | ||
+ | |||
+ | **PCAP - minor programs**: bpair has some expectation for file names | ||
+ | so Shyamini could not use it. | ||
+ | |||
+ | **Output files**: | ||
+ | |||
+ | Kevin suggested the class create a table for how all these tools have | ||
+ | done on Pog: number of contigs, maximum contig length, N50., ... | ||
+ | |||
+ | **Last slide, a summary of output files**: 10% thrown out perhaps | ||
+ | because of the bclean coverage parameter being 75? It took 2hrs to | ||
+ | run PCAP. | ||
+ | |||
+ | |||
+ | autopcap is handy but it had some expectations about the path.... Aha! | ||
+ | First line in autopcap sets a path variable to '..'. Moral: For\\ | ||
+ | (non-algorithmic) problems like this it can be easier to check out the | ||
+ | source than the documentation:) | ||
+ | |||
+ | Kevin remarked that it would be interesting to run PCAP raising the | ||
+ | coverage from 75 and expects it should run about as well as Velvet.\\ | ||
+ | It looks promising so far... Can it handle SOLiD data though? | ||
+ | Shyamini said that would require quality info. | ||
+ | |||
+ | ==== Practical discussion ==== | ||
+ | |||
+ | MPI on the cluster is not working well so maybe cannot use it to take | ||
+ | advantage of PCAP's parallelism. With much pain John got it to work\\ | ||
+ | (had to make a machine list, set various environment variables, | ||
+ | ... without benefit of documentation). Perhaps a talk for next Friday | ||
+ | (please)... | ||
+ | |||
+ | It's very helpful when running these assemblers to be able to | ||
+ | recognize symptoms that the machine is thrashing, //i.e.// that the machine\\ | ||
+ | is spending so much time swapping pages in/out of memory that | ||
+ | it is doing no useful work. Kevin's tips: Run 'top' to check CPU and\\ | ||
+ | memory usage. For CPU usage (user, sys, idle), 98% or more of your | ||
+ | job should be 'user.' If the job's idle goes way up it is waiting for\\ | ||
+ | disk access so swapping //could// be a problem. If both the system | ||
+ | and idle times go way up and user time goes way down, thrashing. If\\ | ||
+ | at the same time memory usage goes up, then you're definitely | ||
+ | thrashing. So kill the process. It will never complete since page\\ | ||
+ | faults kill speed by a factor of 100s or 1000s. | ||
+ | |||
+ | Nice-ing your job is a good thing and does not slow down your job if | ||
+ | you are the only one on the machine. Use qdel on Sun to remove jobs\\ | ||
+ | causing problems. | ||
+ | |||
+ | Kevin also discussed SSH set-up: It's magic;) Copy someone else's ritual | ||
+ | There is info in Kevin's .ssh directories (SOE and campusrocks). | ||
+ | |||
+ | ssh-copy-id may be helpful for setting up keys and permissions, the | ||
+ | latter part often error-prone if done by hand. | ||
+ | qlogin is an alternative to ssh. |