This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
lecture_notes:05-12-2010 [2010/05/18 01:26] svasili |
lecture_notes:05-12-2010 [2010/05/20 04:15] (current) jmagasin |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | * A brief outline of PCAP program, and the results obtained from PCAP assembly on Pog 454 data were shown by Shyamini. | + | Today Shyamini discussed the PCAP assembler and her results\\ |
| + | applying it to Pog 454 data. | ||
| + | |||
| + | ==== Resources from Shyamini==== | ||
| * Presentation slides in pdf format are posted {{:lecture_notes:pcap_ppt.pdf|here}}. | * Presentation slides in pdf format are posted {{:lecture_notes:pcap_ppt.pdf|here}}. | ||
| * Article - "Generating Genome Assembly with PCAP" describes the major and minor programs used to run PCAP, their default parameters, the output files, and a brief description of the program. The link to the article is {{:lecture_notes:pcap.pdf|here}}. | * Article - "Generating Genome Assembly with PCAP" describes the major and minor programs used to run PCAP, their default parameters, the output files, and a brief description of the program. The link to the article is {{:lecture_notes:pcap.pdf|here}}. | ||
| Line 5: | Line 9: | ||
| + | ==== Kevin discusses contaminants found when assembling H. pylori ==== | ||
| + | |||
| + | For the SOLiD paired end data for //H.pylori//, some bases have 1000 | ||
| + | or more **reads** (not kmers), some bases having 186,000, identical over all\\ | ||
| + | 25 bases! Such reads comprise about 3% of the data. What's going on? These | ||
| + | reads are not from species Nader's lab is sequencing and\\ | ||
| + | so he thinks they are contaminants in the reagents. The reads have no | ||
| + | major blast hits, so Jenny proposed they might be sizing ladders\\ | ||
| + | and so would not be in GenBank. Kevin may ask Invitrogen for their ladder | ||
| + | sequences. It would be nice if they published them. | ||
| + | |||
| + | Reagents are often quite dirty. One reason sequencing reagents are | ||
| + | so expensive is because have to ultrapurify them. | ||
| + | |||
| + | The Pog data (454) only had four sequences with over 1000 reads. | ||
| + | |||
| + | Lesson: The first thing to do with any biological data is to //look | ||
| + | for the unknowns// (stuff that should not be there). They had removed\\ | ||
| + | plasmids and transposons. After mapping, //see what you mapped.// | ||
| + | Here, the problem was not that they had reads mapping to multiple\\ | ||
| + | genome locations but rather that reads not from the genome mapped to | ||
| + | it. | ||
| + | |||
| + | Kevin wrote a script for identifying such reads, something like 'findshortreads'. | ||
| + | |||
| + | ==== PCAP presentation by Shyamini ==== | ||
| + | |||
| + | //Shyamini's detailed slides are posted. This section supplements | ||
| + | with discussion during the presentation.// | ||
| + | |||
| + | PCAP is an overlap consensus assembler. | ||
| + | |||
| + | **Phase 1**: Banded dynamic programming is like Smith-Waterman but | ||
| + | instead of an n<sup>2</sup> matrix it uses seed hits that you know | ||
| + | line\\ | ||
| + | up and stays on the diagonal between them. It loses some | ||
| + | sensitivity -- can't find really remote homologs -- but for sequences\\ | ||
| + | with just a few base differences (3 or 4) it is appropriate. BDP is | ||
| + | used here not just for efficiency but because we want sequences\\ | ||
| + | to be | ||
| + | nearly identical. (BDP would be too limited for GenBank searches, | ||
| + | //e.g.// it would not handle inserts of 60 bases.) Don't use BDP if\\ | ||
| + | your short reads are eukaryotic mRNA because you will lose the | ||
| + | introns. BDP is great for //de novo// assembly of contigs. | ||
| + | |||
| + | |||
| + | **Phase 3**: PCAP reads the FASTA and qual files. | ||
| + | |||
| + | **PCAP - major programs**: The bclean default depth parameter of 75 is | ||
| + | a problem for us since for Pog we have 60x coverage. We should\\ | ||
| + | increase it to 200 or 300. (We want to get rid of things that occur | ||
| + | 10x in the genome, not twice.) | ||
| + | |||
| + | It might be good to look at the distribution of read depths by | ||
| + | position and see if it is uniform (hopefully, but not in practice).\\ | ||
| + | Actually, could get this info from the ACE file (if not using the | ||
| + | 'nobig' option). | ||
| + | |||
| + | **PCAP - minor programs**: bpair has some expectation for file names | ||
| + | so Shyamini could not use it. | ||
| + | |||
| + | **Output files**: | ||
| + | |||
| + | Kevin suggested the class create a table for how all these tools have | ||
| + | done on Pog: number of contigs, maximum contig length, N50., ... | ||
| + | |||
| + | **Last slide, a summary of output files**: 10% thrown out perhaps | ||
| + | because of the bclean coverage parameter being 75? It took 2hrs to | ||
| + | run PCAP. | ||
| + | |||
| + | |||
| + | autopcap is handy but it had some expectations about the path.... Aha! | ||
| + | First line in autopcap sets a path variable to '..'. Moral: For\\ | ||
| + | (non-algorithmic) problems like this it can be easier to check out the | ||
| + | source than the documentation:) | ||
| + | |||
| + | Kevin remarked that it would be interesting to run PCAP raising the | ||
| + | coverage from 75 and expects it should run about as well as Velvet.\\ | ||
| + | It looks promising so far... Can it handle SOLiD data though? | ||
| + | Shyamini said that would require quality info. | ||
| + | |||
| + | ==== Practical discussion ==== | ||
| + | |||
| + | MPI on the cluster is not working well so maybe cannot use it to take | ||
| + | advantage of PCAP's parallelism. With much pain John got it to work\\ | ||
| + | (had to make a machine list, set various environment variables, | ||
| + | ... without benefit of documentation). Perhaps a talk for next Friday | ||
| + | (please)... | ||
| + | |||
| + | It's very helpful when running these assemblers to be able to | ||
| + | recognize symptoms that the machine is thrashing, //i.e.// that the machine\\ | ||
| + | is spending so much time swapping pages in/out of memory that | ||
| + | it is doing no useful work. Kevin's tips: Run 'top' to check CPU and\\ | ||
| + | memory usage. For CPU usage (user, sys, idle), 98% or more of your | ||
| + | job should be 'user.' If the job's idle goes way up it is waiting for\\ | ||
| + | disk access so swapping //could// be a problem. If both the system | ||
| + | and idle times go way up and user time goes way down, thrashing. If\\ | ||
| + | at the same time memory usage goes up, then you're definitely | ||
| + | thrashing. So kill the process. It will never complete since page\\ | ||
| + | faults kill speed by a factor of 100s or 1000s. | ||
| + | |||
| + | Nice-ing your job is a good thing and does not slow down your job if | ||
| + | you are the only one on the machine. Use qdel on Sun to remove jobs\\ | ||
| + | causing problems. | ||
| + | |||
| + | Kevin also discussed SSH set-up: It's magic;) Copy someone else's ritual | ||
| + | There is info in Kevin's .ssh directories (SOE and campusrocks). | ||
| + | |||
| + | ssh-copy-id may be helpful for setting up keys and permissions, the | ||
| + | latter part often error-prone if done by hand. | ||
| + | qlogin is an alternative to ssh. | ||