This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
lecture_notes:04-28-2010 [2010/05/02 05:22] galt |
lecture_notes:04-28-2010 [2010/05/02 16:22] (current) karplus added workaround for campusrocks filesystem problem |
||
|---|---|---|---|
| Line 3: | Line 3: | ||
| === Misc Notes: === | === Misc Notes: === | ||
| - | **campusrocks is broken!** | + | **campusrocks is broken!** The head node has the file system mounted as /campusdata, but the client nodes have it mounted as /campus. The workaround is to use the trick in assemblies/Pog/map-colorspace5/Makefile |
| + | <code> | ||
| + | CWD ?= $(subst campusdata,campus,$(shell pwd)) | ||
| + | </code> | ||
| + | Then instead of | ||
| + | <code> | ||
| + | qsub -cwd | ||
| + | </code> use | ||
| + | <code> | ||
| + | qsub -wd ${CWD} | ||
| + | </code> | ||
| //Pog// has 2 repeats: ~1k & 1.1k \\ | //Pog// has 2 repeats: ~1k & 1.1k \\ | ||
| use makefiles, not shell scripts! | use makefiles, not shell scripts! | ||
| - | SOLiD data formats:\\ | + | **Sanger quality info**\\ |
| + | Kevin found the location of the Sanger qual info.\\ | ||
| + | .as or something like that.\\ | ||
| + | 3 different files from 3 different runs.\\ | ||
| + | |||
| + | **SOLiD data formats**:\\ | ||
| .csfasta = colorspace with numbers\\ | .csfasta = colorspace with numbers\\ | ||
| .de = changes #s to letters (0123 -> ACGT) but it’s colors not numbers! very confusing.\\ | .de = changes #s to letters (0123 -> ACGT) but it’s colors not numbers! very confusing.\\ | ||
| Line 47: | Line 63: | ||
| ${EUSRC}/assembly/Assemble.pl pogreads.fasta 25\\ | ${EUSRC}/assembly/Assemble.pl pogreads.fasta 25\\ | ||
| - | result: \\ | + | **Result**: \\ |
| ~2k contigs which create a 2x long genome… suspicious \\ | ~2k contigs which create a 2x long genome… suspicious \\ | ||
| are contigs overlapping? \\ | are contigs overlapping? \\ | ||
| Line 99: | Line 115: | ||
| === Celera Assember: === | === Celera Assember: === | ||
| - | needs qual info (need this from Sanger reads, too) \\ | + | **Result**: \\ |
| - | ... so can't run unless you have the .qual files | + | Celera on Pog 454 got 2.4M genome. 386 contigs. Max size 34k.\\ |
| + | Needs quality information also, even for the Sanger reads \\ | ||
| + | So can't run unless you have the .qual files | ||
| - | seemed to have a script to convert Illumina -> their format… but not released yet | + | Script for converting Illumina (Solexa) reads into their format but not released yet.\\ |
| + | Their next release is supposedly soon (May 1st). | ||
| - | result: \\ | + | They have settings for sungrid running, but it did not work,\\ |
| - | with 454 data alone: 386 contigs \\ | + | so he turned it off. |
| - | (newbler: ~40 contigs) \\ | + | |
| - | took about 50min | + | How noisy is the solid data? (Kevin)\\ |
| + | On the stuff that maps completely, about 1.5% err rate.\\ | ||
| + | The ones that didn't map cleanly had error-rate 2.5%.\\ | ||
| + | Error rate goes up at the end.\\ | ||
| + | Had some fluidics reads problems at some base positions. | ||
| + | Took about 50 minutes for all.\\ | ||
| + | For comparison, Newbler took 18 minutes and 31 non-overlapping contigs. | ||
| - | === Mira === | + | Just qsub them with no arguments, and it runs everything. ("Them"? "it"? What does this sentence mean? FIXME --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/05/02 09:14//) |
| - | needs datafile named pog_in.[format].fa \\ | ||
| - | sff_extract script to create .qual files | ||
| - | created 30 contigs >=500 (largest contig 640k) \\ | + | === MIRA === |
| - | but... upon mapping to the reference genome, \\ | + | |
| + | Mostly used the default settings. | ||
| + | |||
| + | mira-assembly1/ | ||
| + | |||
| + | Running is easy. | ||
| + | Parameters: fasta denovo, tell it which instruments it has (e.g. 454 etc). | ||
| + | |||
| + | Needs datafile named pog_in.[format].fa \\ | ||
| + | uses sff_extract script to create .fasta and .fasta.qual files \\ | ||
| + | and also the traceinfo_in.454.xml file. | ||
| + | |||
| + | Time: 1 hour plus. | ||
| + | |||
| + | Created 621 contigs, 30 larger than 500. (largest contig 640k) \\ | ||
| + | The 500 cutoff it probably too large.\\ | ||
| + | 100 might me more reasonable.\\ | ||
| + | Total concensus size is good.\\ | ||
| + | But... upon mapping to the reference genome, \\ | ||
| it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent. | it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent. | ||
| - | it’s getting bigger contigs because it’s joining them incorrectly! \\ | + | It’s getting bigger contigs because it’s joining them incorrectly! \\ |
| - | this is very bad; worse even than a lot of small contigs \\ | + | This is very bad; worse even than a lot of small contigs \\ |
| + | |||
| + | Not DBG. Should find out more about how it actually works.\\ | ||
| + | Good to know how it works so you know what to do with the parameters. | ||
| + | |||
| + | Newbler may be able to take fasta+qual file. | ||
| + | |||
| + | Mira might be worth fussing with on the parameters a bit more if it looks like | ||
| + | it is doing a good job. | ||
| + | |||
| + | Mira probably can't handle large genomes due to memory. | ||
| + | Mira has a tool to estimate memory required. | ||
| + | For a 3.2G genome it will need 1.1TB ram. | ||