This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
lecture_notes:04-28-2010 [2010/04/29 21:31] learithe created |
lecture_notes:04-28-2010 [2010/05/02 16:22] karplus added workaround for campusrocks filesystem problem |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | John St. John's lecture on EULER-SR and Celera; Michael Cusack's lecture on MIRA | ||
+ | |||
=== Misc Notes: === | === Misc Notes: === | ||
- | **campusrocks is broken!** | + | **campusrocks is broken!** The head node has the file system mounted as /campusdata, but the client nodes have it mounted as /campus. The workaround is to use the trick in assemblies/Pog/map-colorspace5/Makefile |
+ | <code> | ||
+ | CWD ?= $(subst campusdata,campus,$(shell pwd)) | ||
+ | </code> | ||
+ | Then instead of | ||
+ | <code> | ||
+ | qsub -cwd | ||
+ | </code> use | ||
+ | <code> | ||
+ | qsub -wd ${CWD} | ||
+ | </code> | ||
//Pog// has 2 repeats: ~1k & 1.1k \\ | //Pog// has 2 repeats: ~1k & 1.1k \\ | ||
use makefiles, not shell scripts! | use makefiles, not shell scripts! | ||
- | SOLiD data formats:\\ | + | **Sanger quality info**\\ |
+ | Kevin found the location of the Sanger qual info.\\ | ||
+ | .as or something like that.\\ | ||
+ | 3 different files from 3 different runs.\\ | ||
+ | |||
+ | **SOLiD data formats**:\\ | ||
.csfasta = colorspace with numbers\\ | .csfasta = colorspace with numbers\\ | ||
.de = changes #s to letters (0123 -> ACGT) but it’s colors not numbers! very confusing.\\ | .de = changes #s to letters (0123 -> ACGT) but it’s colors not numbers! very confusing.\\ | ||
Line 12: | Line 30: | ||
- | === Euler === | + | Kevin mapped newbler to join the contigs |
+ | found a bug in the python script to map the solid reads. | ||
+ | Detected because there were no joining reads | ||
+ | for the two that joined the extrachromosomal reads. | ||
+ | There was a sign error in one of my tests. | ||
+ | Re-did colorspace mapping on newbler5 assembly. | ||
+ | May still have a bug since one gap is covered by 10 thousand reads | ||
+ | whereas the other side has one that is only covered by 200 reads. | ||
+ | Will be looking to see if there is another bug. | ||
+ | If you have mate-pair data, it's good to have software | ||
+ | to check for correct answers. Pog matepair data, | ||
+ | compare to other assembly tools. | ||
- | ran well first time (it ran, at least) \\ | + | === Euler-SR === |
- | have to run it where you installed it \\ | + | |
- | no makefiles \\ | + | |
- | result: \\ | + | SR == Short Reads |
+ | |||
+ | Euler-SR is a short-read De Bruijn Graph assembler | ||
+ | that can use long reads and mate-pairs. | ||
+ | |||
+ | **euler-sr-assembly1/**\\ | ||
+ | Ran on 454 data with the Sanger data concatenated into one file. | ||
+ | |||
+ | Have to set up env vars.\\ | ||
+ | No make install options.\\ | ||
+ | Things are mixed up.\\ | ||
+ | You have to run it where you installed it \\ | ||
+ | ${EUSRC}\\ | ||
+ | |||
+ | It ran well the first time (it ran, at least) \\ | ||
+ | |||
+ | ${EUSRC}/assembly/Assemble.pl pogreads.fasta 25\\ | ||
+ | |||
+ | **Result**: \\ | ||
~2k contigs which create a 2x long genome… suspicious \\ | ~2k contigs which create a 2x long genome… suspicious \\ | ||
are contigs overlapping? \\ | are contigs overlapping? \\ | ||
//find out:// \\ | //find out:// \\ | ||
- | check blat_strict_match (blat alignment to reference genome) \\ | + | check contig-blat_strict_match (blat alignment to reference genome) \\ |
look for "Q name" (contigs) which match to the same "T start" positions on the reference genome \\ | look for "Q name" (contigs) which match to the same "T start" positions on the reference genome \\ | ||
//answer://yes, appear to overlap a lot – double coverage because they totally overlap | //answer://yes, appear to overlap a lot – double coverage because they totally overlap | ||
+ | |||
+ | There is one 91k contig.\\ | ||
Things to try to improve the run: \\ | Things to try to improve the run: \\ | ||
- | - longer k-mers \\ | + | - longer k-mers, increasing to 31 should be easy \\ |
- increase frequency threshold (help make up for read errors, maybe?) \\ | - increase frequency threshold (help make up for read errors, maybe?) \\ | ||
+ | - throw out the tiny contigs, reduce your cutoff. | ||
+ | |||
+ | Does have an option to do some simple quality filtering on the reads\\ | ||
+ | if quality data such as fastq is used?\\ | ||
+ | -minmult look at how many things map to this area,\\ | ||
+ | if less than this many things, throw it out.\\ | ||
+ | |||
+ | Error-correct reads, construct repeat graph,\\ | ||
+ | simplifiy repeat graph with mate-reads\\ | ||
+ | Error correction by threading.\\ | ||
+ | Tries to make minimal corrections to beginnings of reads,\\ | ||
+ | uses those to make the kmers. Later threads the full readlength through. | ||
"Error Correction via threading" \\ | "Error Correction via threading" \\ | ||
Line 36: | Line 95: | ||
- perhaps this is where it went wrong? \\ | - perhaps this is where it went wrong? \\ | ||
+ | Mate reads.\\ | ||
+ | Multiple paths of similar length are hard to disambiguate.\\ | ||
+ | You can use multiple matepairs and bootstrap analysis.\\ | ||
+ | Use the paths with the highest probability.\\ | ||
+ | |||
+ | Pog repeats aside:\\ | ||
+ | There are several large homologous regions on opposite strands\\ | ||
+ | in Pog data that are kinds of repeats. \\ | ||
+ | They are at both ends of the area that inverts.\\ | ||
+ | Inversion happens by homologous matching, then swapping by two strands.\\ | ||
+ | Like a sloppy integrase. | ||
+ | |||
+ | |||
+ | **Solid data.**\\ | ||
+ | Used the regular base-space data in colorspace_input.fa (not double-encoded).\\ | ||
Tried to run on just the SOLiD data… started on Sunday, but still running (Wed) \\ | Tried to run on just the SOLiD data… started on Sunday, but still running (Wed) \\ | ||
Line 41: | Line 115: | ||
=== Celera Assember: === | === Celera Assember: === | ||
- | needs qual info (need this from Sanger reads, too) \\ | + | **Result**: \\ |
- | ... so can't run unless you have the .qual files | + | Celera on Pog 454 got 2.4M genome. 386 contigs. Max size 34k.\\ |
+ | Needs quality information also, even for the Sanger reads \\ | ||
+ | So can't run unless you have the .qual files | ||
- | seemed to have a script to convert Illumina -> their format… but not released yet | + | Script for converting Illumina (Solexa) reads into their format but not released yet.\\ |
+ | Their next release is supposedly soon (May 1st). | ||
- | result: \\ | + | They have settings for sungrid running, but it did not work,\\ |
- | with 454 data alone: 386 contigs \\ | + | so he turned it off. |
- | (newbler: ~40 contigs) \\ | + | |
- | took about 50min | + | How noisy is the solid data? (Kevin)\\ |
+ | On the stuff that maps completely, about 1.5% err rate.\\ | ||
+ | The ones that didn't map cleanly had error-rate 2.5%.\\ | ||
+ | Error rate goes up at the end.\\ | ||
+ | Had some fluidics reads problems at some base positions. | ||
+ | Took about 50 minutes for all.\\ | ||
+ | For comparison, Newbler took 18 minutes and 31 non-overlapping contigs. | ||
- | === Mira === | + | Just qsub them with no arguments, and it runs everything. ("Them"? "it"? What does this sentence mean? FIXME --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/05/02 09:14//) |
- | needs datafile named pog_in.[format].fa \\ | ||
- | sff_extract script to create .qual files | ||
- | created 30 contigs >=500 (largest contig 640k) \\ | + | === MIRA === |
- | but... upon mapping to the reference genome, \\ | + | |
+ | Mostly used the default settings. | ||
+ | |||
+ | mira-assembly1/ | ||
+ | |||
+ | Running is easy. | ||
+ | Parameters: fasta denovo, tell it which instruments it has (e.g. 454 etc). | ||
+ | |||
+ | Needs datafile named pog_in.[format].fa \\ | ||
+ | uses sff_extract script to create .fasta and .fasta.qual files \\ | ||
+ | and also the traceinfo_in.454.xml file. | ||
+ | |||
+ | Time: 1 hour plus. | ||
+ | |||
+ | Created 621 contigs, 30 larger than 500. (largest contig 640k) \\ | ||
+ | The 500 cutoff it probably too large.\\ | ||
+ | 100 might me more reasonable.\\ | ||
+ | Total concensus size is good.\\ | ||
+ | But... upon mapping to the reference genome, \\ | ||
it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent. | it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent. | ||
- | it’s getting bigger contigs because it’s joining them incorrectly! \\ | + | It’s getting bigger contigs because it’s joining them incorrectly! \\ |
- | this is very bad; worse even than a lot of small contigs \\ | + | This is very bad; worse even than a lot of small contigs \\ |
+ | |||
+ | Not DBG. Should find out more about how it actually works.\\ | ||
+ | Good to know how it works so you know what to do with the parameters. | ||
+ | |||
+ | Newbler may be able to take fasta+qual file. | ||
+ | |||
+ | Mira might be worth fussing with on the parameters a bit more if it looks like | ||
+ | it is doing a good job. | ||
+ | |||
+ | Mira probably can't handle large genomes due to memory. | ||
+ | Mira has a tool to estimate memory required. | ||
+ | For a 3.2G genome it will need 1.1TB ram. | ||