Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-28-2010 [2010/05/02 05:20]
galt
+++ lecture_notes:04-28-2010 [2010/05/02 16:22] (current)
karplus added workaround for campusrocks filesystem problem
@@ Line 3: / Line 3: @@
 === Misc Notes: ===
-**campusrocks is broken!**
+**campusrocks is broken!**  The head node has the file system mounted as /campusdata, but the client nodes have it mounted as /campus.  The workaround is to use the trick in assemblies/Pog/map-colorspace5/Makefile
+<code>
+CWD ?= $(subst campusdata,campus,$(shell pwd))
+</code>
+Then instead of
+<code>
+        qsub -cwd
+</code> use
+<code>
+        qsub -wd ${CWD}
+</code>
 //Pog// has 2 repeats: ~1k & 1.1k \\
 use makefiles, not shell scripts!
-SOLiD data formats:\\
+**Sanger quality info**\\
+Kevin found the location of the Sanger qual info.\\
+.as or something like that.\\
+different files from 3 different runs.\\
+**SOLiD data formats**:\\
 .csfasta = colorspace with numbers\\
 .de = changes #s to letters (0123 -> ACGT) but it’s colors not numbers! very confusing.\\
@@ Line 47: / Line 63: @@
 ${EUSRC}/assembly/Assemble.pl pogreads.fasta 25\\
-result: \\
+**Result**: \\
 ~2k contigs which create a 2x long genome… suspicious \\
 are contigs overlapping? \\
@@ Line 63: / Line 79: @@
 Does have an option to do some simple quality filtering on the reads\\
-if quality data such as fastq is used.\\
+if quality data such as fastq is used?\\
 -minmult look at how many things map to this area,\\
 if less than this many things, throw it out.\\
 Error-correct reads, construct repeat graph,\\
-simplifiy repepat graph with mate-reads\\
+simplifiy repeat graph with mate-reads\\
 Error correction by threading.\\
 Tries to make minimal corrections to beginnings of reads,\\
@@ Line 99: / Line 115: @@
 === Celera Assember: ===
-needs qual info (need this from Sanger reads, too) \\
+**Result**: \\
-... so can't run unless you have the .qual files
+Celera on Pog 454 got 2.4M genome.  386 contigs.  Max size 34k.\\
+Needs quality information also, even for the Sanger reads \\
+So can't run unless you have the .qual files
-seemed to have a script to convert Illumina -> their format… but not released yet
+Script for converting Illumina (Solexa) reads into their format but not released yet.\\
+Their next release is supposedly soon (May 1st).
-result: \\
+They have settings for sungrid running, but it did not work,\\
-with 454 data alone: 386 contigs \\
+so he turned it off.
-(newbler: ~40 contigs) \\
-took about 50min
+How noisy is the solid data? (Kevin)\\
+On the stuff that maps completely, about 1.5% err rate.\\
+The ones that didn't map cleanly had error-rate 2.5%.\\
+Error rate goes up at the end.\\
+Had some fluidics reads problems at some base positions.
+Took about 50 minutes for all.\\
+For comparison, Newbler took 18 minutes and 31 non-overlapping contigs.
-=== Mira ===
+Just qsub them with no arguments, and it runs everything. ("Them"? "it"? What does this sentence mean? FIXME  --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/05/02 09:14//)
-needs datafile named pog_in.[format].fa \\
-sff_extract script to create .qual files
-created 30 contigs >=500 (largest contig 640k) \\
+=== MIRA ===
-but... upon mapping to the reference genome,  \\
+Mostly used the default settings.
+mira-assembly1/
+Running is easy.
+Parameters: fasta denovo, tell it which instruments it has (e.g. 454 etc).
+Needs datafile named pog_in.[format].fa \\
+uses sff_extract script to create .fasta and .fasta.qual files \\
+and also the traceinfo_in.454.xml file.
+Time: 1 hour plus.
+Created 621 contigs, 30 larger than 500. (largest contig 640k) \\
+The 500 cutoff it probably too large.\\
+might me more reasonable.\\
+Total concensus size is good.\\
+But... upon mapping to the reference genome,  \\
 it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent.
-it’s getting bigger contigs because it’s joining them incorrectly! \\
+It’s getting bigger contigs because it’s joining them incorrectly! \\
-this is very bad; worse even than a lot of small contigs \\
+This is very bad; worse even than a lot of small contigs \\
+Not DBG.  Should find out more about how it actually works.\\
+Good to know how it works so you know what to do with the parameters.
+Newbler may be able to take fasta+qual file.
+Mira might be worth fussing with on the parameters a bit more if it looks like
+it is doing a good job.
+Mira probably can't handle large genomes due to memory.
+Mira has a tool to estimate memory required.
+For a 3.2G genome it will need 1.1TB ram.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools