Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-28-2010 [2010/04/29 21:31]
learithe created
+++ lecture_notes:04-28-2010 [2010/05/02 16:22]
karplus added workaround for campusrocks filesystem problem
@@ Line 1: / Line 1: @@
+John St. John's lecture on EULER-SR and Celera; Michael Cusack's lecture on MIRA
 === Misc Notes: ===
-**campusrocks is broken!**
+**campusrocks is broken!**  The head node has the file system mounted as /campusdata, but the client nodes have it mounted as /campus.  The workaround is to use the trick in assemblies/Pog/map-colorspace5/Makefile
+<code>
+CWD ?= $(subst campusdata,campus,$(shell pwd))
+</code>
+Then instead of
+<code>
+        qsub -cwd
+</code> use
+<code>
+        qsub -wd ${CWD}
+</code>
 //Pog// has 2 repeats: ~1k & 1.1k \\
 use makefiles, not shell scripts!
-SOLiD data formats:\\
+**Sanger quality info**\\
+Kevin found the location of the Sanger qual info.\\
+.as or something like that.\\
+different files from 3 different runs.\\
+**SOLiD data formats**:\\
 .csfasta = colorspace with numbers\\
 .de = changes #s to letters (0123 -> ACGT) but it’s colors not numbers! very confusing.\\
@@ Line 12: / Line 30: @@
-=== Euler ===
+Kevin mapped newbler to join the contigs
+found a bug in the python script to map the solid reads.
+Detected because there were no joining reads
+for the two that joined the extrachromosomal reads.
+There was a sign error in one of my tests.
+Re-did colorspace mapping on newbler5 assembly.
+May still have a bug since one gap is covered by 10 thousand reads
+whereas the other side has one that is only covered by 200 reads.
+Will be looking to see if there is another bug.
+If you have mate-pair data, it's good to have software
+to check for correct answers.  Pog matepair data,
+compare to other assembly tools.
-ran well first time (it ran, at least)  \\
+=== Euler-SR ===
-have to run it where you installed it \\
-no makefiles \\
-result: \\
+SR == Short Reads
+Euler-SR is a short-read De Bruijn Graph assembler
+that can use long reads and mate-pairs.
+**euler-sr-assembly1/**\\
+Ran on 454 data with the Sanger data concatenated into one file.
+Have to set up env vars.\\
+No make install options.\\
+Things are mixed up.\\
+You have to run it where you installed it \\
+${EUSRC}\\
+It ran well the first time (it ran, at least)  \\
+${EUSRC}/assembly/Assemble.pl pogreads.fasta 25\\
+**Result**: \\
 ~2k contigs which create a 2x long genome… suspicious \\
 are contigs overlapping? \\
 //find out:// \\
-check blat_strict_match  (blat alignment to reference genome) \\
+check contig-blat_strict_match  (blat alignment to reference genome) \\
 look for "Q name" (contigs) which match to the same "T start" positions on the reference genome \\
 //answer://yes, appear to overlap a lot – double coverage because they totally overlap
+There is one 91k contig.\\
 Things to try to improve the run: \\
-- longer k-mers \\
+- longer k-mers, increasing to 31 should be easy \\
 - increase frequency threshold (help make up for read errors, maybe?) \\
+- throw out the tiny contigs, reduce your cutoff.
+Does have an option to do some simple quality filtering on the reads\\
+if quality data such as fastq is used?\\
+-minmult look at how many things map to this area,\\
+if less than this many things, throw it out.\\
+Error-correct reads, construct repeat graph,\\
+simplifiy repeat graph with mate-reads\\
+Error correction by threading.\\
+Tries to make minimal corrections to beginnings of reads,\\
+uses those to make the kmers.   Later threads the full readlength through.
 "Error Correction via threading" \\
@@ Line 36: / Line 95: @@
 - perhaps this is where it went wrong? \\
+Mate reads.\\
+Multiple paths of similar length are hard to disambiguate.\\
+You can use multiple matepairs and bootstrap analysis.\\
+Use the paths with the highest probability.\\
+Pog repeats aside:\\
+There are several large homologous regions on opposite strands\\
+in Pog data that are kinds of repeats.  \\
+They are at both ends of the area that inverts.\\
+Inversion happens by homologous matching, then swapping by two strands.\\
+Like a sloppy integrase.
+**Solid data.**\\
+Used the regular base-space data in colorspace_input.fa (not double-encoded).\\
 Tried to run on just the SOLiD data… started on Sunday, but still running (Wed) \\
@@ Line 41: / Line 115: @@
 === Celera Assember: ===
-needs qual info (need this from Sanger reads, too) \\
+**Result**: \\
-... so can't run unless you have the .qual files
+Celera on Pog 454 got 2.4M genome.  386 contigs.  Max size 34k.\\
+Needs quality information also, even for the Sanger reads \\
+So can't run unless you have the .qual files
-seemed to have a script to convert Illumina -> their format… but not released yet
+Script for converting Illumina (Solexa) reads into their format but not released yet.\\
+Their next release is supposedly soon (May 1st).
-result: \\
+They have settings for sungrid running, but it did not work,\\
-with 454 data alone: 386 contigs \\
+so he turned it off.
-(newbler: ~40 contigs) \\
-took about 50min
+How noisy is the solid data? (Kevin)\\
+On the stuff that maps completely, about 1.5% err rate.\\
+The ones that didn't map cleanly had error-rate 2.5%.\\
+Error rate goes up at the end.\\
+Had some fluidics reads problems at some base positions.
+Took about 50 minutes for all.\\
+For comparison, Newbler took 18 minutes and 31 non-overlapping contigs.
-=== Mira ===
+Just qsub them with no arguments, and it runs everything. ("Them"? "it"? What does this sentence mean? FIXME  --- //[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/05/02 09:14//)
-needs datafile named pog_in.[format].fa \\
-sff_extract script to create .qual files
-created 30 contigs >=500 (largest contig 640k) \\
+=== MIRA ===
-but... upon mapping to the reference genome,  \\
+Mostly used the default settings.
+mira-assembly1/
+Running is easy.
+Parameters: fasta denovo, tell it which instruments it has (e.g. 454 etc).
+Needs datafile named pog_in.[format].fa \\
+uses sff_extract script to create .fasta and .fasta.qual files \\
+and also the traceinfo_in.454.xml file.
+Time: 1 hour plus.
+Created 621 contigs, 30 larger than 500. (largest contig 640k) \\
+The 500 cutoff it probably too large.\\
+might me more reasonable.\\
+Total concensus size is good.\\
+But... upon mapping to the reference genome,  \\
 it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent.
-it’s getting bigger contigs because it’s joining them incorrectly! \\
+It’s getting bigger contigs because it’s joining them incorrectly! \\
-this is very bad; worse even than a lot of small contigs \\
+This is very bad; worse even than a lot of small contigs \\
+Not DBG.  Should find out more about how it actually works.\\
+Good to know how it works so you know what to do with the parameters.
+Newbler may be able to take fasta+qual file.
+Mira might be worth fussing with on the parameters a bit more if it looks like
+it is doing a good job.
+Mira probably can't handle large genomes due to memory.
+Mira has a tool to estimate memory required.
+For a 3.2G genome it will need 1.1TB ram.

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools