User Tools

Site Tools


lecture_notes:04-28-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lecture_notes:04-28-2010 [2010/05/02 05:20]
galt
lecture_notes:04-28-2010 [2010/05/02 16:22] (current)
karplus added workaround for campusrocks filesystem problem
Line 3: Line 3:
 === Misc Notes: === === Misc Notes: ===
  
-**campusrocks is broken!**+**campusrocks is broken!** ​ The head node has the file system mounted as /​campusdata,​ but the client nodes have it mounted as /​campus. ​ The workaround is to use the trick in assemblies/​Pog/​map-colorspace5/​Makefile 
 +<​code>​ 
 +CWD ?= $(subst campusdata,​campus,​$(shell pwd)) 
 +</​code>​ 
 +Then instead of  
 +<​code>​ 
 +        qsub -cwd 
 +</​code>​ use  
 +<​code>​ 
 +        qsub -wd ${CWD} 
 +</​code>​ 
  
 //Pog// has 2 repeats: ~1k & 1.1k \\ //Pog// has 2 repeats: ~1k & 1.1k \\
 use makefiles, not shell scripts! use makefiles, not shell scripts!
  
-SOLiD data formats:\\+**Sanger quality info**\\ 
 +Kevin found the location of the Sanger qual info.\\ ​  
 +.as or something like that.\\ 
 +3 different files from 3 different runs.\\ 
 + 
 +**SOLiD data formats**:\\
 .csfasta = colorspace with numbers\\ .csfasta = colorspace with numbers\\
 .de = changes #s to letters (0123 -> ACGT) but it’s colors not numbers! very confusing.\\ .de = changes #s to letters (0123 -> ACGT) but it’s colors not numbers! very confusing.\\
Line 47: Line 63:
 ${EUSRC}/​assembly/​Assemble.pl pogreads.fasta 25\\ ${EUSRC}/​assembly/​Assemble.pl pogreads.fasta 25\\
  
-result: \\+**Result**: \\
 ~2k contigs which create a 2x long genome… suspicious \\ ~2k contigs which create a 2x long genome… suspicious \\
 are contigs overlapping?​ \\ are contigs overlapping?​ \\
Line 63: Line 79:
  
 Does have an option to do some simple quality filtering on the reads\\ Does have an option to do some simple quality filtering on the reads\\
-if quality data such as fastq is used.\\+if quality data such as fastq is used?\\
 -minmult look at how many things map to this area,\\ -minmult look at how many things map to this area,\\
 if less than this many things, throw it out.\\ if less than this many things, throw it out.\\
  
 Error-correct reads, construct repeat graph,\\ Error-correct reads, construct repeat graph,\\
-simplifiy ​repepat ​graph with mate-reads\\+simplifiy ​repeat ​graph with mate-reads\\
 Error correction by threading.\\ Error correction by threading.\\
 Tries to make minimal corrections to beginnings of reads,\\ Tries to make minimal corrections to beginnings of reads,\\
Line 99: Line 115:
 === Celera Assember: === === Celera Assember: ===
  
-needs qual info (need this from Sanger reads, too) \\ +**Result**: ​\\ 
-... so can't run unless you have the .qual files+Celera on Pog 454 got 2.4M genome 386 contigs Max size 34k.\\ 
 +Needs quality information also, even for the Sanger reads \\ 
 +So can't run unless you have the .qual files
  
-seemed to have a script to convert ​Illumina ​-> their format… but not released yet+Script for converting ​Illumina ​(Solexa) reads into their format but not released yet.\\ 
 +Their next release is supposedly soon (May 1st).
  
-result: ​\\ +They have settings for sungrid running, but it did not work,\\ 
-with 454 data alone: 386 contigs \\ +so he turned it off.
-(newbler: ~40 contigs) \\+
  
-took about 50min+How noisy is the solid data? (Kevin)\\ 
 +On the stuff that maps completely, ​about 1.5% err rate.\\ 
 +The ones that didn't map cleanly had error-rate 2.5%.\\ 
 +Error rate goes up at the end.\\ 
 +Had some fluidics reads problems at some base positions.
  
 +Took about 50 minutes for all.\\
 +For comparison, Newbler took 18 minutes and 31 non-overlapping contigs.
  
-=== Mira ===+Just qsub them with no arguments, and it runs everything. ("​Them"?​ "​it"?​ What does this sentence mean? FIXME  --- //​[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/05/02 09:14//)
  
-needs datafile named pog_in.[format].fa \\ 
-sff_extract script to create .qual files 
  
-created 30 contigs >=500 (largest contig 640k) \\ +=== MIRA === 
-but... upon mapping to the reference genome, ​ \\+ 
 +Mostly used the default settings. 
 + 
 +mira-assembly1/​ 
 + 
 +Running is easy. 
 +Parameters: fasta denovo, tell it which instruments it has (e.g. 454 etc). 
 + 
 +Needs datafile named pog_in.[format].fa \\ 
 +uses sff_extract script to create .fasta and .fasta.qual files \\ 
 +and also the traceinfo_in.454.xml file. 
 + 
 +Time: 1 hour plus. 
 + 
 +Created 621 contigs, 30 larger than 500(largest contig 640k) \\ 
 +The 500 cutoff it probably too large.\\ 
 +100 might me more reasonable.\\ 
 +Total concensus size is good.\\ 
 +But... upon mapping to the reference genome, ​ \\
 it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent. it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent.
-it’s getting bigger contigs because it’s joining them incorrectly! \\ +It’s getting bigger contigs because it’s joining them incorrectly! \\ 
-this is very bad; worse even than a lot of small contigs \\+This is very bad; worse even than a lot of small contigs \\ 
 + 
 +Not DBG.  Should find out more about how it actually works.\\ 
 +Good to know how it works so you know what to do with the parameters. 
 + 
 +Newbler may be able to take fasta+qual file. 
 + 
 +Mira might be worth fussing with on the parameters a bit more if it looks like 
 +it is doing a good job. 
 + 
 +Mira probably can't handle large genomes due to memory. 
 +Mira has a tool to estimate memory required. 
 +For a 3.2G genome it will need 1.1TB ram.
  
  
lecture_notes/04-28-2010.1272777608.txt.gz · Last modified: 2010/05/02 05:20 by galt