User Tools

Site Tools


lecture_notes:04-28-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
lecture_notes:04-28-2010 [2010/04/29 21:31]
learithe created
lecture_notes:04-28-2010 [2010/05/02 16:22] (current)
karplus added workaround for campusrocks filesystem problem
Line 1: Line 1:
 +John St. John's lecture on EULER-SR and Celera; Michael Cusack'​s lecture on MIRA
 +
 === Misc Notes: === === Misc Notes: ===
  
-**campusrocks is broken!**+**campusrocks is broken!** ​ The head node has the file system mounted as /​campusdata,​ but the client nodes have it mounted as /​campus. ​ The workaround is to use the trick in assemblies/​Pog/​map-colorspace5/​Makefile 
 +<​code>​ 
 +CWD ?= $(subst campusdata,​campus,​$(shell pwd)) 
 +</​code>​ 
 +Then instead of  
 +<​code>​ 
 +        qsub -cwd 
 +</​code>​ use  
 +<​code>​ 
 +        qsub -wd ${CWD} 
 +</​code>​ 
  
 //Pog// has 2 repeats: ~1k & 1.1k \\ //Pog// has 2 repeats: ~1k & 1.1k \\
 use makefiles, not shell scripts! use makefiles, not shell scripts!
  
-SOLiD data formats:\\+**Sanger quality info**\\ 
 +Kevin found the location of the Sanger qual info.\\ ​  
 +.as or something like that.\\ 
 +3 different files from 3 different runs.\\ 
 + 
 +**SOLiD data formats**:\\
 .csfasta = colorspace with numbers\\ .csfasta = colorspace with numbers\\
 .de = changes #s to letters (0123 -> ACGT) but it’s colors not numbers! very confusing.\\ .de = changes #s to letters (0123 -> ACGT) but it’s colors not numbers! very confusing.\\
Line 12: Line 30:
  
  
-=== Euler ===+Kevin mapped newbler to join the contigs 
 +found a bug in the python script to map the solid reads. 
 +Detected because there were no joining reads 
 +for the two that joined the extrachromosomal reads. 
 +There was a sign error in one of my tests. 
 +Re-did colorspace mapping on newbler5 assembly. 
 +May still have a bug since one gap is covered by 10 thousand reads 
 +whereas the other side has one that is only covered by 200 reads. 
 +Will be looking to see if there is another bug. 
 +If you have mate-pair data, it's good to have software  
 +to check for correct answers. ​ Pog matepair data, 
 +compare to other assembly tools.
  
-ran well first time (it ran, at least) ​ \\ +=== Euler-SR ===
-have to run it where you installed it \\ +
-no makefiles \\+
  
-result: \\+SR == Short Reads 
 + 
 +Euler-SR is a short-read De Bruijn Graph assembler 
 +that can use long reads and mate-pairs. 
 + 
 +**euler-sr-assembly1/​**\\ 
 +Ran on 454 data with the Sanger data concatenated into one file. 
 + 
 +Have to set up env vars.\\ 
 +No make install options.\\ 
 +Things are mixed up.\\ 
 +You have to run it where you installed it \\ 
 +${EUSRC}\\ 
 + 
 +It ran well the first time (it ran, at least) ​ \\ 
 + 
 +${EUSRC}/​assembly/​Assemble.pl pogreads.fasta 25\\ 
 + 
 +**Result**: \\
 ~2k contigs which create a 2x long genome… suspicious \\ ~2k contigs which create a 2x long genome… suspicious \\
 are contigs overlapping?​ \\ are contigs overlapping?​ \\
 //find out:// \\ //find out:// \\
-check blat_strict_match ​ (blat alignment to reference genome) \\+check contig-blat_strict_match ​ (blat alignment to reference genome) \\
 look for "Q name" (contigs) which match to the same "T start" positions on the reference genome \\ look for "Q name" (contigs) which match to the same "T start" positions on the reference genome \\
 //​answer://​yes,​ appear to overlap a lot – double coverage because they totally overlap ​ //​answer://​yes,​ appear to overlap a lot – double coverage because they totally overlap ​
 +
 +There is one 91k contig.\\
  
 Things to try to improve the run: \\ Things to try to improve the run: \\
-- longer k-mers \\+- longer k-mers, increasing to 31 should be easy \\
 - increase frequency threshold (help make up for read errors, maybe?) \\ - increase frequency threshold (help make up for read errors, maybe?) \\
 +- throw out the tiny contigs, reduce your cutoff.
 +
 +Does have an option to do some simple quality filtering on the reads\\
 +if quality data such as fastq is used?\\
 +-minmult look at how many things map to this area,\\
 +if less than this many things, throw it out.\\
 +
 +Error-correct reads, construct repeat graph,\\
 +simplifiy repeat graph with mate-reads\\
 +Error correction by threading.\\
 +Tries to make minimal corrections to beginnings of reads,\\
 +uses those to make the kmers. ​  Later threads the full readlength through.
  
 "Error Correction via threading"​ \\ "Error Correction via threading"​ \\
Line 36: Line 95:
 - perhaps this is where it went wrong? \\ - perhaps this is where it went wrong? \\
  
 +Mate reads.\\
 +Multiple paths of similar length are hard to disambiguate.\\
 +You can use multiple matepairs and bootstrap analysis.\\
 +Use the paths with the highest probability.\\
 +
 +Pog repeats aside:\\
 +There are several large homologous regions on opposite strands\\
 +in Pog data that are kinds of repeats. ​ \\
 +They are at both ends of the area that inverts.\\
 +Inversion happens by homologous matching, then swapping by two strands.\\
 +Like a sloppy integrase.
 +
 +
 +**Solid data.**\\
 +Used the regular base-space data in colorspace_input.fa (not double-encoded).\\
 Tried to run on just the SOLiD data… started on Sunday, but still running (Wed) \\ Tried to run on just the SOLiD data… started on Sunday, but still running (Wed) \\
  
Line 41: Line 115:
 === Celera Assember: === === Celera Assember: ===
  
-needs qual info (need this from Sanger reads, too) \\ +**Result**: ​\\ 
-... so can't run unless you have the .qual files+Celera on Pog 454 got 2.4M genome 386 contigs Max size 34k.\\ 
 +Needs quality information also, even for the Sanger reads \\ 
 +So can't run unless you have the .qual files
  
-seemed to have a script to convert ​Illumina ​-> their format… but not released yet+Script for converting ​Illumina ​(Solexa) reads into their format but not released yet.\\ 
 +Their next release is supposedly soon (May 1st).
  
-result: ​\\ +They have settings for sungrid running, but it did not work,\\ 
-with 454 data alone: 386 contigs \\ +so he turned it off.
-(newbler: ~40 contigs) \\+
  
-took about 50min+How noisy is the solid data? (Kevin)\\ 
 +On the stuff that maps completely, ​about 1.5% err rate.\\ 
 +The ones that didn't map cleanly had error-rate 2.5%.\\ 
 +Error rate goes up at the end.\\ 
 +Had some fluidics reads problems at some base positions.
  
 +Took about 50 minutes for all.\\
 +For comparison, Newbler took 18 minutes and 31 non-overlapping contigs.
  
-=== Mira ===+Just qsub them with no arguments, and it runs everything. ("​Them"?​ "​it"?​ What does this sentence mean? FIXME  --- //​[[karplus@soe.ucsc.edu|Kevin Karplus]] 2010/05/02 09:14//)
  
-needs datafile named pog_in.[format].fa \\ 
-sff_extract script to create .qual files 
  
-created 30 contigs >=500 (largest contig 640k) \\ +=== MIRA === 
-but... upon mapping to the reference genome, ​ \\+ 
 +Mostly used the default settings. 
 + 
 +mira-assembly1/​ 
 + 
 +Running is easy. 
 +Parameters: fasta denovo, tell it which instruments it has (e.g. 454 etc). 
 + 
 +Needs datafile named pog_in.[format].fa \\ 
 +uses sff_extract script to create .fasta and .fasta.qual files \\ 
 +and also the traceinfo_in.454.xml file. 
 + 
 +Time: 1 hour plus. 
 + 
 +Created 621 contigs, 30 larger than 500(largest contig 640k) \\ 
 +The 500 cutoff it probably too large.\\ 
 +100 might me more reasonable.\\ 
 +Total concensus size is good.\\ 
 +But... upon mapping to the reference genome, ​ \\
 it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent. it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent.
-it’s getting bigger contigs because it’s joining them incorrectly! \\ +It’s getting bigger contigs because it’s joining them incorrectly! \\ 
-this is very bad; worse even than a lot of small contigs \\+This is very bad; worse even than a lot of small contigs \\ 
 + 
 +Not DBG.  Should find out more about how it actually works.\\ 
 +Good to know how it works so you know what to do with the parameters. 
 + 
 +Newbler may be able to take fasta+qual file. 
 + 
 +Mira might be worth fussing with on the parameters a bit more if it looks like 
 +it is doing a good job. 
 + 
 +Mira probably can't handle large genomes due to memory. 
 +Mira has a tool to estimate memory required. 
 +For a 3.2G genome it will need 1.1TB ram.
  
  
lecture_notes/04-28-2010.1272576713.txt.gz · Last modified: 2010/04/29 21:31 by learithe