User Tools

Site Tools


archive:computer_resources:assemblies

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
archive:computer_resources:assemblies [2010/04/22 12:29]
karplus Added T-shirt k-mer suggestion.
archive:computer_resources:assemblies [2010/04/28 20:11]
galt
Line 1: Line 1:
 ====== assemblies/ ====== ====== assemblies/ ======
 This directory has a subdirectory for each organism. This directory has a subdirectory for each organism.
 +
 +===== test/ =====
 +
 +For test assemblies provided by the tool makers to check that installation is correct.
 +  * velvet
 +    * velvet-assembly1/​ A first attempt to use velvet to assemble its own simulated TEST 100kb genome with short 35b reads and long 100b reads. ​ Assembly worked very well in the end. Was reminded that when using short reads, must keep k small to get adequate coverage.
 +      * 6th try, k=21 cov=19 -shortPaired -long
 +      * Final graph has 9 nodes and n50 of 99975, max 99975, total 100143, using 137863/​144858 reads
 +      * RESULTS: excellent! ​ (TEST DATA ONLY)
 +
  
 ===== Pog/ ===== ===== Pog/ =====
Line 21: Line 31:
     * mira-assembly1/​     * mira-assembly1/​
   * velvet   * velvet
-    * velvet-assembly1/ A first attempt to use velvet to assemble its own simulated TEST 100kb genome with short 35b reads and long 100b reads. ​ Assembly worked very well in the end. Was reminded that when using short reads, must keep k small to get adequate coverage.\\ +    * velvet-assembly1/​ Assembling Pog 454 long reads with velvet. ​ After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. ​ It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. ​ Because these were long reads, we could set k up to 31.  Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet.  Given that the average read is 370b, it should have been able to support longer k-values. ​ Best results so far:\\ 
-      * 6th try, k=21 cov=19 -shortPaired -long\\ +      * Final graph has 2176 nodes {195 contigs over 62b) and n50 of 224364, max {contig size} 680241, total {genome size} 2481051, using 778249/​782604 reads 
-      * Final graph has 9 nodes and n50 of 99975, max 99975, total 100143, using 137863/​144858 reads\\ +    * velvet-assembly2/ Assembling Pog Solid mate-paired ​25b reads with velvet in double-encoded colorspace ​(24 DE base reads). Not had time to optimize parameters yet.  Did get up to a max-contig size of 95k.
-      * RESULTS: excellent! ​ (TEST DATA ONLY)\\ +
-    * velvet-assembly2/ Assembling Pog 454 long reads with velvet. ​ After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. ​ It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. ​ Because these were long reads, we could set k up to 31.  Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet.  Given that the average read is 370b, it should have been able to support longer k-values.  ​ +
-      * Best results so far:\\ +
-      * Final graph has 1755 nodes and n50 of 41723, max {contig size} 142286, total {genome size} 2468925. +
-    * velvet-assembly3/ Assembling Pog Solid mate-paired ​26b reads with velvet in double-encoded colorspace. Not had time to optimize parameters yet.  Did get up to a max-contig size of 95k.+
       * velveth_de out 21 -shortPaired /​campusdata/​BME235/​data/​Pog/​solid_run/​paired/​output/​doubleEncoded_input.de       * velveth_de out 21 -shortPaired /​campusdata/​BME235/​data/​Pog/​solid_run/​paired/​output/​doubleEncoded_input.de
       * velvetg_de out -exp_cov 50 -cov_cutoff 13 -ins_length 2200       * velvetg_de out -exp_cov 50 -cov_cutoff 13 -ins_length 2200
Line 37: Line 42:
  
 ===== slug/ ===== ===== slug/ =====
-  * newbler-assembly1/​ first attempt at de novo assembly using Newbler, using all the reads from 454_run1 and 454_run2. ​ This assembly of 499,873 reads including 138,351,643 bases produced only 2,910,773 bases assembled into 8,963 contigs. ​ From this low assembly number, I estimate the coverage to be about 0.043x and the genome size to be about 3.2E9 basepairs. (See the README file for the calculation.) ​ Much of the assembly is low-complexity regions (repetitions of short repeats (GA)*, (TA)*, (TTC)*, (AC)*, (TAG)*, (CGAA)*, (TATC)*, (CAA)*, ... ).  The most common 14-mer that is not a repeat of a short k-mer is TAGTTTACAGCTTG (so that is what we should put on the T-shirt).+  * newbler-assembly1/​ first attempt at de novo assembly using Newbler, using all the reads from 454_run1 and 454_run2.  ​ 
 +    * This assembly of 499,873 reads including 138,351,643 bases produced only 2,910,773 bases assembled into 8,963 contigs
 +    * The longest contig is 5783 bases.  ​ 
 +    * From the total number of bases in the assembly number, I estimate the coverage to be about 0.043x and the genome size to be about 3.2E9 basepairs. (See the README file for the calculation.)  ​ 
 +    * Much of the assembly is low-complexity regions (repetitions of short repeats (AT)*, (AAG)*, (AG)*, (AC)*, (AGT)*, (AGAT)*, (ACAT)*, (AAC)*, (AACG)*, ... ).  ​ 
 +    * The most common 14-mer that is not a repeat of a short k-mer is TAGTTTACAGCTTG (so that is what we should put on the T-shirt). 
 +  * newbler-mapping1-lottia/​ tries to do a reference-based assembly with the //Lottia gigantea// genome as a reference. 
 +    * The reference has 4475 contigs with 359,512,207 bases. 
 +    * The output has 183 contigs with 29,389 bases. 
 +    * The longest contig is only 644 bases---way too small to be of much use. 
 +  * newbler-mapping2-seahare/​ tries to do a reference-based assembly with the //Aplysia californica//​ genome as a reference. 
 +    * The sea hare reference has 8767 contigs, comprising 715,806,041 bases. 
 +    * The output has 2664 contigs, comprising 443,648 bases (still less than the de novo assembly). ​  
 +    * The longest contig is only 1876 bases. 
archive/computer_resources/assemblies.txt · Last modified: 2015/09/02 16:53 by 92.247.181.31