User Tools

Site Tools


archive:computer_resources:assemblies

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
archive:computer_resources:assemblies [2010/06/09 19:13]
svasili
archive:computer_resources:assemblies [2015/09/02 16:53]
92.247.181.31 ↷ Links adapted because of a move operation
Line 19: Line 19:
     * newbler-assembly2/​ is a second de novo assembly using Newbler, starting from the cleaned reads of newbler-clean1/​sff_cleaned/​no_Hyp.sff ​ It gets 42 contigs and 2,449,409 bases.     * newbler-assembly2/​ is a second de novo assembly using Newbler, starting from the cleaned reads of newbler-clean1/​sff_cleaned/​no_Hyp.sff ​ It gets 42 contigs and 2,449,409 bases.
     * newbler-assembly3/​ starts from the same sff file as newbler-assembly2/​ but raises the expected coverage to 60 (close to actual coverage). ​ It gets 41 contigs and 2,449,426 bases, still more than the old version of Newbler got after similar cleaning. ​ The contigs have been mapped to the finished genome (using megablast, blastn, blat, and pluck-scripts/​find-dna-differences). All the contigs map cleanly to the finished genome. If contigs map to more than one place, find-dna-differences may (incorrectly) report it as not mapping.     * newbler-assembly3/​ starts from the same sff file as newbler-assembly2/​ but raises the expected coverage to 60 (close to actual coverage). ​ It gets 41 contigs and 2,449,426 bases, still more than the old version of Newbler got after similar cleaning. ​ The contigs have been mapped to the finished genome (using megablast, blastn, blat, and pluck-scripts/​find-dna-differences). All the contigs map cleanly to the finished genome. If contigs map to more than one place, find-dna-differences may (incorrectly) report it as not mapping.
-    * map-colorspace3/​ uses the [[bioinformatic_tools:​pluck-scripts|pluck-scripts]] script map-colorspace to map the SOLiD mate-pair reads onto the contigs of the newbler-assembly3/​ run.  The intent is to find what contigs join to what other ones.  The numbering starts with 3, not 1, so that the map-colorspace directories correspond to the newbler-assembly directories that they are mapping onto.+    * map-colorspace3/​ uses the [[archive:bioinformatic_tools:​pluck-scripts|pluck-scripts]] script map-colorspace to map the SOLiD mate-pair reads onto the contigs of the newbler-assembly3/​ run.  The intent is to find what contigs join to what other ones.  The numbering starts with 3, not 1, so that the map-colorspace directories correspond to the newbler-assembly directories that they are mapping onto.
     * newbler-partial3/​ assembled the partially-assembled reads of newbler-assembly3/​ to see if any extended or connected contigs. ​ Seven of the 131 new contigs could be used to extend newbler-assembly3/​ contigs, but none spanned 2 contigs.     * newbler-partial3/​ assembled the partially-assembled reads of newbler-assembly3/​ to see if any extended or connected contigs. ​ Seven of the 131 new contigs could be used to extend newbler-assembly3/​ contigs, but none spanned 2 contigs.
     * newbler-assembly4/​ starts from the same sff file as newbler-assembly2/​ and newbler-assembly3/​ but adds the contigs of newbler-partial3/​ as extra reads. ​ This did not help, getting 45 contigs and 2,449,287 bases.     * newbler-assembly4/​ starts from the same sff file as newbler-assembly2/​ and newbler-assembly3/​ but adds the contigs of newbler-partial3/​ as extra reads. ​ This did not help, getting 45 contigs and 2,449,287 bases.
Line 29: Line 29:
     * euler-sr-assembly1/​     * euler-sr-assembly1/​
   * mira   * mira
-    * mira-assembly1/+    * [[archive:​computer_resources:​assemblies:​mira:​pog:​mira-assembly1]] 
 +    * [[archive:​computer_resources:​assemblies:​mira:​pog:​mira-assembly2]]
   * velvet   * velvet
     * velvet-assembly1/​ Assembling Pog 454 long reads with velvet. ​ After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. ​ It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. ​ Because these were long reads, we could set k up to 31.  Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet.  Given that the average read is 370b, it should have been able to support longer k-values. ​ Best results so far:\\     * velvet-assembly1/​ Assembling Pog 454 long reads with velvet. ​ After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. ​ It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. ​ Because these were long reads, we could set k up to 31.  Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet.  Given that the average read is 370b, it should have been able to support longer k-values. ​ Best results so far:\\
Line 99: Line 100:
     * so ran with filling -R to get 12k maxcontig.     * so ran with filling -R to get 12k maxcontig.
     * Then ran the scaffolding steps with 200bp insert size.     * Then ran the scaffolding steps with 200bp insert size.
-    * For all steps, used low default cutoffs since our 10x coverage +    * For all steps, used low default cutoffs since our 10x coverage is not high.  21k max scaffold size.   
-    * is not high.  21k max scaffold size.  ​Estimated +    * Estimated ​genome size is around 3G.  ​ 
-    * genome size is around 3G.  The 4 steps are +    * The 4 steps are 
-    * 1. pregraph (3.5 to 4.5 hours for 30 to 60 cpus) +      ​- ​pregraph (3.5 to 4.5 hours for 30 to 60 cpus) 
-    * 2. contig (1.3 hours) +      ​- ​contig (1.3 hours) 
-    * 3. map (0.6 hours with 60cpus) - paired ends +      ​- ​map (0.6 hours with 60cpus) - paired ends 
-    * 4. scaff (1 hour with 60cpus)+      ​- ​scaff (1 hour with 60cpus) 
 +  * barcode-of-life/​ attempt to assemble the mitochondrial genome, documented on its own page: [[computer_resources:​assemblies:​mitochondrion]]  
 +  * SOAPdenovo-assembly2/​ Assembly with new + old Illumina and 454 data. 
 +    * SOAPdenovo 1.05 - can handle gzipped fastq files. 
 +    * Runs with k27, 31, 47, and 63 so far.  47 was the best overall. ​ 63 got the longest contig (~14.9kb). 
 +    * Run parameters:​ 
 +      - pregraph: 
 +        * lowest count size of 2 (-d 2) 
 +      - contig: 
 +        * solve tiny repeats on (-R) 
 +      - map: 
 +        * all default 
 +      - scaff: 
 +        * intra-scaffold gap closure on (-F) 
 +    * Statistics for each kmer size assembly (using illumina and 454 data, using both for contig and scaffolding):​ 
 +      * k31: 
 +         * 1,298,372 scaffolds from 4,814,226 contigs sum up 632,​702,​276bp,​ with average length 487, 0 gaps filled 
 +         * 3,611,844 scaffolds&​singleton sum up 1,​133,​413,​022bp,​ with average length 313 
 +         * the longest is 10,​340bp,​scaffold N50 is 442 bp, scaffold N90 is 148 bp 
 +      * k47: 
 +         * 871,819 scaffolds from 5,306,463 contigs sum up 530,​762,​874bp,​ with average length 608, 0 gaps filled 
 +         * 4,203,195 scaffolds&​singleton sum up 1,​296,​678,​043bp,​ with average length 308 
 +         * the longest is 14,​750bp,​scaffold N50 is 458 bp, scaffold N90 is 140 bp 
 +      * k63: 
 +         * 270,887 scaffolds from 4,022,505 contigs sum up 139,​720,​415bp,​ with average length 515, 0 gaps filled 
 +         * 3,710,532 scaffolds&​singleton sum up 690,​332,​560bp,​ with average length 186 
 +         * the longest is 14,​897bp,​scaffold N50 is 232 bp, scaffold N90 is 112 bp
  
archive/computer_resources/assemblies.txt · Last modified: 2015/09/02 16:53 by 92.247.181.31