User Tools

Site Tools


archive:computer_resources:assemblies

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
archive:computer_resources:assemblies [2011/05/29 23:48]
karplus [slug/] added barcode-of-life description
archive:computer_resources:assemblies [2015/09/02 16:53]
92.247.181.31 ↷ Links adapted because of a move operation
Line 19: Line 19:
     * newbler-assembly2/​ is a second de novo assembly using Newbler, starting from the cleaned reads of newbler-clean1/​sff_cleaned/​no_Hyp.sff ​ It gets 42 contigs and 2,449,409 bases.     * newbler-assembly2/​ is a second de novo assembly using Newbler, starting from the cleaned reads of newbler-clean1/​sff_cleaned/​no_Hyp.sff ​ It gets 42 contigs and 2,449,409 bases.
     * newbler-assembly3/​ starts from the same sff file as newbler-assembly2/​ but raises the expected coverage to 60 (close to actual coverage). ​ It gets 41 contigs and 2,449,426 bases, still more than the old version of Newbler got after similar cleaning. ​ The contigs have been mapped to the finished genome (using megablast, blastn, blat, and pluck-scripts/​find-dna-differences). All the contigs map cleanly to the finished genome. If contigs map to more than one place, find-dna-differences may (incorrectly) report it as not mapping.     * newbler-assembly3/​ starts from the same sff file as newbler-assembly2/​ but raises the expected coverage to 60 (close to actual coverage). ​ It gets 41 contigs and 2,449,426 bases, still more than the old version of Newbler got after similar cleaning. ​ The contigs have been mapped to the finished genome (using megablast, blastn, blat, and pluck-scripts/​find-dna-differences). All the contigs map cleanly to the finished genome. If contigs map to more than one place, find-dna-differences may (incorrectly) report it as not mapping.
-    * map-colorspace3/​ uses the [[bioinformatic_tools:​pluck-scripts|pluck-scripts]] script map-colorspace to map the SOLiD mate-pair reads onto the contigs of the newbler-assembly3/​ run.  The intent is to find what contigs join to what other ones.  The numbering starts with 3, not 1, so that the map-colorspace directories correspond to the newbler-assembly directories that they are mapping onto.+    * map-colorspace3/​ uses the [[archive:bioinformatic_tools:​pluck-scripts|pluck-scripts]] script map-colorspace to map the SOLiD mate-pair reads onto the contigs of the newbler-assembly3/​ run.  The intent is to find what contigs join to what other ones.  The numbering starts with 3, not 1, so that the map-colorspace directories correspond to the newbler-assembly directories that they are mapping onto.
     * newbler-partial3/​ assembled the partially-assembled reads of newbler-assembly3/​ to see if any extended or connected contigs. ​ Seven of the 131 new contigs could be used to extend newbler-assembly3/​ contigs, but none spanned 2 contigs.     * newbler-partial3/​ assembled the partially-assembled reads of newbler-assembly3/​ to see if any extended or connected contigs. ​ Seven of the 131 new contigs could be used to extend newbler-assembly3/​ contigs, but none spanned 2 contigs.
     * newbler-assembly4/​ starts from the same sff file as newbler-assembly2/​ and newbler-assembly3/​ but adds the contigs of newbler-partial3/​ as extra reads. ​ This did not help, getting 45 contigs and 2,449,287 bases.     * newbler-assembly4/​ starts from the same sff file as newbler-assembly2/​ and newbler-assembly3/​ but adds the contigs of newbler-partial3/​ as extra reads. ​ This did not help, getting 45 contigs and 2,449,287 bases.
Line 29: Line 29:
     * euler-sr-assembly1/​     * euler-sr-assembly1/​
   * mira   * mira
-    * [[computer_resources:​assemblies:​mira:​pog:​mira-assembly1|]] +    * [[archive:computer_resources:​assemblies:​mira:​pog:​mira-assembly1]] 
-    * [[computer_resources:​assemblies:​mira:​pog:​mira-assembly2|]]+    * [[archive:computer_resources:​assemblies:​mira:​pog:​mira-assembly2]]
   * velvet   * velvet
     * velvet-assembly1/​ Assembling Pog 454 long reads with velvet. ​ After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. ​ It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. ​ Because these were long reads, we could set k up to 31.  Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet.  Given that the average read is 370b, it should have been able to support longer k-values. ​ Best results so far:\\     * velvet-assembly1/​ Assembling Pog 454 long reads with velvet. ​ After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. ​ It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. ​ Because these were long reads, we could set k up to 31.  Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet.  Given that the average read is 370b, it should have been able to support longer k-values. ​ Best results so far:\\
Line 100: Line 100:
     * so ran with filling -R to get 12k maxcontig.     * so ran with filling -R to get 12k maxcontig.
     * Then ran the scaffolding steps with 200bp insert size.     * Then ran the scaffolding steps with 200bp insert size.
-    * For all steps, used low default cutoffs since our 10x coverage +    * For all steps, used low default cutoffs since our 10x coverage is not high.  21k max scaffold size.   
-    * is not high.  21k max scaffold size.  ​Estimated +    * Estimated ​genome size is around 3G.  ​ 
-    * genome size is around 3G.  The 4 steps are+    * The 4 steps are
       - pregraph (3.5 to 4.5 hours for 30 to 60 cpus)       - pregraph (3.5 to 4.5 hours for 30 to 60 cpus)
       - contig (1.3 hours)       - contig (1.3 hours)
       - map (0.6 hours with 60cpus) - paired ends       - map (0.6 hours with 60cpus) - paired ends
       - scaff (1 hour with 60cpus)       - scaff (1 hour with 60cpus)
-  * barcode-of-life/​ attempt to assemble the mitochondrial genome, ​with particular emphasis ​on the gene for mitochondrial cytochrome c oxidase subunit I protein I (CO1), which is used for the "​barcode of life"​. ​[[http://www.boldsystems.org/​|BOLD ​(barcode of life database)]] +  * barcode-of-life/​ attempt to assemble the mitochondrial genome, ​documented ​on its own page: [[computer_resources:assemblies:​mitochondrion]]  
-      * Started with a search ​of SOAPdenovo-assembly1/​k31/​soapSlug.scafSeq for scaffolds that matched examples from other mollusks. +  * SOAPdenovo-assembly2Assembly with new + old Illumina and 454 data. 
-      * Looked for 454 reads that extended or joined contigs in scaffold +    * SOAPdenovo 1.05 - can handle gzipped fastq files. 
-      * Repeated ​(sometimes ​using more sensitive searchesuntil no more credible ​scaffolds from the SOAPdenovo-assembly1/​k31/​ assembly nor 454 reads were found. +    * Runs with k27, 31, 47, and 63 so far.  47 was the best overall. ​ 63 got the longest contig ​(~14.9kb)
-      * Next step (not done yetas of 29 May 2011) is to find all Illumina reads that map to the mitochondrial draft and assemble them.+    * Run parameters: 
 +      ​- pregraph: 
 +        ​lowest count size of 2 (-d 2) 
 +      ​- contig: 
 +        ​solve tiny repeats on (-R) 
 +      ​- map: 
 +        ​all default 
 +      - scaff: 
 +        * intra-scaffold gap closure on (-F) 
 +    * Statistics for each kmer size assembly ​(using ​illumina and 454 data, using both for contig and scaffolding)
 +      * k31: 
 +         * 1,​298,​372 ​scaffolds from 4,814,226 contigs sum up 632,​702,​276bp,​ with average length 487, 0 gaps filled 
 +         * 3,611,844 scaffolds&​singleton sum up 1,​133,​413,​022bp,​ with average length 313 
 +         ​* ​the longest is 10,​340bp,​scaffold N50 is 442 bp, scaffold N90 is 148 bp 
 +      * k47: 
 +         * 871,819 scaffolds from 5,306,463 contigs sum up 530,​762,​874bp,​ with average length 608, 0 gaps filled 
 +         * 4,203,195 scaffolds&​singleton sum up 1,​296,​678,​043bp,​ with average length 308 
 +         * the longest ​is 14,​750bp,​scaffold N50 is 458 bp, scaffold N90 is 140 bp 
 +      * k63: 
 +         * 270,887 scaffolds from 4,022,505 contigs sum up 139,​720,​415bp,​ with average length 515, 0 gaps filled 
 +         * 3,710,532 scaffolds&​singleton sum up 690,​332,​560bp,​ with average length 186 
 +         ​* ​the longest is 14,​897bp,​scaffold N50 is 232 bp, scaffold N90 is 112 bp 
archive/computer_resources/assemblies.txt · Last modified: 2015/09/02 16:53 by 92.247.181.31