User Tools

Site Tools


archive:computer_resources:assemblies

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
archive:computer_resources:assemblies [2011/06/02 19:26]
eyliaw
archive:computer_resources:assemblies [2015/09/02 16:53]
92.247.181.31 ↷ Links adapted because of a move operation
Line 19: Line 19:
     * newbler-assembly2/​ is a second de novo assembly using Newbler, starting from the cleaned reads of newbler-clean1/​sff_cleaned/​no_Hyp.sff ​ It gets 42 contigs and 2,449,409 bases.     * newbler-assembly2/​ is a second de novo assembly using Newbler, starting from the cleaned reads of newbler-clean1/​sff_cleaned/​no_Hyp.sff ​ It gets 42 contigs and 2,449,409 bases.
     * newbler-assembly3/​ starts from the same sff file as newbler-assembly2/​ but raises the expected coverage to 60 (close to actual coverage). ​ It gets 41 contigs and 2,449,426 bases, still more than the old version of Newbler got after similar cleaning. ​ The contigs have been mapped to the finished genome (using megablast, blastn, blat, and pluck-scripts/​find-dna-differences). All the contigs map cleanly to the finished genome. If contigs map to more than one place, find-dna-differences may (incorrectly) report it as not mapping.     * newbler-assembly3/​ starts from the same sff file as newbler-assembly2/​ but raises the expected coverage to 60 (close to actual coverage). ​ It gets 41 contigs and 2,449,426 bases, still more than the old version of Newbler got after similar cleaning. ​ The contigs have been mapped to the finished genome (using megablast, blastn, blat, and pluck-scripts/​find-dna-differences). All the contigs map cleanly to the finished genome. If contigs map to more than one place, find-dna-differences may (incorrectly) report it as not mapping.
-    * map-colorspace3/​ uses the [[bioinformatic_tools:​pluck-scripts|pluck-scripts]] script map-colorspace to map the SOLiD mate-pair reads onto the contigs of the newbler-assembly3/​ run.  The intent is to find what contigs join to what other ones.  The numbering starts with 3, not 1, so that the map-colorspace directories correspond to the newbler-assembly directories that they are mapping onto.+    * map-colorspace3/​ uses the [[archive:bioinformatic_tools:​pluck-scripts|pluck-scripts]] script map-colorspace to map the SOLiD mate-pair reads onto the contigs of the newbler-assembly3/​ run.  The intent is to find what contigs join to what other ones.  The numbering starts with 3, not 1, so that the map-colorspace directories correspond to the newbler-assembly directories that they are mapping onto.
     * newbler-partial3/​ assembled the partially-assembled reads of newbler-assembly3/​ to see if any extended or connected contigs. ​ Seven of the 131 new contigs could be used to extend newbler-assembly3/​ contigs, but none spanned 2 contigs.     * newbler-partial3/​ assembled the partially-assembled reads of newbler-assembly3/​ to see if any extended or connected contigs. ​ Seven of the 131 new contigs could be used to extend newbler-assembly3/​ contigs, but none spanned 2 contigs.
     * newbler-assembly4/​ starts from the same sff file as newbler-assembly2/​ and newbler-assembly3/​ but adds the contigs of newbler-partial3/​ as extra reads. ​ This did not help, getting 45 contigs and 2,449,287 bases.     * newbler-assembly4/​ starts from the same sff file as newbler-assembly2/​ and newbler-assembly3/​ but adds the contigs of newbler-partial3/​ as extra reads. ​ This did not help, getting 45 contigs and 2,449,287 bases.
Line 29: Line 29:
     * euler-sr-assembly1/​     * euler-sr-assembly1/​
   * mira   * mira
-    * [[computer_resources:​assemblies:​mira:​pog:​mira-assembly1|]] +    * [[archive:computer_resources:​assemblies:​mira:​pog:​mira-assembly1]] 
-    * [[computer_resources:​assemblies:​mira:​pog:​mira-assembly2|]]+    * [[archive:computer_resources:​assemblies:​mira:​pog:​mira-assembly2]]
   * velvet   * velvet
     * velvet-assembly1/​ Assembling Pog 454 long reads with velvet. ​ After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. ​ It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. ​ Because these were long reads, we could set k up to 31.  Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet.  Given that the average read is 370b, it should have been able to support longer k-values. ​ Best results so far:\\     * velvet-assembly1/​ Assembling Pog 454 long reads with velvet. ​ After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. ​ It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. ​ Because these were long reads, we could set k up to 31.  Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet.  Given that the average read is 370b, it should have been able to support longer k-values. ​ Best results so far:\\
Line 100: Line 100:
     * so ran with filling -R to get 12k maxcontig.     * so ran with filling -R to get 12k maxcontig.
     * Then ran the scaffolding steps with 200bp insert size.     * Then ran the scaffolding steps with 200bp insert size.
-    * For all steps, used low default cutoffs since our 10x coverage +    * For all steps, used low default cutoffs since our 10x coverage is not high.  21k max scaffold size.   
-    * is not high.  21k max scaffold size.  ​Estimated +    * Estimated ​genome size is around 3G.  ​ 
-    * genome size is around 3G.  The 4 steps are+    * The 4 steps are
       - pregraph (3.5 to 4.5 hours for 30 to 60 cpus)       - pregraph (3.5 to 4.5 hours for 30 to 60 cpus)
       - contig (1.3 hours)       - contig (1.3 hours)
       - map (0.6 hours with 60cpus) - paired ends       - map (0.6 hours with 60cpus) - paired ends
       - scaff (1 hour with 60cpus)       - scaff (1 hour with 60cpus)
-  * barcode-of-life/​ attempt to assemble the mitochondrial genome, ​with particular emphasis ​on the gene for mitochondrial cytochrome c oxidase subunit I protein I (CO1), which is used for the "​barcode of life"​. ​[[http://​www.boldsystems.org/​|BOLD (barcode of life database)]] +  * barcode-of-life/​ attempt to assemble the mitochondrial genome, ​documented ​on its own page: [[computer_resources:assemblies:​mitochondrion]] 
-      * Started with a search of SOAPdenovo-assembly1/​k31/​soapSlug.scafSeq for scaffolds that matched examples from other mollusks. +
-      * Looked for 454 reads that extended or joined contigs in scaffold +
-      * Repeated (sometimes using more sensitive searches) until no more credible scaffolds from the SOAPdenovo-assembly1/​k31/​ assembly nor 454 reads were found. +
-      * The 454 coverage of the mitochondrion is so slight as to be nearly useless, so instead we can iterate: +
-        - find all Illumina reads that map to the mitochondrial draft, using BWA +
-        - assemble them using SOAPdenovo. +
-      * It looks like the Illumina reads have about 228x coverage of the mitochondrion, but coverage is patchy, and it seems to be difficult to close the circle (at least with SOAPdenovo). ​  +
-      * I have an almost complete mitochondrial genome, and I'm hoping that a few more iterations or some tricky assembly will close it into a clean circular genome.+
   * SOAPdenovo-assembly2/​ Assembly with new + old Illumina and 454 data.   * SOAPdenovo-assembly2/​ Assembly with new + old Illumina and 454 data.
     * SOAPdenovo 1.05 - can handle gzipped fastq files.     * SOAPdenovo 1.05 - can handle gzipped fastq files.
     * Runs with k27, 31, 47, and 63 so far.  47 was the best overall. ​ 63 got the longest contig (~14.9kb).     * Runs with k27, 31, 47, and 63 so far.  47 was the best overall. ​ 63 got the longest contig (~14.9kb).
     * Run parameters:     * Run parameters:
-      ​pregraph: +      ​pregraph: 
-        ​lowest count size of 2 (-d 2) +        ​lowest count size of 2 (-d 2) 
-      ​contig: +      ​contig: 
-        ​solve tiny repeats on (-R) +        ​solve tiny repeats on (-R) 
-      ​map: +      ​map: 
-        ​all default +        ​all default 
-      ​scaff: +      ​scaff: 
-        ​intra-scaffold gap closure on (-F)+        ​intra-scaffold gap closure on (-F) 
 +    * Statistics for each kmer size assembly (using illumina and 454 data, using both for contig and scaffolding):​ 
 +      * k31: 
 +         * 1,298,372 scaffolds from 4,814,226 contigs sum up 632,​702,​276bp,​ with average length 487, 0 gaps filled 
 +         * 3,611,844 scaffolds&​singleton sum up 1,​133,​413,​022bp,​ with average length 313 
 +         * the longest is 10,​340bp,​scaffold N50 is 442 bp, scaffold N90 is 148 bp 
 +      * k47: 
 +         * 871,819 scaffolds from 5,306,463 contigs sum up 530,​762,​874bp,​ with average length 608, 0 gaps filled 
 +         * 4,203,195 scaffolds&​singleton sum up 1,​296,​678,​043bp,​ with average length 308 
 +         * the longest is 14,​750bp,​scaffold N50 is 458 bp, scaffold N90 is 140 bp 
 +      * k63: 
 +         * 270,887 scaffolds from 4,022,505 contigs sum up 139,​720,​415bp,​ with average length 515, 0 gaps filled 
 +         * 3,710,532 scaffolds&​singleton sum up 690,​332,​560bp,​ with average length 186 
 +         * the longest is 14,​897bp,​scaffold N50 is 232 bp, scaffold N90 is 112 bp 
archive/computer_resources/assemblies.txt · Last modified: 2015/09/02 16:53 by 92.247.181.31