User Tools

Site Tools


archive:computer_resources:assemblies

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
archive:computer_resources:assemblies [2011/06/03 21:37]
eyliaw [slug/]
archive:computer_resources:assemblies [2015/09/02 16:53] (current)
92.247.181.31 ↷ Links adapted because of a move operation
Line 19: Line 19:
     * newbler-assembly2/​ is a second de novo assembly using Newbler, starting from the cleaned reads of newbler-clean1/​sff_cleaned/​no_Hyp.sff ​ It gets 42 contigs and 2,449,409 bases.     * newbler-assembly2/​ is a second de novo assembly using Newbler, starting from the cleaned reads of newbler-clean1/​sff_cleaned/​no_Hyp.sff ​ It gets 42 contigs and 2,449,409 bases.
     * newbler-assembly3/​ starts from the same sff file as newbler-assembly2/​ but raises the expected coverage to 60 (close to actual coverage). ​ It gets 41 contigs and 2,449,426 bases, still more than the old version of Newbler got after similar cleaning. ​ The contigs have been mapped to the finished genome (using megablast, blastn, blat, and pluck-scripts/​find-dna-differences). All the contigs map cleanly to the finished genome. If contigs map to more than one place, find-dna-differences may (incorrectly) report it as not mapping.     * newbler-assembly3/​ starts from the same sff file as newbler-assembly2/​ but raises the expected coverage to 60 (close to actual coverage). ​ It gets 41 contigs and 2,449,426 bases, still more than the old version of Newbler got after similar cleaning. ​ The contigs have been mapped to the finished genome (using megablast, blastn, blat, and pluck-scripts/​find-dna-differences). All the contigs map cleanly to the finished genome. If contigs map to more than one place, find-dna-differences may (incorrectly) report it as not mapping.
-    * map-colorspace3/​ uses the [[bioinformatic_tools:​pluck-scripts|pluck-scripts]] script map-colorspace to map the SOLiD mate-pair reads onto the contigs of the newbler-assembly3/​ run.  The intent is to find what contigs join to what other ones.  The numbering starts with 3, not 1, so that the map-colorspace directories correspond to the newbler-assembly directories that they are mapping onto.+    * map-colorspace3/​ uses the [[archive:bioinformatic_tools:​pluck-scripts|pluck-scripts]] script map-colorspace to map the SOLiD mate-pair reads onto the contigs of the newbler-assembly3/​ run.  The intent is to find what contigs join to what other ones.  The numbering starts with 3, not 1, so that the map-colorspace directories correspond to the newbler-assembly directories that they are mapping onto.
     * newbler-partial3/​ assembled the partially-assembled reads of newbler-assembly3/​ to see if any extended or connected contigs. ​ Seven of the 131 new contigs could be used to extend newbler-assembly3/​ contigs, but none spanned 2 contigs.     * newbler-partial3/​ assembled the partially-assembled reads of newbler-assembly3/​ to see if any extended or connected contigs. ​ Seven of the 131 new contigs could be used to extend newbler-assembly3/​ contigs, but none spanned 2 contigs.
     * newbler-assembly4/​ starts from the same sff file as newbler-assembly2/​ and newbler-assembly3/​ but adds the contigs of newbler-partial3/​ as extra reads. ​ This did not help, getting 45 contigs and 2,449,287 bases.     * newbler-assembly4/​ starts from the same sff file as newbler-assembly2/​ and newbler-assembly3/​ but adds the contigs of newbler-partial3/​ as extra reads. ​ This did not help, getting 45 contigs and 2,449,287 bases.
Line 29: Line 29:
     * euler-sr-assembly1/​     * euler-sr-assembly1/​
   * mira   * mira
-    * [[computer_resources:​assemblies:​mira:​pog:​mira-assembly1|]] +    * [[archive:computer_resources:​assemblies:​mira:​pog:​mira-assembly1]] 
-    * [[computer_resources:​assemblies:​mira:​pog:​mira-assembly2|]]+    * [[archive:computer_resources:​assemblies:​mira:​pog:​mira-assembly2]]
   * velvet   * velvet
     * velvet-assembly1/​ Assembling Pog 454 long reads with velvet. ​ After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. ​ It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. ​ Because these were long reads, we could set k up to 31.  Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet.  Given that the average read is 370b, it should have been able to support longer k-values. ​ Best results so far:\\     * velvet-assembly1/​ Assembling Pog 454 long reads with velvet. ​ After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. ​ It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. ​ Because these were long reads, we could set k up to 31.  Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet.  Given that the average read is 370b, it should have been able to support longer k-values. ​ Best results so far:\\
Line 100: Line 100:
     * so ran with filling -R to get 12k maxcontig.     * so ran with filling -R to get 12k maxcontig.
     * Then ran the scaffolding steps with 200bp insert size.     * Then ran the scaffolding steps with 200bp insert size.
-    * For all steps, used low default cutoffs since our 10x coverage +    * For all steps, used low default cutoffs since our 10x coverage is not high.  21k max scaffold size.   
-    * is not high.  21k max scaffold size.  ​Estimated +    * Estimated ​genome size is around 3G.  ​ 
-    * genome size is around 3G.  The 4 steps are+    * The 4 steps are
       - pregraph (3.5 to 4.5 hours for 30 to 60 cpus)       - pregraph (3.5 to 4.5 hours for 30 to 60 cpus)
       - contig (1.3 hours)       - contig (1.3 hours)
       - map (0.6 hours with 60cpus) - paired ends       - map (0.6 hours with 60cpus) - paired ends
       - scaff (1 hour with 60cpus)       - scaff (1 hour with 60cpus)
-  * barcode-of-life/​ attempt to assemble the mitochondrial genome, ​with particular emphasis ​on the gene for mitochondrial cytochrome c oxidase subunit I protein I (CO1), which is used for the "​barcode of life"​. ​[[http://​www.boldsystems.org/​|BOLD (barcode of life database)]] +  * barcode-of-life/​ attempt to assemble the mitochondrial genome, ​documented ​on its own page: [[computer_resources:assemblies:​mitochondrion]] 
-      * Started with a search of SOAPdenovo-assembly1/​k31/​soapSlug.scafSeq for scaffolds that matched examples from other mollusks. +
-      * Looked for 454 reads that extended or joined contigs in scaffold +
-      * Repeated (sometimes using more sensitive searches) until no more credible scaffolds from the SOAPdenovo-assembly1/​k31/​ assembly nor 454 reads were found. +
-      * The 454 coverage of the mitochondrion is so slight as to be nearly useless, so instead we can iterate: +
-        - find all Illumina reads that map to the mitochondrial draft, using BWA +
-        - assemble them using SOAPdenovo. +
-      * It looks like the Illumina reads have about 228x coverage of the mitochondrion,​ but coverage is patchy, and it seems to be difficult to close the circle (at least with SOAPdenovo). ​  +
-      * We have an almost complete mitochondrial genome, and I'm hoping that a few more iterations or some tricky assembly will close it into a clean circular genome. +
-      * It turns out that a lot of the hard hand work and iterated searching to assemble the mitochondrion was not necessary, as the SOAPdenovo-assembly2/​k63_w_454_contigs/​ assembly now has a 14960-long contig (not scaffold!) which is an almost-full-length mitochondrion,​ roughly as good as the best I've managed to assemble so far.  I'll combine it with my efforts and see if I can eke out a few more bases.+
   * SOAPdenovo-assembly2/​ Assembly with new + old Illumina and 454 data.   * SOAPdenovo-assembly2/​ Assembly with new + old Illumina and 454 data.
     * SOAPdenovo 1.05 - can handle gzipped fastq files.     * SOAPdenovo 1.05 - can handle gzipped fastq files.
Line 131: Line 122:
     * Statistics for each kmer size assembly (using illumina and 454 data, using both for contig and scaffolding):​     * Statistics for each kmer size assembly (using illumina and 454 data, using both for contig and scaffolding):​
       * k31:       * k31:
-          1298372 ​scaffolds from 4814226 ​contigs sum up 632702276bp, with average length 487, 0 gaps filled +         * 1,​298,​372 ​scaffolds from 4,​814,​226 ​contigs sum up 632,​702,​276bp, with average length 487, 0 gaps filled 
-          ​3611844 ​scaffolds&​singleton sum up 1133413022bp, with average length 313 +         * 3,​611,​844 ​scaffolds&​singleton sum up 1,​133,​413,​022bp, with average length 313 
-          the longest is 10340bp,scaffold N50 is 442 bp, scaffold N90 is 148 bp+         * the longest is 10,340bp,scaffold N50 is 442 bp, scaffold N90 is 148 bp
       * k47:       * k47:
-          871819 ​scaffolds from 5306463 ​contigs sum up 530762874bp, with average length 608, 0 gaps filled +         * 871,​819 ​scaffolds from 5,​306,​463 ​contigs sum up 530,​762,​874bp, with average length 608, 0 gaps filled 
-          ​4203195 ​scaffolds&​singleton sum up 1296678043bp, with average length 308 +         * 4,​203,​195 ​scaffolds&​singleton sum up 1,​296,​678,​043bp, with average length 308 
-          the longest is 14750bp,scaffold N50 is 458 bp, scaffold N90 is 140 bp+         * the longest is 14,750bp,scaffold N50 is 458 bp, scaffold N90 is 140 bp
       * k63:       * k63:
-          270887 ​scaffolds from 4022505 ​contigs sum up 139720415bp, with average length 515, 0 gaps filled +         * 270,​887 ​scaffolds from 4,​022,​505 ​contigs sum up 139,​720,​415bp, with average length 515, 0 gaps filled 
-          ​3710532 ​scaffolds&​singleton sum up 690332560bp, with average length 186 +         * 3,​710,​532 ​scaffolds&​singleton sum up 690,​332,​560bp, with average length 186 
-          the longest is 14897bp,scaffold N50 is 232 bp, scaffold N90 is 112 bp+         * the longest is 14,897bp,scaffold N50 is 232 bp, scaffold N90 is 112 bp
  
archive/computer_resources/assemblies.1307137039.txt.gz · Last modified: 2011/06/03 21:37 by eyliaw