User Tools

Site Tools


archive:computer_resources:assemblies

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
archive:computer_resources:assemblies [2010/04/28 20:11]
galt
archive:computer_resources:assemblies [2010/05/19 20:49]
galt
Line 23: Line 23:
     * newbler-assembly4/​ starts from the same sff file as newbler-assembly2/​ and newbler-assembly3/​ but adds the contigs of newbler-partial3/​ as extra reads. ​ This did not help, getting 45 contigs and 2,449,287 bases.     * newbler-assembly4/​ starts from the same sff file as newbler-assembly2/​ and newbler-assembly3/​ but adds the contigs of newbler-partial3/​ as extra reads. ​ This did not help, getting 45 contigs and 2,449,287 bases.
     * newbler-assembly5/​ starts from the same sff file as newbler-assembly2,​3,​4 but adds 45 Sanger reads totalling 44,187 bases from PCR reactions (mainly designed to test contig-join hypotheses). It gets 31 contigs and 2,451,007 bases.     * newbler-assembly5/​ starts from the same sff file as newbler-assembly2,​3,​4 but adds 45 Sanger reads totalling 44,187 bases from PCR reactions (mainly designed to test contig-join hypotheses). It gets 31 contigs and 2,451,007 bases.
-    * map-colorspace5/​ maps the SOLiD mate-pair data onto the contigs of newbler-assembly5/ ​ Other than some problems placing contig4 and the ece insertions, we can reconstruct some pretty large chunks of the genome from the mate-pair ends.+    * map-colorspace5/​ maps the SOLiD mate-pair data onto the contigs of newbler-assembly5/ ​ Other than some problems placing contig4 and the ece insertions, we can reconstruct some pretty large chunks of the genome from the mate-pair ends. This directory contains the trim9.joins file, which is needed for doing the **homework** to attempt to reconstruct the genome.
   * euler   * euler
     * euler-assembly1/​     * euler-assembly1/​
Line 38: Line 38:
       * Final graph has 3602 nodes and n50 of 4851, max 94854, total 1767903, using 28785664/​61262410 reads       * Final graph has 3602 nodes and n50 of 4851, max 94854, total 1767903, using 28785664/​61262410 reads
   * SOAPdenovo   * SOAPdenovo
-    * SOAPdenovo-assembly1/​ Assembling Pog 454 long reads with SOAPdenovo. ​ After being simply unable to get any version of the program to read a FASTA file despite documentation examples, I finally found a utility sff2fastq that made it possible to run SOAPdenovo on Pog 454 fastq. ​ I have not had time to optimize parameters yet.  The largest contig made with default params was just 4k.+    * SOAPdenovo-assembly1/​ Assembling Pog 454 long reads with SOAPdenovo. ​ After being simply unable to get any version of the program to read a FASTA file despite documentation examples, I finally found a utility sff2fastq that made it possible to run SOAPdenovo on Pog 454 fastq. ​ I have not had time to optimize parameters yet.  The largest contig made with default params was just 4k.  Later raised cutoff to 12 and got maxcontig of 70k.  Could not run the scaffold step because it crashed, probably because it was written for short 52bp solexa reads and the long 454 reads are messing it up.
  
  
Line 56: Line 56:
     * The output has 2664 contigs, comprising 443,648 bases (still less than the de novo assembly).  ​     * The output has 2664 contigs, comprising 443,648 bases (still less than the de novo assembly).  ​
     * The longest contig is only 1876 bases.     * The longest contig is only 1876 bases.
 +  * SOAPdenovo-assembly1/​ First run of SOAPdenovo on illumina paired ends.
 +    * SOAPdenovo requires fastq input files.
 +    * It was used to assemble the Panda genome by BGI.
 +    * Used kolossus which has 1TB and 64cpus.
 +    * Ran with k=31 and k=23.  k=31 was better (9k maxcontig)
 +    * so ran with filling -R to get 12k maxcontig.
 +    * Then ran the scaffolding steps with 200bp insert size.
 +    * For all steps, used low default cutoffs since our 10x coverage
 +    * is not high.  21k max scaffold size.  Estimated
 +    * genome size is around 3G.  The 4 steps are
 +    * 1. pregraph (3.5 to 4.5 hours for 30 to 60 cpus)
 +    * 2. contig (1.3 hours)
 +    * 3. map (0.6 hours with 60cpus) - paired ends
 +    * 4. scaff (1 hour with 60cpus)
  
archive/computer_resources/assemblies.txt · Last modified: 2015/09/02 16:53 by 92.247.181.31