User Tools

Site Tools


archive:computer_resources:assemblies

Table of Contents

assemblies/

This directory has a subdirectory for each organism.

test/

For test assemblies provided by the tool makers to check that installation is correct.

  • velvet
    • velvet-assembly1/ A first attempt to use velvet to assemble its own simulated TEST 100kb genome with short 35b reads and long 100b reads. Assembly worked very well in the end. Was reminded that when using short reads, must keep k small to get adequate coverage.
      • 6th try, k=21 cov=19 -shortPaired -long
      • Final graph has 9 nodes and n50 of 99975, max 99975, total 100143, using 137863/144858 reads
      • RESULTS: excellent! (TEST DATA ONLY)

Pog/

Pyrobaculum oguniense assemblies

  • Newbler (plus map-colorspace)
    • newbler-assembly1/ is an attempt to do a de novo assembly using the 454 tools (Newbler) version 2.3, starting with the entire set of reads (including any contaminants). This resulted in 43 contigs and 2449932 bases.
    • newbler-clean1/ does not create an assembly, instead it is an attempt to remove contaminant reads from the Pog 454 data, by removing reads that map to Helicobacter pylori genomes. The results are in newbler-clean1/sff_cleaned/no_Hyp.sff
    • newbler-assembly2/ is a second de novo assembly using Newbler, starting from the cleaned reads of newbler-clean1/sff_cleaned/no_Hyp.sff It gets 42 contigs and 2,449,409 bases.
    • newbler-assembly3/ starts from the same sff file as newbler-assembly2/ but raises the expected coverage to 60 (close to actual coverage). It gets 41 contigs and 2,449,426 bases, still more than the old version of Newbler got after similar cleaning. The contigs have been mapped to the finished genome (using megablast, blastn, blat, and pluck-scripts/find-dna-differences). All the contigs map cleanly to the finished genome. If contigs map to more than one place, find-dna-differences may (incorrectly) report it as not mapping.
    • map-colorspace3/ uses the pluck-scripts script map-colorspace to map the SOLiD mate-pair reads onto the contigs of the newbler-assembly3/ run. The intent is to find what contigs join to what other ones. The numbering starts with 3, not 1, so that the map-colorspace directories correspond to the newbler-assembly directories that they are mapping onto.
    • newbler-partial3/ assembled the partially-assembled reads of newbler-assembly3/ to see if any extended or connected contigs. Seven of the 131 new contigs could be used to extend newbler-assembly3/ contigs, but none spanned 2 contigs.
    • newbler-assembly4/ starts from the same sff file as newbler-assembly2/ and newbler-assembly3/ but adds the contigs of newbler-partial3/ as extra reads. This did not help, getting 45 contigs and 2,449,287 bases.
    • newbler-assembly5/ starts from the same sff file as newbler-assembly2,3,4 but adds 45 Sanger reads totalling 44,187 bases from PCR reactions (mainly designed to test contig-join hypotheses). It gets 31 contigs and 2,451,007 bases.
    • map-colorspace5/ maps the SOLiD mate-pair data onto the contigs of newbler-assembly5/ Other than some problems placing contig4 and the ece insertions, we can reconstruct some pretty large chunks of the genome from the mate-pair ends. This directory contains the trim9.joins file, which is needed for doing the homework to attempt to reconstruct the genome.
  • euler
    • euler-assembly1/
  • euler-sr
    • euler-sr-assembly1/
  • mira
  • velvet
    • velvet-assembly1/ Assembling Pog 454 long reads with velvet. After very poor results with default settings, eventually started to get good results by getting the expected coverage (60) and cutoff (13) correct. It took a long time try different parameter settings. Also using the long reads as both short and long reads gave substantially better results. Because these were long reads, we could set k up to 31. Also tried with specially compiled version of velvet that could use k > 31, but can not report any improvement yet. Given that the average read is 370b, it should have been able to support longer k-values. Best results so far:
      • Final graph has 2176 nodes {195 contigs over 62b) and n50 of 224364, max {contig size} 680241, total {genome size} 2481051, using 778249/782604 reads
    • velvet-assembly1a/ Assembling Pog 454 long reads with velvet. This is just re-running velvet-assembly1 with the best parameters found and also using kevin's makefile adapted to the velvet data.
    • velvet-assembly2/ Assembling Pog Solid mate-paired 25b reads with velvet in double-encoded colorspace (24 DE base reads). Not had time to optimize parameters yet. Did get up to a max-contig size of 95k.
      • velveth_de out 21 -shortPaired /campusdata/BME235/data/Pog/solid_run/paired/output/doubleEncoded_input.de
      • velvetg_de out -exp_cov 50 -cov_cutoff 13 -ins_length 2200
      • Final graph has 3602 nodes and n50 of 4851, max 94854, total 1767903, using 28785664/61262410 reads
  • SOAPdenovo
    • SOAPdenovo-assembly1/ Assembling Pog 454 long reads with SOAPdenovo. After being simply unable to get any version of the program to read a FASTA file despite documentation examples, I finally found a utility sff2fastq that made it possible to run SOAPdenovo on Pog 454 fastq. I have not had time to optimize parameters yet. The largest contig made with default params was just 4k. Later raised cutoff to 12 and got maxcontig of 70k. Could not run the scaffold step because there are no paired libs in this data set.
  • Ray
    • Ray-assembly1/
                      Assembling Pog 454 long reads with Ray,
                      a parallel implementation of the OpenAssembler.
                      This software seems to be Canadian.
                      It took 3 hours to run, and the output was
                      not very good, max contig size being about 12k.
                      Sadly there are no parameters to tweak.
  • ABySS
    • abyss-assembly1/
                      Assembling Pog454 long reads with ABySS.
                      The best params found were kmer size 36 and coverage cutoff 15
                      #ABYSS -k 36 -c 15 both.fq
                      #Total size: mean 1844.8 sd 3479.7 min 36 (1179) max 32566 (556) median 204
  • PCAP
    • pcap-assembly1/
                      Assembled Pog 454 long reads with pcap default parameters. Sanger reads are not included.
                      It was necessary to increase the minimum depth coverage for repeats before we got anything good.
                      Assembled Pog 454 long reads with minimum depth coverage for repeats set to 200, and rest of the parameters unmodified. 
                      faSize contigs.bases info : 
                      2506151 bases (8 N's 2506143 real 2506143 upper 0 lower) in 219 sequences in 1 files
                      Total size: mean 11443.6 sd 65849.3 min 56 (Contig174.1) max 611479 (Contig0.1) median 195
                      N count: mean 0.0 sd 0.2
                      U count: mean 11443.6 sd 65849.3
                      Using Kevin's makefile, the blat alignments showed large contigs that looked basically correct, except for contig 8.
                      However many of them overlapped, unlike the Newbler output.  This may have been due to a
                      difference in the way Newbler and PCAP tried to handle the mixed population in the sample where
                      3 inverting regions are found with various frequencies.
                      Also, a cutoff should probably be supplied somewhere after the 17th largest contig because
                      most of the rest of the 219 was small contigs probably representing noise.

slug/

  • newbler-assembly1/ first attempt at de novo assembly using Newbler, using all the reads from 454_run1 and 454_run2.
    • This assembly of 499,873 reads including 138,351,643 bases produced only 2,910,773 bases assembled into 8,963 contigs.
    • The longest contig is 5783 bases.
    • From the total number of bases in the assembly number, I estimate the coverage to be about 0.043x and the genome size to be about 3.2E9 basepairs. (See the README file for the calculation.)
    • Much of the assembly is low-complexity regions (repetitions of short repeats (AT)*, (AAG)*, (AG)*, (AC)*, (AGT)*, (AGAT)*, (ACAT)*, (AAC)*, (AACG)*, … ).
    • The most common 14-mer that is not a repeat of a short k-mer is TAGTTTACAGCTTG (so that is what we should put on the T-shirt).
  • newbler-mapping1-lottia/ tries to do a reference-based assembly with the Lottia gigantea genome as a reference.
    • The reference has 4475 contigs with 359,512,207 bases.
    • The output has 183 contigs with 29,389 bases.
    • The longest contig is only 644 bases—way too small to be of much use.
  • newbler-mapping2-seahare/ tries to do a reference-based assembly with the Aplysia californica genome as a reference.
    • The sea hare reference has 8767 contigs, comprising 715,806,041 bases.
    • The output has 2664 contigs, comprising 443,648 bases (still less than the de novo assembly).
    • The longest contig is only 1876 bases.
  • SOAPdenovo-assembly1/ First run of SOAPdenovo on illumina paired ends.
    • SOAPdenovo requires fastq input files.
    • It was used to assemble the Panda genome by BGI.
    • Used kolossus which has 1TB and 64cpus.
    • Ran with k=31 and k=23. k=31 was better (9k maxcontig)
    • so ran with filling -R to get 12k maxcontig.
    • Then ran the scaffolding steps with 200bp insert size.
    • For all steps, used low default cutoffs since our 10x coverage is not high. 21k max scaffold size.
    • Estimated genome size is around 3G.
    • The 4 steps are
      1. pregraph (3.5 to 4.5 hours for 30 to 60 cpus)
      2. contig (1.3 hours)
      3. map (0.6 hours with 60cpus) - paired ends
      4. scaff (1 hour with 60cpus)
  • barcode-of-life/ attempt to assemble the mitochondrial genome, documented on its own page: mitochondrion
  • SOAPdenovo-assembly2/ Assembly with new + old Illumina and 454 data.
    • SOAPdenovo 1.05 - can handle gzipped fastq files.
    • Runs with k27, 31, 47, and 63 so far. 47 was the best overall. 63 got the longest contig (~14.9kb).
    • Run parameters:
      1. pregraph:
        • lowest count size of 2 (-d 2)
      2. contig:
        • solve tiny repeats on (-R)
      3. map:
        • all default
      4. scaff:
        • intra-scaffold gap closure on (-F)
    • Statistics for each kmer size assembly (using illumina and 454 data, using both for contig and scaffolding):
      • k31:
        • 1,298,372 scaffolds from 4,814,226 contigs sum up 632,702,276bp, with average length 487, 0 gaps filled
        • 3,611,844 scaffolds&singleton sum up 1,133,413,022bp, with average length 313
        • the longest is 10,340bp,scaffold N50 is 442 bp, scaffold N90 is 148 bp
      • k47:
        • 871,819 scaffolds from 5,306,463 contigs sum up 530,762,874bp, with average length 608, 0 gaps filled
        • 4,203,195 scaffolds&singleton sum up 1,296,678,043bp, with average length 308
        • the longest is 14,750bp,scaffold N50 is 458 bp, scaffold N90 is 140 bp
      • k63:
        • 270,887 scaffolds from 4,022,505 contigs sum up 139,720,415bp, with average length 515, 0 gaps filled
        • 3,710,532 scaffolds&singleton sum up 690,332,560bp, with average length 186
        • the longest is 14,897bp,scaffold N50 is 232 bp, scaffold N90 is 112 bp
You could leave a comment if you were logged in.
archive/computer_resources/assemblies.txt · Last modified: 2015/09/02 16:53 by 92.247.181.31