User Tools

Site Tools


lecture_notes:04-20-2015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
lecture_notes:04-20-2015 [2015/04/21 09:24]
gepoliano created
lecture_notes:04-20-2015 [2015/04/25 03:03]
calef [User experience]
Line 1: Line 1:
 ======Team 1 report: assembly with Meraculous====== ======Team 1 report: assembly with Meraculous======
 =====Basic features===== =====Basic features=====
-Published by the US Department of energy. Meraculous was initially designed for haploid assembly, but currently supports diploid assembly as well. The advantages of this assembler include multi-threaded and parallelized computation,​ absence of error-correction for faster processing, paired-end short reads compatibility (e.g., Illumina), efficient and conservative traversal of subgraphs of the de Bruijn graph, selection of kmer set, production of a set of maximal linear sub-paths of the de Bruijn graph, alignment of reads to assembly ​ in order to identify useful read-pair information and closure of gaps. Meraculous has been used to assemble //Pichia stipitis// genome, ​producing ​15.4 Mb genome, 75 bp paired reads with 425x coverage. ​As a result, ​95% of the genome ​was covered ​and an N50 101 kb was obtained. +Published by the Joint Genome Institute, part of the US Department of Energy. Meraculous was initially designed for haploid assembly, but currently supports diploid assembly as well. The advantages of this assembler include multi-threaded and parallelized computation,​ absence of error-correction for faster processing, paired-end short reads compatibility (e.g., Illumina), efficient and conservative traversal of subgraphs of the de Bruijn graph, selection of kmer set, production of a set of maximal linear sub-paths of the de Bruijn graph, ​and alignment of reads to assembly ​ in order to identify useful read-pair information and closure of gaps. Meraculous has been used to assemble ​the //Pichia stipitis// genome, ​15.4 Mb genome, ​using 75 bp paired reads with 425x coverage. ​The resulting assembly covered ​95% of the genome and had an N50 of 101 kb.
 =====Meraculous algorithm===== =====Meraculous algorithm=====
  
   * Counts occurrences of each kmer in the data set.   * Counts occurrences of each kmer in the data set.
-  * Removes kmers whose frequency are below the user threshold.+  * Removes kmers whose frequency are below a threshold provided by the user.
   * For each kmer, counts the number of high-quality single-base extensions   * For each kmer, counts the number of high-quality single-base extensions
-  * Classifies the 5' and 3' ends of each kmer+  * Classifies the 5' and 3' ends of each kmer as U, F, or X, corresponding to having zero, one, or multiple high-quality single-base extensions
   * Stores the extensions of kmers with a classification in a hash   * Stores the extensions of kmers with a classification in a hash
-  * Removes non-reciprocal ​linkages ​between kmers+  * Removes non-reciprocal ​U-U extensions ​between kmers (i.e. an extension where the end of one mer is marked as U but the other is marked F). 
 +  * Stores the linear subgraph of U-U extensions
   * Selects kmers at random and extend outwards to produce contigs   * Selects kmers at random and extend outwards to produce contigs
   * Aligns all reads to contigs via BLAST   * Aligns all reads to contigs via BLAST
   * Assembles contigs into scaffolds using paired-end data   * Assembles contigs into scaffolds using paired-end data
   * Searches unaligned reads as potential gap-closers using mate-pair data   * Searches unaligned reads as potential gap-closers using mate-pair data
- 
 =====Meraculous limitations===== =====Meraculous limitations=====
-  * The assembler relies on data with high quality in order to avoid error correction+  * The assembler relies on data with high quality in order to avoid error correction, also requires high coverage
   * Initial release did not support polyploid genome assembly due to allowing for linear subgraphs of the de Bruijn graph only   * Initial release did not support polyploid genome assembly due to allowing for linear subgraphs of the de Bruijn graph only
-  * Low memory footprint+  * High disk space usage
 =====User experience===== =====User experience=====
   * Requires an array of other scripts in other languages   * Requires an array of other scripts in other languages
   * Most of high level scripts are written in perl   * Most of high level scripts are written in perl
-  * Tested ​the program in small dataset ​and obtained contigs+  * Runs from a shell script and a user-provided config file 
 +  * SGE-aware, handles qsub and monitoring jobs 
 +  * Pipeline is well sub-divided,​ running ​the program ​produces intermediate files and executables allowing the user to suspend, resume, or restart the run from any step in the pipeline 
 +  * Thorough error logging for each step in the algorithm 
 +  * Tested the program with the packaged test data and obtained contigs
 =====Installation===== =====Installation=====
-  * Main issue was get all dependencies together  +  * Main issue was new version of GCC and getting ​all the dependencies together ​~16 hrs  
-  * There was one non-standard perl mode needed+  * There was one non-standard perl module ​needed 
 +  * Files with carriage returns
   * Some scripts contain error but they aren't hard to fix.   * Some scripts contain error but they aren't hard to fix.
 =====Running Meraculous===== =====Running Meraculous=====
lecture_notes/04-20-2015.txt · Last modified: 2015/04/25 03:07 by calef