User Tools

Site Tools


lecture_notes:04-20-2015

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
lecture_notes:04-20-2015 [2015/04/25 02:55]
calef [Basic features]
lecture_notes:04-20-2015 [2015/04/25 03:07] (current)
calef [Error correction]
Line 5: Line 5:
  
   * Counts occurrences of each kmer in the data set.   * Counts occurrences of each kmer in the data set.
-  * Removes kmers whose frequency are below the user threshold.+  * Removes kmers whose frequency are below a threshold provided by the user.
   * For each kmer, counts the number of high-quality single-base extensions   * For each kmer, counts the number of high-quality single-base extensions
-  * Classifies the 5' and 3' ends of each kmer+  * Classifies the 5' and 3' ends of each kmer as U, F, or X, corresponding to having zero, one, or multiple high-quality single-base extensions
   * Stores the extensions of kmers with a classification in a hash   * Stores the extensions of kmers with a classification in a hash
-  * Removes non-reciprocal ​linkages ​between kmers+  * Removes non-reciprocal ​U-U extensions ​between kmers (i.e. an extension where the end of one mer is marked as U but the other is marked F). 
 +  * Stores the linear subgraph of U-U extensions
   * Selects kmers at random and extend outwards to produce contigs   * Selects kmers at random and extend outwards to produce contigs
   * Aligns all reads to contigs via BLAST   * Aligns all reads to contigs via BLAST
   * Assembles contigs into scaffolds using paired-end data   * Assembles contigs into scaffolds using paired-end data
   * Searches unaligned reads as potential gap-closers using mate-pair data   * Searches unaligned reads as potential gap-closers using mate-pair data
- 
 =====Meraculous limitations===== =====Meraculous limitations=====
-  * The assembler relies on data with high quality in order to avoid error correction+  * The assembler relies on data with high quality in order to avoid error correction, also requires high coverage
   * Initial release did not support polyploid genome assembly due to allowing for linear subgraphs of the de Bruijn graph only   * Initial release did not support polyploid genome assembly due to allowing for linear subgraphs of the de Bruijn graph only
-  * Low memory footprint+  * High disk space usage
 =====User experience===== =====User experience=====
   * Requires an array of other scripts in other languages   * Requires an array of other scripts in other languages
   * Most of high level scripts are written in perl   * Most of high level scripts are written in perl
-  * Tested the program ​in small dataset ​and obtained contigs+  * Tested the program ​with the packaged test data and obtained contigs
 =====Installation===== =====Installation=====
-  * Main issue was get all dependencies together  +  * Main issue was new version of GCC and getting ​all the dependencies together ​~16 hrs  
-  * There was one non-standard perl mode needed+  * There was one non-standard perl module ​needed 
 +  * Files with carriage returns
   * Some scripts contain error but they aren't hard to fix.   * Some scripts contain error but they aren't hard to fix.
 =====Running Meraculous===== =====Running Meraculous=====
-  * Execute run_meraculous.sh scripts along with the configuration file+  * Execute run_meraculous.sh scripts along with user-provided ​configuration file
   * Configuration file contains info on where where data is and what format it comes in   * Configuration file contains info on where where data is and what format it comes in
-  * It creates ​a timestamped folder that includes directories containing results of each step and executables to modify ​the run +  * Creates ​a timestamped folder that includes directories containing results of each step and executables to suspend, resume, or restart ​the run from that step 
-  * Then you can check the errors that made a run fail and resume the run +  * Thorough error-logging at each step, allowing ​you to check the errors that made a run fail and then resume the run after fixing the errors 
-  * Logs are informative+  * SGE-aware, handles qsub and monitoring jobs
 =====Overall impression===== =====Overall impression=====
   * Straightforward to figure out what went wrong just requiring a basic understanding of Perl   * Straightforward to figure out what went wrong just requiring a basic understanding of Perl
Line 38: Line 39:
   * Logs are very useful   * Logs are very useful
 =====Error correction===== =====Error correction=====
-  * Meraculous requires error correction and adapter removal. Trimming is unnecessary. +  * Meraculous requires error correction and adapter removal. Trimming is unnecessary, as low quality reads are ignored during contig formation
-  * High error rates stop the assembler. Need to be removed.+  * High error rates bog down the assembler. Need to be removed.
   * Kmer size chosen directly affects assembly quality   * Kmer size chosen directly affects assembly quality
 =====KamerGenie===== =====KamerGenie=====
lecture_notes/04-20-2015.1429930524.txt.gz · Last modified: 2015/04/25 02:55 by calef