Differences

This shows you the differences between two versions of the page.

--- lecture_notes:04-20-2015 [2015/04/25 02:55]
calef [Basic features]
+++ lecture_notes:04-20-2015 [2015/04/25 03:07] (current)
calef [Error correction]
@@ Line 5: / Line 5: @@
   * Counts occurrences of each kmer in the data set.
-  * Removes kmers whose frequency are below the user threshold.
+  * Removes kmers whose frequency are below a threshold provided by the user.
   * For each kmer, counts the number of high-quality single-base extensions
-  * Classifies the 5' and 3' ends of each kmer
+  * Classifies the 5' and 3' ends of each kmer as U, F, or X, corresponding to having zero, one, or multiple high-quality single-base extensions
   * Stores the extensions of kmers with a classification in a hash
-  * Removes non-reciprocal linkages between kmers
+  * Removes non-reciprocal U-U extensions between kmers (i.e. an extension where the end of one mer is marked as U but the other is marked F).
+  * Stores the linear subgraph of U-U extensions
   * Selects kmers at random and extend outwards to produce contigs
   * Aligns all reads to contigs via BLAST
   * Assembles contigs into scaffolds using paired-end data
   * Searches unaligned reads as potential gap-closers using mate-pair data
 =====Meraculous limitations=====
-  * The assembler relies on data with high quality in order to avoid error correction
+  * The assembler relies on data with high quality in order to avoid error correction, also requires high coverage
   * Initial release did not support polyploid genome assembly due to allowing for linear subgraphs of the de Bruijn graph only
-  * Low memory footprint
+  * High disk space usage
 =====User experience=====
   * Requires an array of other scripts in other languages
   * Most of high level scripts are written in perl
-  * Tested the program in small dataset and obtained contigs
+  * Tested the program with the packaged test data and obtained contigs
 =====Installation=====
-  * Main issue was get all dependencies together
+  * Main issue was new version of GCC and getting all the dependencies together ~16 hrs
-  * There was one non-standard perl mode needed
+  * There was one non-standard perl module needed
+  * Files with carriage returns
   * Some scripts contain error but they aren't hard to fix.
 =====Running Meraculous=====
-  * Execute run_meraculous.sh scripts along with the configuration file
+  * Execute run_meraculous.sh scripts along with user-provided configuration file
   * Configuration file contains info on where where data is and what format it comes in
-  * It creates a timestamped folder that includes directories containing results of each step and executables to modify the run
+  * Creates a timestamped folder that includes directories containing results of each step and executables to suspend, resume, or restart the run from that step
-  * Then you can check the errors that made a run fail and resume the run
+  * Thorough error-logging at each step, allowing you to check the errors that made a run fail and then resume the run after fixing the errors
-  * Logs are informative
+  * SGE-aware, handles qsub and monitoring jobs
 =====Overall impression=====
   * Straightforward to figure out what went wrong just requiring a basic understanding of Perl
@@ Line 38: / Line 39: @@
   * Logs are very useful
 =====Error correction=====
-  * Meraculous requires error correction and adapter removal. Trimming is unnecessary.
+  * Meraculous requires error correction and adapter removal. Trimming is unnecessary, as low quality reads are ignored during contig formation.
-  * High error rates stop the assembler. Need to be removed.
+  * High error rates bog down the assembler. Need to be removed.
   * Kmer size chosen directly affects assembly quality
 =====KamerGenie=====

Banana Slug Genomics

User Tools

Site Tools

Differences

Page Tools