This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
lecture_notes:04-20-2015 [2015/04/25 02:55] calef [Basic features] |
lecture_notes:04-20-2015 [2015/04/25 03:07] (current) calef [Error correction] |
||
|---|---|---|---|
| Line 5: | Line 5: | ||
| * Counts occurrences of each kmer in the data set. | * Counts occurrences of each kmer in the data set. | ||
| - | * Removes kmers whose frequency are below the user threshold. | + | * Removes kmers whose frequency are below a threshold provided by the user. |
| * For each kmer, counts the number of high-quality single-base extensions | * For each kmer, counts the number of high-quality single-base extensions | ||
| - | * Classifies the 5' and 3' ends of each kmer | + | * Classifies the 5' and 3' ends of each kmer as U, F, or X, corresponding to having zero, one, or multiple high-quality single-base extensions |
| * Stores the extensions of kmers with a classification in a hash | * Stores the extensions of kmers with a classification in a hash | ||
| - | * Removes non-reciprocal linkages between kmers | + | * Removes non-reciprocal U-U extensions between kmers (i.e. an extension where the end of one mer is marked as U but the other is marked F). |
| + | * Stores the linear subgraph of U-U extensions | ||
| * Selects kmers at random and extend outwards to produce contigs | * Selects kmers at random and extend outwards to produce contigs | ||
| * Aligns all reads to contigs via BLAST | * Aligns all reads to contigs via BLAST | ||
| * Assembles contigs into scaffolds using paired-end data | * Assembles contigs into scaffolds using paired-end data | ||
| * Searches unaligned reads as potential gap-closers using mate-pair data | * Searches unaligned reads as potential gap-closers using mate-pair data | ||
| - | |||
| =====Meraculous limitations===== | =====Meraculous limitations===== | ||
| - | * The assembler relies on data with high quality in order to avoid error correction | + | * The assembler relies on data with high quality in order to avoid error correction, also requires high coverage |
| * Initial release did not support polyploid genome assembly due to allowing for linear subgraphs of the de Bruijn graph only | * Initial release did not support polyploid genome assembly due to allowing for linear subgraphs of the de Bruijn graph only | ||
| - | * Low memory footprint | + | * High disk space usage |
| =====User experience===== | =====User experience===== | ||
| * Requires an array of other scripts in other languages | * Requires an array of other scripts in other languages | ||
| * Most of high level scripts are written in perl | * Most of high level scripts are written in perl | ||
| - | * Tested the program in small dataset and obtained contigs | + | * Tested the program with the packaged test data and obtained contigs |
| =====Installation===== | =====Installation===== | ||
| - | * Main issue was get all dependencies together | + | * Main issue was new version of GCC and getting all the dependencies together ~16 hrs |
| - | * There was one non-standard perl mode needed | + | * There was one non-standard perl module needed |
| + | * Files with carriage returns | ||
| * Some scripts contain error but they aren't hard to fix. | * Some scripts contain error but they aren't hard to fix. | ||
| =====Running Meraculous===== | =====Running Meraculous===== | ||
| - | * Execute run_meraculous.sh scripts along with the configuration file | + | * Execute run_meraculous.sh scripts along with user-provided configuration file |
| * Configuration file contains info on where where data is and what format it comes in | * Configuration file contains info on where where data is and what format it comes in | ||
| - | * It creates a timestamped folder that includes directories containing results of each step and executables to modify the run | + | * Creates a timestamped folder that includes directories containing results of each step and executables to suspend, resume, or restart the run from that step |
| - | * Then you can check the errors that made a run fail and resume the run | + | * Thorough error-logging at each step, allowing you to check the errors that made a run fail and then resume the run after fixing the errors |
| - | * Logs are informative | + | * SGE-aware, handles qsub and monitoring jobs |
| =====Overall impression===== | =====Overall impression===== | ||
| * Straightforward to figure out what went wrong just requiring a basic understanding of Perl | * Straightforward to figure out what went wrong just requiring a basic understanding of Perl | ||
| Line 38: | Line 39: | ||
| * Logs are very useful | * Logs are very useful | ||
| =====Error correction===== | =====Error correction===== | ||
| - | * Meraculous requires error correction and adapter removal. Trimming is unnecessary. | + | * Meraculous requires error correction and adapter removal. Trimming is unnecessary, as low quality reads are ignored during contig formation. |
| - | * High error rates stop the assembler. Need to be removed. | + | * High error rates bog down the assembler. Need to be removed. |
| * Kmer size chosen directly affects assembly quality | * Kmer size chosen directly affects assembly quality | ||
| =====KamerGenie===== | =====KamerGenie===== | ||