User Tools

Site Tools


lecture_notes:04-04-2011

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

====== Overview of Assembly ====== Kevin outlined the processes involving in assembling a genome. * Clean up the Reads * Clustering and Building Contigs * Order and Orient Contigs ===== Clean Up Reads ===== There are two separate and distinct parts of data clean up; error correction and contaminant removal. === Error Correction === * May not be necessary for all types of data (Sanger/454). * Can be done before or after contig assembly. * Before: K-mer Counting * After: Map reads to consensus sequences from contigs. == K-mer Counting == * Count the number of occurrences of each K-mer in the reads. * Remove reads or correct individual bases of K-mers with low counts. * K-mer size must be large enough not to produce trivial counts, but small enough to fit memory constraints. === Contaminant Removal === * Contamination can come from many sources: * Human (dust) * Bacterial * Viral (hard to remove) * Use BLAST to remove sequences that are unexpected. * Expensive to run. * Blast contigs instead of individual reads. * Strategies: * Look for specific contaminants (Human, E. coli). * Examine ribosome to identify possible contaminants. * Look for things we would not expected to see (e.g. eukaryotic sequence in prokaryotes or vice versa). * Once a contig is identified as contaminant: * Remove the contig and reads that map to it. * Rebuild the contigs. * We risk removing parts of the target genome that are very similar to the contaminant * Example from class: nitrogen fixing genes common to two bacterial strains. ===== Cluster Reads and Build Contigs ===== * Build a graph for the reads * De Bruijn Graph * Typically used for small reads. * Overlap Consensus * Generally used for larger reads. * larger memory requirements. * Reads that don't quite fit can be error corrected. * Ideally use high quality data (Sanger,454). Recent trend is to use cheaper data. * **Result**: Contigs ===== Order and Orient Contigs ===== * Iterative process: * Use new data when available. * Map reads to draft to import draft version. * Leftover reads are sent back for clustering. * Mate-pair data is useful for bridging contigs that are adjacent but do not overlap (due to missing data, repeat sequence, etc.) * **Result**: Scaffolds ====== Homework ====== Learn about the [[http://www.cbcb.umd.edu/software/jellyfish/|Jellyfish]] tool for K-mer counting. Try running it on the //Pyrobaculum// data. Use different parameters and monitor its memory usage. Fill the the wiki page for [[bioinformatic_tools:jellyfish|Jellyfish]].

You could leave a comment if you were logged in.
lecture_notes/04-04-2011.1301977444.txt.gz · Last modified: 2011/04/05 04:24 by svohr