User Tools

Site Tools


lecture_notes:04-21-2010

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

===== April 21st, 2010 ===== ==== Logistic related notes ==== * There will be a guest lecture on Friday, April 23rd, 2010. Please do not be late! * By Friday, April 23rd, 2010, all published sources which contain information pertaining to Banana slug should be added to the [[:banana_slug_biology|Banana Slug Biology]] page. **Is this the right page?** * Expect 40-50 different resources (papers, books, phd theses) * The bibliography must be annotated, though annotations are not required this friday. * As you run assembly tool, keep the [[computer_resources:assemblies|assemblies/]] wiki page updated! ==== Continuation of Newbler Assembler methods ==== * By running Newbler using settings described last friday, Kevin obtained 43 contigs and ~2.4 megabases of DNA. * This measure is not bad compared to most other assemblers === Overview of directories === * Newbler-clean * Not attempting to assemble data. * Used to remove contamination for H. Pylori which was sequenced on the same machine in the same run. * If you have a reference genome for your contaminants, you can try to clean them out computationally by mapping and removing matches. * Makefile Overview * Set up a new mapping * Maps reads from 454 Pog data to the contaminant genome * Interesting output: ReadStatus.txt * Contains mapping status of each read (Mapped, unmapped, partially mapped, too short) * Reads mapped to contaminants should be filtered out from 454 data and should not be sued for assembly. * Create a new SFF file (using sfffile utility) with unmapped data in file "no_Hyp.sff" * It may be beneficial to hang onto "too short" reads for use in assemblers which utilize shorter reads. * You can also use Megablast and NCBI's Taxonomy Report to identify contaminants. * Newbler-assembly2 * Uses clean data (no H. Pylori) from newbler-clean1 * One less small contig, mostly the same * Newbler * * Changed expected coverage to 60x-- This number is closer to the real coverage based on better estimates of genome size. * Still reported 41 contigs (or maybe one less?) * good news- All contigs map to reference genome * Map-colorspace3 (Map-colorspace directories begin indexing at 3 rather than 1) * The scripts ran in this directory were originally intended for finding inverseions from mate-pair reads. * New features have been added since then. * Now looks for reads between contigs * Tries to orient contigs * Newbler-partial3 * attempts to use only partially mapped/unmapped reads * Plan is to later map contigs from this assembly to the full contigs from before (extend edges?) * Megablast, blastn, blat, find dna differences are four methods for mapping partially mapped reads onto contigs created by newbler. * Megablast and blastn showed very similar results * blat * output is a psl file * Nice results * Very sensitive but also very slow * Shows matches down to 14 bases in size * can handle intros and splice sites * May be too sensitive for some purposes * There are some parameters you can tweak which will alter the output. * Kevin thought it would be best to keep blat's default parameters and filter output. * A script was written to filter output and find the best matches. * For a resequencing project, take all contigs, map with blat, sort by start site and build a scaffold according to the order of the start sites of each contig. * Find dna differences * Shows differences between contigs and a reference genome in a human readable format * Newbler-assembly4 * Adds partial3 contigs to full search as reads * Didn't seem to help much * The resulting output had fewer total bases and more contigs than originally - perhaps worse. * Newbler-assembly5 * Utilized Sanger reads produced by David Bernick to join contigs * 45 Sanger reads total * Sanger reads were created by running pcr on conjectured contig join sites. * Two possible methods for guessing contig join sites * Use mate-pair data from solid run as evidence to create primers * David Bernick used know Crispr repeats. Some contigs showed the start of a repeat, while others showed the end * After adding Sanger reads, the final assembly went down to 31 contigs * In contig-length.rdb * Most reads have ~0.16 reads/base * The reads/base metric was utilized because computation was simpler than coverage. * Some reads had greater reads/base ~0.32 (qty. 3) and ~0.44 reads per base * Could 0.32 reads/base indicate two occurences of the contig in the genome? * Could 0.44 occur three times in the genome? * All 0.16 reads/base contigs occured a single time in the genome * 2/3 0.32 reads/base appeared twice in the genome. * 1/3 0.32 reads/base was a short contig and was within the expected deviation in reads/base from a sequencing run * 0.44 reads/base occured twice, not three times * Less contigs and more bases * Map-colorspace5 * map-colorspace has many parameters. * You can list all the parameters by issuing the command "map-colorspace --help" * Lots of output files using this command (~2 per contig). May prove problematic with assemblies that contain many contigs * Set length parameters using a histogram of reads mapped to contigs * merge_cross option merges all cross products into a single file rather than (number of contigs)^2 files * scaffold fasta & and color qualities are thrown out in this method * If F3 or R3 maps, the program aggresivly tries to map the paired read. * Output * Uniquely mapped reads mapped both F3 and R3 (Forward and Reverse) reads to the same contig * Multiply mapped- can map to multiple contigs * Wrong range, only F3 or R3 map * Summary File * Maps F3 and R3 from the end of one contig to the beginning of another * Shows that contigs are near, not touching. * short contigs may be skipped in paired end reads * QUESTION: Can we make a single connected graph using this data and all possible paths? -sometimes if data is particularly coherent. not guaranteed.

You could leave a comment if you were logged in.
lecture_notes/04-21-2010.1272054143.txt.gz · Last modified: 2010/04/23 20:22 by mpcusack