This is an old revision of the document!
===== April 21st, 2010 ===== ==== Logistic related notes ==== * There will be a guest lecture on Friday, April 23rd, 2010. Please do not be late! * By Friday, April 23rd, 2010, all published sources which contain information pertaining to Banana slug should be added to the [[:banana_slug_biology|Banana Slug Biology]] page. **Is this the right page?** * Expect 40-50 different resources (papers, books, phd theses) * The bibliography must be annotated, though annotations are not required this friday. ==== Continuation of Newbler Assembler methods ==== * By running Newbler using settings described last friday, Kevin obtained 43 contigs and ~2.4 megabases of DNA. * This measure is not bad compared to most other assemblers === Overview of directories === * Newbler-clean * Not attempting to assemble data. * Used to remove contamination for H. Pylori whihc was sequenced in the same run. * If you know your contaminants, you can try to clean them out computationally. * Makefile Overview * Set up a new mapping * Maps reads from 454 Pog data to the contaminat genome * Interested in ReadStatus.txt file * Contains status of each read (Mapped, unmapped, partially mapped, too short) * Reads mapped to contaminants should be filtered out from 454 data and should not be sued for assembly. * Create a new SFF file with unmapped data in file "no_Hyp.sff" * It may be beneficial to hang onto "too short" reads for use in assemblers which utilize shorter reads. * By removing H. Pylori, Newbler assembly removed a contig. * You can also use Megablast and NCBI's Taxonomy Report to identify contaminants. * Newbler-assembly3 didn't work * Changed expected coverage to 60x-- This number is closer to the final coverage. * Still reported 41 contigs * Map-colorspace3 (Map-colorspace directories begin indexing at 3 rather than 1) * The scripts ran in this directory were originally intended for finding inverseions from mate-pair reads. * New features have beed added since then. * Now looks for reads between contigs * Newbler-partial3 attempts to use partially mapped reads to join contigs * Megablast, blastn, blat, find dna differences are four methods for mapping partially mapped reads onto contigs created by newbler. * Megablast and blastn showed very similar results * blat * output is a psl file * Nice results * Very sensitive but also very slow * Shows matches down to 14 bases in size * can handle intros and splice sites * May be too sensitive for some purposes * There are some parameters you can tweak which will alter the output. * Kevin thought it would be best to keep blat's default parameters and filter output. * A script was written to filter output and find the best matches. * For a resequencing project, take all contigs, map with blat, sort by start site and build a scaffold according to the order of the start sites of each contig. * Find dna differences * Shows differences between contigs and a reference genome in a human readable format * Newbler-assembly4 failed * Used partially mapped reads to form contigs * Didn't seem to help much * The resulting output had fewer total bases and more contigs than originally. * newbler-assembly5 * Utilized Sanger reads produced by David Bernick to join contigs * 45 Sanger reads total * Sanger reads were created by running pcr on conjectured contig join sites. * Two possible methods for guessing contig join sites * Use mate-pair data from solid run as evidence to create primers * David Bernick used know Crispr repeats. Some contigs showed the start of a repeat, while others showed the end * After adding Sanger reads, the final assembly went down to 31 contigs * In contig-length.rdb * Most reads have ~0.16 reads/base * The reads/base metric was utilized because computation was simpler than coverage. * Some reads had greater reads/base ~0.32 (qty. 3) and ~0.44 reads per base * Could 0.32 reads/base indicate two occurences of the contig in the genome? * Could 0.44 occur three times in the genome? * All 0.16 reads/base contigs occured a single time in the genome * 2/3 0.32 reads/base appeared twice in the genome. * 1/3 0.32 reads/base was a short contig and was within the expected deviation in reads/base from a sequencing run * 0.44 reads/base occured twice, not three times * Map-colorspace5 * Makefile has many parameters. * You can list all the parameters by issuing the command "mapcolorspace --help" * Lots of output files using this command (~2 per contig). May prove problematic with assemblies that contain many contigs * Set length parameters using a histogram of reads mapped to contigs * merge_cross option merges all cross products into a single file rather than (number of contigs)^2 files * scaffold fasta & and color qualities are thrown out in this method * If F3 or R3 maps, the program aggresivly tries to map the paired read. * Output * Uniquely mapped reads mapped both F3 and R3 (Forward and Reverse) reads to the same contig * Multiply mapped- can map to multiple contigs * Wrong range, only F3 or R3 map * Summary File * Maps F3 and R3 from the end of one contig to the beginning of another * Shows that contigs are near, not touching. * short contigs may be skipped in paired end reads * QUESTION: Can we make a single connected graph using this data and all possible paths?