User Tools

Site Tools


lecture_notes:04-21-2010

April 21st, 2010

  • There will be a guest lecture on Friday, April 23rd, 2010. Please do not be late!
  • By Friday, April 23rd, 2010, all published sources which contain information pertaining to Banana slug should be added to the Banana Slug Biology page. Is this the right page?
    • Expect 40-50 different resources (papers, books, phd theses)
    • The bibliography must be annotated, though annotations are not required this friday.
    • As you run assembly tool, keep the assemblies/ wiki page updated!

Continuation of Newbler Assembler methods

  • By running Newbler using settings described last friday, Kevin obtained 43 contigs and ~2.4 megabases of DNA.
    • This measure is not bad compared to most other assemblers

Overview of directories

  • Newbler-clean
    • Not attempting to assemble data.
    • Used to remove contamination for H. Pylori which was sequenced on the same machine in the same run.
    • If you have a reference genome for your contaminants, you can try to clean them out computationally by mapping and removing matches.
    • Makefile Overview
      • Set up a new mapping
      • Maps reads from 454 Pog data to the contaminant genome
    • Interesting output: ReadStatus.txt
      • Contains mapping status of each read (Mapped, unmapped, partially mapped, too short)
      • Reads mapped to contaminants should be filtered out from 454 data and should not be sued for assembly.
    • Create a new SFF file (using sfffile utility) with unmapped data in file “no_Hyp.sff”
    • It may be beneficial to hang onto “too short” reads for use in assemblers which utilize shorter reads.
    • You can also use Megablast and NCBI's Taxonomy Report to identify contaminants.
  • Newbler-assembly2
    • Uses clean data (no H. Pylori) from newbler-clean1
    • One less small contig, mostly the same
  • Newbler *
    • Changed expected coverage to 60x– This number is closer to the real coverage based on better estimates of genome size.
    • Still reported 41 contigs (or maybe one less?)
    • good news- All contigs map to reference genome
  • Map-colorspace3 (Map-colorspace directories begin indexing at 3 rather than 1)
    • The scripts ran in this directory were originally intended for finding inverseions from mate-pair reads.
    • New features have been added since then.
    • Now looks for reads between contigs
    • Tries to orient contigs
  • Newbler-partial3
    • attempts to use only partially mapped/unmapped reads
    • Plan is to later map contigs from this assembly to the full contigs from before (extend edges?)
    • Megablast, blastn, blat, find dna differences are four methods for mapping partially mapped reads onto contigs created by newbler.
      • Megablast and blastn showed very similar results
      • blat
        • output is a psl file
        • Nice results
        • Very sensitive but also very slow
        • Shows matches down to 14 bases in size
        • can handle intros and splice sites
        • May be too sensitive for some purposes
        • There are some parameters you can tweak which will alter the output.
          • Kevin thought it would be best to keep blat's default parameters and filter output.
        • A script was written to filter output and find the best matches.
        • For a resequencing project, take all contigs, map with blat, sort by start site and build a scaffold according to the order of the start sites of each contig.
      • Find dna differences
        • Shows differences between contigs and a reference genome in a human readable format
  • Newbler-assembly4
    • Adds partial3 contigs to full search as reads
    • Didn't seem to help much
    • The resulting output had fewer total bases and more contigs than originally - perhaps worse.
  • Newbler-assembly5
    • Utilized Sanger reads produced by David Bernick to join contigs
    • 45 Sanger reads total
      • Sanger reads were created by running pcr on conjectured contig join sites.
      • Two possible methods for guessing contig join sites
        • Use mate-pair data from solid run as evidence to create primers
        • David Bernick used know Crispr repeats. Some contigs showed the start of a repeat, while others showed the end
      • After adding Sanger reads, the final assembly went down to 31 contigs
      • In contig-length.rdb
        • Most reads have ~0.16 reads/base
          • The reads/base metric was utilized because computation was simpler than coverage.
        • Some reads had greater reads/base ~0.32 (qty. 3) and ~0.44 reads per base
          • Could 0.32 reads/base indicate two occurences of the contig in the genome?
          • Could 0.44 occur three times in the genome?
        • All 0.16 reads/base contigs occured a single time in the genome
        • 2/3 0.32 reads/base appeared twice in the genome.
        • 1/3 0.32 reads/base was a short contig and was within the expected deviation in reads/base from a sequencing run
        • 0.44 reads/base occured twice, not three times
    • Less contigs and more bases
  • Map-colorspace5
    • map-colorspace has many parameters.
      • You can list all the parameters by issuing the command “map-colorspace –help”
    • Lots of output files using this command (~2 per contig). May prove problematic with assemblies that contain many contigs
    • Set length parameters using a histogram of reads mapped to contigs
    • merge_cross option merges all cross products into a single file rather than (number of contigs)^2 files
    • scaffold fasta & and color qualities are thrown out in this method
    • If F3 or R3 maps, the program aggresivly tries to map the paired read.
    • Output
      • Uniquely mapped reads mapped both F3 and R3 (Forward and Reverse) reads to the same contig
      • Multiply mapped- can map to multiple contigs
      • Wrong range, only F3 or R3 map
    • Summary File
      • Maps F3 and R3 from the end of one contig to the beginning of another
      • Shows that contigs are near, not touching.
      • short contigs may be skipped in paired end reads
    • QUESTION: Can we make a single connected graph using this data and all possible paths? -sometimes if data is particularly coherent. not guaranteed.
You could leave a comment if you were logged in.
lecture_notes/04-21-2010.txt · Last modified: 2015/09/02 18:17 by 104.144.27.91