Banana Slug Genomics

April 21st, 2010

Logistic related notes

There will be a guest lecture on Friday, April 23rd, 2010. Please do not be late!
By Friday, April 23rd, 2010, all published sources which contain information pertaining to Banana slug should be added to the Banana Slug Biology page. Is this the right page?
- Expect 40-50 different resources (papers, books, phd theses)
- The bibliography must be annotated, though annotations are not required this friday.
- As you run assembly tool, keep the assemblies/ wiki page updated!

Continuation of Newbler Assembler methods

By running Newbler using settings described last friday, Kevin obtained 43 contigs and ~2.4 megabases of DNA.
- This measure is not bad compared to most other assemblers

Overview of directories

Newbler-clean
- Not attempting to assemble data.
- Used to remove contamination for H. Pylori which was sequenced on the same machine in the same run.
- If you have a reference genome for your contaminants, you can try to clean them out computationally by mapping and removing matches.
- Makefile Overview
  - Set up a new mapping
  - Maps reads from 454 Pog data to the contaminant genome
- Interesting output: ReadStatus.txt
  - Contains mapping status of each read (Mapped, unmapped, partially mapped, too short)
  - Reads mapped to contaminants should be filtered out from 454 data and should not be sued for assembly.
- Create a new SFF file (using sfffile utility) with unmapped data in file “no_Hyp.sff”
- It may be beneficial to hang onto “too short” reads for use in assemblers which utilize shorter reads.
- You can also use Megablast and NCBI's Taxonomy Report to identify contaminants.
Newbler-assembly2
- Uses clean data (no H. Pylori) from newbler-clean1
- One less small contig, mostly the same
Newbler *
- Changed expected coverage to 60x– This number is closer to the real coverage based on better estimates of genome size.
- Still reported 41 contigs (or maybe one less?)
- good news- All contigs map to reference genome
Map-colorspace3 (Map-colorspace directories begin indexing at 3 rather than 1)
- The scripts ran in this directory were originally intended for finding inverseions from mate-pair reads.
- New features have been added since then.
- Now looks for reads between contigs
- Tries to orient contigs
Newbler-partial3
- attempts to use only partially mapped/unmapped reads
- Plan is to later map contigs from this assembly to the full contigs from before (extend edges?)
- Megablast, blastn, blat, find dna differences are four methods for mapping partially mapped reads onto contigs created by newbler.
  - Megablast and blastn showed very similar results
  - blat
    - output is a psl file
    - Nice results
    - Very sensitive but also very slow
    - Shows matches down to 14 bases in size
    - can handle intros and splice sites
    - May be too sensitive for some purposes
    - There are some parameters you can tweak which will alter the output.
      - Kevin thought it would be best to keep blat's default parameters and filter output.
    - A script was written to filter output and find the best matches.
    - For a resequencing project, take all contigs, map with blat, sort by start site and build a scaffold according to the order of the start sites of each contig.
  - Find dna differences
    - Shows differences between contigs and a reference genome in a human readable format
Newbler-assembly4
- Adds partial3 contigs to full search as reads
- Didn't seem to help much
- The resulting output had fewer total bases and more contigs than originally - perhaps worse.
Newbler-assembly5
- Utilized Sanger reads produced by David Bernick to join contigs
- 45 Sanger reads total
  - Sanger reads were created by running pcr on conjectured contig join sites.
  - Two possible methods for guessing contig join sites
    - Use mate-pair data from solid run as evidence to create primers
    - David Bernick used know Crispr repeats. Some contigs showed the start of a repeat, while others showed the end
  - After adding Sanger reads, the final assembly went down to 31 contigs
  - In contig-length.rdb
    - Most reads have ~0.16 reads/base
      - The reads/base metric was utilized because computation was simpler than coverage.
    - Some reads had greater reads/base ~0.32 (qty. 3) and ~0.44 reads per base
      - Could 0.32 reads/base indicate two occurences of the contig in the genome?
      - Could 0.44 occur three times in the genome?
    - All 0.16 reads/base contigs occured a single time in the genome
    - 2/3 0.32 reads/base appeared twice in the genome.
    - 1/3 0.32 reads/base was a short contig and was within the expected deviation in reads/base from a sequencing run
    - 0.44 reads/base occured twice, not three times
- Less contigs and more bases
Map-colorspace5
- map-colorspace has many parameters.
  - You can list all the parameters by issuing the command “map-colorspace –help”
- Lots of output files using this command (~2 per contig). May prove problematic with assemblies that contain many contigs
- Set length parameters using a histogram of reads mapped to contigs
- merge_cross option merges all cross products into a single file rather than (number of contigs)^2 files
- scaffold fasta & and color qualities are thrown out in this method
- If F3 or R3 maps, the program aggresivly tries to map the paired read.
- Output
  - Uniquely mapped reads mapped both F3 and R3 (Forward and Reverse) reads to the same contig
  - Multiply mapped- can map to multiple contigs
  - Wrong range, only F3 or R3 map
- Summary File
  - Maps F3 and R3 from the end of one contig to the beginning of another
  - Shows that contigs are near, not touching.
  - short contigs may be skipped in paired end reads
- QUESTION: Can we make a single connected graph using this data and all possible paths? -sometimes if data is particularly coherent. not guaranteed.

You could leave a comment if you were logged in.

Banana Slug Genomics

User Tools

Site Tools

Table of Contents

April 21st, 2010

Logistic related notes

Continuation of Newbler Assembler methods

Overview of directories

Page Tools