User Tools

Site Tools


Lecture Notes for May 28, 2010


  1. High level polishing of genome with Newbler
  2. Description of scripts in ~karplus/pluck/ dir



Map separate dirs
New mapping instead of new assembly
-rst (repeat score threshold) stands for when you have multiple places where things could map, when do you consider it to be unique and when it is just a repeat you cannot handle
-rst 0 was used, is not the default
Played around with different ways of mapping
Used in makefile for postmapping:

  • megablast, blastn: never look at it now
  • blat, blat-strict-match: do use and look at

Mapping subdirectory with a lot of stuff in it:
Look at: ./454NewblerMetrics

  • Tells how many reads and bases and how many reads managed to map
  • In specific example, not many managed to map, same sort of statistics like the de novo assembly
  • Look at bottom line first for allContigMetrics to see how the mapping performed
  • Sometimes works well, sometimes doesn't work well at all, depends if organism undergoes high mutation rate


More recent mapping
Plasmid cleaned up but not the transposon
Transposon typical length 2.5k
About 2,000 of the reads mapped to the transposon
4% read error means there is a 4% difference between the NCBI version of the transposon and what is being mapped to, not necessarily a 4% read error in the sequencing
consensusAccuracy is 93%, usually higher as 98%-99%, may be that the transposon is highly variable
Tells you how each read was handled
Status of reads: Unmapped, Full, Partial, Repeat, Chimeric (across different sequences)
Some mapped partially to the transposon (e.g. at end of transposon and on chromosome)

  • Could have been partially mapped in the middle, would have been an incorrect mapping


  • Actual completed mapping


  • Check to see the quality of the mapping


Second attempt at mapping
Similar error rate as first attempt
Looks like things are mapping properly
Low error rate as 454 provides cleaner data for Newbler to map

  • Mapped to the assembly, at least 30% match


  • Reads that did not map


Started over removing the plasmids and transposon
Mapping contigs of v24 to v19 assembly
Worried that previous assembly may have bugs in it (e.g. miss a path because previous path from old assembly is still there, may miss another path)
Did it without looking at previous work that Jenny had done, to be confident that had independently arrived at same conclusion
Went from 54 to 51 contigs
Did more through job of cleaning out the plasmid this assembly
One things should notice that is there is a large variation of reads per base

  • Some are really short, so a lot overlaps there
  • Should be reads per base + an amount to compensate

If there are big spikes, suspect that they may be parasites

  • e.g. transposases, integrases, etc.
  • In this case, just repeat mappings


trim0 = have to match all bases in reads
trimX = have to match all-X bases
trim9 = good compromise this time between too many matches in mapping vs. too few

  • Trim to 15-mers from 24-mers
  • ./Makefile
  • –min_SNP = report SNP if is found above X times
  • –min_length = min length for
  • –peak_length = peak length for
  • –merge_cross = put in trim9_cross.rdb if
  • –supress_reads = suppress reads that have

Look at your data, don't just accept the program output. See if it matches your expectations and investigate anything that looks suspicious.



  • File that was semi-automatically generated that joins the contigs together
You could leave a comment if you were logged in.
lecture_notes/05-28-2010.txt · Last modified: 2010/05/28 22:36 by cbrumbau