Using Mate-Paired data to fix Newbler assembly of H.pylori


Kevin assembled H.pylori 454 data using Newbler mapping assembly. To fix the assembly from Newbler, Kevin made use of SOLiD mate-paired data. This is done on version 16 of H.pylori and it still has bugs. Kevin has a software that checks for inversions.

Newbler output

  • Uses the reads and outputs number of contigs with little gaps between contigs. Breaks them into contigs when a repeat is encountered.
  • All the reads not used in doing the assembly are those considered as not fully mapped, or partially mapped, or chimeric reads, or repeats by Newbler.
  • All these unused reads are in the sff file.
  • De novo assembly is done on these unused reads and some chunks of these reads are mapped as repeats.
  • These repeats can occur at several places in the chromosome and can be seen as contig → repeat → contig → repeat..
  • There might be two copies of same repeat which occur in tandem. These can be viewed as a loop in De Bruijn graph.

SOLiD mate-paired data

  • Mapped the mate-pair data to the 454 data to see where the anamolies are.
  • trim9 is used for mapping.
  • trim9-good-length.hist has distribution of mate-pair reads.
  • Good reads are assumed to be between 1500-3500, as beyond that interval the probability of noise is high.
  • Clustered mate-paired reads as good reads and not good reads.
  • Not good reads are again classified into 3 bins - between chromosome and plasmid, bad reads, and within plasmids.
  • Good reads which are interesting are within chromosome.
  • There might be deletions or inversions of reads.
  • If the same mate-paired reads R, F (where, R → F in forward strand, and F → R in reverse strand) occur in multiple places, where they are adjacent to each other at one location and there is a huge gap between them in the other location, then there is a probability that the ones which are adjacent to each other are part of the reads with huge gap between them and they were moved/copied from that place to some other place.
  • The deletions are listed in a file called 'del'.
  • There might be inversions of reads where, R goes in one direction and F goes in the opposite direction. These insertions might be due to mis-assembly or inversions in biological populations.
  • The inversions are listed in a file called 'inv'.
  • Histogram of trimp9-chrom-v16_center100 (in bins of 100), gives distribution of the mate-paired data.
  • Looking at the largest values is useful, as interesting things might be happening there.
  • Spikes are interesting to investigate. Zoom in to figure out the spikes and extract the reads that fell in that region.
  • Showed example of how there is a 1000 base pair insert in the middle that needs to be removed. The reads were classified as good reads as they were in < 3500 interval.
  • Gaps in the histogram represents the repeat regions.
  • How to find where the 1000 base insert is? Looking at the flanking regions of these repeats obtained from mapping.
  • Histogram of 720k region shows two peaks shown as two dots. Zoomed in to look at the regions where the peaks were. By examining the regions, gaps were seen which can have the 1000 base insert.
  • Rechecked this result by working at the unused data.
  • There were 3 more regions were found to have this 1000 base insert.
  • Overall 7 peaks were observed, 2 being homologous ends, 4 having 1000 base insert, and 1 still needs to be figured out.

Things to be done

  • H.pylori has important region called as pathogenecity island : -7 → -42 → -7. This region is highly polymorphic.
  • PCR needs to be done on this region.
  • Then take out the mis-assembly polymorphism, and then think about the biological polymorphisms.
  • This region could be a misplaced piece and needs to be moved.

Volunteers for next weeks presentations

  1. John Kim
  2. Shyamini Vasili
  3. Jenny Draper
  4. Jeff Long
  5. Thomas
  6. Michael Cusack

There is no class on Friday May 14th, due to time conflict with Graduate Research Symposium (2-5pm).

