Mapped the mate-pair data to the 454 data to see where the anamolies are.
trim9 is used for mapping.
trim9-good-length.hist has distribution of mate-pair reads.
Good reads are assumed to be between 1500-3500, as beyond that interval the probability of noise is high.
Clustered mate-paired reads as good reads, and not good reads.
Not good reads are again classified into 3 bins - between chromosome and plasmid, bad reads, and within plasmids.
Good reads which are interesting are within chromosome.
There might be deletions or inversions of reads.
If the same mate-paired reads R, F (where, R → F in forward strand, and F → R in reverse strand) occur in multiple places, where they are adjacent to each other at one location and there is a huge gap between them in the other location, then there is a probability that the ones which are adjacent to each other are part of the reads with huge gap between them and they were moved/copied from that place to some other place.
The deletions are listed in a file called 'del'.
There might be inversions of reads where, R goes in one direction and F goes in the opposite direction. These insertions might be due to mis-assembly or inversions in biological populations.
The inversions are listed in a file called 'inv'.
Histogram of trimp9-chrom-v16_center100 (in bins of 100), gives distribution of the mate-paired data.
Looking at the largest values is useful, as interesting things might be happening there.
Spikes are interesting to investigate. Zoom in to figure out the spikes and extract the reads that fell in that region.
Showed example of how there is a 1000 base pair insert in the middle that needs to be removed. Although the reads are bad with inserts, but they were classified as good reads since they were in < 3500 interval.
Gaps in the histogram represents the repeat regions.
How to find where the 1000 base insert is? Looking at the flanking regions of these repeats obtained from mapping.
Histogram of 720k region shows two peaks shown as two dots. Zoomed in to look at the regions where the peaks were. By examining the regions, gaps were seen which can have the 1000 base insert.
Rechecked this result by working at the unused data.
There were 3 more regions found to have this 1000 base insert.
Overall 7 peaks were observed, 2 being homologous ends, 4 having 1000 base insert, and 1 still needs to be figured out.