User Tools

Site Tools


archive:bioinformatic_tools:bwa

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
archive:bioinformatic_tools:bwa [2011/05/13 03:01]
svohr created
archive:bioinformatic_tools:bwa [2011/05/20 18:53]
svohr
Line 14: Line 14:
  
 Their are two options for the algorithm. The default option, ''​is'',​ is relatively fast and works on genomes smaller than 2GB. The other algorithm, ''​bwtsw'',​ is slower and less accurate but works on longer reads and works with larger databases. Their are two options for the algorithm. The default option, ''​is'',​ is relatively fast and works on genomes smaller than 2GB. The other algorithm, ''​bwtsw'',​ is slower and less accurate but works on longer reads and works with larger databases.
 +
  
 Next, the reads are aligned to the reference using the ''​aln''​ command. Next, the reads are aligned to the reference using the ''​aln''​ command.
Line 28: Line 29:
 </​code>​ </​code>​
  
 +===== Quirks =====
 +The SAM formatted alignments include a column labeled "​inferred insert length"​ by the BWA manual, but in the SAM specification it is described as the "​template length"​ or distance between the leftmost mapped base to the rightmost mapped base. The second description seems to
 +match the output of BWA. However, there are some template lengths that do not appear to be calculated correctly.
 +
 +<​code>​
 +bc07_1.fastq:​
 +@HWUSI-EAS1722:​4:​66:​6286:​18215#​CAGATC/​1
 +AGCAGTCGTCGTGGTATGCCTGGATGTTACAGCAGTCGTCGTGGTATGACTGGATGTTACAGCAGTCGTCGTGGTATGACTGGATGTTACAGCAGTCGTCGTGGTATGACTGGAT
 +
 +bc07_2.fastq:​
 +@HWUSI-EAS1722:​4:​66:​6286:​18215#​CAGATC/​2
 +CACGACGACTGCTGTAACATCCAGGCATACCACGACGACTGCTGTAACATCCAGGCATACCACGACGACAGCTATAACATACACTCATACCACGA
 +</​code>​
 +
 +For example, these two reads make up a pair that overlaps.
 +
 +<​code>​
 +...ACATCCAGTCATACCACGACGACTGCTGTAACATCCAGGCATACCACGACGACTGCT
 +                 ​CACGACGACTGCTGTAACATCCAGGCATACCACGACGACTGCTGTAACATCCAGGCATACCACGACGACAGCTATAACATACACTCATACCACGA
 +</​code>​
 +
 +Instead of reporting the total length, the length of the overlap is reported.
 +<​code>​
 +HWUSI-EAS1722:​4:​66:​6286:​18215#​CAGATC 81 GAZ7HUX03HIJAL 272 23 115M = 344 -43
 +HWUSI-EAS1722:​4:​66:​6286:​18215#​CAGATC 161 GAZ7HUX03HIJAL 344 25 95M = 272 43
 +</​code>​
 +
 +This explains the incorrect short lengths found in our histograms. This does not appear to affect the pairs that do not overlap and most of these overlapping reads that should be combined using SeqPrep.
 +
 +
 +====== Determining Paired-End Insert Size ======
 +BWA was used to estimate the distribution of insert sizes in the Illumina runs for banana slug. The 454 reads were used as the reference and the Illumina reads were mapped onto them. The distribution of the insert lengths can be inferred from the pairs that map onto the same 454 read. This is possible because our insert sizes are smaller than the size of the 454 reads.
 +
 +Here is the frequencies of each inferred insert length from the SAM file from the paired end alignments for Illumina run 2. The mean inferred insert size for the barcode 7 reads is 258 bases and 138 bases for the barcode 8 reads. ​ Note that this differs considerably from the estimates of 411 bp for barcode 7 and 372bp for barcode 8 from the [[computer_resources:​data|computer_resources:​data]] page, which was based on bioanalyzer results for the DNA library. ​ What is the discrepancy?​ Is it different definitions of the length (including neither, one, or both reads in the length)? Why does the barcode 8 graph cut off so abruptly? (overlapping reads?) If the "​inferred insert length"​ here is between the reads, then we need to add 200 for the read lengths to get the full DNA length, giving 458 and 338, which are fairly close to numbers reported by the bioanalyzer,​ but that would not explain the cutoff at 100.  If the inferred insert length here is the difference in the start positions in the same strand of the two reads, we would have to add 100 for the read length, getting 354 and 238, which seem a bit low.
 +
 +{{:​bioinformatic_tools:​run2_insert_size_histogram.png|}}
 +
 +So what exactly is the "​inferred insert length"​ being plotted here?  After looking at the SAM format specification and some of the entries in the SAM files, it appears we are actually looking at the "​template length",​ the total length of each end read plus the insert size.
 +
 +===== After SeqPrep =====
 +We ran [[bioinformatic_tools:​seqprep|SeqPrep]] on run 2 to remove the Illumina adapter sequences and merge pairs that overlapped and mapped the remaining pairs to the 454 reference. SeqPrep removed most of the barcode 8 pairs that were mapped previously, but left most of the barcode 7 pairs that previously mapped unchanged.
 +
 +{{:​bioinformatic_tools:​run2_seqprep_template_size_histogram.png|}}
 +
 +These histograms show the mapped lengths for the paired-end templates and the lengths of merged reads from SeqPrep along with the 454 read length distribution for comparison. In each of these, we can see the distinct range for the SeqPrep merged reads and the split between merged and unmerged pairs. Lengths less than 90 may be incorrect. The higher frequency of these in run 1 can be explained its higher coverage.
 +
 +In the merged lengths for both run 1 and run 2 barcode 8 there is a gap of 10 lengths (66-75 for run 1, 105-114 for run 2 bc08) where no reads were observed. This may be an artifact of SeqPrep and the read lengths.
 +
 +{{:​bioinformatic_tools:​run1_seqprep_histogram.png|}}
 +
 +{{:​bioinformatic_tools:​run2_bc07_seqprep_histogram.png|}}
  
 +{{:​bioinformatic_tools:​run2_bc08_seqprep_histogram.png|}}
archive/bioinformatic_tools/bwa.txt · Last modified: 2015/09/04 09:06 by 68.180.228.52