User Tools

Site Tools

**Distributed construction of an FM index from multiple input files**

If your data sets consists of multiple files, you can construct the FM-index for each file separately then merge the indices together to obtain an index of the entire data. This requires much less memory than constructing an index from a single file containing the entire data set. For example, suppose your data consists of four files:

We begin by constructing an index of each file individually:

sga index s_1_1.fastq
sga index s_1_2.fastq
sga index s_2_1.fastq
sga index s_2_2.fastq
Then we want to merge the indices together in pairs until we obtain a single index:

sga merge -p merged1 s_1_1.fastq s_1_2.fastq
sga merge -p merged2 s_2_1.fastq s_2_2.fastq
sga merge -p final merged1.fa merged2.fa
The final index can then be used in other steps of the pipeline, for instance to error correct the original sequence files:

sga correct -p final s_1_1.fastq
sga correct -p final s_1_2.fastq
sga correct -p final s_2_1.fastq
sga correct -p final s_2_2.fastq
You could leave a comment if you were logged in.
contributors/team_3/merging_indexes.txt · Last modified: 2015/07/28 06:01 by ceisenhart