User Tools

Site Tools

#! /bin/bash -x

# Example assembly of 100bp C. elegans data set. The only argument
# this script takes is the overlap length used for the final contig assembly.

# We assume the data is downloaded from the SRA and converted to fastq files
# Set IN1 and IN2 to be the paths to the data on your filesystem

# Parameters

# Overlap parameter used for the final assembly. This is the only argument
# to the script

# The number of threads to use

# To save memory, we index $D reads at a time then merge the indices together

# Correction k-mer value

# The minimum k-mer coverage for the filter step. Each 27-mer
# in the reads must be seen at least this many times

# Overlap parameter used for FM-merge. This value must be no greater than the minimum
# overlap value you wish to try for the assembly step.

# Parameter for the small repeat resolution algorithm

# The number of pairs required to link two contigs into a scaffold

# The minimum length of contigs to include in a scaffold
# Distance estimate tolerance when resolving scaffold sequences

# Turn off collapsing bubbles around indels

# First, preprocess the data to remove ambiguous basecalls
$SGA_BIN preprocess --pe-mode 1 -o SRR065390.fastq $IN1 $IN2

# Error correction
# Build the index that will be used for error correction
# As the error corrector does not require the reverse BWT, suppress
# construction of the reversed index
$SGA_BIN index -a ropebwt -t $CPU --no-reverse SRR065390.fastq

# Perform error correction with a 41-mer.
# The k-mer cutoff parameter is learned automatically
$SGA_BIN correct -k $CK --discard --learn -t $CPU -o$CK.fastq SRR065390.fastq

# Contig assembly

# Index the corrected data.
$SGA_BIN index -a ropebwt -t $CPU$CK.fastq

# Remove exact-match duplicates and reads with low-frequency k-mers
$SGA_BIN filter -x $COV_FILTER -t $CPU --homopolymer-check --low-complexity-check$CK.fastq

# Merge simple, unbranched chains of vertices
$SGA_BIN fm-merge -m $MOL -t $CPU -o merged.k$CK.fa$CK.filter.pass.fa

# Build an index of the merged sequences
$SGA_BIN index -d 1000000 -t $CPU merged.k$CK.fa

# Remove any substrings that were generated from the merge process
$SGA_BIN rmdup -t $CPU merged.k$CK.fa

# Compute the structure of the string graph
$SGA_BIN overlap -m $MOL -t $CPU merged.k$CK.rmdup.fa

# Perform the contig assembly without bubble popping

# Scaffolding/Paired end resolution

# Realign reads to the contigs
~/work/devel/sga/src/bin/sga-align --name $CTGS $IN1 $IN2

# Make contig-contig distance estimates
~/work/devel/sga/src/bin/ -n $MIN_PAIRS --prefix libPE

# Make contig copy number estimates
~/work/devel/sga/src/bin/ -m $MIN_LENGTH > libPE.astat

$SGA_BIN scaffold -m $MIN_LENGTH --pe -a libPE.astat -o scaffolds.n$MIN_PAIRS.scaf $CTGS
$SGA_BIN scaffold2fasta -m $MIN_LENGTH -a $GRAPH -o scaffolds.n$MIN_PAIRS.fa -d $SCAFFOLD_TOLERANCE --use-overlap --write-unplaced scaffolds.n$MIN_PAIRS.scaf
You could leave a comment if you were logged in.
archive/c-elegans_example_shell_script.txt · Last modified: 2015/07/18 20:34 by ceisenhart