User Tools

Site Tools


post-assembly_analysis:2015:rna_scaffolding

RNA scaffolding

The RNA Scaffolding is currently in process. Please contact Chris Eisenhart with questions.

The current pipeline can be broken down into three major steps; data processing, transcriptome assembly, and genome scaffolding

The corresponding data files and wet lab procedures are documented online here.

Data processing

The raw data is in fastq paired end format. The data is then processed using fastUniq to remove PCR duplicates. Next the low quality sequences are removed with an in house program (fastqQuality).

Transcriptome assembly

The transcriptome assembly is being done with Trinity.

Genome scaffolding

The transriptome is mapped to the full genomic assembly with BLAT. The output PSL file is used with L_RNA_scaffolder to scaffold the genome.

Current progress

I am working with a small subset of the RNA seq data running it through the pipeline to optimize the options and system usage. Currently one full run has been done (completing L_RNA_scaffolder and generating a new fasta assembly file) while seven partial runs have been done (completing the transcriptome assembly). I am still deciding what data processing is needed, I am debating running a RAM expensive de duplication to ensure that all duplicates are removed. These partial runs has been using 130+ Gigs of RAM at their peak, which means that without optimization the full run will crash even our Terrabyte RAM machines.

Currently I have one undergraduate from UC Berkley working on the pipeline, Daren Liu . Daren has been assisting me by writing a program for fastq de duplication, and a program for generating fasta statistics.

Using Daren's fasta statistics program we are working on optimizing the transcriptome assembly. Currently we are focusing on getting the highest N50 without major losses of bases, while considering RAM expenses.

Discussion

, 2015/08/31 18:15

Why do you think that deduplication is important for scaffolding? We're not trying to quantitate the mRNA transcripts. Uneven coverage should not be a major concern.

, 2015/09/01 09:22

My rationale is for keeping the debruin graph smaller during the transcriptome assembly. For the transcriptome assembly we don't really want duplicates/repeats since I am shooting for a single assembly. If we were doing differential expression analysis then I would leave the repeats/duplicates in there

, 2015/09/05 09:27

Error correction using the kmers of the shotgun DNA sequencing may be a better approach for keeping the deBruijn graph small.

You could leave a comment if you were logged in.
post-assembly_analysis/2015/rna_scaffolding.txt · Last modified: 2015/08/31 14:11 by ceisenhart