The RNA Scaffolding is currently in process. Please contact Chris Eisenhart with questions.
The current pipeline can be broken down into three major steps; data processing, transcriptome assembly, and genome scaffolding
The corresponding data files and wet lab procedures are documented online here.
The raw data is in fastq paired end format. The data is then processed using fastUniq to remove PCR duplicates. Next the low quality sequences are removed with an in house program (fastqQuality).
The transcriptome assembly is being done with Trinity.
The transriptome is mapped to the full genomic assembly with BLAT. The output PSL file is used with L_RNA_scaffolder to scaffold the genome.
I am working with a small subset of the RNA seq data running it through the pipeline to optimize the options and system usage. Currently one full run has been done (completing L_RNA_scaffolder and generating a new fasta assembly file) while seven partial runs have been done (completing the transcriptome assembly). I am still deciding what data processing is needed, I am debating running a RAM expensive de duplication to ensure that all duplicates are removed. These partial runs has been using 130+ Gigs of RAM at their peak, which means that without optimization the full run will crash even our Terrabyte RAM machines.
Currently I have one undergraduate from UC Berkley working on the pipeline, Daren Liu . Daren has been assisting me by writing a program for fastq de duplication, and a program for generating fasta statistics.
Using Daren's fasta statistics program we are working on optimizing the transcriptome assembly. Currently we are focusing on getting the highest N50 without major losses of bases, while considering RAM expenses.