User Tools

Site Tools


contributors:team_5_page

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
contributors:team_5_page [2015/05/16 03:01]
ceisenhart
contributors:team_5_page [2015/09/04 14:56]
157.55.39.159 ↷ Links adapted because of a move operation
Line 10: Line 10:
 Discovar //de novo// is a next generation sequence assembly program. The program was developed by the Broad Institute and was released late in 2014. Discovar //de novo// is designed for 250 bp long illumina reads with the PCR duplicates and adaptor sequences removed. The following webpage contains the manual as provided by the Broad Institute (http://​www.broadinstitute.org/​software/​discovar/​blog/​):​ Discovar //de novo// is a next generation sequence assembly program. The program was developed by the Broad Institute and was released late in 2014. Discovar //de novo// is designed for 250 bp long illumina reads with the PCR duplicates and adaptor sequences removed. The following webpage contains the manual as provided by the Broad Institute (http://​www.broadinstitute.org/​software/​discovar/​blog/​):​
  
-[[Discovar //de novo// manual:Discovar de novo manual]].+[[contributors:team_5:​discovar_de_novo_manual]].
  
  
Line 17: Line 17:
 The raw data was received as fastq pairs. ​ Each pair contains a forward and reverse strand. These pairs are ran through Skewer to remove adaptor sequences, then ran through fastUniq to remove PCR duplicates. ​ Next the forward and reverse strand are merged into a single unaligned BAM file. The raw data was received as fastq pairs. ​ Each pair contains a forward and reverse strand. These pairs are ran through Skewer to remove adaptor sequences, then ran through fastUniq to remove PCR duplicates. ​ Next the forward and reverse strand are merged into a single unaligned BAM file.
  
-{{:flowchart5.4.2015.png}}+{{:fastqtobam.png}}
  
 All unaligned BAM files are then passed into Discovar //de novo//. The output is an assembly in .fasta format and Discovar //de novo// visualization files. ​ The .fasta file can then be re-scaffolded with a scaffolding program (see next workflow). The finished file can be represented on the UCSC genome browser. All unaligned BAM files are then passed into Discovar //de novo//. The output is an assembly in .fasta format and Discovar //de novo// visualization files. ​ The .fasta file can then be re-scaffolded with a scaffolding program (see next workflow). The finished file can be represented on the UCSC genome browser.
Line 28: Line 28:
  
 ====FastQC of adapter-trimmed and PCR duplicate-removed data==== ====FastQC of adapter-trimmed and PCR duplicate-removed data====
-After removing adapters and PCR duplicates, we run FastQC in two of the libraries. In general, the quality of the reads decrease in the last base-positions. Also, read 2 of the SW019 library shows problems in the per tile sequence quality. Bellow are the pdf files with the fastqc for the PCR and adapter removed libraries. The protocol we used to run fastqc is uploaded in this link: [[fastqc:​fastqc]]. The path to the files is: campusdata/​gchaves/​fastqc_trimmed_PCR_duplicates.+After removing adapters and PCR duplicates, we run FastQC in two of the libraries. In general, the quality of the reads decrease in the last base-positions. Also, read 2 of the SW019 library shows problems in the per tile sequence quality. Bellow are the pdf files with the fastqc for the PCR and adapter removed libraries. The protocol we used to run fastqc is uploaded in this link: [[archive:fastqc:​fastqc]]. The path to the files is: campusdata/​gchaves/​fastqc_trimmed_PCR_duplicates.
  
 {{:​sw018_adaptertrimmed_dup..._r1.pdf| SW018_R1}} {{:​sw018_adaptertrimmed_dup..._r1.pdf| SW018_R1}}
Line 43: Line 43:
 The fastq to bam conversion was performed using the picard toolset. ​ Specifically the fastqToSam.jar file was used to prepare the bam files. ​ The fastq to bam conversion was performed using the picard toolset. ​ Specifically the fastqToSam.jar file was used to prepare the bam files. ​
  
-[[team_5_page:fastqToSamCommands ​| FastqToSam commands]]+[[contributors:team_5:​fastqtosamcommands| FastqToSam commands]]
  
  
Line 50: Line 50:
 The run logs are stored as .txt files. The full logs can be seen on the wiki here, The run logs are stored as .txt files. The full logs can be seen on the wiki here,
 | Run log | Data used|  | Run log | Data used| 
-|[[team_5_page:1PerRun ​| 1% data ]]| (Pre Skewer and FastUniq) MiSeq data SW019_S1_L001,​ HiSeq data SW018_S1_L007,​ HiSeq data SW019_S2_L008 | +|[[contributors:team_5:​1perrun| 1% data ]]| (Pre Skewer and FastUniq) MiSeq data SW019_S1_L001,​ HiSeq data SW018_S1_L007,​ HiSeq data SW019_S2_L008 | 
-|[[team_5_page:5PerRun ​| 5% data]]| (Pre Skewer and FastUniq) MiSeq data SW019_S1_L001,​ HiSeq data SW018_S1_L007,​ HiSeq data SW019_S2_L008 | +|[[contributors:team_5:​5perrun| 5% data]]| (Pre Skewer and FastUniq) MiSeq data SW019_S1_L001,​ HiSeq data SW018_S1_L007,​ HiSeq data SW019_S2_L008 | 
-|[[team_5_page:10PerRun ​| 10% data]]| (Pre Skewer and FastUniq) MiSeq data SW019_S1_L001,​ HiSeq data SW018_S1_L007,​ HiSeq data SW019_S2_L008 ​ | +|[[contributors:team_5:​10perrun| 10% data]]| (Pre Skewer and FastUniq) MiSeq data SW019_S1_L001,​ HiSeq data SW018_S1_L007,​ HiSeq data SW019_S2_L008 ​ | 
-|[[team_5_page:50PerRun ​| 50% data]]| (Post Skewer and FastUniq) MiSeq data SW019_S1_L001,​ HiSeq data SW018_S1_L007,​ HiSeq data SW019_S2_L008 | +|[[contributors:team_5:​50perrun| 50% data]]| (Post Skewer and FastUniq) MiSeq data SW019_S1_L001,​ HiSeq data SW018_S1_L007,​ HiSeq data SW019_S2_L008 | 
-|[[team_5_page:50PerRunUCSF ​| 50% data UCSF]]| (Post Skewer and FastUniq) UCSF SW018 and SW019 data |+|[[contributors:team_5:​50perrunucsf| 50% data UCSF]]| (Post Skewer and FastUniq) UCSF SW018 and SW019 data 
 +|[[contributors:​team_5:​fullrun1| Full data run 1]] | (Post Skewer and FastUniq) 100% of the MiSeq SW019, UCSF SW019, and 50 % BS-tag datasets | 
 +|[[contributors:​team_5:​kollosusfullrun| Kolossus full run]] | (Post Skewer and FastUniq) 100% of the MiSeq SW019, UCSF SW019, UCSF SW018, BS-tag, BS-MK datasets ​|
  
 The logs are very large, important statistics have been gathered and are compared below. ​ The logs are very large, important statistics have been gathered and are compared below. ​
 Note that MPL1 is an acronym for mean length of first read in pair up to first error.  ​ Note that MPL1 is an acronym for mean length of first read in pair up to first error.  ​
-| | 1% run | 5% run | 10 % run| 50 % run | 50% UCSF run |  +| | 1% run | 5% run | 10 % run| 50 % run | 50% UCSF run | FullRun1 | Kollosus full run |  
-| Total runtime | 1.75 hours| 1.53 hours| 2.4 hours| 8.53 hours| 14.9 hours |  +| Total runtime | 1.75 hours| 1.53 hours| 2.4 hours| 8.53 hours| 14.9 hours | 24.2 hours| 103 hours |    
-| Peak memory use | 43.92 GB | 78.10 GB| 151.05 GB| 220.11 GB | 184.09 GB|  +| Peak memory use | 43.92 GB | 78.10 GB| 151.05 GB| 220.11 GB | 184.09 ​GB| 246.03 GB| 583.25 ​GB | 
-| Bases in 1kb+ scaffolds| 75,233 | 592,685 | 1,476,875 |  101,397,871 | 1,​528,​625,​509 | +| Bases in 1kb+ scaffolds| 75,233 | 592,685 | 1,476,875 |  101,397,871 | 1,​528,​625,​509 ​| 1,​849,​167,​875 | 1,​885,​373,​341
-| Bases in 10kb+ scaffolds| 10,572 | 11,088 | 168,543 | 151,417 | 137,959,107 | +| Bases in 10kb+ scaffolds| 10,572 | 11,088 | 168,543 | 151,417 | 137,​959,​107 ​| 972,798,485 | 1,​106,​140,​476 ​
-| MPL1 | 2 | 2 | 3 | 7 | 156 | +| MPL1 | 2 | 2 | 3 | 7 | 156 | 169 | 169 |  
-| Contig N50 | 2,622 | 2,067 | 2,563 | 1,489  | 3,979 |  +| Contig N50 | 2,622 | 2,067 | 2,563 | 1,489  | 3,979 | 9,513 | 10,​427 ​|  
-| Scaffold N50 | 2,622 | 2,067 | 2,563 | 1,489  | 3,979 | +| Scaffold N50 | 2,622 | 2,067 | 2,563 | 1,489  | 3,979 | 10,634 | 12,549 |  
 +| Coverage | | | | | 16x | 47x | 80X 
  
 ====Fasta assemblies==== ====Fasta assemblies====
Line 77: Line 80:
  
  
- ​**NOTE for all statistics listed below the scaffolds and contigs are identical** 
 |Assembly name| Bytes | Total bases | # scafs | Av. scaf len | Longest scaf | Scaf N50 | # Scaf > 5Kb | Bases in 10kb+ scafs| ​ |Assembly name| Bytes | Total bases | # scafs | Av. scaf len | Longest scaf | Scaf N50 | # Scaf > 5Kb | Bases in 10kb+ scafs| ​
 |1%run| 463K | 448,486 | 2,558 | 350 | 5,385 | 2,622| | 10,572 | |1%run| 463K | 448,486 | 2,558 | 350 | 5,385 | 2,622| | 10,572 |
Line 83: Line 85:
 |10%run| 7.5 M | 7,382,612 | 38,195 | 386 | 11,911 | 2,563 | | 168,543 | |10%run| 7.5 M | 7,382,612 | 38,195 | 386 | 11,911 | 2,563 | | 168,543 |
 |50%run| 137 M | 137,695,736 | 273,653 | 1,006 | 19,658 | 1,489 | | 151,417 |  |50%run| 137 M | 137,695,736 | 273,653 | 1,006 | 19,658 | 1,489 | | 151,417 | 
-|UCSF50%run | 1.9 G | 1,​839,​371,​352 | 3,094,953 | 1,26134,857 | 3,979 | 138,226 | 137,959,107 |+|UCSF50%run | 1.9 G | 1,​839,​371,​352 | 1,126,557 | 1,63255,757 | 3,979 | 80,721 | 137,​959,​107 ​
 +|firstFullRun | 2.2G | 2,​245,​788,​654 | 1,450,447 | 1,548 | 153,999 | 10,634 | 118,545 | 972,798,485 | 
 +|Kolossus full run | 2.4G | 2,​395,​797,​282 | 1,843,153 | 1299 | 129,831 | 12,549 | 113,978 | 1,​106,​140,​476 ​|
  
-The latest ​run assembled 138,226 scaffolds longer than 5000 base pairs This accounts for roughly 30% of the banana slug genome. ​ The scaffolds longer than 5,000 base pairs were separated and put at the following location +The absolute path to our latest ​assembly in .fasta format is;
- +
-/​campusdata/​BME235/​S15_assemblies/​DiscovarDeNovo/​UCSF50%run/​bigContigs.fa+
  
 +/​campusdata/​BME235/​S15_assemblies/​DiscovarDeNovo/​KolossusAssembly/​discovarDeNovoKolossusAssembly.fasta
  
 Looking at the 10% run, the majority of scaffolds generated are quite short (<​1kb).  ​ Looking at the 10% run, the majority of scaffolds generated are quite short (<​1kb).  ​
Line 97: Line 100:
  
 {{:​histogram_for_ucsf_50_run.png?​200|}} {{:​histogram_for_ucsf_50_run.png?​200|}}
- + 
-The banana slug genome is estimated to be 2.1 billion bases (2,800 million), our latest run has assembled just under 2 billion bases! ​+
  
 ====Post assembly scaffolding==== ====Post assembly scaffolding====
 The program SSPace (documentation below) was used to scaffold the the assembly with mate pair data. The UCSF SW041 and SW042 mate pair libraries were used to generate the library.txt file.  The program SSPace (documentation below) was used to scaffold the the assembly with mate pair data. The UCSF SW041 and SW042 mate pair libraries were used to generate the library.txt file. 
  
-[[team_5_page:SSPaceSummaryFile ​| SSpace summary file ]] +[[contributors:team_5:​sspacesummaryfile| SSpace summary file UCSF 50% ]]  
 + 
 +[[contributors:​team_5:​sspacesummaryfile2| SSpace summary file firstFullRun ​]] 
  
  
Line 109: Line 113:
 The .fasta assemblies were run through BLAST. The results are below, ​ The .fasta assemblies were run through BLAST. The results are below, ​
  
-[[team_5_page:10%Blast ​| 10% BLAST results ]]+[[contributors:team_5:​10_blast| 10% BLAST results ]]
  
 There seems to be a very high sequence identity with Notopygos (http://​sv.wikipedia.org/​wiki/​Notopygos) There seems to be a very high sequence identity with Notopygos (http://​sv.wikipedia.org/​wiki/​Notopygos)
  
-[[ team_5_page:50%_UCSF_Blast ​| 50% UCSF Data BLAST results ]]+[[contributors:team_5:​50_ucsf_blast| 50% UCSF Data BLAST results ]]
 ====UCSC genome browser hub==== ====UCSC genome browser hub====
  
 See instructions for setting up the hub here,  See instructions for setting up the hub here, 
-[[banana_slug_genome_browser |Banana slug browser ]]+[[post-assembly_analysis:​banana_slug_genome_browser|Banana slug browser ]]
  
  
contributors/team_5_page.txt · Last modified: 2015/09/04 14:56 by 157.55.39.159