# Banana Slug Genomics

### Site Tools

lecture_notes:04-10-2015

This is an old revision of the document!

De novo Assembly III Guest lecturer: Stefan Prost, stefan.prost@berkley.edu

Best rated assemblers #Error Correction (EC) using the k-mer spectrum

```. BLESS
. SGA
. Musket```

#EC Using Kmer Counts

`. RACER`

#Adapter and Low Quality Base Trimming

```. Skewer (Jiang et al. 2014)
. Trimmomatic (Bolger et al. 2014)```

#Contamination Filtering

```. Blast
. Allpaths-LG
. Removing Low Frequency k-mers
. Discarding Scaffolds shorter than 1kb```

#de Bruijn Graph

```. Most assemblers use kmers to create a de Bruijn graph
. “In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols; the same symbol may appear multiple times in a sequence” - http://en.wikipedia.org/wiki/De_Bruijn_graph ```

```. Allpaths-LG
. Abyss (good guessing assembler…fast but not too accurate but must run for every kmer)
. SOAPdenovo
. Platanus
. SSAKE
. SGA
. Celera```

#Abyss Output files

```. unitigs.fa
. “A unitig is a special kind of contig. Ideally, it is fully consistent with all the data including reads, overlaps, and mate constraints. In practice, unitigs can only be consistent with most of the data. Conceptually, a unitig is a high-confidence contig. Maximal unitigs should contain either (1) unique sequence up to repeat boundaries, with less than a read-length of repeat on each end, or (2) nearly the full extent of a genomic repeat.” - http://wgs-assembler.sourceforge.net/wiki/index.php/Celera_Assembler_Terminology
. contigs.fa
. scaffolds.fa```

#N50

```. N50 at least 50% of the bases are in scaffolds this size or bigger
. A contig N50 is calculated by first ordering every contig by lenght from longest to shortest.  Next, starting from the longest contig, lenghts of each contig are summed, until this running sum equals one-half o the total length of all contigs in the assemly.  The contig N50 of the assembly is the length of the shortest contig in this list (Yandel and Ence 2012). ```

#Assembly Quality Assessment

```. By annotation: CEGMA (Parra et al. 2007) looks for certain genes with the sequence to measure quality. (http://korflab.ucdavis.edu/datasets/cegma/)
. Using RNA transcripts: Baa.pl (Ryan 2013/4, Arxiv)
. De novo liklihood-based measures (LAP; Ghodsi et al. 2013)
. Feature Response Curve
. Mate-pair orientations and separations
. Repeat content by kmer analysis
. Depth-of-coverage
. Correlated polymorphism in the read alignments

#Gap Filling

```. GapCloser (Luo et al. 2012)
. Most gap closing is done on first cycle (Run CEGMA > GapCloser > Cegma again)
. GapFiller (Nadalin et al. 2011)```

#Resolving Mis-Assemblies

```. REAPR (Hunt et al. 2013) better than NxRepair
. Run Cegma to figure out best assembly and then run reaper
. NxRepair (Murphy et al. 2014)```

#Genome Merging

```. Metassembler (Wences and Schatz 2014, Arxiv)
. ranks assemblies by N50
. Merges the different assemblies to form optimal assembly
. GAM-NGS (Vicedomini et al. 2013)```

. Read clipping marks the quality of the read. Soft masking (low quality gets lowercase character) and hard masking (low quality gets replaced by N character)

. Number of scaffolds shold be close to number chromosomes (ideally but rare) . Contig N50 close to or bigger 2x avg(gene size)

#Downstream Processing

```. Repeat annotation
. Gene annotation
. Mapping to get Diploid Genome```

. Efasta [AT{G,C}A]