Genome Annotation
Repeats
Masking:
Done to facilitate conventional gene annotation efforts.
Helps avoid false SNP calls and mapping ambiguities.
Hard Masking: replacing repeats with Ns {ACGTNNNNNNNNNATGG}
Soft Masking: replacing repeats with lowercase {ACGTtagtagtagATGG}
Repeat Annotation:
Types of Repeats:
low-complexity sequence: microsatellites, homopolymers, etc.
Transposable Elements:
* class 1: retrotransposon; “copy & paste”; LTR, LINES, SINES
* class 2: DNA transposons; “cut & paste”; subclass 1 and subclass 2
Repeat Content
Tools
Homology: RepeatMasker
denovo: RepeatModeler, WindowMasker, RepeatScout, Piler
denovo from reads: REPdenovo, TEDNA
NOTE: denovo tools run risk of false positives from highly conserved protein-coding genes.
Gene Annotation
Evidence-driven Annotation
Ab initio Gene Prediction
doesn't require evidence data
requires training for organism of interest
most find single most likely CDS
do not report UTR's (incomplete gene model)
does not accommodate spliceoforms
requires high-quality assembly (scaffold N50 ≈ avg gene size)
Combined Approach
Annotation Metrics
Sensitivity, specificity, accuracy, AED
AED = 1 - ACC = 1 - .5(Sensitivity+specificity)
AED useful for identifying low quality inconsistent annotations (can be manually curated later)
Tools
Pipelines: Maker2, Pasa, Ensembl, NCBI
Evidence Mapping: BLAST/BLAT, Exonerate (computationally expensive)
ab initio gene predictors: Augustus, SNAP, GeneMark
Choosers and Combiners: JigSaw, Glean
Visualization & Curation: Artemis, Apollo, JBROWSE, IGV