This is an old revision of the document!
Genome Annotation
Repeat Annotation
Masking
Done to facilitate conventional gene annotation efforts.
Helps avoid false SNP calls and mapping ambiguities.
Hard Masking: replacing repeats with Ns {ACGTNNNNNNNNNATGG}
Soft Masking: replacing repeats with lowercase {ACGTtagtagtagATGG}
Repeat Annotation
Types of Repeats
Transposable Elements
class 1: retrotransposon; “copy & paste”; LTR, LINES, SINES
class 2: DNA transposons; “cut & paste”; subclass 1 and subclass 2
Repeat Content
Homology: RepeatMasker
denovo: RepeatModeler, WindowMasker, RepeatScout, Piler
denovo from reads: REPdenovo, TEDNA
NOTE: denovo tools run risk of false positives from highly conserved protein-coding genes.
Gene Annotation
Evidence-driven Annotation
Ab initio Gene Prediction
doesn't require evidence data
requires training for organism of interest
most find single most likely CDS
do not report UTR's (incomplete gene model)
does not accommodate spliceoforms
requires high-quality assembly (scaffold N50 ≈ avg gene size)
Combined Approach
Annotation Metrics
Sensitivity, specificity, accuracy, AED
AED = 1 - ACC = 1 - .5(Sensitivity+specificity)
AED useful for identifying low quality inconsistent annotations (can be manually curated later)