====== Genome Annotation ====== ===== Repeats ===== Masking: * Done to facilitate conventional gene annotation efforts. * Helps avoid false SNP calls and mapping ambiguities. * Hard Masking: replacing repeats with Ns {ACGTNNNNNNNNNATGG} * Soft Masking: replacing repeats with lowercase {ACGTtagtagtagATGG} Repeat Annotation: * Different types of repeats can be studied along with their levels of activity (evolutionary analyses) Types of Repeats: * low-complexity sequence: microsatellites, homopolymers, etc. * Transposable Elements: * * class 1: retrotransposon; "copy & paste"; LTR, LINES, SINES * * class 2: DNA transposons; "cut & paste"; subclass 1 and subclass 2 Repeat Content * Does not necessarily correlate with genome size * some correlation within the same group Tools * Homology: RepeatMasker * denovo: RepeatModeler, WindowMasker, RepeatScout, Piler * denovo from reads: REPdenovo, TEDNA * NOTE: denovo tools run risk of false positives from highly conserved protein-coding genes. ===== Gene Annotation ===== Evidence-driven Annotation * protein information, EST, **RNA-Seq** Ab initio Gene Prediction * doesn't require evidence data * requires training for organism of interest * most find single most likely CDS * do not report UTR's (incomplete gene model) * does not accommodate spliceoforms * requires high-quality assembly (scaffold N50 ≈ avg gene size) Combined Approach * challenge of collating different models and sources of evidence. Annotation Metrics * Sensitivity, specificity, accuracy, AED * AED = 1 - ACC = 1 - .5(Sensitivity+specificity) * AED useful for identifying low quality inconsistent annotations (can be manually curated later) Tools * Pipelines: Maker2, Pasa, Ensembl, NCBI * Evidence Mapping: BLAST/BLAT, Exonerate (computationally expensive) * ab initio gene predictors: Augustus, SNAP, GeneMark * Choosers and Combiners: JigSaw, Glean * Visualization & Curation: Artemis, Apollo, JBROWSE, IGV