This shows you the differences between two versions of the page.
Next revision | Previous revision Last revision Both sides next revision | ||
lecture_notes:05-27-2015 [2015/05/28 11:56] emfeal created |
lecture_notes:05-27-2015 [2015/05/28 12:27] emfeal |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Genome Annotation ====== | ====== Genome Annotation ====== | ||
===== Repeat Annotation ===== | ===== Repeat Annotation ===== | ||
+ | ==== Masking ==== | ||
+ | * Done to facilitate conventional gene annotation efforts. | ||
+ | * Helps avoid false SNP calls and mapping ambiguities. | ||
+ | * Hard Masking: replacing repeats with Ns {ACGTNNNNNNNNNATGG} | ||
+ | * Soft Masking: replacing repeats with lowercase {ACGTtagtagtagATGG} | ||
+ | ==== Repeat Annotation ==== | ||
+ | * Different types of repeats can be studied along with their levels of activity (evolutionary analyses) | ||
+ | === Types of Repeats === | ||
+ | * low-complexity sequence: microsatellites, homopolymers, etc. | ||
+ | == Transposable Elements == | ||
+ | * class 1: retrotransposon; "copy & paste"; LTR, LINES, SINES | ||
+ | * class 2: DNA transposons; "cut & paste"; subclass 1 and subclass 2 | ||
+ | === Repeat Content === | ||
+ | * Does not necessarily correlate with genome size | ||
+ | * some correlation within the same group | ||
+ | === Tools === | ||
+ | * Homology: RepeatMasker | ||
+ | * denovo: RepeatModeler, WindowMasker, RepeatScout, Piler | ||
+ | * denovo from reads: REPdenovo, TEDNA | ||
+ | * NOTE: denovo tools run risk of false positives from highly conserved protein-coding genes. | ||
+ | ===== Gene Annotation ===== | ||
+ | ==== Evidence-driven Annotation ==== | ||
+ | * protein information, EST, **RNA-Seq** | ||
+ | ==== Ab initio Gene Prediction ==== | ||
+ | * doesn't require evidence data | ||
+ | * requires training for organism of interest | ||
+ | * most find single most likely CDS | ||
+ | * do not report UTR's (incomplete gene model) | ||
+ | * does not accommodate spliceoforms | ||
+ | * requires high-quality assembly (scaffold N50 ≈ avg gene size) | ||
+ | ==== Combined Approach ==== | ||
+ | * challenge of collating different models and sources of evidence. | ||
+ | ==== Annotation Metrics ==== | ||
+ | * Sensitivity, specificity, accuracy, AED | ||
+ | * AED = 1 - ACC = 1 - .5(Sensitivity+specificity) | ||
+ | * AED useful for identifying low quality inconsistent annotations (can be manually curated later) | ||
+ | |||
+ | |||