6. Gene and variant annotations

Learning outcome

After this chapter, the students can explain the concepts “gene annotation” and “variant annotation” and their roles in analyses of population genetic data. They can evaluate the pitfalls of different automated prediction approaches and choose a suitable method for their own data.

Neutral and non-neutral sites

Early population genetics built on Charles Darwin’s work and natural selection had a central role. Given the omnipresent selection, the appearance of genetic variation and alternative alleles had to be explained by balancing selection and other active forces. The neutral theory by Motoo Kimura challenged that and proposed that much of the genetic variation is neutral and the changes are driven by random processes. This started a long-lasting war between selectionists and neutralists.

The Wright-Fisher population model and the coalescence model assume that the sites studied are neutral and the changes in allele frequencies are a result of a stochastic process and not affected by selection. Many of the analyses on this course are based on this assumption. The assumption of neutrality doesn’t necessarily need to be strict: can assume that a majority of sites are evolving roughly neutrally and the sites under strong directional selection are rare enough not to significantly affect the analyses.

Although we have assumed “rough neutrality”, we have not specifically defined what makes the process “neutral” and how we know that the sites are evolving “neutrally”. One reason for the omissions it that showing specific sites to evolve neutrally is difficult. Showing that specific sites evolve non-neutrally is slightly easier, and predicting that certain sites probably do not evolve neutrally is even easier.

The easiest criterion to predict that site is under selection (and not evolving neutrally) is to show that it has a biological function based on the DNA sequence; many (but not necessarily all) changes in the DNA sequence break this biological function and affect the fitness of the individual carrying the mutation, thus triggering selection on the mutation.

What is then a biological function? The Encyclopedia Of DNA Elements (ENCODE) was a major undertaking to understand the human genome. One of the controversial claims of the final publication (ref and ref) was that 80% of the human genome has a biochemical function. This claim was trashed by evolutionary biologists (ref, ref and ref). There is little progress since then and it is still unclear what is a real “function” and when an “effect” is mere noise.

Gene annotation

Although we mostly assume neutral evolution on this course, sometimes the function is of interest, even the main target of the study. One of the easiest genomic functions to predict is the location of genes, although even there the boundary between the “classical” protein-coding genes and the “new” non-coding genes, i.e.genome regions expressed in a gene-like manner, is getting complicated.

The locations of genes can be predicted in many ways. Protein-coding genes must have open reading frames (ORFs) and, if multi-exonic, proper splice sites. These patterns can be detected computationally, but the resulting predictions are rough and noisy. In practice, de novo inference is done from biological data, the spliced transcripts (or peptides) generated by the organisms. Messenger RNAs (mRNAs) can be extracted and sequenced similarly to the genome’s DNA sequence; these sequencing reads can then be mapped to the genome, taking into account the exon structure. A downside of this approach is that all genes are not expressed everywhere, all the time and in equal quantities. This is tackled by combining data from multiple tissues (in a case of an animal, e.g. muscle, liver, testis, brain) and increasing the sequencing depth.

Even with good resources, it is extremely difficult (or impossible) to cover all genes and alternative transcripts of a complex organism. Furthermore, the transcriptome of a species is not fixed: in addition to allelic differences within genes, there are differences in gene presence/absence and gene copy number between populations and individuals.

In A beginner’s guide to eukaryotic genome annotation, the authors write:

Although the terms ‘gene prediction’ and ‘gene annotation’ are often used as if they are synonyms, they are not. With a few exceptions, gene predictors find the single most likely coding sequence (CDS) of a gene and do not report untranslated regions (UTRs) or alternatively spliced variants. Gene prediction is therefore a somewhat misleading term. A more accurate description might be ‘canonical CDS prediction’.

Gene prediction versus gene annotation.

Gene prediction versus gene annotation.

Gene annotations, conversely, generally include UTRs, alternative splice isoforms and have attributes such as evidence trails. The figure shows a genome annotation and its associated evidence. Terms in parentheses are the names of commonly used software tools for assembling particular types of evidence. Note that the gene annotation (shown in blue) captures both alternatively spliced forms and the 5′ and 3′UTRs suggested by the evidence. By contrast, the gene prediction that is generated by SNAP (shown in green) is incorrect as regards the gene’s 5′ exons and start-of-translation site and, like most gene-predictors, it predicts only a single transcript with no UTR.

Gene annotation is thus a more complex task than gene prediction. A pipeline for genome annotation must not only deal with heterogeneous types of evidence in the form of the expressed sequence tags (ESTs), RNA-seq data, protein homologies and gene predictions, but it must also synthesize all of these data into coherent gene models and produce an output that describes its results in sufficient detail for these outputs to become suitable inputs to genome browsers and annotation databases.

On this course, we do not do de novo gene prediction but use homology and transfer the gene predictions from another species, the three-spined stickleback that is included in the major genome databases.

Structure of a gene

The gene structure is described in the Wikipedia article like this:

Schematic view of eukaryote gene structure.

Schematic view of eukaryote gene structure.

In the upper-most row, the different elements are indicated with different colours and, in principle, our genome annotation could include all of these. In practice, the gene annotation often consists of the red blocks only. The grey blocks, introns’ can then be defined, as well as proximity to a gene, e.g. “upstream of gene” (sometimes called “promoter”), “downstream of gene” etc.

This figure lacks details, however, and especially in more complex organisms, genes often produce multiple different transcripts, some of which may be non-functional and directed immediately to a decay pathway. The figures below show the alternative transcripts of the human BRCA2 gene:

BRCA2 transcripts

BRCA2 transcripts

BRCA2 transcripts

BRCA2 transcripts

Such accurate annotations are not available for many other species.

Annotation lift-over

There are different ways of transferring the information from a closely-related model species to the target species. People behind the UCSC genome browser developed a tool called LifOver to convert coordinates in one genome to those in another genome, and called this transfer “lifting”. If one has a description of homologous regions in the two genomes, the procedure is pretty simple. Let’s assume that a region in ctg2078 in species_A would be homologous to a region in chr3 in species_B. We could then write conversion tables like these to convert a coordinate in one species to another:

ctg2078 19679   chr3    207555
ctg2078 19680   chr3    207556
ctg2078 19681   chr3    207557
ctg2078 19682   chr3    207558
ctg2078 19683   chr3    207559
ctg2078 19684   chr3    207560
ctg2078 19685   chr3    207561
ctg2078 19686   chr3    207562
ctg2078 19687   chr3    207563
ctg2078 19688   chr3    207564
ctg2078 19689   chr3    207566
ctg2078 19690   chr3    207567
ctg2078 19691   chr3    207568
ctg2078 19692   chr3    207569
ctg2078 19693   chr3    207570
ctg2078 19694   chr3    207571
ctg2078 19695   chr3    207572

From species_A (cols 1,2) to species_B (cols 3,4)

chr3    207555  ctg2078 19679
chr3    207556  ctg2078 19680
chr3    207557  ctg2078 19681
chr3    207558  ctg2078 19682
chr3    207559  ctg2078 19683
chr3    207560  ctg2078 19684
chr3    207561  ctg2078 19685
chr3    207562  ctg2078 19686
chr3    207563  ctg2078 19687
chr3    207564  ctg2078 19688
chr3    207565  NA      NA
chr3    207566  ctg2078 19689
chr3    207567  ctg2078 19690
chr3    207568  ctg2078 19691
chr3    207569  ctg2078 19692
chr3    207570  ctg2078 19693
chr3    207571  ctg2078 19694
chr3    207572  ctg2078 19695

From species_B (cols 1,2) to species_A (cols 3,4)

To convert “ctg2078:19685” to “chr3:207561”, we would just need to find a row where the first two columns match and then read the new coordinate in the last two columns. To do it other way round, the position of columns are swapped (right).

Space-wise, this would be very inefficient, however. We can see that most sites follow in order and there is only one small gap in the other genome. If the genome would large, this kind of an index file would be several gigabytes in size and very slow to read into computer memory.

The same task is done in a more compact way by using the Chain Format. The one above is an artificial example, but the same information could be stored e.g. in this manner:

chain 1200 ctg2078 56789 + 19679 19696 chr3 5456798 + 207555 207573
10 0 1
7

(Many numbers above are completely arbitrary.)

There’s a program chainSwap that can swap the direction of the chain:

chain 1200 chr3 5456798 + 207555 207573 ctg2078 56789 + 19679 19696 1
10  1   0
7

In practice, the chains are of course much longer and more complex, but they are very space-efficient.

With a lift-over chain, the positions in one genome can be converted to positions in another genome. Note, though, that some genome regions may be unique to one species, evolved beyond recognition or been duplicated, and all the positions are rarely included in a chain file. If the sites are missing, the coordinates at those sites cannot be transferred.

A downside of this approach is that the positions in the genome are transferred without any check whether the regions defined in the files show any similarity in the two genomes; it is enough that the start and end positions of the elements are inferred to be homologous. Some “lift-over” programs check that the element transferred to a new genome is exactly as long as the original element: this is often unnecessarily strict and e.g. gene regions tolerate indels within exons and can be correct even with length changes.

Alignment of gene or protein sequences only

The generation of a lit-over chain file used to be a bit cumbersome and, as it requires a full genome alignment, computationally heavy. LifOff provides an alternative approach and only needs to align the regions that are lifted. LiftOff uses the gene annotation for species_A to extract the gene sequences from the species_A’s genome, and then aligns these gene sequences to species_B’s genome. From the resulting alignments, it computes the location of the genes and their exons in species_B’s genome and outputs the information in the structured format of gene annotation.

A variant of this approach is not to use DNA (gene) sequences but protein sequences. There are a few programs that can align protein sequences to genomic sequences, considering the alternative reading frames and exon splicing in the process. The most efficient of these methods is miniprot. The program can output the alignments in a structured format but unfortunately, this format is not directly compatible with all downstream processing tools.

Using the gene structure figure, LiftOff takes the middle row from one species as input and aligns it to the top row of another species; from the alignment, it then calculates the block boundaries. On the other hand, miniprot takes the bottom row from one species as input and aligns it to the top row of another species; from the alignment, it then resolves the block boundaries.

Schematic view of eukaryote gene structure.

Schematic view of eukaryote gene structure.

A potentially serious downside of these approaches is that they can’t be used to transfer the locations of small elements, e.g. transcription factor binding sites, from one genome to another: to get reliable alignments, the elements have to be unique and long enough and, in the cases of miniprot, protein-coding.

Variant annotation

Types of consequences

Given the annotation of genes, the location of VCF variants relative to that can be easily solved. Ensembl provides a comprehensive description of the variants contained in their Variation database, visualised below and listed in detail their calculated variant consequences:

Variant consequences as defined by Ensembl.

Variant consequences as defined by Ensembl.

The terms in the figure are listed in the table below, shown in order of severity (more severe to less severe) as estimated by Ensembl.

SO term SO description Display term IMPACT
transcript_ablation A feature ablation whereby the deleted region includes a transcript feature Transcript ablation HIGH
splice_acceptor_variant A splice variant that changes the 2 base region at the 3’ end of an intron Splice acceptor variant HIGH
splice_donor_variant A splice variant that changes the 2 base region at the 5’ end of an intron Splice donor variant HIGH
stop_gained A sequence variant whereby at least one base of a codon is changed, resulting in a premature stop codon, leading to a shortened transcript Stop gained HIGH
frameshift_variant A sequence variant which causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three Frameshift variant HIGH
stop_lost A sequence variant where at least one base of the terminator codon (stop) is changed, resulting in an elongated transcript Stop lost HIGH
start_lost A codon variant that changes at least one base of the canonical start codon Start lost HIGH
transcript_amplification A feature amplification of a region containing a transcript Transcript amplification HIGH
feature_elongation A sequence variant that causes the extension of a genomic feature, with regard to the reference sequence Feature elongation HIGH
feature_truncation A sequence variant that causes the reduction of a genomic feature, with regard to the reference sequence Feature truncation HIGH
inframe_insertion An inframe non synonymous variant that inserts bases into in the coding sequence Inframe insertion MODERATE
inframe_deletion An inframe non synonymous variant that deletes bases from the coding sequence Inframe deletion MODERATE
missense_variant A sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved Missense variant MODERATE
protein_altering_variant A sequence_variant which is predicted to change the protein encoded in the coding sequence Protein altering variant MODERATE
splice_donor_5th_base_variant A sequence variant that causes a change at the 5th base pair after the start of the intron in the orientation of the transcript Splice donor 5th base variant LOW
splice_region_variant A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron Splice region variant LOW
splice_donor_region_variant A sequence variant that falls in the region between the 3rd and 6th base after splice junction (5’ end of intron) Splice donor region variant LOW
splice_polypyrimidine_tract_variant A sequence variant that falls in the polypyrimidine tract at 3’ end of intron between 17 and 3 bases from the end (acceptor -3 to acceptor -17) Splice polypyrimidine tract variant LOW
incomplete_terminal_codon_variant A sequence variant where at least one base of the final codon of an incompletely annotated transcript is changed Incomplete terminal codon variant LOW
start_retained_variant A sequence variant where at least one base in the start codon is changed, but the start remains Start retained variant LOW
stop_retained_variant A sequence variant where at least one base in the terminator codon is changed, but the terminator remains Stop retained variant LOW
synonymous_variant A sequence variant where there is no resulting change to the encoded amino acid Synonymous variant LOW
coding_sequence_variant A sequence variant that changes the coding sequence Coding sequence variant MODIFIER
mature_miRNA_variant A transcript variant located with the sequence of the mature miRNA Mature miRNA variant MODIFIER
5_prime_UTR_variant A UTR variant of the 5’ UTR 5 prime UTR variant MODIFIER
3_prime_UTR_variant A UTR variant of the 3’ UTR 3 prime UTR variant MODIFIER
non_coding_transcript_exon_variant A sequence variant that changes non-coding exon sequence in a non-coding transcript Non coding transcript exon variant MODIFIER
intron_variant A transcript variant occurring within an intron Intron variant MODIFIER
NMD_transcript_variant A variant in a transcript that is the target of NMD NMD transcript variant MODIFIER
non_coding_transcript_variant A transcript variant of a non coding RNA gene Non coding transcript variant MODIFIER
coding_transcript_variant A transcript variant of a protein coding gene Coding transcript variant MODIFIER
upstream_gene_variant A sequence variant located 5’ of a gene Upstream gene variant MODIFIER
downstream_gene_variant A sequence variant located 3’ of a gene Downstream gene variant MODIFIER
TFBS_ablation A feature ablation whereby the deleted region includes a transcription factor binding site TFBS ablation MODIFIER
TFBS_amplification A feature amplification of a region containing a transcription factor binding site TFBS amplification MODIFIER
TF_binding_site_variant A sequence variant located within a transcription factor binding site TF binding site variant MODIFIER
regulatory_region_ablation A feature ablation whereby the deleted region includes a regulatory region Regulatory region ablation MODIFIER
regulatory_region_amplification A feature amplification of a region containing a regulatory region Regulatory region amplification MODIFIER
regulatory_region_variant A sequence variant located within a regulatory region Regulatory region variant MODIFIER
intergenic_variant A sequence variant located in the intergenic region, between genes Intergenic variant MODIFIER
sequence_variant A sequence_variant is a non exact copy of a sequence_feature or genome exhibiting one or more sequence_alteration Sequence variant MODIFIER


Coding changes and genetic codes

Traditionally, much of the interest has focused on coding sequence (CDS) variants and, among those, on synonymous and non-synonymous changes. A significant reason behind that is again their relative simplicity and therefore suitability to mathematical modelling.

Synonymous variants affect CDS but do not change the amino acid that the affected codon codes for. In micro-organisms, the efficiency of gene transcription and translation is often optimised to such extent that even the availability of transfer RNAs for different codons and the efficiency of their incorporation matters. In such cases, synonymous codons may have fitness differences and not be selectively neutral. In higher organisms, synonymous variants, as such, are considered to be approximately neutral; however, the gene region with a variant may have other functions than just protein coding, and a neutral variant may be linked to a nearby variant under selection and thus indirectly affected by selection.

Non-synonymous variants – also called missense variants – change the codon such that the encoded amino acid changes. The amino acids differ greatly in their properties and the severity of the change is typically not reported.

The synonymous and missense mutations are naturally a consequence of the redundancies in the genetic code; several codons code for the same amino acid:

Standard genetic code

Standard genetic code

Synonymous changes are mostly in the codon third position but also the codon first position changes can be synonymous. In addition to that, the codon changes caused by first and third position tend to less dramatic than those caused by the second position. As an example of that, we can study the scores in the BLOSUM tables. The series of BLOSUM tables (used e.g. by BLAST) are generated from homologous protein sequences that are diverged for a certain amount. Each amino-acid pair gets a log-odds score that reflects how frequently the pair is observed in the data compared to the changes of seeing them by chance: \[ LogOddRatio = 2 log_2 \left(\frac{P(O)}{P(E)}\right) \]

The scores are highly positive for pairs that are seen frequently, such as Leucine (L) matched with Leucine that has the score 6:

   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V
A  7 -3 -3 -3 -1 -2 -2  0 -3 -3 -3 -1 -2 -4 -1  2  0 -5 -4 -1
R -3  9 -1 -3 -6  1 -1 -4  0 -5 -4  3 -3 -5 -3 -2 -2 -5 -4 -4
N -3 -1  9  2 -5  0 -1 -1  1 -6 -6  0 -4 -6 -4  1  0 -7 -4 -5
D -3 -3  2 10 -7 -1  2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6
C -1 -6 -5 -7 13 -5 -7 -6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2
Q -2  1  0 -1 -5  9  3 -4  1 -5 -4  2 -1 -5 -3 -1 -1 -4 -3 -4
E -2 -1 -1  2 -7  3  8 -4  0 -6 -6  1 -4 -6 -2 -1 -2 -6 -5 -4
G  0 -4 -1 -3 -6 -4 -4  9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6
H -3  0  1 -2 -7  1  0 -4 12 -6 -5 -1 -4 -2 -4 -2 -3 -4  3 -5
I -3 -5 -6 -7 -2 -5 -6 -7 -6  7  2 -5  2 -1 -5 -4 -2 -5 -3  4
L -3 -4 -6 -7 -3 -4 -6 -7 -5  2  6 -4  3  0 -5 -4 -3 -4 -2  1
K -1  3  0 -2 -6  2  1 -3 -1 -5 -4  8 -3 -5 -2 -1 -1 -6 -4 -4
M -2 -3 -4 -6 -3 -1 -4 -5 -4  2  3 -3  9  0 -4 -3 -1 -3 -3  1
F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1  0 -5  0 10 -6 -4 -4  0  4 -2
P -1 -3 -4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4
S  2 -2  1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2  7  2 -6 -3 -3
T  0 -2  0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3  2  8 -5 -3  0
W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3  0 -7 -6 -5 16  3 -5
Y -4 -4 -4 -6 -5 -3 -5 -6  3 -3 -2 -4 -3  4 -6 -3 -3  3 11 -3
V -1 -4 -5 -6 -2 -4 -4 -6 -5  4  1 -4  1 -2 -4 -3  0 -5 -3  7

If we have a closer look on the scores for Leucine pairing with any other amino acid, we can see that the biggest scores are for amino acids whose codons differ from Leucine by a codon first position change:

AA          Leu  Met  Ile  Val  Phe  Tyr  Ala  Cys  Thr  Arg  Gln  Lys  Ser  Trp  His  Pro  Asn  Glu  Asp  Gly
score(Leu)  6    3    2    1    0    -2   -3   -3   -3   -4   -4   -4   -4   -4   -5   -5   -6   -6   -7   -7

Those amino acids are also chemically the most similar.

If studying something else but the nuclear DNA of common animals or plants, it is important to consider the genetic code. The most frequently needed alternative code is that for vertebrate mitochondrion:

Genetic code for vertebrate mitochondrion

Genetic code for vertebrate mitochondrion

The difference is not huge but, as the affected codons mostly code for stop codons, highly significant:

          TGA  ATA  AGA  AGG
Standard  *    I    R    R
Vet_Mito  W    M    *    *


The NCBI website list 33 different genetic codes!

Combined effects

When inferring the impact of a variant, it is crucial to note that the nearby variants may jointly have a different consequence than anyone alone. Let’s assume that we have three genomic positions that together form the codon “TAT” and code for Tyrosine, and an individual is heterozygous at the second and third position:

chr    1234    T   .   0/0
chr    1235    A   G   0/1
chr    1236    T   G   0/1

It is now important to know if the two alternative alleles are inherited together:

chr    1234    T   .   0|0
chr    1235    A   G   0|1
chr    1236    T   G   0|1

or separately:

chr    1234    T   .   0|0
chr    1235    A   G   1|0
chr    1236    T   G   0|1

If they are inherited together, the alternative codon is “TGG” and codes for Tryptophan. If they are inherited separately, the segregating codons are “TGT” and “TAG” which code for Cysteine and the Stop codon. Even if the VCF data are not specifically phased (i.e., the co-inheritance of alleles resolved), the phase of many nearby variants is automatically resolved from the sequencing data: if the variants are in the same read, they must come from the same parent.

One of the variant annotation programs is bcftools csq and it considers all nearby variants together. The program website gives examples of the impact of nearby variants:

Three types of compound variants that lead to incorrect consequence prediction when handled in a localized manner each separately rather than jointly.

Three types of compound variants that lead to incorrect consequence prediction when handled in a localized manner each separately rather than jointly.

In addition to adjacent variants causing multinucleotide changes, also indels and variants separated by introns can have combined effects.

Structure of a gene annotation

A gene can code multiple transcripts and each transcript may have multiple exons, some of which are coding for a protein. To be able to infer the impact of a coding variant, the annotation program has to understand the position of each exon in the whole. For that, the gene annotation has to be hierarchical and structured correctly. While the element type (“gene”, “mRNA”, “exon”, “CDS” etc.) is in the third column of a GFF file, the relationship of individual elements is in the ninth column. Below, is a minimal example by bcftools csq for the description of CDS → mRNA → gene relationships:

# The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,    
# looks up their parent transcript (determined from the "Parent=transcript:" attribute),    
# the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype    
# (the most interesting is "protein_coding").    
#    
# Empty and commented lines are skipped, the following GFF columns are required
#   1.  chromosome
#   2.  IGNORED
#   3.  type (CDS, exon, three_prime_UTR, five_prime_UTR, gene, transcript, etc.)
#   4.  start of the feature (1-based)
#   5.  end of the feature (1-based)
#   6.  IGNORED
#   7.  strand (+ or -)
#   8.  phase (0, 1, 2 or .)
#   9.  semicolon-separated attributes (see below)
#
# Attributes required for
#   gene lines:
#   - ID=gene:
#   - biotype=
#   - Name=      [optional]
#
#   transcript lines:
#   - ID=transcript:
#   - Parent=gene:
#   - biotype=
#
#   other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
#   - Parent=transcript:
#
# Supported biotypes:
#   - see the function gff_parse_biotype() in bcftools/csq.c

1   ignored_field  gene            21  2148  .   -   .   ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1   ignored_field  transcript      21  2148  .   -   .   ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1   ignored_field  three_prime_UTR 21  2054  .   -   .   Parent=transcript:TranscriptId
1   ignored_field  exon            21  2148  .   -   .   Parent=transcript:TranscriptId
1   ignored_field  CDS             21  2148  .   -   1   Parent=transcript:TranscriptId
1   ignored_field  five_prime_UTR  210 2148  .   -   .   Parent=transcript:TranscriptId

It is notable that one variant position can be part of many gene products and its impact on different products can vary. In principle, a genomic region could be a part of two overlapping genes in different reading frames, and a variant could then be synonymous (and neutral) in one product but cause a drastic effect in the other. With many overlapping gene annotations, the variant annotation can be very complex.

Other predictions

In addition to mechanistic predictions, there are many empirical measures of the severity of the variant. Unsurprisingly, these are best developed for human; however, Ensembl provides measures based on evolutionary conservation for many other species and these can often be transferred to closely related species. The description of Ensembl Pathogenicity predictions gives more details.


Take-home message

Gene annotations allow inferring the impact of genomic variants on the genomic products. For well-studied systems such as humans, the gene annotations are comprehensive and the impact of a variant can be predicted precisely. However, the predictions are mechanical and the empirical evidence may give conflicting signal, e.g. indicating strong evolutionary conservation of a site with no apparent impact based on the annotation model.

Variant annotations are extremely sensitive to the correctness of the gene annotations. If the reading frame is wrong, all predictions of variant effects are likely wrong.