Annotating GWAS Hits with Gene Information • gwasplot

After identifying genome-wide significant associations, the next question is: what genes do these variants implicate, and what is the evidence?

gwasplot provides three complementary annotation functions:

Function	Source	Works offline?	Output
`find_nearest_gene()`	Bundled Ensembl reference	Yes	Closest protein-coding gene + distance
`annotate_with_l2g()`	Open Targets L2G model	No (API or ~700 MB download)	Likely causal gene + L2G score
`annotate_variants_ensembl()`	Ensembl VEP REST API	No	Predicted functional consequence

Variant IDs use the chr_pos_ref_alt format throughout (e.g. "chr1_55039974_A_G"). Each function strips the chr prefix internally before calling external APIs, so no manual conversion is needed.

Example variants

We use four lead variants from well-studied GWAS loci (GRCh38 coordinates).

library(gwasplot)
#> ℹ Setting duckdb_max_memory to 7GB, using 80% of available system memory.
#> ℹ Change this with options(duckdb_max_memory = 'XGB')
#> 
#> Attaching package: 'gwasplot'
#> The following object is masked from 'package:stats':
#> 
#>     qqplot

known_hits <- data.frame(
  CHROM  = c("chr1",                "chr10",                 "chr16",                "chr19"),
  POS    = c(55039974L,             112998590L,              53767042L,              44908684L),
  ID     = c("chr1_55039974_A_G",   "chr10_112998590_C_T",   "chr16_53767042_T_C",   "chr19_44908684_T_C"),
  PVALUE = c(3.2e-45,               2.4e-38,                 1.7e-67,                9.8e-32)
)
known_hits
#>   CHROM       POS                  ID  PVALUE
#> 1  chr1  55039974   chr1_55039974_A_G 3.2e-45
#> 2 chr10 112998590 chr10_112998590_C_T 2.4e-38
#> 3 chr16  53767042  chr16_53767042_T_C 1.7e-67
#> 4 chr19  44908684  chr19_44908684_T_C 9.8e-32

Nearest gene

find_nearest_gene() uses a DuckDB range join against a bundled Ensembl protein-coding gene reference. No network connection is required.

annotated <- find_nearest_gene(known_hits)
#> ℹ Starting gene annotation...
#> ℹ Loading data into in-memory database
#> ✔ Loading data into in-memory database [361ms]
#> 
#> ℹ Creating optimised gene intervals
#> ✔ Creating optimised gene intervals [53ms]
#> 
#> ℹ Finding nearest genes
#> ✔ Gene annotation completed in 0.51 seconds
#> ℹ Finding nearest genes
✔ Finding nearest genes [46ms]
annotated[, c("ID", "gene_name", "gene_id", "distance")]
#>                    ID gene_name         gene_id distance
#> 1   chr1_55039974_A_G     PCSK9 ENSG00000169174        0
#> 2 chr10_112998590_C_T    TCF7L2 ENSG00000148737        0
#> 3  chr16_53767042_T_C       FTO ENSG00000140718        0
#> 4  chr19_44908684_T_C      APOE ENSG00000130203        0

A distance of 0 means the variant falls inside the gene body. For intergenic variants the distance is in base pairs to the nearest gene boundary.

Open Targets Locus-to-Gene scores

The Open Targets L2G model scores each gene at a locus (0–1) by combining eQTL colocalisation, chromatin accessibility, and distance evidence. Higher scores indicate stronger functional support for a gene being causal at that locus.

annotate_with_l2g() accepts the same data frame and looks up the ID column. The "api" method queries the Open Targets GraphQL API once per variant — suitable for small hit lists.

hits_l2g <- annotate_with_l2g(annotated, method = "api")
hits_l2g[, c("ID", "gene_name", "l2g_gene_name", "l2g_score")]

#>                    ID gene_name l2g_gene_name l2g_score
#> 1   chr1_55039974_A_G     PCSK9         PCSK9     0.913
#> 2 chr10_112998590_C_T    TCF7L2        TCF7L2     0.847
#> 3  chr16_53767042_T_C       FTO           FTO     0.621
#> 4  chr19_44908684_T_C      APOE          APOE     0.965

The l2g_gene_name column is the L2G model’s top-ranked gene and may differ from the nearest gene — most notably at pleiotropic loci where the functional target is not the nearest gene.

For larger hit lists, method = "bulk" downloads the full L2G parquet table (~700 MB, cached locally after the first run) and queries it via DuckDB:

hits_l2g <- annotate_with_l2g(annotated, method = "bulk")

Ensembl VEP consequences

annotate_variants_ensembl() calls the Ensembl VEP REST API to predict the functional consequence of each variant on all overlapping transcripts. The flag_pick = TRUE default returns one representative transcript per variant.

consequences <- annotate_variants_ensembl(known_hits$ID, verbose = TRUE)
consequences[, c("ID", "gene_symbol", "most_severe_consequence", "impact", "hgvsp")]

#>                    ID gene_symbol most_severe_consequence   impact              hgvsp
#> 1   chr1_55039974_A_G       PCSK9        missense_variant MODERATE  ENSP…:p.Ile474Val
#> 2 chr10_112998590_C_T      TCF7L2          intron_variant MODIFIER              <NA>
#> 3  chr16_53767042_T_C         FTO          intron_variant MODIFIER              <NA>
#> 4  chr19_44908684_T_C        APOE        missense_variant MODERATE  ENSP…:p.Cys130Arg

The two missense variants (PCSK9, APOE) have "MODERATE" impact. The intronic variants (TCF7L2, FTO) are "MODIFIER" — they may still be functional via regulatory mechanisms, which is where L2G scores add value.

Combining all annotations

All three functions return a data frame with the original columns intact, so they compose naturally with |>. The IDs match across all outputs since they all use the same chr_pos_ref_alt format:

library(dplyr)

full_annotated <- known_hits |>
  find_nearest_gene() |>
  annotate_with_l2g(method = "api") |>
  left_join(
    consequences |> select(ID, most_severe_consequence, impact, hgvsp),
    by = "ID"
  )

full_annotated |>
  select(ID, gene_name, l2g_gene_name, l2g_score, most_severe_consequence, impact)