
Get Started with glyfun
glyfun.Rmdglyfun is a package for functional enrichment analysis
of glycoproteomics data. Why does glycoproteomics need a dedicated tool?
For transcriptomics and proteomics, functional enrichment analysis is
usually fairly straightforward. For glycoproteomics, however, we need to
account for both macro- and micro-heterogeneity. Macro-heterogeneity
means that one protein can carry multiple glycosylation sites, while
micro-heterogeneity means that each site can carry multiple glycan
structures. As a result, a single protein may have several dysregulated
glycosylation events across different sites and glycans. That extra
layer of complexity makes traditional enrichment workflows harder to
apply directly.
In general, glyfun answers two questions:
- Which functions are enriched for proteins with dysregulated glycosylation?
- Which functions are enriched for proteins with a specific dysregulated glycosylation event?
The following sections walk through both perspectives and show when each one is useful.
Prerequisites: To follow this vignette, you should
be familiar with the core glycoverse packages,
especially glyexp and
glystats. For the
glycan-centric enrichment analysis in the second half, you will also
need some familiarity with glydet. Experience with
clusterProfiler is helpful too, since glyfun
returns standard enrichment result objects that you can inspect and
visualize with that ecosystem.
Approach 1: Protein-Centric Enrichment Analysis
Let’s start with the first question. We first run differential analysis on the glycoforms.
library(glyexp)
library(glyclean)
#>
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#>
#> aggregate
library(glystats)
exp <- auto_clean(real_experiment) |> filter_obs(group %in% c("H", "C"))
#>
#> ── Normalizing data ──
#>
#> ℹ No QC samples found. Using default normalization method based on experiment type.
#> ℹ Experiment type is "glycoproteomics". Using `normalize_median()`.
#> ✔ Normalization completed.
#>
#> ── Removing variables with too many missing values ──
#>
#> ℹ No QC samples found. Using all samples.
#> ℹ Applying preset "discovery"...
#> ℹ Total removed: 24 (0.56%) variables.
#> ✔ Variable removal completed.
#>
#> ── Imputing missing values ──
#>
#> ℹ No QC samples found. Using default imputation method based on sample size.
#> ℹ Sample size <= 30, using `impute_sample_min()`.
#> ✔ Imputation completed.
#>
#> ── Aggregating data ──
#>
#> ℹ Aggregating to "gfs" level
#> ✔ Aggregation completed.
#>
#> ── Normalizing data again ──
#>
#> ℹ No QC samples found. Using default normalization method based on experiment type.
#> ℹ Experiment type is "glycoproteomics". Using `normalize_median()`.
#> ✔ Normalization completed.
#>
#> ── Correcting batch effects ──
#>
#> ℹ Batch column not found in sample_info. Skipping batch correction.
#> ✔ Batch correction completed.
dea_res <- gly_limma(exp)
#> ℹ Ref Group: "H"
#> ℹ Test Group: "C"
get_tidy_result(dea_res)
#> # A tibble: 3,979 × 14
#> variable protein glycan_composition glycan_structure protein_site gene
#> <chr> <chr> <comp> <struct> <int> <chr>
#> 1 P08185-176-He… P08185 Hex(5)HexNAc(4)Ne… NeuAc(??-?)Hex(… 176 SERP…
#> 2 P04196-344-He… P04196 Hex(5)HexNAc(4)Ne… NeuAc(??-?)Hex(… 344 HRG
#> 3 P04196-344-He… P04196 Hex(5)HexNAc(4) Hex(??-?)HexNAc… 344 HRG
#> 4 P04196-344-He… P04196 Hex(5)HexNAc(4)Ne… NeuAc(??-?)Hex(… 344 HRG
#> 5 P10909-291-He… P10909 Hex(6)HexNAc(5) Hex(??-?)HexNAc… 291 CLU
#> 6 P04196-344-He… P04196 Hex(5)HexNAc(4)Ne… NeuAc(??-?)Hex(… 344 HRG
#> 7 P04196-345-He… P04196 Hex(5)HexNAc(4) Hex(??-?)HexNAc… 345 HRG
#> 8 P04196-344-He… P04196 Hex(5)HexNAc(4)dH… dHex(??-?)Hex(?… 344 HRG
#> 9 P04196-344-He… P04196 Hex(4)HexNAc(3) Hex(??-?)HexNAc… 344 HRG
#> 10 P04196-344-He… P04196 Hex(4)HexNAc(4)Ne… NeuAc(??-?)Hex(… 344 HRG
#> # ℹ 3,969 more rows
#> # ℹ 8 more variables: log2fc <dbl>, AveExpr <dbl>, t <dbl>, p_val <dbl>,
#> # p_adj <dbl>, b <dbl>, ref_group <chr>, test_group <chr>Over Representation Analysis (ORA)
Now that we have differential results for glycoforms, we can ask which functions are enriched among proteins with dysregulated glycosylation. Here, a protein is included if it has at least one dysregulated glycoform. The workflow has three steps:
- Filter the glycoforms with p-value and log2FC cutoff.
- Get the unique proteins with dysregulated glycoforms.
- Perform Over Representation Analysis (ORA) for these proteins.
glyfun wraps these steps in a set of convenience
functions:
# Gene Ontology (GO) enrichment analysis
enrich_ora_go(dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)
# KEGG pathway enrichment analysis
enrich_ora_kegg(dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)
# Reactome pathway enrichment analysis
enrich_ora_reactome(dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)
# WikiPathways enrichment analysis
enrich_ora_wp(dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)
# Disease Ontology (DO) enrichment analysis
enrich_ora_do(dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)
# DisGeNET enrichment analysis
enrich_ora_disgenet(dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)Use p_cutoff and log2fc_cutoff to define
which glycoforms count as dysregulated.
Under the hood, these functions call the corresponding
clusterProfiler functions and return an
enrichResult object. That means you can convert the result
to a tibble with as_tibble(), or visualize it with
dotplot().
go_res <- enrich_ora_go(dea_res)
# Convert to tibble
tibble::as_tibble(go_res)
# Dotplot
clusterProfiler::dotplot(go_res)Gene Set Enrichment Analysis (GSEA)
You can also perform Gene Set Enrichment Analysis (GSEA). Unlike ORA, GSEA does not require a hard cutoff to define dysregulated glycoforms. Instead, it works from a ranked protein list, often based on statistics such as log2FC.
This raises one practical question: one protein can have multiple
glycoforms with different log2FC values, so how should it be ranked?
glyfun handles this by aggregating glycoform-level
statistics into one value per protein. The default is the median, but
you can also choose other methods such as the mean or maximum.
# GSEA for Gene Ontology (GO)
enrich_gsea_go(dea_res, aggr = "median")
# GSEA for KEGG pathway
enrich_gsea_kegg(dea_res, aggr = "median")
# GSEA for Reactome pathway
enrich_gsea_reactome(dea_res, aggr = "median")
# GSEA for WikiPathways
enrich_gsea_wp(dea_res, aggr = "median")
# GSEA for Disease Ontology (DO)
enrich_gsea_do(dea_res, aggr = "median")
# GSEA for DisGeNET
enrich_gsea_disgenet(dea_res, aggr = "median")These functions also support several ranking methods. By default,
glyfun uses "signed_log10p", which combines
the sign of log2FC with the p-value. Other options include
"log2fc", "abs_log2fc", "log10p",
and "log2fc_log10p". See the GSEA function documentation
for the full details.
Approach 2: Glycan-Centric Enrichment Analysis
Note: You should be familiar with the glydet package to understand this section.
The protein-centric approach is useful and easy to interpret, but it hides important glycan-level information. Saying that a protein has “dysregulated glycosylation” can be too broad. Often, we want a more specific question: which functions are associated with particular glycosylation changes? For example, we might ask which functions are linked to increased fucosylation or decreased sialylation.
To do that, we first calculate glycosylation traits with
glydet.
library(glydet)
trait_exp <- derive_traits(exp)
trait_exp |>
get_var_info() |>
dplyr::distinct(trait, explanation)
#> # A tibble: 14 × 2
#> trait explanation
#> <chr> <chr>
#> 1 TM Proportion of high-mannose glycans among all glycans.
#> 2 TH Proportion of hybrid glycans among all glycans.
#> 3 TC Proportion of complex glycans among all glycans.
#> 4 MM Abundance-weighted mean of mannose count within high-mannose glycans.
#> 5 CA2 Proportion of bi-antennary glycans within complex glycans.
#> 6 CA3 Proportion of tri-antennary glycans within complex glycans.
#> 7 CA4 Proportion of tetra-antennary glycans within complex glycans.
#> 8 TF Proportion of fucosylated glycans among all glycans.
#> 9 TFc Proportion of core-fucosylated glycans among all glycans.
#> 10 TFa Proportion of arm-fucosylated glycans among all glycans.
#> 11 TB Proportion of glycans with bisecting GlcNAc among all glycans.
#> 12 GS Abundance-weighted mean of degree of sialylation per galactose among a…
#> 13 AG Abundance-weighted mean of degree of galactosylation per antenna among…
#> 14 TS Proportion of sialylated glycans among all glycans.The result contains several trait types, including TF
for fucosylation and GS for sialylation. These traits are
calculated site by site, so each glycosylation site on each protein has
its own trait values.
Next, we run differential analysis on the trait experiment.
trait_dea_res <- gly_limma(trait_exp)
#> ℹ Ref Group: "H"
#> ℹ Test Group: "C"
#> Warning: Zero sample variances detected, have been offset away from zero
#> Warning in splines::ns(covariate, df = splinedf, intercept = TRUE): shoving
#> 'interior' knots matching boundary knots to insideOver Representation Analysis (ORA)
As before, ORA uses hard cutoffs to identify dysregulated features.
The difference is that enrichment is now performed separately for each
trait. For example, for the TF trait, we filter
statistically significant glycosites and then run enrichment analysis on
the proteins with dysregulated TF. The same process is
repeated for GS and the other traits. In
glyfun, this is called glycan-centric enrichment.
glyfun provides the following functions for
glycan-centric ORA:
# Glycan-centric ORA for Gene Ontology (GO)
enrich_gc_ora_go(trait_dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)
# Glycan-centric ORA for KEGG pathway
enrich_gc_ora_kegg(trait_dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)
# Glycan-centric ORA for Reactome pathway
enrich_gc_ora_reactome(trait_dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)
# Glycan-centric ORA for WikiPathways
enrich_gc_ora_wp(trait_dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)
# Glycan-centric ORA for Disease Ontology (DO)
enrich_gc_ora_do(trait_dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)
# Glycan-centric ORA for DisGeNET
enrich_gc_ora_disgenet(trait_dea_res, p_cutoff = 0.05, log2fc_cutoff = 1)These functions return compareClusterResult objects,
which can also be visualized with dotplot().
Trait differential results are not the only valid input for glycan-centric enrichment. You can also use raw glycoform experiments or motif quantification experiments. With raw glycoform experiments, enrichment is performed for each glycan composition or structure; with motif quantification experiments, it is performed for each motif. Because this can trigger many enrichment analyses, raw glycoform-level input can be slow and is usually not the first option to try.
Gene Set Enrichment Analysis (GSEA)
GSEA can also be performed in a glycan-centric manner. The idea mirrors glycan-centric ORA: rank proteins separately for each trait, then run GSEA for each ranked list.
# Glycan-centric GSEA for Gene Ontology (GO)
enrich_gc_gsea_go(trait_dea_res, aggr = "median")
# Glycan-centric GSEA for KEGG pathway
enrich_gc_gsea_kegg(trait_dea_res, aggr = "median")
# Glycan-centric GSEA for Reactome pathway
enrich_gc_gsea_reactome(trait_dea_res, aggr = "median")
# Glycan-centric GSEA for WikiPathways
enrich_gc_gsea_wp(trait_dea_res, aggr = "median")
# Glycan-centric GSEA for Disease Ontology (DO)
enrich_gc_gsea_do(trait_dea_res, aggr = "median")
# Glycan-centric GSEA for DisGeNET
enrich_gc_gsea_disgenet(trait_dea_res, aggr = "median")Try these workflows on your own data and see which biological patterns emerge.