
Working with glyexp
with-exp.Rmd
When Motifs Meet Experiments: A Perfect Partnership 🤝
The real power of glymotif
shines brightest when it
joins forces with other tools in the glycoverse
ecosystem.
If you’re already using glyread
to import your
glycoproteomics results and glyexp
to manage your
experimental data, you’re in for a treat! glymotif
provides
some incredibly useful functions to perform experiment
manipulation about motifs—think of it as adding a new lens to your
analytical microscope. 🔬
Important note: This vignette assumes you’re
familiar with the glyexp
package. If you haven’t met it
yet, we highly recommend checking out its introduction
first. Trust us—it’s worth the detour! 🚀
library(glymotif)
library(glyexp)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#>
#> select_var
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Meet Our Star Player: Real Glycoproteomics Data 🌟
Time to roll up our sleeves and dive into the good stuff! 💪 Let’s
work with a real-world dataset that will showcase what
glymotif
can do when it teams up with actual experimental
data. We’ll use glyread::read_pglyco3_pglycoquant()
to load
our data into a proper glyexp::experiment()
object — think
of it as our data’s new home where everything is organized and ready for
analysis. We will also use glyclean::auto_clean()
to
preprocess the data.
library(glyread)
library(glyclean)
#>
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#>
#> aggregate
exp <- glyread::read_pglyco3_pglycoquant(
"glycopeptides.list",
sample_info = "sample_info.csv",
glycan_type = "N"
) |>
mutate_obs(sample = stringr::str_split_i(sample, "-", -1)) |>
auto_clean()
#> ℹ Reading data
#> ✔ Reading data [437ms]
#>
#> ℹ Parsing glycan compositions and structures
#> ✔ Parsing glycan compositions and structures [3.1s]
#>
#> ℹ Packing experiment
#> ✔ Packing experiment [20ms]
#>
#> ℹ Performing protein inference
#> ✔ Performing protein inference [337ms]
#>
#> ℹ Normalizing data (Median)
#> ✔ Normalizing data (Median) [128ms]
#>
#> ℹ Removing variables with >50% missing values
#> ✔ Removing variables with >50% missing values [18ms]
#>
#> ℹ Imputing missing values
#> ℹ Sample size <= 30, using sample minimum imputation
#> ℹ Imputing missing values✔ Imputing missing values [24ms]
#>
#> ℹ Aggregating data
#> ✔ Aggregating data [145ms]
#>
#> ℹ Normalizing data again
#> ℹ Detecting batch effects using ANOVA for 263 variables...
#> ℹ Normalizing data again✔ Batch effect detection completed. 1 out of 263 variables show significant batch effects (p < 0.05).
#> ℹ Normalizing data againℹ Batch effects detected in 0.4% of variables (<=10%). Skipping batch correction.
#> ℹ Normalizing data again✔ Normalizing data again [683ms]
exp
#>
#> ── Experiment ──────────────────────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 263 variables
#> ℹ Sample information fields: group and batch
#> ℹ Variable information fields: protein, gene, glycan_composition, glycan_structure, and protein_site
Now, let’s peek under the hood and see what treasures we’re working with! 👀
get_var_info(exp)
#> # A tibble: 263 × 6
#> variable protein gene glycan_composition glycan_structure protein_site
#> <chr> <chr> <chr> <comp> <structure> <int>
#> 1 V1 P08185 SERPINA6 Hex(5)HexNAc(4)NeuAc… NeuAc(??-?)Hex(… 176
#> 2 V2 P04196 HRG Hex(5)HexNAc(4)NeuAc… NeuAc(??-?)Hex(… 344
#> 3 V3 P04196 HRG Hex(5)HexNAc(4) Hex(??-?)HexNAc… 344
#> 4 V4 P04196 HRG Hex(5)HexNAc(4)NeuAc… NeuAc(??-?)Hex(… 344
#> 5 V5 P10909 CLU Hex(6)HexNAc(5) Hex(??-?)HexNAc… 291
#> 6 V6 P04196 HRG Hex(5)HexNAc(4)NeuAc… NeuAc(??-?)Hex(… 344
#> 7 V7 P04196 HRG Hex(5)HexNAc(4) Hex(??-?)HexNAc… 345
#> 8 V8 P04196 HRG Hex(5)HexNAc(4)dHex(… dHex(??-?)Hex(?… 344
#> 9 V9 P04196 HRG Hex(4)HexNAc(3) Hex(??-?)HexNAc… 344
#> 10 V10 P04196 HRG Hex(4)HexNAc(4)NeuAc… NeuAc(??-?)Hex(… 344
#> # ℹ 253 more rows
get_sample_info(exp)
#> # A tibble: 12 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 C_1 C 1
#> 2 C_2 C 2
#> 3 C_3 C 3
#> 4 H_1 H 1
#> 5 H_2 H 2
#> 6 H_3 H 3
#> 7 M_1 M 1
#> 8 M_2 M 2
#> 9 M_3 M 3
#> 10 Y_1 Y 1
#> 11 Y_2 Y 2
#> 12 Y_3 Y 3
What we have here is a beautiful N-glycoproteomics dataset featuring 500 PSMs (Peptide Spectrum Matches) across 12 samples — a perfect playground for motif analysis! 🎮
Pro tip: 💡 In real-world data analysis, you’ll
definitely want to use glyclean
to perform data
preprocessing before diving into any analysis. Think of it as washing
your vegetables before cooking—essential for the best results! 🥕
Adding Motif Annotations to an Experiment 🏷️
Here’s where things get interesting! 🤔 We know that the variable information tibble contains all the juicy details about each glycoform - the proteins, sites, and glycan structures. But what if we want to sprinkle some motif magic into the mix? What if we want to add more columns that tell us about the motifs hiding in our glycans?
The simple approach: One motif at a time (very intuitive!) 🎯
exp |>
mutate_var(n_hex = have_motif(glycan_structure, "Hex(a1-")) |>
get_var_info() |>
select(variable, protein, glycan_structure, n_hex)
#> # A tibble: 263 × 4
#> variable protein glycan_structure n_hex
#> <chr> <chr> <structure> <lgl>
#> 1 V1 P08185 NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[NeuAc(??-?)… FALSE
#> 2 V2 P04196 NeuAc(??-?)Hex(??-?)HexNAc(??-?)[HexNAc(??-?)]Hex(??-… FALSE
#> 3 V3 P04196 Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)HexNAc(??-?)H… FALSE
#> 4 V4 P04196 NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)He… FALSE
#> 5 V5 P10909 Hex(??-?)HexNAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[H… FALSE
#> 6 V6 P04196 NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[NeuAc(??-?)… FALSE
#> 7 V7 P04196 Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)HexNAc(??-?)H… FALSE
#> 8 V8 P04196 dHex(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[dHex(??-?)He… FALSE
#> 9 V9 P04196 Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)]Hex(??-?)Hex… FALSE
#> 10 V10 P04196 NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?… FALSE
#> # ℹ 253 more rows
The tempting approach: Multiple motifs at once (you might be tempted to do this…) 🤷♀️
# Don't do this
exp |>
mutate_var(
n_hex = have_motif(glycan_structure, "Hex(a1-"),
n_hexna = have_motif(glycan_structure, "HexNAc(a1-"),
n_dhex = have_motif(glycan_structure, "dHex(a1-")
)
Hold on there, speed racer! 🛑 While this approach works, it’s not
very efficient, and your computer won’t thank you for it. Here’s why:
those three separate calls to have_motif()
all perform
time-consuming validations and conversions on the same set of glycan
structures. It’s like washing the same dishes three times instead of
doing them all at once! 🍽️ Plus, there’s a lot of repetitive typing. You
have to type have_motif
and glycan_structure
three times — talk about finger fatigue! 😴
The smart approach: Use the
add_motifs_lgl()
or add_motifs_int()
functions
instead! ⚡ They might look like simple syntactic sugar, but they’re
actually optimized powerhouses designed specifically for this exact
scenario:
exp2 <- exp |>
add_motifs_lgl(c(motif1 = "Hex(??-", motif2 = "HexNAc(??-", motif3 = "dHex(??-"))
Voilà! 🎉 The motif annotations are now seamlessly integrated into your variable information tibble.
exp2 |>
get_var_info() |>
select(variable, motif1, motif2, motif3)
#> # A tibble: 263 × 4
#> variable motif1 motif2 motif3
#> <chr> <lgl> <lgl> <lgl>
#> 1 V1 TRUE TRUE FALSE
#> 2 V2 TRUE TRUE FALSE
#> 3 V3 TRUE TRUE FALSE
#> 4 V4 TRUE TRUE FALSE
#> 5 V5 TRUE TRUE FALSE
#> 6 V6 TRUE TRUE FALSE
#> 7 V7 TRUE TRUE FALSE
#> 8 V8 TRUE TRUE TRUE
#> 9 V9 TRUE TRUE FALSE
#> 10 V10 TRUE TRUE FALSE
#> # ℹ 253 more rows
But wait, what can you actually do with these shiny new columns? 🤔 The possibilities are endless, but here’s a tantalizing example to get your creative juices flowing:
# You can perform pathway enrichment on all glycoproteins containing some motif:
exp2 |>
filter_var(motif1 == TRUE) |>
enrich_reactome() # from the `glystats` package
The Art of Motif Quantification in Experiments 📊
Now we’re entering truly exciting territory! 🚀 Meet
quantify_motifs()
— another incredibly useful tool in our
arsenal. This function takes a glyexp::experiment()
object
and returns a brand new one with motif quantifications. But what does
that actually mean? 🤷♂️
Here’s the beautiful concept: when we quantify a motif in a glycoproteomic context, we’re measuring the abundance of motifs on each glycosite in each sample. This information elegantly combines two key pieces: the motif count on each site and the abundance of that site. It’s like knowing not just how many sports cars are in a parking lot, but also how often each parking spot is used! 🚗
motifs <- c(
"NeuAc(??-?)Hex(??-?)HexNAc(??-",
"Hex(??-?)HexNAc(??-",
"HexNAc(??-",
"NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNAc(??-",
"Hex(??-?)[dHex(??-?)]HexNAc(??-",
"dHex(??-?)HexNAc(??-"
)
motif_exp <- quantify_motifs(exp, motifs, alignments = "terminal", ignore_linkages = TRUE)
# Pro tip: Our motifs don't have linkage information,
# but we can still set `ignore_linkages` to TRUE manually to turbocharge the calculation! ⚡
Here’s the beautiful part: motif_exp
is also a
glyexp::experiment()
! 🎭 Think of it this way: if we
describe a typical glycoproteomics experiment()
as “the
quantities of glycans on different glycosites in different samples”,
then motif_exp
is “the quantities of motifs on different
glycosites in different samples”. It’s the same elegant structure, just
viewed through a different analytical lens! 🔍
Let’s explore our newly minted experiment and see what treasures it holds:
# First, let's peek at the variable information
print(get_var_info(motif_exp), n = 12)
#> # A tibble: 174 × 5
#> variable protein gene protein_site motif
#> <chr> <chr> <chr> <int> <chr>
#> 1 V1 P08185 SERPINA6 176 NeuAc(??-?)Hex(??-?)HexNAc(??-
#> 2 V2 P08185 SERPINA6 176 Hex(??-?)HexNAc(??-
#> 3 V3 P08185 SERPINA6 176 HexNAc(??-
#> 4 V4 P08185 SERPINA6 176 NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNA…
#> 5 V5 P08185 SERPINA6 176 Hex(??-?)[dHex(??-?)]HexNAc(??-
#> 6 V6 P08185 SERPINA6 176 dHex(??-?)HexNAc(??-
#> 7 V7 P04196 HRG 344 NeuAc(??-?)Hex(??-?)HexNAc(??-
#> 8 V8 P04196 HRG 344 Hex(??-?)HexNAc(??-
#> 9 V9 P04196 HRG 344 HexNAc(??-
#> 10 V10 P04196 HRG 344 NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNA…
#> 11 V11 P04196 HRG 344 Hex(??-?)[dHex(??-?)]HexNAc(??-
#> 12 V12 P04196 HRG 344 dHex(??-?)HexNAc(??-
#> # ℹ 162 more rows
# And now, the expression matrix (where the magic numbers live!)
get_expr_mat(motif_exp)[1:12, ]
#> C_1 C_2 C_3 H_1 H_2 H_3
#> V1 152014.3 1066076 102104.8 313757.6 48186.51 1058981
#> V2 0.0 0 0.0 0.0 0.00 0
#> V3 0.0 0 0.0 0.0 0.00 0
#> V4 0.0 0 0.0 0.0 0.00 0
#> V5 0.0 0 0.0 0.0 0.00 0
#> V6 0.0 0 0.0 0.0 0.00 0
#> V7 6345221568.8 7206105753 7191439575.7 140324912.5 533431894.56 300715555
#> V8 610187320.7 687432028 618492715.7 1672944.5 816325.57 4082485
#> V9 301461764.4 469696174 342199332.7 10748745.5 499654575.62 292347836
#> V10 204401527.3 361718038 551686432.8 42321892.3 11721625.17 1939821
#> V11 0.0 0 0.0 0.0 0.00 0
#> V12 226547981.2 387992287 581512073.9 42387630.9 11785577.54 1978999
#> M_1 M_2 M_3 Y_1 Y_2 Y_3
#> V1 14084377 8630328 7735835 16851418 11241393 739108.8
#> V2 0 0 0 0 0 0.0
#> V3 0 0 0 0 0 0.0
#> V4 0 0 0 0 0 0.0
#> V5 0 0 0 0 0 0.0
#> V6 0 0 0 0 0 0.0
#> V7 5747809593 5202035052 3736151840 1652226063 3236759193 2298155309.7
#> V8 951853984 668247296 733802918 233115428 301886646 314476664.4
#> V9 309613176 97558541 215055020 75713539 84626608 48423869.5
#> V10 291843379 161976038 257232685 148760800 139522555 149173688.3
#> V11 0 0 0 0 0 0.0
#> V12 324153530 177427606 282567373 160099412 145271893 160356962.6
Pretty familiar, right? 😊 It’s just like the experiment you know and love, except now we’re quantifying motifs rather than glycans.
Here’s the best part: this means all your favorite functions
in glystats
can be used on this experiment! 🎉 It’s like
having a universal remote that works with your new TV: everything just
clicks! 📺
library(glystats)
gly_anova(motif_exp)
gly_pca(motif_exp)
# And others...