Working with glyexp • glymotif

When Motifs Meet Experiments: A Perfect Partnership 🤝

The real power of glymotif shines brightest when it joins forces with other tools in the glycoverse ecosystem. If you’re already using glyread to import your glycoproteomics results and glyexp to manage your experimental data, you’re in for a treat! glymotif provides some incredibly useful functions to perform experiment manipulation about motifs—think of it as adding a new lens to your analytical microscope. 🔬

Important note: This vignette assumes you’re familiar with the glyexp package. If you haven’t met it yet, we highly recommend checking out its introduction first. Trust us—it’s worth the detour! 🚀

library(glymotif)
library(glyexp)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#> 
#>     select_var
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Meet Our Star Player: Real Glycoproteomics Data 🌟

Time to roll up our sleeves and dive into the good stuff! 💪 Let’s work with a real-world dataset that will showcase what glymotif can do when it teams up with actual experimental data. We’ll use glyread::read_pglyco3_pglycoquant() to load our data into a proper glyexp::experiment() object — think of it as our data’s new home where everything is organized and ready for analysis. We will also use glyclean::auto_clean() to preprocess the data.

library(glyread)
library(glyclean)
#> 
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#> 
#>     aggregate

exp <- glyread::read_pglyco3_pglycoquant(
  "glycopeptides.list",
  sample_info = "sample_info.csv",
  glycan_type = "N"
) |> 
  mutate_obs(sample = stringr::str_split_i(sample, "-", -1)) |>
  auto_clean()
#> ℹ Reading data
#> ✔ Reading data [437ms]
#> 
#> ℹ Parsing glycan compositions and structures
#> ✔ Parsing glycan compositions and structures [3.1s]
#> 
#> ℹ Packing experiment
#> ✔ Packing experiment [20ms]
#> 
#> ℹ Performing protein inference
#> ✔ Performing protein inference [337ms]
#> 
#> ℹ Normalizing data (Median)
#> ✔ Normalizing data (Median) [128ms]
#> 
#> ℹ Removing variables with >50% missing values
#> ✔ Removing variables with >50% missing values [18ms]
#> 
#> ℹ Imputing missing values
#> ℹ Sample size <= 30, using sample minimum imputation
#> ℹ Imputing missing values✔ Imputing missing values [24ms]
#> 
#> ℹ Aggregating data
#> ✔ Aggregating data [145ms]
#> 
#> ℹ Normalizing data again
#> ℹ Detecting batch effects using ANOVA for 263 variables...
#> ℹ Normalizing data again✔ Batch effect detection completed. 1 out of 263 variables show significant batch effects (p < 0.05).
#> ℹ Normalizing data againℹ Batch effects detected in 0.4% of variables (<=10%). Skipping batch correction.
#> ℹ Normalizing data again✔ Normalizing data again [683ms]
exp
#> 
#> ── Experiment ──────────────────────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 263 variables
#> ℹ Sample information fields: group and batch
#> ℹ Variable information fields: protein, gene, glycan_composition, glycan_structure, and protein_site

Now, let’s peek under the hood and see what treasures we’re working with! 👀

get_var_info(exp)
#> # A tibble: 263 × 6
#>    variable protein gene     glycan_composition    glycan_structure protein_site
#>    <chr>    <chr>   <chr>    <comp>                <structure>             <int>
#>  1 V1       P08185  SERPINA6 Hex(5)HexNAc(4)NeuAc… NeuAc(??-?)Hex(…          176
#>  2 V2       P04196  HRG      Hex(5)HexNAc(4)NeuAc… NeuAc(??-?)Hex(…          344
#>  3 V3       P04196  HRG      Hex(5)HexNAc(4)       Hex(??-?)HexNAc…          344
#>  4 V4       P04196  HRG      Hex(5)HexNAc(4)NeuAc… NeuAc(??-?)Hex(…          344
#>  5 V5       P10909  CLU      Hex(6)HexNAc(5)       Hex(??-?)HexNAc…          291
#>  6 V6       P04196  HRG      Hex(5)HexNAc(4)NeuAc… NeuAc(??-?)Hex(…          344
#>  7 V7       P04196  HRG      Hex(5)HexNAc(4)       Hex(??-?)HexNAc…          345
#>  8 V8       P04196  HRG      Hex(5)HexNAc(4)dHex(… dHex(??-?)Hex(?…          344
#>  9 V9       P04196  HRG      Hex(4)HexNAc(3)       Hex(??-?)HexNAc…          344
#> 10 V10      P04196  HRG      Hex(4)HexNAc(4)NeuAc… NeuAc(??-?)Hex(…          344
#> # ℹ 253 more rows

get_sample_info(exp)
#> # A tibble: 12 × 3
#>    sample group batch
#>    <chr>  <chr> <dbl>
#>  1 C_1    C         1
#>  2 C_2    C         2
#>  3 C_3    C         3
#>  4 H_1    H         1
#>  5 H_2    H         2
#>  6 H_3    H         3
#>  7 M_1    M         1
#>  8 M_2    M         2
#>  9 M_3    M         3
#> 10 Y_1    Y         1
#> 11 Y_2    Y         2
#> 12 Y_3    Y         3

What we have here is a beautiful N-glycoproteomics dataset featuring 500 PSMs (Peptide Spectrum Matches) across 12 samples — a perfect playground for motif analysis! 🎮

Pro tip: 💡 In real-world data analysis, you’ll definitely want to use glyclean to perform data preprocessing before diving into any analysis. Think of it as washing your vegetables before cooking—essential for the best results! 🥕

Adding Motif Annotations to an Experiment 🏷️

Here’s where things get interesting! 🤔 We know that the variable information tibble contains all the juicy details about each glycoform - the proteins, sites, and glycan structures. But what if we want to sprinkle some motif magic into the mix? What if we want to add more columns that tell us about the motifs hiding in our glycans?

The simple approach: One motif at a time (very intuitive!) 🎯

exp |> 
  mutate_var(n_hex = have_motif(glycan_structure, "Hex(a1-")) |>
  get_var_info() |>
  select(variable, protein, glycan_structure, n_hex)
#> # A tibble: 263 × 4
#>    variable protein glycan_structure                                       n_hex
#>    <chr>    <chr>   <structure>                                            <lgl>
#>  1 V1       P08185  NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[NeuAc(??-?)… FALSE
#>  2 V2       P04196  NeuAc(??-?)Hex(??-?)HexNAc(??-?)[HexNAc(??-?)]Hex(??-… FALSE
#>  3 V3       P04196  Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)HexNAc(??-?)H… FALSE
#>  4 V4       P04196  NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)He… FALSE
#>  5 V5       P10909  Hex(??-?)HexNAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[H… FALSE
#>  6 V6       P04196  NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[NeuAc(??-?)… FALSE
#>  7 V7       P04196  Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)HexNAc(??-?)H… FALSE
#>  8 V8       P04196  dHex(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[dHex(??-?)He… FALSE
#>  9 V9       P04196  Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)]Hex(??-?)Hex… FALSE
#> 10 V10      P04196  NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?… FALSE
#> # ℹ 253 more rows

The tempting approach: Multiple motifs at once (you might be tempted to do this…) 🤷‍♀️

# Don't do this
exp |>
  mutate_var(
    n_hex = have_motif(glycan_structure, "Hex(a1-"),
    n_hexna = have_motif(glycan_structure, "HexNAc(a1-"),
    n_dhex = have_motif(glycan_structure, "dHex(a1-")
  )

Hold on there, speed racer! 🛑 While this approach works, it’s not very efficient, and your computer won’t thank you for it. Here’s why: those three separate calls to have_motif() all perform time-consuming validations and conversions on the same set of glycan structures. It’s like washing the same dishes three times instead of doing them all at once! 🍽️ Plus, there’s a lot of repetitive typing. You have to type have_motif and glycan_structure three times — talk about finger fatigue! 😴

The smart approach: Use the add_motifs_lgl() or add_motifs_int() functions instead! ⚡ They might look like simple syntactic sugar, but they’re actually optimized powerhouses designed specifically for this exact scenario:

exp2 <- exp |> 
  add_motifs_lgl(c(motif1 = "Hex(??-", motif2 = "HexNAc(??-", motif3 = "dHex(??-"))

Voilà! 🎉 The motif annotations are now seamlessly integrated into your variable information tibble.

exp2 |>
  get_var_info() |>
  select(variable, motif1, motif2, motif3)
#> # A tibble: 263 × 4
#>    variable motif1 motif2 motif3
#>    <chr>    <lgl>  <lgl>  <lgl> 
#>  1 V1       TRUE   TRUE   FALSE 
#>  2 V2       TRUE   TRUE   FALSE 
#>  3 V3       TRUE   TRUE   FALSE 
#>  4 V4       TRUE   TRUE   FALSE 
#>  5 V5       TRUE   TRUE   FALSE 
#>  6 V6       TRUE   TRUE   FALSE 
#>  7 V7       TRUE   TRUE   FALSE 
#>  8 V8       TRUE   TRUE   TRUE  
#>  9 V9       TRUE   TRUE   FALSE 
#> 10 V10      TRUE   TRUE   FALSE 
#> # ℹ 253 more rows

But wait, what can you actually do with these shiny new columns? 🤔 The possibilities are endless, but here’s a tantalizing example to get your creative juices flowing:

# You can perform pathway enrichment on all glycoproteins containing some motif:
exp2 |>
  filter_var(motif1 == TRUE) |>
  enrich_reactome()  # from the `glystats` package

The Art of Motif Quantification in Experiments 📊

Now we’re entering truly exciting territory! 🚀 Meet quantify_motifs() — another incredibly useful tool in our arsenal. This function takes a glyexp::experiment() object and returns a brand new one with motif quantifications. But what does that actually mean? 🤷‍♂️

Here’s the beautiful concept: when we quantify a motif in a glycoproteomic context, we’re measuring the abundance of motifs on each glycosite in each sample. This information elegantly combines two key pieces: the motif count on each site and the abundance of that site. It’s like knowing not just how many sports cars are in a parking lot, but also how often each parking spot is used! 🚗

motifs <- c(
  "NeuAc(??-?)Hex(??-?)HexNAc(??-",
  "Hex(??-?)HexNAc(??-",
  "HexNAc(??-",
  "NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNAc(??-",
  "Hex(??-?)[dHex(??-?)]HexNAc(??-",
  "dHex(??-?)HexNAc(??-"
)
motif_exp <- quantify_motifs(exp, motifs, alignments = "terminal", ignore_linkages = TRUE)
# Pro tip: Our motifs don't have linkage information,
# but we can still set `ignore_linkages` to TRUE manually to turbocharge the calculation! ⚡

Here’s the beautiful part: motif_exp is also a glyexp::experiment()! 🎭 Think of it this way： if we describe a typical glycoproteomics experiment() as “the quantities of glycans on different glycosites in different samples”, then motif_exp is “the quantities of motifs on different glycosites in different samples”. It’s the same elegant structure, just viewed through a different analytical lens! 🔍

Let’s explore our newly minted experiment and see what treasures it holds:

# First, let's peek at the variable information
print(get_var_info(motif_exp), n = 12)
#> # A tibble: 174 × 5
#>    variable protein gene     protein_site motif                                 
#>    <chr>    <chr>   <chr>           <int> <chr>                                 
#>  1 V1       P08185  SERPINA6          176 NeuAc(??-?)Hex(??-?)HexNAc(??-        
#>  2 V2       P08185  SERPINA6          176 Hex(??-?)HexNAc(??-                   
#>  3 V3       P08185  SERPINA6          176 HexNAc(??-                            
#>  4 V4       P08185  SERPINA6          176 NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNA…
#>  5 V5       P08185  SERPINA6          176 Hex(??-?)[dHex(??-?)]HexNAc(??-       
#>  6 V6       P08185  SERPINA6          176 dHex(??-?)HexNAc(??-                  
#>  7 V7       P04196  HRG               344 NeuAc(??-?)Hex(??-?)HexNAc(??-        
#>  8 V8       P04196  HRG               344 Hex(??-?)HexNAc(??-                   
#>  9 V9       P04196  HRG               344 HexNAc(??-                            
#> 10 V10      P04196  HRG               344 NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNA…
#> 11 V11      P04196  HRG               344 Hex(??-?)[dHex(??-?)]HexNAc(??-       
#> 12 V12      P04196  HRG               344 dHex(??-?)HexNAc(??-                  
#> # ℹ 162 more rows

# And now, the expression matrix (where the magic numbers live!)
get_expr_mat(motif_exp)[1:12, ]
#>              C_1        C_2          C_3         H_1          H_2       H_3
#> V1      152014.3    1066076     102104.8    313757.6     48186.51   1058981
#> V2           0.0          0          0.0         0.0         0.00         0
#> V3           0.0          0          0.0         0.0         0.00         0
#> V4           0.0          0          0.0         0.0         0.00         0
#> V5           0.0          0          0.0         0.0         0.00         0
#> V6           0.0          0          0.0         0.0         0.00         0
#> V7  6345221568.8 7206105753 7191439575.7 140324912.5 533431894.56 300715555
#> V8   610187320.7  687432028  618492715.7   1672944.5    816325.57   4082485
#> V9   301461764.4  469696174  342199332.7  10748745.5 499654575.62 292347836
#> V10  204401527.3  361718038  551686432.8  42321892.3  11721625.17   1939821
#> V11          0.0          0          0.0         0.0         0.00         0
#> V12  226547981.2  387992287  581512073.9  42387630.9  11785577.54   1978999
#>            M_1        M_2        M_3        Y_1        Y_2          Y_3
#> V1    14084377    8630328    7735835   16851418   11241393     739108.8
#> V2           0          0          0          0          0          0.0
#> V3           0          0          0          0          0          0.0
#> V4           0          0          0          0          0          0.0
#> V5           0          0          0          0          0          0.0
#> V6           0          0          0          0          0          0.0
#> V7  5747809593 5202035052 3736151840 1652226063 3236759193 2298155309.7
#> V8   951853984  668247296  733802918  233115428  301886646  314476664.4
#> V9   309613176   97558541  215055020   75713539   84626608   48423869.5
#> V10  291843379  161976038  257232685  148760800  139522555  149173688.3
#> V11          0          0          0          0          0          0.0
#> V12  324153530  177427606  282567373  160099412  145271893  160356962.6

Pretty familiar, right? 😊 It’s just like the experiment you know and love, except now we’re quantifying motifs rather than glycans.

Here’s the best part: this means all your favorite functions in glystats can be used on this experiment! 🎉 It’s like having a universal remote that works with your new TV： everything just clicks! 📺

library(glystats)

gly_anova(motif_exp)
gly_pca(motif_exp)
# And others...