Skip to contents

When Motifs Meet Experiments: A Perfect Partnership 🤝

The real power of glymotif shines brightest when it joins forces with other tools in the glycoverse ecosystem. If you’re already using glyread to import your glycoproteomics results and glyexp to manage your experimental data, you’re in for a treat! glymotif provides some incredibly useful functions to perform experiment manipulation about motifs—think of it as adding a new lens to your analytical microscope. 🔬

Important note: This vignette assumes you’re familiar with the glyexp package. If you haven’t met it yet, we highly recommend checking out its introduction first. Trust us—it’s worth the detour! 🚀

library(glymotif)
library(glyexp)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#> 
#>     select_var
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Meet Our Star Player: Real Glycoproteomics Data 🌟

Time to roll up our sleeves and dive into the good stuff! 💪 Let’s work with a real-world dataset that will showcase what glymotif can do when it teams up with actual experimental data. We will use real_experiment from the glyexp package, an serum N-glycoproteomics study with 12 samples. Firstly, let’s use glyclean::auto_clean() to preprocess the data.

library(glyclean)
#> 
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#> 
#>     aggregate

exp <- auto_clean(real_experiment)
#>  Normalizing data (Median)
#>  Normalizing data (Median) [160ms]
#> 
#>  Removing variables with >50% missing values
#>  Removing variables with >50% missing values [45ms]
#> 
#>  Imputing missing values
#>  Sample size <= 30, using sample minimum imputation
#>  Imputing missing values Imputing missing values [24ms]
#> 
#>  Aggregating data
#>  Aggregating data [886ms]
#> 
#>  Normalizing data again
#>  Normalizing data again [16ms]
exp
#> 
#> ── Experiment ──────────────────────────────────────────────────────────────────
#>  Expression matrix: 12 samples, 3880 variables
#>  Sample information fields: group <chr>
#>  Variable information fields: protein <chr>, gene <chr>, glycan_composition <comp>, glycan_structure <struct>, protein_site <int>

Now, let’s peek under the hood and see what treasures we’re working with! 👀

get_var_info(exp)
#> # A tibble: 3,880 × 6
#>    variable protein gene     glycan_composition     
#>    <chr>    <chr>   <chr>    <comp>                 
#>  1 V1       P08185  SERPINA6 Hex(5)HexNAc(4)NeuAc(2)
#>  2 V2       P04196  HRG      Hex(5)HexNAc(4)NeuAc(1)
#>  3 V3       P04196  HRG      Hex(5)HexNAc(4)        
#>  4 V4       P04196  HRG      Hex(5)HexNAc(4)NeuAc(1)
#>  5 V5       P10909  CLU      Hex(6)HexNAc(5)        
#>  6 V6       P04196  HRG      Hex(5)HexNAc(4)NeuAc(2)
#>  7 V7       P04196  HRG      Hex(5)HexNAc(4)        
#>  8 V8       P04196  HRG      Hex(5)HexNAc(4)dHex(2) 
#>  9 V9       P04196  HRG      Hex(4)HexNAc(3)        
#> 10 V10      P04196  HRG      Hex(4)HexNAc(4)NeuAc(1)
#> # ℹ 3,870 more rows
#> # ℹ 2 more variables: glycan_structure <struct>, protein_site <int>
get_sample_info(exp)
#> # A tibble: 12 × 2
#>    sample group
#>    <chr>  <chr>
#>  1 C1     C    
#>  2 C2     C    
#>  3 C3     C    
#>  4 H1     H    
#>  5 H2     H    
#>  6 H3     H    
#>  7 M1     M    
#>  8 M2     M    
#>  9 M3     M    
#> 10 Y1     Y    
#> 11 Y2     Y    
#> 12 Y3     Y

What we have here is a beautiful N-glycoproteomics dataset featuring 500 PSMs (Peptide Spectrum Matches) across 12 samples — a perfect playground for motif analysis! 🎮

Pro tip: 💡 In real-world data analysis, you’ll definitely want to use glyclean to perform data preprocessing before diving into any analysis. Think of it as washing your vegetables before cooking—essential for the best results! 🥕

Adding Motif Annotations to an Experiment 🏷️

Here’s where things get interesting! 🤔 We know that the variable information tibble contains all the juicy details about each glycoform - the proteins, sites, and glycan structures. But what if we want to sprinkle some motif magic into the mix? What if we want to add more columns that tell us about the motifs hiding in our glycans?

The simple approach: One motif at a time (very intuitive!) 🎯

exp |> 
  mutate_var(n_hex = have_motif(glycan_structure, "Hex(a1-")) |>
  get_var_info() |>
  select(variable, protein, glycan_structure, n_hex)
#> # A tibble: 3,880 × 4
#>    variable protein
#>    <chr>    <chr>  
#>  1 V1       P08185 
#>  2 V2       P04196 
#>  3 V3       P04196 
#>  4 V4       P04196 
#>  5 V5       P10909 
#>  6 V6       P04196 
#>  7 V7       P04196 
#>  8 V8       P04196 
#>  9 V9       P04196 
#> 10 V10      P04196 
#> # ℹ 3,870 more rows
#> # ℹ 2 more variables: glycan_structure <struct>, n_hex <lgl>

The tempting approach: Multiple motifs at once (you might be tempted to do this…) 🤷‍♀️

# Don't do this
exp |>
  mutate_var(
    n_hex = have_motif(glycan_structure, "Hex(a1-"),
    n_hexna = have_motif(glycan_structure, "HexNAc(a1-"),
    n_dhex = have_motif(glycan_structure, "dHex(a1-")
  )

Hold on there, speed racer! 🛑 While this approach works, it’s not very efficient, and your computer won’t thank you for it. Here’s why: those three separate calls to have_motif() all perform time-consuming validations and conversions on the same set of glycan structures. It’s like washing the same dishes three times instead of doing them all at once! 🍽️ Plus, there’s a lot of repetitive typing. You have to type have_motif and glycan_structure three times — talk about finger fatigue! 😴

The smart approach: Use the add_motifs_lgl() or add_motifs_int() functions instead! ⚡ They might look like simple syntactic sugar, but they’re actually optimized powerhouses designed specifically for this exact scenario:

exp2 <- exp |> 
  add_motifs_lgl(c(motif1 = "Hex(??-", motif2 = "HexNAc(??-", motif3 = "dHex(??-"))

Voilà! 🎉 The motif annotations are now seamlessly integrated into your variable information tibble.

exp2 |>
  get_var_info() |>
  select(variable, motif1, motif2, motif3)
#> # A tibble: 3,880 × 4
#>    variable motif1 motif2 motif3
#>    <chr>    <lgl>  <lgl>  <lgl> 
#>  1 V1       TRUE   TRUE   FALSE 
#>  2 V2       TRUE   TRUE   FALSE 
#>  3 V3       TRUE   TRUE   FALSE 
#>  4 V4       TRUE   TRUE   FALSE 
#>  5 V5       TRUE   TRUE   FALSE 
#>  6 V6       TRUE   TRUE   FALSE 
#>  7 V7       TRUE   TRUE   FALSE 
#>  8 V8       TRUE   TRUE   TRUE  
#>  9 V9       TRUE   TRUE   FALSE 
#> 10 V10      TRUE   TRUE   FALSE 
#> # ℹ 3,870 more rows

But wait, what can you actually do with these shiny new columns? 🤔 The possibilities are endless, but here’s a tantalizing example to get your creative juices flowing:

# You can perform pathway enrichment on all glycoproteins containing some motif:
exp2 |>
  filter_var(motif1 == TRUE) |>
  gly_enrich_reactome()  # from the `glystats` package

The Art of Motif Quantification in Experiments 📊

Now we’re entering truly exciting territory! 🚀 Meet quantify_motifs() — another incredibly useful tool in our arsenal. This function takes a glyexp::experiment() object and returns a brand new one with motif quantifications. But what does that actually mean? 🤷‍♂️

Here’s the beautiful concept: when we quantify a motif in a glycoproteomic context, we’re measuring the abundance of motifs on each glycosite in each sample. This information elegantly combines two key pieces: the motif count on each site and the abundance of that site. It’s like knowing not just how many sports cars are in a parking lot, but also how often each parking spot is used! 🚗

motifs <- c(
  "NeuAc(??-?)Hex(??-?)HexNAc(??-",
  "Hex(??-?)HexNAc(??-",
  "HexNAc(??-",
  "NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNAc(??-",
  "Hex(??-?)[dHex(??-?)]HexNAc(??-",
  "dHex(??-?)HexNAc(??-"
)
motif_exp <- quantify_motifs(exp, motifs, alignments = "terminal", ignore_linkages = TRUE)
# Pro tip: Our motifs don't have linkage information,
# but we can still set `ignore_linkages` to TRUE manually to turbocharge the calculation! ⚡

Here’s the beautiful part: motif_exp is also a glyexp::experiment()! 🎭 Think of it this way: if we describe a typical glycoproteomics experiment() as “the quantities of glycans on different glycosites in different samples”, then motif_exp is “the quantities of motifs on different glycosites in different samples”. It’s the same elegant structure, just viewed through a different analytical lens! 🔍

Let’s explore our newly minted experiment and see what treasures it holds:

# First, let's peek at the variable information
print(get_var_info(motif_exp), n = 12)
#> # A tibble: 1,644 × 5
#>    variable protein protein_site motif                                     gene 
#>    <chr>    <chr>          <int> <chr>                                     <chr>
#>  1 V1       A6NJW9            49 NeuAc(??-?)Hex(??-?)HexNAc(??-            CD8B2
#>  2 V2       A6NJW9            49 Hex(??-?)HexNAc(??-                       CD8B2
#>  3 V3       A6NJW9            49 HexNAc(??-                                CD8B2
#>  4 V4       A6NJW9            49 NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNAc(?… CD8B2
#>  5 V5       A6NJW9            49 Hex(??-?)[dHex(??-?)]HexNAc(??-           CD8B2
#>  6 V6       A6NJW9            49 dHex(??-?)HexNAc(??-                      CD8B2
#>  7 V7       O14786           150 NeuAc(??-?)Hex(??-?)HexNAc(??-            NRP1 
#>  8 V8       O14786           150 Hex(??-?)HexNAc(??-                       NRP1 
#>  9 V9       O14786           150 HexNAc(??-                                NRP1 
#> 10 V10      O14786           150 NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNAc(?… NRP1 
#> 11 V11      O14786           150 Hex(??-?)[dHex(??-?)]HexNAc(??-           NRP1 
#> 12 V12      O14786           150 dHex(??-?)HexNAc(??-                      NRP1 
#> # ℹ 1,632 more rows
# And now, the expression matrix (where the magic numbers live!)
get_expr_mat(motif_exp)[1:12, ]
#>           C1       C2       C3       H1       H2       H3       M1         M2
#> V1   4518941  3597488  9291116  1652455 17718.79 18276.18 10131349  4903136.0
#> V2         0        0        0        0     0.00     0.00        0        0.0
#> V3         0        0        0        0     0.00     0.00        0        0.0
#> V4   4518941  3597488  9291116  1652455 17718.79 18276.18 10131349  4903136.0
#> V5         0        0        0        0     0.00     0.00        0        0.0
#> V6  13556823 10792465 27873349  4957366 53156.37 54828.54 30394046 14709407.9
#> V7   8358832 12224955 13576169 13126524 17718.79 18276.18 16723916   904568.5
#> V8   8358832 12224955 13576169 13126524 17718.79 18276.18 16723916   904568.5
#> V9         0        0        0        0     0.00     0.00        0        0.0
#> V10        0        0        0        0     0.00     0.00        0        0.0
#> V11        0        0        0        0     0.00     0.00        0        0.0
#> V12  8358832 12224955 13576169 13126524 17718.79 18276.18 16723916   904568.5
#>          M3         Y1       Y2       Y3
#> V1  2876921   697401.2  3718507  3636137
#> V2        0        0.0        0        0
#> V3        0        0.0        0        0
#> V4  2876921   697401.2  3718507  3636137
#> V5        0        0.0        0        0
#> V6  8630764  2092203.5 11155520 10908410
#> V7  3096274 10234681.7  9805953 12449337
#> V8  3096274 10234681.7  9805953 12449337
#> V9        0        0.0        0        0
#> V10       0        0.0        0        0
#> V11       0        0.0        0        0
#> V12 3096274 10234681.7  9805953 12449337

Pretty familiar, right? 😊 It’s just like the experiment you know and love, except now we’re quantifying motifs rather than glycans.

Here’s the best part: this means all your favorite functions in glystats can be used on this experiment! 🎉 It’s like having a universal remote that works with your new TV: everything just clicks! 📺

library(glystats)

gly_anova(motif_exp)
gly_pca(motif_exp)
# And others...