
Working with glyexp
with-exp.Rmd
When Motifs Meet Experiments: A Perfect Partnership 🤝
The real power of glymotif
shines brightest when it
joins forces with other tools in the glycoverse
ecosystem.
If you’re already using glyread
to import your
glycoproteomics results and glyexp
to manage your
experimental data, you’re in for a treat! glymotif
provides
some incredibly useful functions to perform experiment
manipulation about motifs—think of it as adding a new lens to your
analytical microscope. 🔬
Important note: This vignette assumes you’re
familiar with the glyexp
package. If you haven’t met it
yet, we highly recommend checking out its introduction
first. Trust us—it’s worth the detour! 🚀
library(glymotif)
library(glyexp)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#>
#> select_var
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Meet Our Star Player: Real Glycoproteomics Data 🌟
Time to roll up our sleeves and dive into the good stuff! 💪 Let’s
work with a real-world dataset that will showcase what
glymotif
can do when it teams up with actual experimental
data. We will use real_experiment
from the
glyexp
package, an serum N-glycoproteomics study with 12
samples. Firstly, let’s use glyclean::auto_clean()
to
preprocess the data.
library(glyclean)
#>
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#>
#> aggregate
exp <- auto_clean(real_experiment)
#> ℹ Normalizing data (Median)
#> ✔ Normalizing data (Median) [160ms]
#>
#> ℹ Removing variables with >50% missing values
#> ✔ Removing variables with >50% missing values [45ms]
#>
#> ℹ Imputing missing values
#> ℹ Sample size <= 30, using sample minimum imputation
#> ℹ Imputing missing values✔ Imputing missing values [24ms]
#>
#> ℹ Aggregating data
#> ✔ Aggregating data [886ms]
#>
#> ℹ Normalizing data again
#> ✔ Normalizing data again [16ms]
exp
#>
#> ── Experiment ──────────────────────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 3880 variables
#> ℹ Sample information fields: group <chr>
#> ℹ Variable information fields: protein <chr>, gene <chr>, glycan_composition <comp>, glycan_structure <struct>, protein_site <int>
Now, let’s peek under the hood and see what treasures we’re working with! 👀
get_var_info(exp)
#> # A tibble: 3,880 × 6
#> variable protein gene glycan_composition
#> <chr> <chr> <chr> <comp>
#> 1 V1 P08185 SERPINA6 Hex(5)HexNAc(4)NeuAc(2)
#> 2 V2 P04196 HRG Hex(5)HexNAc(4)NeuAc(1)
#> 3 V3 P04196 HRG Hex(5)HexNAc(4)
#> 4 V4 P04196 HRG Hex(5)HexNAc(4)NeuAc(1)
#> 5 V5 P10909 CLU Hex(6)HexNAc(5)
#> 6 V6 P04196 HRG Hex(5)HexNAc(4)NeuAc(2)
#> 7 V7 P04196 HRG Hex(5)HexNAc(4)
#> 8 V8 P04196 HRG Hex(5)HexNAc(4)dHex(2)
#> 9 V9 P04196 HRG Hex(4)HexNAc(3)
#> 10 V10 P04196 HRG Hex(4)HexNAc(4)NeuAc(1)
#> # ℹ 3,870 more rows
#> # ℹ 2 more variables: glycan_structure <struct>, protein_site <int>
get_sample_info(exp)
#> # A tibble: 12 × 2
#> sample group
#> <chr> <chr>
#> 1 C1 C
#> 2 C2 C
#> 3 C3 C
#> 4 H1 H
#> 5 H2 H
#> 6 H3 H
#> 7 M1 M
#> 8 M2 M
#> 9 M3 M
#> 10 Y1 Y
#> 11 Y2 Y
#> 12 Y3 Y
What we have here is a beautiful N-glycoproteomics dataset featuring 500 PSMs (Peptide Spectrum Matches) across 12 samples — a perfect playground for motif analysis! 🎮
Pro tip: 💡 In real-world data analysis, you’ll
definitely want to use glyclean
to perform data
preprocessing before diving into any analysis. Think of it as washing
your vegetables before cooking—essential for the best results! 🥕
Adding Motif Annotations to an Experiment 🏷️
Here’s where things get interesting! 🤔 We know that the variable information tibble contains all the juicy details about each glycoform - the proteins, sites, and glycan structures. But what if we want to sprinkle some motif magic into the mix? What if we want to add more columns that tell us about the motifs hiding in our glycans?
The simple approach: One motif at a time (very intuitive!) 🎯
exp |>
mutate_var(n_hex = have_motif(glycan_structure, "Hex(a1-")) |>
get_var_info() |>
select(variable, protein, glycan_structure, n_hex)
#> # A tibble: 3,880 × 4
#> variable protein
#> <chr> <chr>
#> 1 V1 P08185
#> 2 V2 P04196
#> 3 V3 P04196
#> 4 V4 P04196
#> 5 V5 P10909
#> 6 V6 P04196
#> 7 V7 P04196
#> 8 V8 P04196
#> 9 V9 P04196
#> 10 V10 P04196
#> # ℹ 3,870 more rows
#> # ℹ 2 more variables: glycan_structure <struct>, n_hex <lgl>
The tempting approach: Multiple motifs at once (you might be tempted to do this…) 🤷♀️
# Don't do this
exp |>
mutate_var(
n_hex = have_motif(glycan_structure, "Hex(a1-"),
n_hexna = have_motif(glycan_structure, "HexNAc(a1-"),
n_dhex = have_motif(glycan_structure, "dHex(a1-")
)
Hold on there, speed racer! 🛑 While this approach works, it’s not
very efficient, and your computer won’t thank you for it. Here’s why:
those three separate calls to have_motif()
all perform
time-consuming validations and conversions on the same set of glycan
structures. It’s like washing the same dishes three times instead of
doing them all at once! 🍽️ Plus, there’s a lot of repetitive typing. You
have to type have_motif
and glycan_structure
three times — talk about finger fatigue! 😴
The smart approach: Use the
add_motifs_lgl()
or add_motifs_int()
functions
instead! ⚡ They might look like simple syntactic sugar, but they’re
actually optimized powerhouses designed specifically for this exact
scenario:
exp2 <- exp |>
add_motifs_lgl(c(motif1 = "Hex(??-", motif2 = "HexNAc(??-", motif3 = "dHex(??-"))
Voilà! 🎉 The motif annotations are now seamlessly integrated into your variable information tibble.
exp2 |>
get_var_info() |>
select(variable, motif1, motif2, motif3)
#> # A tibble: 3,880 × 4
#> variable motif1 motif2 motif3
#> <chr> <lgl> <lgl> <lgl>
#> 1 V1 TRUE TRUE FALSE
#> 2 V2 TRUE TRUE FALSE
#> 3 V3 TRUE TRUE FALSE
#> 4 V4 TRUE TRUE FALSE
#> 5 V5 TRUE TRUE FALSE
#> 6 V6 TRUE TRUE FALSE
#> 7 V7 TRUE TRUE FALSE
#> 8 V8 TRUE TRUE TRUE
#> 9 V9 TRUE TRUE FALSE
#> 10 V10 TRUE TRUE FALSE
#> # ℹ 3,870 more rows
But wait, what can you actually do with these shiny new columns? 🤔 The possibilities are endless, but here’s a tantalizing example to get your creative juices flowing:
# You can perform pathway enrichment on all glycoproteins containing some motif:
exp2 |>
filter_var(motif1 == TRUE) |>
gly_enrich_reactome() # from the `glystats` package
The Art of Motif Quantification in Experiments 📊
Now we’re entering truly exciting territory! 🚀 Meet
quantify_motifs()
— another incredibly useful tool in our
arsenal. This function takes a glyexp::experiment()
object
and returns a brand new one with motif quantifications. But what does
that actually mean? 🤷♂️
Here’s the beautiful concept: when we quantify a motif in a glycoproteomic context, we’re measuring the abundance of motifs on each glycosite in each sample. This information elegantly combines two key pieces: the motif count on each site and the abundance of that site. It’s like knowing not just how many sports cars are in a parking lot, but also how often each parking spot is used! 🚗
motifs <- c(
"NeuAc(??-?)Hex(??-?)HexNAc(??-",
"Hex(??-?)HexNAc(??-",
"HexNAc(??-",
"NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNAc(??-",
"Hex(??-?)[dHex(??-?)]HexNAc(??-",
"dHex(??-?)HexNAc(??-"
)
motif_exp <- quantify_motifs(exp, motifs, alignments = "terminal", ignore_linkages = TRUE)
# Pro tip: Our motifs don't have linkage information,
# but we can still set `ignore_linkages` to TRUE manually to turbocharge the calculation! ⚡
Here’s the beautiful part: motif_exp
is also a
glyexp::experiment()
! 🎭 Think of it this way: if we
describe a typical glycoproteomics experiment()
as “the
quantities of glycans on different glycosites in different samples”,
then motif_exp
is “the quantities of motifs on different
glycosites in different samples”. It’s the same elegant structure, just
viewed through a different analytical lens! 🔍
Let’s explore our newly minted experiment and see what treasures it holds:
# First, let's peek at the variable information
print(get_var_info(motif_exp), n = 12)
#> # A tibble: 1,644 × 5
#> variable protein protein_site motif gene
#> <chr> <chr> <int> <chr> <chr>
#> 1 V1 A6NJW9 49 NeuAc(??-?)Hex(??-?)HexNAc(??- CD8B2
#> 2 V2 A6NJW9 49 Hex(??-?)HexNAc(??- CD8B2
#> 3 V3 A6NJW9 49 HexNAc(??- CD8B2
#> 4 V4 A6NJW9 49 NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNAc(?… CD8B2
#> 5 V5 A6NJW9 49 Hex(??-?)[dHex(??-?)]HexNAc(??- CD8B2
#> 6 V6 A6NJW9 49 dHex(??-?)HexNAc(??- CD8B2
#> 7 V7 O14786 150 NeuAc(??-?)Hex(??-?)HexNAc(??- NRP1
#> 8 V8 O14786 150 Hex(??-?)HexNAc(??- NRP1
#> 9 V9 O14786 150 HexNAc(??- NRP1
#> 10 V10 O14786 150 NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNAc(?… NRP1
#> 11 V11 O14786 150 Hex(??-?)[dHex(??-?)]HexNAc(??- NRP1
#> 12 V12 O14786 150 dHex(??-?)HexNAc(??- NRP1
#> # ℹ 1,632 more rows
# And now, the expression matrix (where the magic numbers live!)
get_expr_mat(motif_exp)[1:12, ]
#> C1 C2 C3 H1 H2 H3 M1 M2
#> V1 4518941 3597488 9291116 1652455 17718.79 18276.18 10131349 4903136.0
#> V2 0 0 0 0 0.00 0.00 0 0.0
#> V3 0 0 0 0 0.00 0.00 0 0.0
#> V4 4518941 3597488 9291116 1652455 17718.79 18276.18 10131349 4903136.0
#> V5 0 0 0 0 0.00 0.00 0 0.0
#> V6 13556823 10792465 27873349 4957366 53156.37 54828.54 30394046 14709407.9
#> V7 8358832 12224955 13576169 13126524 17718.79 18276.18 16723916 904568.5
#> V8 8358832 12224955 13576169 13126524 17718.79 18276.18 16723916 904568.5
#> V9 0 0 0 0 0.00 0.00 0 0.0
#> V10 0 0 0 0 0.00 0.00 0 0.0
#> V11 0 0 0 0 0.00 0.00 0 0.0
#> V12 8358832 12224955 13576169 13126524 17718.79 18276.18 16723916 904568.5
#> M3 Y1 Y2 Y3
#> V1 2876921 697401.2 3718507 3636137
#> V2 0 0.0 0 0
#> V3 0 0.0 0 0
#> V4 2876921 697401.2 3718507 3636137
#> V5 0 0.0 0 0
#> V6 8630764 2092203.5 11155520 10908410
#> V7 3096274 10234681.7 9805953 12449337
#> V8 3096274 10234681.7 9805953 12449337
#> V9 0 0.0 0 0
#> V10 0 0.0 0 0
#> V11 0 0.0 0 0
#> V12 3096274 10234681.7 9805953 12449337
Pretty familiar, right? 😊 It’s just like the experiment you know and love, except now we’re quantifying motifs rather than glycans.
Here’s the best part: this means all your favorite functions
in glystats
can be used on this experiment! 🎉 It’s like
having a universal remote that works with your new TV: everything just
clicks! 📺
library(glystats)
gly_anova(motif_exp)
gly_pca(motif_exp)
# And others...