
Get Started with glyexp
glyexp.Rmd
Picture this: you’re knee-deep in omics experiments (especially the fascinating world of glycomics and glycoproteomics), and you’re juggling three types of data like a lab virtuoso:
- Expression data - the actual measurements of your biological molecules (glycans, glycopeptides, and their friends)
- Molecular annotations - the ID cards for your molecules (structures, sequences, you name it)
- Experimental metadata - the story behind your samples (time points, treatments, experimental conditions)
Here’s where glyexp
swoops in to save the day! 🦸♀️
The experiment()
class is your new best friend - think
of it as a smart container that keeps all three data types organized and
talking to each other. No more scattered spreadsheets or lost
annotations!
Why should you care? Every package in the
glycoverse
ecosystem speaks experiment()
fluently. It’s like having a universal translator for your glycomics
workflow - everything just clicks together.
library(glyexp)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#>
#> select_var
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(conflicted)
# Resolve function conflicts - prefer glyexp version over deprecated dplyr version
# `dplyr::select_var` is deprecated anyway, so we can safely override it
conflicts_prefer(glyexp::select_var)
#> [conflicted] Will prefer glyexp::select_var over
#> any other package.
Your First Steps into the Glycoverse
Let’s dive in with our trusty toy experiment - think of it as your training wheels before you tackle the real deal.
toy_exp <- toy_experiment()
toy_exp
#>
#> ── Experiment ──────────────────────────────────────────────────────────────────
#> ℹ Expression matrix: 6 samples, 4 variables
#> ℹ Sample information fields: group and batch
#> ℹ Variable information fields: protein, peptide, and glycan_composition
Look at that beautiful summary! When you print an
experiment()
object, it’s like getting a snapshot of your
entire experimental world - variables, observations, and all the
metadata that makes your data meaningful.
Now, let’s peek under the hood. You can extract the three core components faster than you can say “glycosylation”:
🧬 The Expression Matrix - Your Data’s Heart and Soul
get_expr_mat(toy_exp)
#> S1 S2 S3 S4 S5 S6
#> V1 1 5 9 13 17 21
#> V2 2 6 10 14 18 22
#> V3 3 7 11 15 19 23
#> V4 4 8 12 16 20 24
This matrix is where the magic happens - rows are your variables (molecules), columns are your observations (samples), and the numbers tell your biological story.
🏷️ Variable Information - Meet Your Molecules
get_var_info(toy_exp)
#> # A tibble: 4 × 4
#> variable protein peptide glycan_composition
#> <chr> <chr> <chr> <chr>
#> 1 V1 PRO1 PEP1 H5N2
#> 2 V2 PRO2 PEP2 H5N2
#> 3 V3 PRO3 PEP3 N3N2
#> 4 V4 PRO3 PEP4 N3N2
Think of this as your molecular address book - every variable gets its own detailed profile.
📋 Sample Information - Know Your Experiments
get_sample_info(toy_exp)
#> # A tibble: 6 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2
And this? This is your experimental diary - tracking every condition, timepoint, and treatment.
Here’s the cool part: Notice how the “variable”
column in get_var_info()
and the “sample” column in
get_sample_info()
perfectly match the row and column names
in your expression matrix? That’s no accident!
These are the index columns - the secret sauce that keeps everything synchronized. They’re like the GPS coordinates that ensure your data stays connected no matter what transformations you throw at it.
Data Wrangling Made Easy - dplyr Meets glyexp
If you’ve ever used dplyr
(and who hasn’t?), you’re
already 90% of the way there! 🎉
For every dplyr
function you know and love,
glyexp
gives you two specialized versions:
-
_obs()
functions: work on your sample metadata -
_var()
functions: work on your variable annotations
Let’s see this in action. Want to focus on just group “A” samples?
subset_exp <- filter_obs(toy_exp, group == "A")
Let’s check what happened to our sample info:
get_sample_info(subset_exp)
#> # A tibble: 3 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
Beautiful! But here’s where the magic really shines - check out the expression matrix:
get_expr_mat(subset_exp)
#> S1 S2 S3
#> V1 1 5 9
#> V2 2 6 10
#> V3 3 7 11
#> V4 4 8 12
🎪 Ta-da! The expression matrix automatically filtered itself to match! It’s like having a well-trained assistant who anticipates your every move.
This is filter_obs()
in a nutshell: “Hey, filter my
sample info this way, and oh yeah, make sure everything else follows
suit.” And it does, flawlessly.
Variable filtering works the same way:
toy_exp |>
filter_obs(group == "A") |>
filter_var(glycan_composition == "H5N2") |>
get_expr_mat()
#> S1 S2 S3
#> V1 1 5 9
#> V2 2 6 10
Notice how these functions support the pipe operator
(|>
)? That’s the dplyr
DNA in action!
The pattern is simple: glyexp
functions are just like
their dplyr
cousins, but with two superpowers:
- They expect and return
experiment()
objects (keeping your data ecosystem intact) - They treat those index columns like precious cargo (no accidental deletions here!)
Complete dplyr Function Reference
Here’s your complete toolkit of supported dplyr-style functions. These functions orchestrate seamless coordination between all three data types - expression matrix, sample information, and variable information - ensuring everything stays perfectly synchronized:
dplyr Function | For Samples (_obs ) |
For Variables (_var ) |
What It Does |
---|---|---|---|
filter() |
filter_obs() |
filter_var() |
Subset rows based on conditions |
select() |
select_obs() |
select_var() |
Choose specific columns |
arrange() |
arrange_obs() |
arrange_var() |
Reorder rows by column values |
mutate() |
mutate_obs() |
mutate_var() |
Create/modify columns |
rename() |
rename_obs() |
rename_var() |
Rename columns |
slice() |
slice_obs() |
slice_var() |
Select rows by position |
slice_head() |
slice_head_obs() |
slice_head_var() |
Select first n rows |
slice_tail() |
slice_tail_obs() |
slice_tail_var() |
Select last n rows |
slice_sample() |
slice_sample_obs() |
slice_sample_var() |
Select random rows |
slice_max() |
slice_max_obs() |
slice_max_var() |
Select rows with highest values |
slice_min() |
slice_min_obs() |
slice_min_var() |
Select rows with lowest values |
The magic ingredient? Every single one of these functions automatically updates the expression matrix to match your metadata operations. Filter out half your samples? The matrix follows suit. Rearrange your variables? The matrix dances to the same tune.
What about other dplyr
functions? For
functions not directly supported (like distinct()
,
pull()
, count()
, etc.), simply extract the
tibble first and go wild:
# Extract the tibble, then use any dplyr function you want
toy_exp |>
get_sample_info() |>
distinct(group)
toy_exp |>
get_var_info() |>
pull(protein) |>
unique()
toy_exp |>
get_sample_info() |>
count(group)
The Sacred Index Columns - Handle with Care
Remember those index columns we mentioned? Here’s the golden rule: Don’t mess with them directly!
Think of them as the foundation of your data house - you can redecorate all you want, but don’t touch the support beams.
Want to select specific columns from your sample info? Easy:
toy_exp |>
select_obs(group) |>
get_sample_info()
#> # A tibble: 6 × 2
#> sample group
#> <chr> <chr>
#> 1 S1 A
#> 2 S2 A
#> 3 S3 A
#> 4 S4 B
#> 5 S5 B
#> 6 S6 B
See how the “sample” column (our trusty index) stuck around? That’s
glyexp
being protective of your data integrity.
Even when you try to be sneaky, it’s got your back:
toy_exp |>
select_obs(-starts_with("sample")) |>
get_sample_info()
#> # A tibble: 6 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2
Nice try, but that index column isn’t going anywhere! 😄
Slicing and Dicing - Matrix-Style Subsetting
Want to subset your experiment? Think matrix indexing, but smarter:
subset_exp <- toy_exp[, 1:3]
This grabs the first 3 samples, and like a good butler, updates everything else accordingly:
get_expr_mat(subset_exp)
#> S1 S2 S3
#> V1 1 5 9
#> V2 2 6 10
#> V3 3 7 11
#> V4 4 8 12
get_sample_info(subset_exp)
#> # A tibble: 3 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
Both the expression matrix and sample info are perfectly in sync. It’s like they’re dancing to the same tune!
When You Need to Break Free - The Tibble Escape Hatch
The glycoverse
ecosystem is pretty comprehensive, but we
know there are times when you need to venture beyond our cozy world.
When that moment comes, as_tibble()
is your bridge to the
broader R universe:
as_tibble(toy_exp)
#> # A tibble: 24 × 8
#> sample group batch variable protein peptide glycan_composition value
#> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <int>
#> 1 S1 A 1 V1 PRO1 PEP1 H5N2 1
#> 2 S2 A 2 V1 PRO1 PEP1 H5N2 5
#> 3 S3 A 1 V1 PRO1 PEP1 H5N2 9
#> 4 S4 B 2 V1 PRO1 PEP1 H5N2 13
#> 5 S5 B 1 V1 PRO1 PEP1 H5N2 17
#> 6 S6 B 2 V1 PRO1 PEP1 H5N2 21
#> 7 S1 A 1 V2 PRO2 PEP2 H5N2 2
#> 8 S2 A 2 V2 PRO2 PEP2 H5N2 6
#> 9 S3 A 1 V2 PRO2 PEP2 H5N2 10
#> 10 S4 B 2 V2 PRO2 PEP2 H5N2 14
#> # ℹ 14 more rows
This transforms your experiment()
into a beautiful, tidy
tibble (what the cool kids call “long format”). Every row is an
observation, every column is a variable - the gold standard for data
analysis (see what is tidy
data).
Pro tip: These tibbles can get really long (think novel-length), especially with all that rich metadata. Smart analysts filter their experiments first:
toy_exp |>
filter_var(glycan_composition == "H5N2") |>
select_obs(group) |>
select_var(-glycan_composition) |>
as_tibble()
#> # A tibble: 12 × 6
#> sample group variable protein peptide value
#> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 S1 A V1 PRO1 PEP1 1
#> 2 S2 A V1 PRO1 PEP1 5
#> 3 S3 A V1 PRO1 PEP1 9
#> 4 S4 B V1 PRO1 PEP1 13
#> 5 S5 B V1 PRO1 PEP1 17
#> 6 S6 B V1 PRO1 PEP1 21
#> 7 S1 A V2 PRO2 PEP2 2
#> 8 S2 A V2 PRO2 PEP2 6
#> 9 S3 A V2 PRO2 PEP2 10
#> 10 S4 B V2 PRO2 PEP2 14
#> 11 S5 B V2 PRO2 PEP2 18
#> 12 S6 B V2 PRO2 PEP2 22
Much more manageable, right?
Building Your Own Data Empire 🏗️
Ready to graduate from toy experiments to the real deal? Time to
build your very own experiment()
object!
Think of it like assembling a puzzle - you need three perfect pieces that fit together seamlessly:
🧩 Piece 1: Expression Matrix - Your numerical
treasure trove
🧩 Piece 2: Sample Information - The story behind each
column
🧩 Piece 3: Variable Information - The identity cards
for each row
The Secret Language of Column Names 🏷️
Here’s where things get deliciously organized! The
glycoverse
ecosystem has its own little naming convention -
think of it as a secret handshake between packages.
When a function desperately needs to find your batch information,
it’ll go hunting for a column named batch
in your sample
data. Smart, right? And here’s the beautiful part: if you’re a rebel who
likes naming things differently, these functions come with escape
hatches (like batch_col = "my_weird_batch_name"
). But
honestly? Life’s easier when you speak the native tongue from day one!
😉
🗂️ The VIP Column Names - Sample Information Edition:
-
sample
: Our beloved index column superstar! -
group
: Your experimental conditions/treatments (psst… make it a factor!) -
batch
: Because batch effects are real, and we need to track them (also a factor, please!)
🧬 The A-List Columns - Variable Information Edition:
-
variable
: The other half of our dynamic index duo! -
protein
: Your protein’s formal name -
gene
: The genetic blueprint behind it all
-
peptide
: The specific sequence doing the heavy lifting -
protein_site
: Where exactly that glycan decided to park itself on the protein -
peptide_site
: The precise peptide address for glycan attachment -
glycan_composition
: Your glycan’s molecular recipe (make it a properglyrepr::glycan_composition()
) -
glycan_structure
: The full architectural blueprint (should beglyrepr::glycan_structure()
)
🎁 Here’s a little secret: If you’re using
glyread
to birth your experiment()
objects,
it’s like having a personal assistant - it’ll handle all the variable
information columns for you! You just need to worry about getting your
sample information tibble dressed up properly. Talk about division of
labor! 💪
But wait, there’s more! You’ll also need to tell
experiment()
what kind of scientific story you’re
telling:
🔬 Experiment Type - Are you diving into pure
glycomics (“glycomics”) or exploring the protein-glycan dance
(“glycoproteomics”)?
🍃 Glycan Type - Are you studying N-linked (“N”) or
O-linked (“O”) glycans?
These metadata fields help other glycoverse
packages
understand your data context and provide the right analysis tools.
Once you have these five elements ready, creating an
experiment()
is as easy as saying “glycosylation”!
library(tibble)
# Step 1: Craft your sample story
sample_info <- tibble(
sample = c("sample1", "sample2", "sample3"),
group = c("A", "B", "A")
)
# Step 2: Define your molecular cast
var_info <- tibble(
variable = c("variable1", "variable2", "variable3"),
glycan_composition = c("H3N2", "H4N2", "H5N2")
)
# Step 3: Generate some exciting (fake) data
expr_mat <- matrix(runif(9, 0, 100), nrow = 3, ncol = 3)
rownames(expr_mat) <- var_info$variable
colnames(expr_mat) <- sample_info$sample
# Step 4: The magic moment - bring it all together! ✨
# Don't forget to specify your experiment type and glycan type!
exp <- experiment(
expr_mat = expr_mat,
sample_info = sample_info,
var_info = var_info,
exp_type = "glycomics", # "glycomics" or "glycoproteomics"
glycan_type = "N" # "N" or "O" linked glycans
)
exp
#>
#> ── Experiment ──────────────────────────────────────────────────────────────────
#> ℹ Expression matrix: 3 samples, 3 variables
#> ℹ Sample information fields: group
#> ℹ Variable information fields: glycan_composition
Voilà! 🎉 You’ve just created your first custom
experiment()
object! Notice how all the pieces click
together perfectly - the row names match your variable IDs, the column
names align with your sample IDs, and everything is beautifully
synchronized.
Need to add more metadata? You can pass additional
information through the ...
parameter:
exp_with_metadata <- experiment(
expr_mat = expr_mat,
sample_info = sample_info,
var_info = var_info,
exp_type = "glycoproteomics",
glycan_type = "O",
instrument = "Orbitrap Fusion",
analysis_date = "2023-12-01",
lab = "Glycoverse Research Lab"
)
This extra metadata gets stored in exp$meta_data
and can
be used by other glycoverse
packages for analysis-specific
functionality.
Pro tip: In real life, your expression matrix and
variable information might come from a software like pGlyco3, and your
sample info from a separate csv file. No matter the source, as long as
those index columns match up, experiment()
will happily
bring them together into one harmonious data structure!
Pro tip again: If you are using pGlyco3
or other softwares for glycopeptide identification and quantification,
you can try the glyread package,
designed to create experiment()
s from the output of
annotation softwares.
Standing on the Shoulders of Giants
Designing experiment()
wasn’t done in a vacuum - we
learned from some amazing predecessors:
SummarizedExperiment 📊
The granddaddy of omics data containers from Bioconductor.
Solid as a rock for RNA-seq, but not quite “tidy” enough for our
taste.
tidySummarizedExperiment 🧹
A brilliant attempt to bring tidy principles to SummarizedExperiment
from the tidySummarizedExperiment
package. We love the concept, but felt that cramming everything into one
tibble doesn’t quite capture the mental model of separated data
types.
massdataset 🔬
Our closest cousin! The massdataset package
gets so many things right - tidy operations, clean data separation,
perfect for mass spec data. We especially admire its data processing
history tracking (reproducibility FTW!).
But here’s our twist: while object-oriented programming has its merits, we believe most R users think functionally. Your code is your reproducibility trail - elegant, transparent, and familiar to every R user.
Our Philosophy 💭
We chose the functional programming path because it feels like home to R
users. No hidden states, no mysterious transformations - just clear,
chainable functions that do exactly what they say on the tin.
Huge thanks to all the developers who paved this road.
glyexp
exists because of your groundbreaking work!
🙏