Get Started with glyexp • glyexp

Picture this: you’re knee-deep in omics experiments (especially the fascinating world of glycomics and glycoproteomics), and you’re juggling three types of data like a lab virtuoso:

Expression data - the actual measurements of your biological molecules (glycans, glycopeptides, and their friends)
Molecular annotations - the ID cards for your molecules (structures, sequences, you name it)
Experimental metadata - the story behind your samples (time points, treatments, experimental conditions)

Here’s where glyexp swoops in to save the day! 🦸‍♀️

The experiment() class is your new best friend - think of it as a smart container that keeps all three data types organized and talking to each other. No more scattered spreadsheets or lost annotations!

Why should you care? Every package in the glycoverse ecosystem speaks experiment() fluently. It’s like having a universal translator for your glycomics workflow - everything just clicks together.

library(glyexp)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#> 
#>     select_var
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(conflicted)

# Resolve function conflicts - prefer glyexp version over deprecated dplyr version
# `dplyr::select_var` is deprecated anyway, so we can safely override it
conflicts_prefer(glyexp::select_var)
#> [conflicted] Will prefer glyexp::select_var over
#> any other package.

Your First Steps into the Glycoverse

Let’s dive in with our trusty toy experiment - think of it as your training wheels before you tackle the real deal.

toy_exp <- toy_experiment()
toy_exp
#> 
#> ── Experiment ──────────────────────────────────────────────────────────────────
#> ℹ Expression matrix: 6 samples, 4 variables
#> ℹ Sample information fields: group and batch
#> ℹ Variable information fields: protein, peptide, and glycan_composition

Look at that beautiful summary! When you print an experiment() object, it’s like getting a snapshot of your entire experimental world - variables, observations, and all the metadata that makes your data meaningful.

Now, let’s peek under the hood. You can extract the three core components faster than you can say “glycosylation”:

🧬 The Expression Matrix - Your Data’s Heart and Soul

get_expr_mat(toy_exp)
#>    S1 S2 S3 S4 S5 S6
#> V1  1  5  9 13 17 21
#> V2  2  6 10 14 18 22
#> V3  3  7 11 15 19 23
#> V4  4  8 12 16 20 24

This matrix is where the magic happens - rows are your variables (molecules), columns are your observations (samples), and the numbers tell your biological story.

🏷️ Variable Information - Meet Your Molecules

get_var_info(toy_exp)
#> # A tibble: 4 × 4
#>   variable protein peptide glycan_composition
#>   <chr>    <chr>   <chr>   <chr>             
#> 1 V1       PRO1    PEP1    H5N2              
#> 2 V2       PRO2    PEP2    H5N2              
#> 3 V3       PRO3    PEP3    N3N2              
#> 4 V4       PRO3    PEP4    N3N2

Think of this as your molecular address book - every variable gets its own detailed profile.

📋 Sample Information - Know Your Experiments

get_sample_info(toy_exp)
#> # A tibble: 6 × 3
#>   sample group batch
#>   <chr>  <chr> <dbl>
#> 1 S1     A         1
#> 2 S2     A         2
#> 3 S3     A         1
#> 4 S4     B         2
#> 5 S5     B         1
#> 6 S6     B         2

And this? This is your experimental diary - tracking every condition, timepoint, and treatment.

Here’s the cool part: Notice how the “variable” column in get_var_info() and the “sample” column in get_sample_info() perfectly match the row and column names in your expression matrix? That’s no accident!

These are the index columns - the secret sauce that keeps everything synchronized. They’re like the GPS coordinates that ensure your data stays connected no matter what transformations you throw at it.

Data Wrangling Made Easy - dplyr Meets glyexp

If you’ve ever used dplyr (and who hasn’t?), you’re already 90% of the way there! 🎉

For every dplyr function you know and love, glyexp gives you two specialized versions:

_obs() functions: work on your sample metadata
_var() functions: work on your variable annotations

Let’s see this in action. Want to focus on just group “A” samples?

subset_exp <- filter_obs(toy_exp, group == "A")

Let’s check what happened to our sample info:

get_sample_info(subset_exp)
#> # A tibble: 3 × 3
#>   sample group batch
#>   <chr>  <chr> <dbl>
#> 1 S1     A         1
#> 2 S2     A         2
#> 3 S3     A         1

Beautiful! But here’s where the magic really shines - check out the expression matrix:

get_expr_mat(subset_exp)
#>    S1 S2 S3
#> V1  1  5  9
#> V2  2  6 10
#> V3  3  7 11
#> V4  4  8 12

🎪 Ta-da! The expression matrix automatically filtered itself to match! It’s like having a well-trained assistant who anticipates your every move.

This is filter_obs() in a nutshell: “Hey, filter my sample info this way, and oh yeah, make sure everything else follows suit.” And it does, flawlessly.

Variable filtering works the same way:

toy_exp |>
  filter_obs(group == "A") |>
  filter_var(glycan_composition == "H5N2") |>
  get_expr_mat()
#>    S1 S2 S3
#> V1  1  5  9
#> V2  2  6 10

Notice how these functions support the pipe operator (|>)? That’s the dplyr DNA in action!

The pattern is simple: glyexp functions are just like their dplyr cousins, but with two superpowers:

They expect and return experiment() objects (keeping your data ecosystem intact)
They treat those index columns like precious cargo (no accidental deletions here!)

Complete dplyr Function Reference

Here’s your complete toolkit of supported dplyr-style functions. These functions orchestrate seamless coordination between all three data types - expression matrix, sample information, and variable information - ensuring everything stays perfectly synchronized:

dplyr Function	For Samples (`_obs`)	For Variables (`_var`)	What It Does
`filter()`	`filter_obs()`	`filter_var()`	Subset rows based on conditions
`select()`	`select_obs()`	`select_var()`	Choose specific columns
`arrange()`	`arrange_obs()`	`arrange_var()`	Reorder rows by column values
`mutate()`	`mutate_obs()`	`mutate_var()`	Create/modify columns
`rename()`	`rename_obs()`	`rename_var()`	Rename columns
`slice()`	`slice_obs()`	`slice_var()`	Select rows by position
`slice_head()`	`slice_head_obs()`	`slice_head_var()`	Select first n rows
`slice_tail()`	`slice_tail_obs()`	`slice_tail_var()`	Select last n rows
`slice_sample()`	`slice_sample_obs()`	`slice_sample_var()`	Select random rows
`slice_max()`	`slice_max_obs()`	`slice_max_var()`	Select rows with highest values
`slice_min()`	`slice_min_obs()`	`slice_min_var()`	Select rows with lowest values

The magic ingredient? Every single one of these functions automatically updates the expression matrix to match your metadata operations. Filter out half your samples? The matrix follows suit. Rearrange your variables? The matrix dances to the same tune.

What about other dplyr functions? For functions not directly supported (like distinct(), pull(), count(), etc.), simply extract the tibble first and go wild:

# Extract the tibble, then use any dplyr function you want
toy_exp |>
  get_sample_info() |>
  distinct(group)

toy_exp |>
  get_var_info() |>
  pull(protein) |>
  unique()

toy_exp |>
  get_sample_info() |>
  count(group)

The Sacred Index Columns - Handle with Care

Remember those index columns we mentioned? Here’s the golden rule: Don’t mess with them directly!

Think of them as the foundation of your data house - you can redecorate all you want, but don’t touch the support beams.

Want to select specific columns from your sample info? Easy:

toy_exp |>
  select_obs(group) |>
  get_sample_info()
#> # A tibble: 6 × 2
#>   sample group
#>   <chr>  <chr>
#> 1 S1     A    
#> 2 S2     A    
#> 3 S3     A    
#> 4 S4     B    
#> 5 S5     B    
#> 6 S6     B

See how the “sample” column (our trusty index) stuck around? That’s glyexp being protective of your data integrity.

Even when you try to be sneaky, it’s got your back:

toy_exp |>
  select_obs(-starts_with("sample")) |>
  get_sample_info()
#> # A tibble: 6 × 3
#>   sample group batch
#>   <chr>  <chr> <dbl>
#> 1 S1     A         1
#> 2 S2     A         2
#> 3 S3     A         1
#> 4 S4     B         2
#> 5 S5     B         1
#> 6 S6     B         2

Nice try, but that index column isn’t going anywhere! 😄

Slicing and Dicing - Matrix-Style Subsetting

Want to subset your experiment? Think matrix indexing, but smarter:

subset_exp <- toy_exp[, 1:3]

This grabs the first 3 samples, and like a good butler, updates everything else accordingly:

get_expr_mat(subset_exp)
#>    S1 S2 S3
#> V1  1  5  9
#> V2  2  6 10
#> V3  3  7 11
#> V4  4  8 12

get_sample_info(subset_exp)
#> # A tibble: 3 × 3
#>   sample group batch
#>   <chr>  <chr> <dbl>
#> 1 S1     A         1
#> 2 S2     A         2
#> 3 S3     A         1

Both the expression matrix and sample info are perfectly in sync. It’s like they’re dancing to the same tune!

When You Need to Break Free - The Tibble Escape Hatch

The glycoverse ecosystem is pretty comprehensive, but we know there are times when you need to venture beyond our cozy world. When that moment comes, as_tibble() is your bridge to the broader R universe:

as_tibble(toy_exp)
#> # A tibble: 24 × 8
#>    sample group batch variable protein peptide glycan_composition value
#>    <chr>  <chr> <dbl> <chr>    <chr>   <chr>   <chr>              <int>
#>  1 S1     A         1 V1       PRO1    PEP1    H5N2                   1
#>  2 S2     A         2 V1       PRO1    PEP1    H5N2                   5
#>  3 S3     A         1 V1       PRO1    PEP1    H5N2                   9
#>  4 S4     B         2 V1       PRO1    PEP1    H5N2                  13
#>  5 S5     B         1 V1       PRO1    PEP1    H5N2                  17
#>  6 S6     B         2 V1       PRO1    PEP1    H5N2                  21
#>  7 S1     A         1 V2       PRO2    PEP2    H5N2                   2
#>  8 S2     A         2 V2       PRO2    PEP2    H5N2                   6
#>  9 S3     A         1 V2       PRO2    PEP2    H5N2                  10
#> 10 S4     B         2 V2       PRO2    PEP2    H5N2                  14
#> # ℹ 14 more rows

This transforms your experiment() into a beautiful, tidy tibble (what the cool kids call “long format”). Every row is an observation, every column is a variable - the gold standard for data analysis (see what is tidy data).

Pro tip: These tibbles can get really long (think novel-length), especially with all that rich metadata. Smart analysts filter their experiments first:

toy_exp |>
  filter_var(glycan_composition == "H5N2") |>
  select_obs(group) |>
  select_var(-glycan_composition) |>
  as_tibble()
#> # A tibble: 12 × 6
#>    sample group variable protein peptide value
#>    <chr>  <chr> <chr>    <chr>   <chr>   <int>
#>  1 S1     A     V1       PRO1    PEP1        1
#>  2 S2     A     V1       PRO1    PEP1        5
#>  3 S3     A     V1       PRO1    PEP1        9
#>  4 S4     B     V1       PRO1    PEP1       13
#>  5 S5     B     V1       PRO1    PEP1       17
#>  6 S6     B     V1       PRO1    PEP1       21
#>  7 S1     A     V2       PRO2    PEP2        2
#>  8 S2     A     V2       PRO2    PEP2        6
#>  9 S3     A     V2       PRO2    PEP2       10
#> 10 S4     B     V2       PRO2    PEP2       14
#> 11 S5     B     V2       PRO2    PEP2       18
#> 12 S6     B     V2       PRO2    PEP2       22

Much more manageable, right?

Building Your Own Data Empire 🏗️

Ready to graduate from toy experiments to the real deal? Time to build your very own experiment() object!

Think of it like assembling a puzzle - you need three perfect pieces that fit together seamlessly:

🧩 Piece 1: Expression Matrix - Your numerical treasure trove
🧩 Piece 2: Sample Information - The story behind each column
🧩 Piece 3: Variable Information - The identity cards for each row

The Secret Language of Column Names 🏷️

Here’s where things get deliciously organized! The glycoverse ecosystem has its own little naming convention - think of it as a secret handshake between packages.

When a function desperately needs to find your batch information, it’ll go hunting for a column named batch in your sample data. Smart, right? And here’s the beautiful part: if you’re a rebel who likes naming things differently, these functions come with escape hatches (like batch_col = "my_weird_batch_name"). But honestly? Life’s easier when you speak the native tongue from day one! 😉

🗂️ The VIP Column Names - Sample Information Edition:

sample: Our beloved index column superstar!
group: Your experimental conditions/treatments (psst… make it a factor!)
batch: Because batch effects are real, and we need to track them (also a factor, please!)

🧬 The A-List Columns - Variable Information Edition:

variable: The other half of our dynamic index duo!
protein: Your protein’s formal name
gene: The genetic blueprint behind it all
peptide: The specific sequence doing the heavy lifting
protein_site: Where exactly that glycan decided to park itself on the protein
peptide_site: The precise peptide address for glycan attachment
glycan_composition: Your glycan’s molecular recipe (make it a proper glyrepr::glycan_composition())
glycan_structure: The full architectural blueprint (should be glyrepr::glycan_structure())

🎁 Here’s a little secret: If you’re using glyread to birth your experiment() objects, it’s like having a personal assistant - it’ll handle all the variable information columns for you! You just need to worry about getting your sample information tibble dressed up properly. Talk about division of labor! 💪

But wait, there’s more! You’ll also need to tell experiment() what kind of scientific story you’re telling:

🔬 Experiment Type - Are you diving into pure glycomics (“glycomics”) or exploring the protein-glycan dance (“glycoproteomics”)?
🍃 Glycan Type - Are you studying N-linked (“N”) or O-linked (“O”) glycans?

These metadata fields help other glycoverse packages understand your data context and provide the right analysis tools.

Once you have these five elements ready, creating an experiment() is as easy as saying “glycosylation”!

library(tibble)

# Step 1: Craft your sample story
sample_info <- tibble(
  sample = c("sample1", "sample2", "sample3"),
  group = c("A", "B", "A")
)

# Step 2: Define your molecular cast
var_info <- tibble(
  variable = c("variable1", "variable2", "variable3"),
  glycan_composition = c("H3N2", "H4N2", "H5N2")
)

# Step 3: Generate some exciting (fake) data
expr_mat <- matrix(runif(9, 0, 100), nrow = 3, ncol = 3)
rownames(expr_mat) <- var_info$variable
colnames(expr_mat) <- sample_info$sample

# Step 4: The magic moment - bring it all together! ✨
# Don't forget to specify your experiment type and glycan type!
exp <- experiment(
  expr_mat = expr_mat,
  sample_info = sample_info,
  var_info = var_info,
  exp_type = "glycomics",      # "glycomics" or "glycoproteomics"
  glycan_type = "N"            # "N" or "O" linked glycans
)

exp
#> 
#> ── Experiment ──────────────────────────────────────────────────────────────────
#> ℹ Expression matrix: 3 samples, 3 variables
#> ℹ Sample information fields: group
#> ℹ Variable information fields: glycan_composition

Voilà! 🎉 You’ve just created your first custom experiment() object! Notice how all the pieces click together perfectly - the row names match your variable IDs, the column names align with your sample IDs, and everything is beautifully synchronized.

Need to add more metadata? You can pass additional information through the ... parameter:

exp_with_metadata <- experiment(
  expr_mat = expr_mat,
  sample_info = sample_info,
  var_info = var_info,
  exp_type = "glycoproteomics",
  glycan_type = "O",
  instrument = "Orbitrap Fusion",
  analysis_date = "2023-12-01",
  lab = "Glycoverse Research Lab"
)

This extra metadata gets stored in exp$meta_data and can be used by other glycoverse packages for analysis-specific functionality.

Pro tip: In real life, your expression matrix and variable information might come from a software like pGlyco3, and your sample info from a separate csv file. No matter the source, as long as those index columns match up, experiment() will happily bring them together into one harmonious data structure!

Pro tip again: If you are using pGlyco3 or other softwares for glycopeptide identification and quantification, you can try the glyread package, designed to create experiment()s from the output of annotation softwares.

Standing on the Shoulders of Giants

Designing experiment() wasn’t done in a vacuum - we learned from some amazing predecessors:

SummarizedExperiment 📊
The granddaddy of omics data containers from Bioconductor. Solid as a rock for RNA-seq, but not quite “tidy” enough for our taste.

tidySummarizedExperiment 🧹
A brilliant attempt to bring tidy principles to SummarizedExperiment from the tidySummarizedExperiment package. We love the concept, but felt that cramming everything into one tibble doesn’t quite capture the mental model of separated data types.

massdataset 🔬
Our closest cousin! The massdataset package gets so many things right - tidy operations, clean data separation, perfect for mass spec data. We especially admire its data processing history tracking (reproducibility FTW!).

But here’s our twist: while object-oriented programming has its merits, we believe most R users think functionally. Your code is your reproducibility trail - elegant, transparent, and familiar to every R user.

Our Philosophy 💭
We chose the functional programming path because it feels like home to R users. No hidden states, no mysterious transformations - just clear, chainable functions that do exactly what they say on the tin.

Huge thanks to all the developers who paved this road. glyexp exists because of your groundbreaking work! 🙏