
dplyr-Style Functions: Data Harmony in Action
Source:vignettes/dplyr-style-functions.Rmd
dplyr-style-functions.RmdThis vignette describes glyexp’s dplyr-style functions for synchronized data manipulation.
When working with multi-table datasets, filtering one table can desynchronize your data from other components. Rearranging another table can break carefully established relationships.
glyexp’s dplyr-style functions address this by understanding the connection between your expression matrix, sample information, and variable annotations. When you transform one component, everything else follows in synchronization.
Note: These functions only work with
experiment() objects - they cannot be used on regular
data.frames, tibbles, or other data structures.
library(glyexp)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(conflicted)
conflicts_prefer(glyexp::select_var)
#> [conflicted] Will prefer glyexp::select_var over
#> any other package.
conflicts_prefer(dplyr::filter)
#> [conflicted] Will prefer dplyr::filter over any
#> other package.Core Philosophy: One Action, Three Updates
glyexp’s dplyr-style functions work on three components:
- Expression Matrix: Numerical data
- Sample Info: Experimental metadata
- Variable Info: Molecular annotations
In traditional data analysis, filtering samples requires manually updating all related tables. glyexp’s dplyr-style functions handle this synchronization automatically.
Here’s an example:
toy_exp <- toy_experiment
print(toy_exp)
#>
#> ── Others Experiment ───────────────────────────────────────────────────────────
#> ℹ Expression matrix: 6 samples, 4 variables
#> ℹ Sample information fields: group <chr>, batch <dbl>
#> ℹ Variable information fields: protein <chr>, peptide <chr>, glycan_composition <chr>Two Flavors: _obs() and _var()
Every dplyr-style function in glyexp comes in two variants:
-
_obs()functions: Work on sample information inexperiment()objects -
_var()functions: Work on variable annotations inexperiment()objects
Both variants automatically update the expression matrix to maintain synchronization.
These functions require an experiment() object as input
and return an experiment() object as output. For standard
tibbles or data.frames, use regular dplyr functions directly.
Filtering
Filtering is the most common operation:
Sample-Based Filtering with filter_obs()
Say you want to focus only on group “A” samples:
# Before filtering - let's see what we have
get_sample_info(toy_exp)
#> # A tibble: 6 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2
# Filter for group A samples only
filtered_exp <- filter_obs(toy_exp, group == "A")
get_sample_info(filtered_exp)
#> # A tibble: 3 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1Check the expression matrix:
# Original matrix dimensions:
dim(get_expr_mat(toy_exp))
#> [1] 4 6
# Original matrix:
get_expr_mat(toy_exp)
#> S1 S2 S3 S4 S5 S6
#> V1 1 5 9 13 17 21
#> V2 2 6 10 14 18 22
#> V3 3 7 11 15 19 23
#> V4 4 8 12 16 20 24
# Filtered expression matrix - automatically updated!
# Filtered matrix dimensions:
dim(get_expr_mat(filtered_exp))
#> [1] 4 3
# Filtered matrix:
get_expr_mat(filtered_exp)
#> S1 S2 S3
#> V1 1 5 9
#> V2 2 6 10
#> V3 3 7 11
#> V4 4 8 12The expression matrix is automatically filtered to match the remaining samples.
Variable-Based Filtering with filter_var()
Now let’s filter variables and watch the same magic happen:
# Filter for specific glycan compositions
var_filtered_exp <- filter_var(toy_exp, glycan_composition == "H5N2")
get_var_info(var_filtered_exp)
#> # A tibble: 2 × 4
#> variable protein peptide glycan_composition
#> <chr> <chr> <chr> <chr>
#> 1 V1 PRO1 PEP1 H5N2
#> 2 V2 PRO2 PEP2 H5N2
# The expression matrix rows automatically follow suit!
get_expr_mat(var_filtered_exp)
#> S1 S2 S3 S4 S5 S6
#> V1 1 5 9 13 17 21
#> V2 2 6 10 14 18 22The matrix rows automatically reduced to match the filtered variables! This is the core power of glyexp - you think about your metadata, and the expression data follows your lead.
Chaining Filters
Both samples and variables can be filtered by chaining operations:
double_filtered <- toy_exp |>
filter_obs(group == "A") |>
filter_var(glycan_composition %in% c("H5N2", "N3N2"))
# Final dimensions after double filtering:
dim(get_expr_mat(double_filtered))
#> [1] 2 3
get_expr_mat(double_filtered)
#> S1 S2 S3
#> V1 1 5 9
#> V2 2 6 10The functions support pipe operations:
Index Columns: Guardians of Data Integrity
Index columns (like “sample” and “variable”) are essential for maintaining data relationships. Removing them would break synchronization.
Let’s see this protection in action:
Attempting to Remove Index Columns
# Try to select everything EXCEPT the sample index column
protective_exp <- select_obs(toy_exp, -sample)
#> Error:
#> ! You should not explicitly select or deselect the "sample" column in
#> `sample_info`.
#> ℹ The "sample" column will be handled by `select_obs()` or `select_var()`
#> automatically.
get_sample_info(protective_exp)
#> Error:
#> ! object 'protective_exp' not foundglyexp throws an error to protect data integrity:
# Same protection for variable info
protective_var_exp <- select_var(toy_exp, -variable)
#> Error:
#> ! You should not explicitly select or deselect the "variable" column in
#> `var_info`.
#> ℹ The "variable" column will be handled by `select_obs()` or `select_var()`
#> automatically.
get_var_info(protective_var_exp)
#> Error:
#> ! object 'protective_var_exp' not foundSimilarly, glyexp throws an error to protect the “variable” column from being removed.
Why This Protection Matters
Without index columns, an experiment() object would lose
its ability to:
- Keep expression matrix and metadata synchronized
- Validate data consistency
- Enable seamless subsetting operations
- Work with other glycoverse packages
Index columns are essential for maintaining data relationships.
Complete Function Reference
glyexp provides dplyr-style equivalents for common data manipulation
functions. Each function comes in both _obs() and
_var() variants, and all automatically maintain matrix
synchronization.
These functions are methods specifically for
experiment() objects.
Core Data Manipulation Functions
| Standard dplyr | Sample Operations | Variable Operations | Description |
|---|---|---|---|
filter() |
filter_obs() |
filter_var() |
Subset with sync |
select() |
select_obs() |
select_var() |
Choose with protection |
arrange() |
arrange_obs() |
arrange_var() |
Sort with order |
mutate() |
mutate_obs() |
mutate_var() |
Create with consistency |
rename() |
rename_obs() |
rename_var() |
Rename with safety |
Advanced Slicing Functions
| Standard dplyr | Sample Operations | Variable Operations | Description |
|---|---|---|---|
slice() |
slice_obs() |
slice_var() |
Position-based selection |
slice_head() |
slice_head_obs() |
slice_head_var() |
Top n with sync |
slice_tail() |
slice_tail_obs() |
slice_tail_var() |
Bottom n with sync |
slice_sample() |
slice_sample_obs() |
slice_sample_var() |
Random with consistency |
slice_max() |
slice_max_obs() |
slice_max_var() |
Highest values with order |
slice_min() |
slice_min_obs() |
slice_min_var() |
Lowest values with order |
Joining Functions
| Standard dplyr | Sample Operations | Variable Operations | Description |
|---|---|---|---|
left_join() |
left_join_obs() |
left_join_var() |
Add new columns from another table (left join) |
inner_join() |
inner_join_obs() |
inner_join_var() |
Add new columns from another table (inner join) |
semi_join() |
semi_join_obs() |
semi_join_var() |
Filter rows from another table (semi join) |
anti_join() |
anti_join_obs() |
anti_join_var() |
Filter rows from another table (anti join) |
Function-by-Function Examples
Selection
# Select specific columns from sample info
selected_exp <- select_obs(toy_exp, group, batch)
get_sample_info(selected_exp)
#> # A tibble: 6 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2
# Select columns from variable info (notice the index protection!)
var_selected_exp <- select_var(toy_exp, glycan_composition)
get_var_info(var_selected_exp)
#> # A tibble: 4 × 2
#> variable glycan_composition
#> <chr> <chr>
#> 1 V1 H5N2
#> 2 V2 H5N2
#> 3 V3 H3N2
#> 4 V4 H3N2Use dplyr-style helpers like starts_with(),
ends_with(), and contains():
# Select columns starting with "glycan"
helper_exp <- select_var(toy_exp, starts_with("glycan"))
get_var_info(helper_exp)
#> # A tibble: 4 × 2
#> variable glycan_composition
#> <chr> <chr>
#> 1 V1 H5N2
#> 2 V2 H5N2
#> 3 V3 H3N2
#> 4 V4 H3N2Arrangement
# Arrange samples by batch and group
arranged_exp <- arrange_obs(toy_exp, batch, group)
get_sample_info(arranged_exp)
#> # A tibble: 6 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S3 A 1
#> 3 S5 B 1
#> 4 S2 A 2
#> 5 S4 B 2
#> 6 S6 B 2Check how the expression matrix columns rearranged to match:
# Expression matrix columns follow the new sample order
get_expr_mat(arranged_exp)
#> S1 S3 S5 S2 S4 S6
#> V1 1 9 17 5 13 21
#> V2 2 10 18 6 14 22
#> V3 3 11 19 7 15 23
#> V4 4 12 20 8 16 24Mutation
# Add a new calculated column to sample info
mutated_exp <- mutate_obs(
toy_exp,
group_batch = paste(group, batch, sep = "_")
)
get_sample_info(mutated_exp)
#> # A tibble: 6 × 4
#> sample group batch group_batch
#> <chr> <chr> <dbl> <chr>
#> 1 S1 A 1 A_1
#> 2 S2 A 2 A_2
#> 3 S3 A 1 A_1
#> 4 S4 B 2 B_2
#> 5 S5 B 1 B_1
#> 6 S6 B 2 B_2
# Create a complexity score for variables
complex_exp <- mutate_var(
toy_exp,
complexity = nchar(glycan_composition)
)
get_var_info(complex_exp)
#> # A tibble: 4 × 5
#> variable protein peptide glycan_composition complexity
#> <chr> <chr> <chr> <chr> <int>
#> 1 V1 PRO1 PEP1 H5N2 4
#> 2 V2 PRO2 PEP2 H5N2 4
#> 3 V3 PRO3 PEP3 H3N2 4
#> 4 V4 PRO3 PEP4 H3N2 4Slicing
# Take the first 2 samples
head_exp <- slice_head_obs(toy_exp, n = 2)
get_sample_info(head_exp)
#> # A tibble: 2 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
# Expression matrix automatically adjusts
get_expr_mat(head_exp)
#> S1 S2
#> V1 1 5
#> V2 2 6
#> V3 3 7
#> V4 4 8
# Sample randomly from variables
set.seed(123) # For reproducibility
random_exp <- slice_sample_var(toy_exp, n = 3)
get_var_info(random_exp)
#> # A tibble: 3 × 4
#> variable protein peptide glycan_composition
#> <chr> <chr> <chr> <chr>
#> 1 V3 PRO3 PEP3 H3N2
#> 2 V4 PRO3 PEP4 H3N2
#> 3 V1 PRO1 PEP1 H5N2Renaming
# Rename columns in sample info
renamed_exp <- rename_obs(toy_exp, experimental_group = group)
get_sample_info(renamed_exp)
#> # A tibble: 6 × 3
#> sample experimental_group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2The index column “sample” remains protected, but everything else can be renamed freely.
Joining
These functions can be useful if you have additional information
stored in a separate tibble, and you want to add it to your
experiment() object.
# Join sample info with variable info
more_sample_info <- tibble::tibble(
sample = c("S1", "S2", "S3", "S4", "S5", "S6"),
age = c(20, 21, 22, 23, 24, 25),
gender = c("M", "F", "M", "F", "M", "F")
)
joined_exp <- left_join_obs(toy_exp, more_sample_info, by = "sample")
get_sample_info(joined_exp)
#> # A tibble: 6 × 5
#> sample group batch age gender
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 S1 A 1 20 M
#> 2 S2 A 2 21 F
#> 3 S3 A 1 22 M
#> 4 S4 B 2 23 F
#> 5 S5 B 1 24 M
#> 6 S6 B 2 25 FYou might have noticed that we don’t have alternatives for
dplyr::right_join() and dplyr::full_join().
This is because by design joining functions in glyexp
should only be used to add new information to your
experiment() object. However, right_join() and
full_join() will add more observations to the resulting
tibbles, which is not suitable for experiment()
objects.
For the same reason, the relationship parameter is fixed
to “many-to-one” for all joining functions in glyexp. You
probably don’t need to know this, but if you do, check out the
documentation of dplyr::left_join() for more details.
Advanced Patterns: Chaining for Complex Operations
The real power emerges when you chain multiple operations together. Here are some patterns:
Pattern 1: Filter → Select → Arrange
complex_pipeline <- toy_exp |>
filter_obs(group == "A") |>
select_obs(group, batch) |>
arrange_obs(desc(batch)) |>
filter_var(protein == "PRO1") |>
select_var(glycan_composition, protein)
print("Final pipeline result:")
#> [1] "Final pipeline result:"
print(complex_pipeline)
#>
#> ── Others Experiment ───────────────────────────────────────────────────────────
#> ℹ Expression matrix: 3 samples, 1 variables
#> ℹ Sample information fields: group <chr>, batch <dbl>
#> ℹ Variable information fields: glycan_composition <chr>, protein <chr>Pattern 2: Mutate → Filter → Slice
analytical_pipeline <- toy_exp |>
mutate_var(composition_length = nchar(glycan_composition)) |>
filter_var(composition_length >= 4) |>
slice_max_var(composition_length, n = 3)
get_var_info(analytical_pipeline)
#> # A tibble: 4 × 5
#> variable protein peptide glycan_composition composition_length
#> <chr> <chr> <chr> <chr> <int>
#> 1 V1 PRO1 PEP1 H5N2 4
#> 2 V2 PRO2 PEP2 H5N2 4
#> 3 V3 PRO3 PEP3 H3N2 4
#> 4 V4 PRO3 PEP4 H3N2 4Pattern 3: Random Sampling for Testing
# Create a smaller dataset for testing
set.seed(456)
test_exp <- toy_exp |>
slice_sample_obs(n = 3) |>
slice_sample_var(n = 4)
print("Test dataset dimensions:")
#> [1] "Test dataset dimensions:"
print(test_exp)
#>
#> ── Others Experiment ───────────────────────────────────────────────────────────
#> ℹ Expression matrix: 3 samples, 4 variables
#> ℹ Sample information fields: group <chr>, batch <dbl>
#> ℹ Variable information fields: protein <chr>, peptide <chr>, glycan_composition <chr>When dplyr-Style Functions Cannot Help
Sometimes you need functionality beyond what glyexp’s dplyr-style functions provide. Extract the tibbles and use any dplyr function you want.
Why Doesn’t glyexp Implement All dplyr Functions?
glyexp only implements functions that preserve the synchronized
multi-table structure of experiment() objects.
Functions like count(), distinct(),
summarise(), and pull() return aggregated
results that break the original data relationships. For these
operations, extract the relevant tibble and use standard dplyr
functions:
# For complex aggregations
toy_exp |>
get_sample_info() |>
count(group)
#> # A tibble: 2 × 2
#> group n
#> <chr> <int>
#> 1 A 3
#> 2 B 3
# For distinct values
toy_exp |>
get_var_info() |>
distinct(protein) |>
pull(protein)
#> [1] "PRO1" "PRO2" "PRO3"
# For advanced filtering with multiple conditions
complex_filter_conditions <- toy_exp |>
get_sample_info() |>
filter(group == "A", batch == 2) |>
pull(sample)
# Then use the results to subset your experiment
filtered_by_complex <- filter_obs(toy_exp, sample %in% complex_filter_conditions)Common Pitfalls and How to Avoid Them
Pitfall 1: Using glyexp Functions on Non-Experiment Objects
This won’t work:
library(tibble)
regular_tibble <- tibble(group = c("A", "B"), value = c(1, 2))
filter_obs(regular_tibble, group == "A")
#> Error in `filter_info_data()`:
#> ! is_experiment(exp) is not TRUEDo this instead:
regular_tibble <- tibble(group = c("A", "B"), value = c(1, 2))
filter(regular_tibble, group == "A")
#> # A tibble: 1 × 2
#> group value
#> <chr> <dbl>
#> 1 A 1
filtered_exp <- filter_obs(toy_exp, group == "A")
get_sample_info(filtered_exp)
#> # A tibble: 3 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1Pitfall 2: Forgetting the Synchronization
Don’t do this:
sample_info <- get_sample_info(toy_exp)
filtered_samples <- filter(sample_info, group == "A")Do this instead:
filtered_exp <- filter_obs(toy_exp, group == "A")Pitfall 3: Trying to Remove Index Columns
This won’t work as expected:
select_obs(toy_exp, -sample)
#> Error:
#> ! You should not explicitly select or deselect the "sample" column in
#> `sample_info`.
#> ℹ The "sample" column will be handled by `select_obs()` or `select_var()`
#> automatically.Embrace the protection:
clean_exp <- select_obs(toy_exp, group, batch)
get_sample_info(clean_exp)
#> # A tibble: 6 × 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2Pitfall 4: Mismatched Operations
Don’t mix operations inappropriately:
arrange_obs(toy_exp, glycan_composition)Use the right function for the right data:
arranged_by_composition <- arrange_var(toy_exp, glycan_composition)
get_var_info(arranged_by_composition)
#> # A tibble: 4 × 4
#> variable protein peptide glycan_composition
#> <chr> <chr> <chr> <chr>
#> 1 V3 PRO3 PEP3 H3N2
#> 2 V4 PRO3 PEP4 H3N2
#> 3 V1 PRO1 PEP1 H5N2
#> 4 V2 PRO2 PEP2 H5N2Performance Considerations
glyexp’s dplyr-style functions are designed to be fast, safe, and consistent.
For large datasets, consider:
- Filtering early in your pipeline to reduce data size
- Using
select_obs()andselect_var()to keep only needed columns - Chaining operations efficiently to minimize intermediate copies
# Efficient pipeline: filter first, then manipulate
efficient_pipeline <- toy_exp |>
filter_obs(group == "A") |> # Reduce samples early
filter_var(protein == "PRO1") |> # Reduce variables early
select_obs(group) |> # Keep only needed sample columns
select_var(glycan_composition) # Keep only needed variable columnsPhilosophy Behind the Design
glyexp’s dplyr-style functions embody a simple philosophy:
“Think about your metadata, and let the data follow.”
This design means:
- Mental Model Alignment: Think in terms of samples and variables, not matrix indices
- Error Prevention: Automatic synchronization prevents common data analysis mistakes
- Familiar Syntax: If you know dplyr, you already know most of glyexp
- Composability: Functions chain together naturally for complex analyses
Summary
glyexp’s dplyr-style functions are experiment-specific data
manipulators designed exclusively for experiment() objects.
They provide:
- Automatic Synchronization: Operations on metadata automatically update the expression matrix
- Index Column Protection: Critical relationship columns are protected from deletion
- Familiar Syntax: Standard dplyr operations with multi-table awareness
-
Type-Aware Operations:
_obs()for samples,_var()for variables
Start with filter_obs() and select_var(),
then build complex pipelines.