
dplyr-Style Functions: Data Harmony in Action
dplyr-style-functions.Rmd
Welcome to the world of synchronized data manipulation! πΌ
If youβve ever worked with multi-table datasets, you know the pain: filter one table, and suddenly your data is out of sync. Rearrange another, and your carefully crafted relationships crumble like a house of cards.
Enter glyexpβs dplyr-style functions - your new data harmony conductors! π―
These arenβt just regular dplyr functions with a fancy wrapper.
Theyβre relationship-aware data manipulators
specifically designed for glyexp::experiment()
objects that
understand the intricate dance between your expression matrix, sample
information, and variable annotations. When you transform one piece,
everything else follows in perfect synchronization.
π― Important Note: These functions only work
with experiment()
objects - you cannot use them on
regular data.frames, tibbles, or other data structures. They are
purpose-built for the synchronized data model that
experiment()
provides.
library(glyexp)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#>
#> select_var
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(conflicted)
conflicts_prefer(glyexp::select_var)
#> [conflicted] Will prefer glyexp::select_var over
#> any other package.
conflicts_prefer(dplyr::filter)
#> [conflicted] Will prefer dplyr::filter over any
#> other package.
The Core Philosophy: One Action, Three Updates π
Imagine youβre the conductor of a three-piece orchestra:
πΌ First violin (Expression Matrix): Your numerical
data
πΌ Second violin (Sample Info): Your experimental
metadata
πΌ Viola (Variable Info): Your molecular
annotations
In traditional data analysis, when you want the first violin to play a solo (filter samples), you have to manually cue each instrument. Miss a beat, and your symphony turns into chaos.
glyexpβs dplyr-style functions are different. Theyβre like having a magical conductorβs baton - wave it once, and all three instruments respond in perfect harmony!
Letβs see this magic in action:
toy_exp <- toy_experiment()
print(toy_exp)
#>
#> ββ Experiment ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> βΉ Expression matrix: 6 samples, 4 variables
#> βΉ Sample information fields: group and batch
#> βΉ Variable information fields: protein, peptide, and glycan_composition
The Two Flavors: _obs()
and _var()
π¦
Every dplyr-style function in glyexp comes in two delicious flavors:
-
_obs()
functions: Work on sample information (observations/columns) inexperiment()
objects -
_var()
functions: Work on variable annotations (variables/rows) inexperiment()
objects
But hereβs the beautiful part - both flavors automatically update the expression matrix to maintain perfect synchronization!
β οΈ Reminder: These specialized functions require an
experiment()
object as input and return an
experiment()
object as output. They cannot be used with
standard tibbles or data.frames - for those, use the regular dplyr
functions directly.
Filtering: The Art of Selective Attention π
Letβs start with the most common operation - filtering your data.
Sample-Based Filtering with filter_obs()
Say you want to focus only on group βAβ samples:
# Before filtering - let's see what we have
get_sample_info(toy_exp)
#> # A tibble: 6 Γ 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2
# Filter for group A samples only
filtered_exp <- filter_obs(toy_exp, group == "A")
get_sample_info(filtered_exp)
#> # A tibble: 3 Γ 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
Beautiful! But hereβs where the magic happens - check the expression matrix:
# Original matrix dimensions:
dim(get_expr_mat(toy_exp))
#> [1] 4 6
# Original matrix:
get_expr_mat(toy_exp)
#> S1 S2 S3 S4 S5 S6
#> V1 1 5 9 13 17 21
#> V2 2 6 10 14 18 22
#> V3 3 7 11 15 19 23
#> V4 4 8 12 16 20 24
# Filtered expression matrix - automatically updated!
# Filtered matrix dimensions:
dim(get_expr_mat(filtered_exp))
#> [1] 4 3
# Filtered matrix:
get_expr_mat(filtered_exp)
#> S1 S2 S3
#> V1 1 5 9
#> V2 2 6 10
#> V3 3 7 11
#> V4 4 8 12
πͺ Ta-da! The expression matrix automatically filtered its columns to match the remaining samples! No manual intervention, no risk of mismatched data - just pure, synchronized harmony.
Variable-Based Filtering with filter_var()
Now letβs filter variables and watch the same magic happen:
# Filter for specific glycan compositions
var_filtered_exp <- filter_var(toy_exp, glycan_composition == "H5N2")
get_var_info(var_filtered_exp)
#> # A tibble: 2 Γ 4
#> variable protein peptide glycan_composition
#> <chr> <chr> <chr> <chr>
#> 1 V1 PRO1 PEP1 H5N2
#> 2 V2 PRO2 PEP2 H5N2
# The expression matrix rows automatically follow suit!
get_expr_mat(var_filtered_exp)
#> S1 S2 S3 S4 S5 S6
#> V1 1 5 9 13 17 21
#> V2 2 6 10 14 18 22
The matrix rows automatically reduced to match the filtered variables! This is the core power of glyexp - you think about your metadata, and the expression data follows your lead.
Chaining Filters: The Symphony Continues π΅
Want to filter both samples and variables? Chain them together like a beautiful melody:
double_filtered <- toy_exp |>
filter_obs(group == "A") |>
filter_var(glycan_composition %in% c("H5N2", "N3N2"))
# Final dimensions after double filtering:
dim(get_expr_mat(double_filtered))
#> [1] 4 3
get_expr_mat(double_filtered)
#> S1 S2 S3
#> V1 1 5 9
#> V2 2 6 10
#> V3 3 7 11
#> V4 4 8 12
Notice the pipe-friendly design? Thatβs the dplyr DNA in action - familiar syntax, powerful results!
The Sacred Index Columns: Guardians of Data Integrity π‘οΈ
Hereβs where glyexp really shines: index column protection. These special columns (like βsampleβ and βvariableβ) are the backbone of your data relationships. Lose them, and your carefully orchestrated data symphony falls apart.
Letβs see this protection in action:
Attempting to Remove Index Columns (Spoiler: It Wonβt Work!) π
# Try to select everything EXCEPT the sample index column
protective_exp <- select_obs(toy_exp, -sample)
#> Error in `select_obs()`:
#> ! You should not explicitly select or deselect the "sample" column in
#> `sample_info`.
#> βΉ The "sample" column will be handled by `select_obs()` automatically.
get_sample_info(protective_exp)
#> Error: object 'protective_exp' not found
Did you see that error message? glyexp throws a helpful error message and protects our data integrity by preventing this operation entirely!
# Same protection for variable info
protective_var_exp <- select_var(toy_exp, -variable)
#> Error in `select_var()`:
#> ! You should not explicitly select or deselect the "variable" column in
#> `var_info`.
#> βΉ The "variable" column will be handled by `select_var()` automatically.
get_var_info(protective_var_exp)
#> Error: object 'protective_var_exp' not found
Similarly, glyexp throws an error to protect the βvariableβ column from being removed! π°
Why This Protection Matters
Without index columns, your experiment()
object would
lose its ability to:
- β Keep expression matrix and metadata synchronized
- β Validate data consistency
- β Enable seamless subsetting operations
- β
Work with other
glycoverse
packages
Think of index columns as the GPS coordinates of your data - remove them, and youβre lost in a sea of unconnected numbers!
The Complete Function Family Tree π³
glyexp provides dplyr-style equivalents for all your favorite data
manipulation functions. Each function comes in both
_obs()
and _var()
flavors, and
all automatically maintain matrix synchronization.
π§ Technical Note: All these functions are
methods specifically for experiment()
objects. Unlike generic dplyr functions that work on various
data types, these functions expect and return experiment()
objects exclusively:
Core Data Manipulation Functions
Standard dplyr | Sample Operations | Variable Operations | Magic Power |
---|---|---|---|
filter() |
filter_obs() |
filter_var() |
π Subset with sync |
select() |
select_obs() |
select_var() |
π― Choose with protection |
arrange() |
arrange_obs() |
arrange_var() |
π Sort with order |
mutate() |
mutate_obs() |
mutate_var() |
β Create with consistency |
rename() |
rename_obs() |
rename_var() |
π·οΈ Rename with safety |
Advanced Slicing Functions
Standard dplyr | Sample Operations | Variable Operations | Specialty |
---|---|---|---|
slice() |
slice_obs() |
slice_var() |
π’ Position-based selection |
slice_head() |
slice_head_obs() |
slice_head_var() |
β¬οΈ Top n with sync |
slice_tail() |
slice_tail_obs() |
slice_tail_var() |
β¬οΈ Bottom n with sync |
slice_sample() |
slice_sample_obs() |
slice_sample_var() |
π² Random with consistency |
slice_max() |
slice_max_obs() |
slice_max_var() |
π Highest values with order |
slice_min() |
slice_min_obs() |
slice_min_var() |
π Lowest values with order |
Deep Dive: Function-by-Function Examples πββοΈ
Letβs explore each function family with hands-on examples!
Selection: Choosing Your Data Wisely π―
# Select specific columns from sample info
selected_exp <- select_obs(toy_exp, group, batch)
get_sample_info(selected_exp)
#> # A tibble: 6 Γ 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2
# Select columns from variable info (notice the index protection!)
var_selected_exp <- select_var(toy_exp, glycan_composition)
get_var_info(var_selected_exp)
#> # A tibble: 4 Γ 2
#> variable glycan_composition
#> <chr> <chr>
#> 1 V1 H5N2
#> 2 V2 H5N2
#> 3 V3 N3N2
#> 4 V4 N3N2
Pro tip: Use dplyr
-style helpers like
starts_with()
, ends_with()
, and
contains()
:
# Select columns starting with "glycan"
helper_exp <- select_var(toy_exp, starts_with("glycan"))
get_var_info(helper_exp)
#> # A tibble: 4 Γ 2
#> variable glycan_composition
#> <chr> <chr>
#> 1 V1 H5N2
#> 2 V2 H5N2
#> 3 V3 N3N2
#> 4 V4 N3N2
Arrangement: Putting Things in Order π
# Arrange samples by batch and group
arranged_exp <- arrange_obs(toy_exp, batch, group)
get_sample_info(arranged_exp)
#> # A tibble: 6 Γ 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S3 A 1
#> 3 S5 B 1
#> 4 S2 A 2
#> 5 S4 B 2
#> 6 S6 B 2
The magic moment: Check how the expression matrix columns rearranged to match!
# Expression matrix columns follow the new sample order
get_expr_mat(arranged_exp)
#> S1 S3 S5 S2 S4 S6
#> V1 1 9 17 5 13 21
#> V2 2 10 18 6 14 22
#> V3 3 11 19 7 15 23
#> V4 4 12 20 8 16 24
Mutation: Creating New Insights β
# Add a new calculated column to sample info
mutated_exp <- mutate_obs(
toy_exp,
group_batch = paste(group, batch, sep = "_")
)
get_sample_info(mutated_exp)
#> # A tibble: 6 Γ 4
#> sample group batch group_batch
#> <chr> <chr> <dbl> <chr>
#> 1 S1 A 1 A_1
#> 2 S2 A 2 A_2
#> 3 S3 A 1 A_1
#> 4 S4 B 2 B_2
#> 5 S5 B 1 B_1
#> 6 S6 B 2 B_2
# Create a complexity score for variables
complex_exp <- mutate_var(
toy_exp,
complexity = nchar(glycan_composition)
)
get_var_info(complex_exp)
#> # A tibble: 4 Γ 5
#> variable protein peptide glycan_composition complexity
#> <chr> <chr> <chr> <chr> <int>
#> 1 V1 PRO1 PEP1 H5N2 4
#> 2 V2 PRO2 PEP2 H5N2 4
#> 3 V3 PRO3 PEP3 N3N2 4
#> 4 V4 PRO3 PEP4 N3N2 4
Slicing: Precision Subsetting π’
# Take the first 2 samples
head_exp <- slice_head_obs(toy_exp, n = 2)
get_sample_info(head_exp)
#> # A tibble: 2 Γ 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
# Expression matrix automatically adjusts
get_expr_mat(head_exp)
#> S1 S2
#> V1 1 5
#> V2 2 6
#> V3 3 7
#> V4 4 8
# Sample randomly from variables
set.seed(123) # For reproducibility
random_exp <- slice_sample_var(toy_exp, n = 3)
get_var_info(random_exp)
#> # A tibble: 3 Γ 4
#> variable protein peptide glycan_composition
#> <chr> <chr> <chr> <chr>
#> 1 V3 PRO3 PEP3 N3N2
#> 2 V4 PRO3 PEP4 N3N2
#> 3 V1 PRO1 PEP1 H5N2
Renaming: Clarity Through Better Names π·οΈ
# Rename columns in sample info
renamed_exp <- rename_obs(toy_exp, experimental_group = group)
get_sample_info(renamed_exp)
#> # A tibble: 6 Γ 3
#> sample experimental_group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2
Notice: The index column βsampleβ remains untouchable, but everything else can be renamed freely!
Advanced Patterns: Chaining for Complex Operations π
The real power emerges when you chain multiple operations together. Here are some advanced patterns:
Pattern 1: Filter β Select β Arrange
complex_pipeline <- toy_exp |>
filter_obs(group == "A") |>
select_obs(group, batch) |>
arrange_obs(desc(batch)) |>
filter_var(protein == "PRO1") |>
select_var(glycan_composition, protein)
print("Final pipeline result:")
#> [1] "Final pipeline result:"
print(complex_pipeline)
#>
#> ββ Experiment ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> βΉ Expression matrix: 3 samples, 1 variables
#> βΉ Sample information fields: group and batch
#> βΉ Variable information fields: glycan_composition and protein
Pattern 2: Mutate β Filter β Slice
analytical_pipeline <- toy_exp |>
mutate_var(composition_length = nchar(glycan_composition)) |>
filter_var(composition_length >= 4) |>
slice_max_var(composition_length, n = 3)
get_var_info(analytical_pipeline)
#> # A tibble: 4 Γ 5
#> variable protein peptide glycan_composition composition_length
#> <chr> <chr> <chr> <chr> <int>
#> 1 V1 PRO1 PEP1 H5N2 4
#> 2 V2 PRO2 PEP2 H5N2 4
#> 3 V3 PRO3 PEP3 N3N2 4
#> 4 V4 PRO3 PEP4 N3N2 4
Pattern 3: Random Sampling for Testing
# Create a smaller dataset for testing
set.seed(456)
test_exp <- toy_exp |>
slice_sample_obs(n = 3) |>
slice_sample_var(n = 4)
print("Test dataset dimensions:")
#> [1] "Test dataset dimensions:"
print(test_exp)
#>
#> ββ Experiment ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> βΉ Expression matrix: 3 samples, 4 variables
#> βΉ Sample information fields: group and batch
#> βΉ Variable information fields: protein, peptide, and glycan_composition
When dplyr-Style Functions Canβt Help: The Escape Hatch πͺ
Sometimes you need functionality that goes beyond what glyexpβs
dplyr-style functions provide. No problem! Since
glyexpβs dplyr-style functions only work with
experiment()
objects, when you need standard dplyr
functionality, simply extract the tibbles and use any dplyr function you
want.
Why Doesnβt glyexp Implement All dplyr Functions? π€
The philosophy is simple: glyexp only implements
functions that preserve the synchronized multi-table
structure of experiment()
objects.
Functions like count()
, distinct()
,
summarise()
, and pull()
return aggregated
results that break the original data relationships. For these
operations, extract the relevant tibble and use standard dplyr
functions:
# For complex aggregations
toy_exp |>
get_sample_info() |>
count(group)
#> # A tibble: 2 Γ 2
#> group n
#> <chr> <int>
#> 1 A 3
#> 2 B 3
# For distinct values
toy_exp |>
get_var_info() |>
distinct(protein) |>
pull(protein)
#> [1] "PRO1" "PRO2" "PRO3"
# For advanced filtering with multiple conditions
complex_filter_conditions <- toy_exp |>
get_sample_info() |>
filter(group == "A", batch == 2) |>
pull(sample)
# Then use the results to subset your experiment
filtered_by_complex <- filter_obs(toy_exp, sample %in% complex_filter_conditions)
Common Pitfalls and How to Avoid Them β οΈ
Pitfall 1: Using glyexp Functions on Non-Experiment Objects
β This wonβt work:
# glyexp functions only work on experiment() objects!
library(tibble)
regular_tibble <- tibble(group = c("A", "B"), value = c(1, 2))
filter_obs(regular_tibble, group == "A") # Error: not an experiment object!
#> Error in filter_info_data(exp = exp, info_field = "sample_info", id_column = "sample", : is_experiment(exp) is not TRUE
β Do this instead:
# Use regular dplyr functions for regular data structures
regular_tibble <- tibble(group = c("A", "B"), value = c(1, 2))
filter(regular_tibble, group == "A") # Works perfectly!
#> # A tibble: 1 Γ 2
#> group value
#> <chr> <dbl>
#> 1 A 1
# Use glyexp functions only with experiment objects
filtered_exp <- filter_obs(toy_exp, group == "A") # This works!
get_sample_info(filtered_exp)
#> # A tibble: 3 Γ 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
Pitfall 2: Forgetting the Synchronization Magic
β Donβt do this:
# This breaks synchronization!
sample_info <- get_sample_info(toy_exp)
filtered_samples <- filter(sample_info, group == "A")
# Now you have filtered sample info but the original expression matrix!
β Do this instead:
# This maintains synchronization
filtered_exp <- filter_obs(toy_exp, group == "A")
# Everything stays in sync!
Pitfall 3: Trying to Remove Index Columns
β This wonβt work as expected:
# Index column protection prevents this - will throw an error!
select_obs(toy_exp, -sample)
#> Error in `select_obs()`:
#> ! You should not explicitly select or deselect the "sample" column in
#> `sample_info`.
#> βΉ The "sample" column will be handled by `select_obs()` automatically.
β Embrace the protection:
# Select the columns you want, let glyexp protect the index
clean_exp <- select_obs(toy_exp, group, batch)
get_sample_info(clean_exp)
#> # A tibble: 6 Γ 3
#> sample group batch
#> <chr> <chr> <dbl>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2
# "sample" column automatically included for data integrity
Pitfall 4: Mismatched Operations
β Donβt mix operations inappropriately:
# This doesn't make sense - you can't arrange sample info by variable properties
arrange_obs(toy_exp, glycan_composition) # glycan_composition is in var_info!
β Use the right function for the right data:
# Arrange variables by their glycan composition
arranged_by_composition <- arrange_var(toy_exp, glycan_composition)
get_var_info(arranged_by_composition)
#> # A tibble: 4 Γ 4
#> variable protein peptide glycan_composition
#> <chr> <chr> <chr> <chr>
#> 1 V1 PRO1 PEP1 H5N2
#> 2 V2 PRO2 PEP2 H5N2
#> 3 V3 PRO3 PEP3 N3N2
#> 4 V4 PRO3 PEP4 N3N2
Performance Considerations: Speed Meets Safety πββοΈπ¨
glyexpβs dplyr-style functions are designed to be:
π Fast: Built on top of highly optimized dplyr
functions
π‘οΈ Safe: Index column protection prevents data
corruption
π Consistent: Automatic synchronization eliminates
manual errors
For large datasets, consider:
- Filtering early in your pipeline to reduce data size
- Using
select_obs()
andselect_var()
to keep only needed columns - Chaining operations efficiently to minimize intermediate copies
# Efficient pipeline: filter first, then manipulate
efficient_pipeline <- toy_exp |>
filter_obs(group == "A") |> # Reduce samples early
filter_var(protein == "PRO1") |> # Reduce variables early
select_obs(group) |> # Keep only needed sample columns
select_var(glycan_composition) # Keep only needed variable columns
The Philosophy Behind the Design π§
glyexpβs dplyr-style functions embody a simple but powerful philosophy:
βThink about your metadata, and let the data follow.β π―
This design choice means:
- Mental Model Alignment: You think in terms of samples and variables, not matrix indices
- Error Prevention: Automatic synchronization prevents the most common data analysis mistakes
- Familiar Syntax: If you know dplyr, you already know 90% of glyexp
- Composability: Functions chain together naturally for complex analyses
Summary π―
glyexpβs dplyr-style functions are experiment-specific data
manipulators designed exclusively for experiment()
objects. They provide four key capabilities:
πΌ Automatic Synchronization - Operations on
metadata automatically update the expression matrix
π‘οΈ Index Column Protection - Critical relationship
columns are protected from deletion
π Familiar Syntax - Standard dplyr operations with
multi-table awareness
π― Type-Aware Operations - _obs()
for
samples, _var()
for variables
Start simple with filter_obs()
and
select_var()
, then build complex pipelines! π΅