
Power User Guide: Efficient Glycan Manipulation
smap.Rmd
Welcome to the Advanced Zone! 🚀
Ready to unlock the full potential of
glyrepr
? This vignette is for those who want to peek under
the hood and master the art of efficient glycan computation. If you’re
writing custom functions for glycan analysis or building the next great
glycomics tool, you’re in the right place!
Fair warning: This guide assumes you’re comfortable with R programming and graph theory concepts. If you’re just getting started, check out our “Getting Started with glyrepr” vignette first.
The Secret Superpower: Unique Structure Optimization
Before we dive into the smap
functions, let’s understand
why they exist and why they’re game-changing for glycan
analysis.
The Problem: Glycan Computation is Expensive 💸
Working with glycan structures means working with graphs, and graph operations are computationally expensive. When you’re analyzing thousands of glycans from a large-scale study, this becomes a real bottleneck.
The Solution: Work Smart, Not Hard 🧠
glyrepr
implements a clever optimization called
unique structure storage. Instead of storing thousands
of identical graphs, it stores only the unique ones and keeps track of
which original positions they belong to.
Let’s see this in action:
# Our test data: some common glycan structures
iupacs <- c(
"Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-", # N-glycan core
"Gal(b1-3)GalNAc(a1-", # O-glycan core 1
"Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-", # O-glycan core 2
"Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man", # Branched mannose
"GlcNAc6Ac(b1-4)Glc3Me(a1-" # With decorations
)
struc <- as_glycan_structure(iupacs)
# Now let's create a realistic dataset with lots of repetition
large_struc <- rep(struc, 1000) # 5,000 total structures
large_struc
#> <glycan_structure[5000]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(a1-
#> [3] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [4] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(?1-
#> [5] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> [6] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [7] Gal(b1-3)GalNAc(a1-
#> [8] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [9] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(?1-
#> [10] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> ... (4990 more not shown)
#> # Unique structures: 5
Notice that magical “# Unique structures: 5”? That’s your performance booster right there!
Let’s verify this optimization is real:
Enter the smap
Universe 🌌
Now here’s the problem: if you try to use regular
lapply()
or purrr::map()
functions on glycan
structures, you’ll hit a wall:
# This won't work and will throw an error
tryCatch(
purrr::map_int(large_struc, ~ igraph::vcount(.x)),
error = function(e) cat("💥 Error:", rlang::cnd_message(e))
)
#> 💥 Error: ℹ In index: 1.
#> Caused by error in `ensure_igraph()`:
#> ! Must provide a graph object (provided wrong object type).
Why does this fail? Because purrr
functions don’t understand the internal structure optimization of
glycan_structure
objects.
The smap
Family to the Rescue!
The smap
functions (think “structure
map”) are drop-in replacements for purrr
functions that are
glycan-aware. They understand the unique structure
optimization and work directly with the underlying graph objects.
# This works beautifully!
vertex_counts <- smap_int(large_struc, ~ igraph::vcount(.x))
vertex_counts[1:10]
#> [1] 5 2 3 5 2 5 2 3 5 2
The “s” stands for “structure” — these functions
operate on the underlying igraph
objects that represent
your glycan structures.
The Complete smap
Toolkit 🛠️
The smap
family provides glycan-aware equivalents for
virtually all purrr
functions:
purrr | smap | purrr | smap |
---|---|---|---|
map() |
smap() |
map2() |
smap2() |
map_lgl() |
smap_lgl() |
map2_lgl() |
smap2_lgl() |
map_int() |
smap_int() |
map2_int() |
smap2_int() |
map_dbl() |
smap_dbl() |
map2_dbl() |
smap2_dbl() |
map_chr() |
smap_chr() |
map2_chr() |
smap2_chr() |
some() |
ssome() |
pmap() |
spmap() |
every() |
severy() |
pmap_*() |
spmap_*() |
none() |
snone() |
imap() |
simap() |
imap_*() |
simap_*() |
Simple rule: Replace map
with
smap
, pmap
with spmap
, and
imap
with simap
. Everything else works exactly
like purrr
!
Let’s Put Them to Work!
Count vertices in each structure:
vertex_counts <- smap_int(large_struc, igraph::vcount)
summary(vertex_counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.0 2.0 3.0 3.4 5.0 5.0
Find structures with more than 4 vertices:
has_many_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) > 4)
sum(has_many_vertices)
#> [1] 2000
Get the degree sequence of each structure:
degree_sequences <- smap(large_struc, ~ igraph::degree(.x))
degree_sequences[1:3] # Show first 3
#> [[1]]
#> 1 2 3 4 5
#> 1 2 3 1 1
#>
#> [[2]]
#> 1 2
#> 1 1
#>
#> [[3]]
#> 1 2 3
#> 2 1 1
Check if any structure has isolated vertices:
Verify all structures are connected:
severy(large_struc, ~ igraph::is_connected(.x))
#> [1] TRUE
Beyond Basic smap()
Quick examples of the extended family:
# smap2: Apply function with additional parameters
thresholds <- c(3, 4, 5)
large_enough <- smap2_lgl(struc[1:3], thresholds, function(g, threshold) {
igraph::vcount(g) >= threshold
})
large_enough
#> [1] TRUE FALSE FALSE
# simap: Include position information
indexed_report <- simap_chr(large_struc[1:3], function(g, i) {
paste0("#", i, ": ", igraph::vcount(g), " vertices")
})
indexed_report
#> [1] "#1: 5 vertices" "#2: 2 vertices" "#3: 3 vertices"
⚠️ Performance Warning: simap
functions
don’t benefit from the unique structure optimization! Since each element
has a different index, the combination of
(structure, index)
is always unique, breaking the
deduplication that makes other smap
functions fast. Use
simap
only when you truly need position information.
Performance: The Magic of Deduplication ⚡
The beauty of smap
functions lies in automatic
deduplication:
# Create a large dataset with high redundancy
huge_struc <- rep(struc, 5000) # 25,000 structures, only 5 unique
cat("Dataset size:", length(huge_struc), "structures\n")
#> Dataset size: 25000 structures
cat("Unique structures:", length(attr(huge_struc, "structures")), "\n")
#> Unique structures: 5
cat("Redundancy factor:", length(huge_struc) / length(attr(huge_struc, "structures")), "x\n")
#> Redundancy factor: 5000 x
library(tictoc)
# Optimized approach: smap only processes 5 unique structures
tic("smap_int (optimized)")
vertex_counts_optimized <- smap_int(huge_struc, igraph::vcount)
toc()
#> smap_int (optimized): 0.002 sec elapsed
# Naive approach: extract all graphs and process each one
tic("Naive approach (all graphs)")
all_graphs <- get_structure_graphs(huge_struc) # Extracts all 25,000 graphs
vertex_counts_naive <- purrr::map_int(all_graphs, igraph::vcount)
toc()
#> Naive approach (all graphs): 0.227 sec elapsed
# Verify results are equivalent (though data types may differ)
all.equal(vertex_counts_optimized, vertex_counts_naive)
#> [1] TRUE
The higher the redundancy, the bigger the performance gain! In real glycoproteomics datasets with repeated structures, this optimization can provide about 10x speedups.
Advanced Patterns and Tips 💡
Working with Complex Functions
The function you pass to smap
must accept an
igraph
object as its first argument. You can use
purrr-style lambda notation:
# Calculate clustering coefficient for each structure
clustering_coeffs <- smap_dbl(large_struc, ~ igraph::transitivity(.x, type = "global"))
summary(clustering_coeffs)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0 0 0 0 0 0 2000
Combining Multiple Metrics
# Create a comprehensive analysis
structure_analysis <- smap(large_struc, function(g) {
list(
vertices = igraph::vcount(g),
edges = igraph::ecount(g),
diameter = ifelse(igraph::is_connected(g), igraph::diameter(g), NA),
clustering = igraph::transitivity(g, type = "global")
)
})
# Convert to a more usable format
analysis_df <- do.call(rbind, lapply(structure_analysis, data.frame))
head(analysis_df)
#> vertices edges diameter clustering
#> 1 5 4 3 0
#> 2 2 1 1 NaN
#> 3 3 2 1 0
#> 4 5 4 2 0
#> 5 2 1 1 NaN
#> 6 5 4 3 0
When to Use smap
Functions
Use smap
functions when:
- ✅ You need to apply
igraph
-based functions to glycan structures - ✅ You want maximum performance with datasets containing repeated structures
- ✅ You’re building custom glycan analysis pipelines
Stick with regular R functions when:
- ❌ Working with compositions
- ❌ Operating on string representations
⚠️ Special note on simap
:
While simap
functions are convenient for position-aware
operations, they don’t provide performance benefits
over regular imap
functions. The inclusion of index
information breaks the unique structure optimization, making each
(structure, index)
pair unique even when structures are
identical.
Real-World Example: Custom Motif Detection
Here’s how you might build a custom glycan analysis pipeline using
smap
functions:
# Custom motif detector
detect_branching <- function(g) {
degrees <- igraph::degree(g)
any(degrees >= 3)
}
# Apply to large dataset - blazingly fast due to unique structure optimization
has_branching <- smap_lgl(large_struc, detect_branching)
cat("Structures with branching:", sum(has_branching), "out of", length(large_struc), "\n")
#> Structures with branching: 2000 out of 5000
# Use smap2 to check structures against complexity thresholds
complexity_thresholds <- rep(c(3, 4, 5, 2, 4), 1000) # Thresholds for each structure
meets_threshold <- smap2_lgl(large_struc, complexity_thresholds, function(g, threshold) {
igraph::vcount(g) >= threshold
})
cat("Structures meeting complexity threshold:", sum(meets_threshold), "out of", length(large_struc), "\n")
#> Structures meeting complexity threshold: 2000 out of 5000
Final Thoughts: You’re Now a Power User! 🎉
Congratulations! You now understand the core optimization that makes
glyrepr
blazingly fast and how to leverage it with the
smap
family of functions.
Key takeaways: - 🧠 Unique structure
optimization is the secret sauce behind glyrepr
’s
performance - 🚀 smap
functions are
drop-in replacements for purrr
that understand glycan
structures - ⚡ Performance gains are dramatic with
large datasets containing repeated structures - 🛠️ Use
smap
for structures, regular R functions for
everything else
You’re now equipped to build the next generation of glycomics analysis tools. Go forth and analyze! 🌟
Session Information
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] tictoc_1.2.1 lobstr_1.1.2 glyrepr_0.5.0
#>
#> loaded via a namespace (and not attached):
#> [1] jsonlite_2.0.0 dplyr_1.1.4 compiler_4.5.1 tidyselect_1.2.1
#> [5] stringr_1.5.1 jquerylib_0.1.4 systemfonts_1.2.3 textshaping_1.0.1
#> [9] yaml_2.3.10 fastmap_1.2.0 R6_2.6.1 generics_0.1.4
#> [13] igraph_2.1.4 knitr_1.50 backports_1.5.0 checkmate_2.3.2
#> [17] tibble_3.3.0 rstackdeque_1.1.1 desc_1.4.3 bslib_0.9.0
#> [21] pillar_1.10.2 rlang_1.1.6 cachem_1.1.0 stringi_1.8.7
#> [25] xfun_0.52 fs_1.6.6 sass_0.4.10 cli_3.6.5
#> [29] pkgdown_2.1.3 magrittr_2.0.3 digest_0.6.37 lifecycle_1.0.4
#> [33] prettyunits_1.2.0 vctrs_0.6.5 evaluate_1.0.4 glue_1.8.0
#> [37] ragg_1.4.0 rmarkdown_2.29 purrr_1.0.4 tools_4.5.1
#> [41] pkgconfig_2.0.3 htmltools_0.5.8.1