
Getting Started with glystats
glystats.Rmd
When we talk about omics data analysis, we’re usually diving into
exciting territories like differential expression analysis, PCA,
survival analysis, and much more! 🧬 glystats
brings all
these powerful analyses together under one unified, user-friendly
interface.
🎯 Key Feature: glystats
offers two
levels of interfaces to fit your workflow:
-
gly_xxx()
: seamlessly works with [glyexp::experiment()], the beating heart of theglycoverse
ecosystem 💓 -
gly_xxx_()
: flexible enough to work with general inputs like matrices or data frames
This vignette focuses on the gly_xxx()
interface. New to
[glyexp::experiment()]? No worries! Check out its introduction
first to get up to speed.
library(glystats)
library(glyexp)
library(glyread)
library(glyclean)
#>
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#>
#> aggregate
🔍 Meet Your Data
Let’s start by exploring our demo dataset:
# Here we use `glyread` to read in the data and use `glyclean` to perform preprocessing
exp <- read_pglyco3_pglycoquant("glycopeptides.list", sample_info = "sample_info.csv") |> auto_clean()
#> ℹ Reading data
#> ℹ Finding leader proteins
#> ✔ Finding leader proteins [97ms]
#>
#> ℹ Reading dataℹ Parsing glycan compositions and structures
#> ✔ Parsing glycan compositions and structures [253ms]
#>
#> ℹ Reading data✔ Reading data [782ms]
#>
#> ℹ Normalizing data (Median)
#> ✔ Normalizing data (Median) [134ms]
#>
#> ℹ Removing variables with >50% missing values
#> ✔ Removing variables with >50% missing values [20ms]
#>
#> ℹ Imputing missing values
#> ℹ Sample size <= 30, using sample minimum imputation
#> ℹ Imputing missing values✔ Imputing missing values [15ms]
#>
#> ℹ Aggregating data
#> ✔ Aggregating data [100ms]
#>
#> ℹ Normalizing data again
#> ✔ Normalizing data again [14ms]
exp
#>
#> ── Experiment ──────────────────────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 221 variables
#> ℹ Sample information fields: group
#> ℹ Variable information fields: protein, gene, glycan_composition, and protein_site
Look at that! 🎉 We’ve got a [glyexp::experiment()] packed with 12 samples and 263 glycoforms. That’s plenty of data to work with!
get_var_info(exp)
#> # A tibble: 221 × 5
#> variable protein gene glycan_composition protein_site
#> <chr> <chr> <chr> <comp> <int>
#> 1 V1 P08185 SERPINA6 Hex(5)HexNAc(4)NeuAc(2) 176
#> 2 V2 P04196 HRG Hex(5)HexNAc(4)NeuAc(1) 344
#> 3 V3 P04196 HRG Hex(5)HexNAc(4) 344
#> 4 V4 P10909 CLU Hex(6)HexNAc(5) 291
#> 5 V5 P04196 HRG Hex(5)HexNAc(4)NeuAc(2) 344
#> 6 V6 P04196 HRG Hex(5)HexNAc(4) 345
#> 7 V7 P04196 HRG Hex(5)HexNAc(4)dHex(2) 344
#> 8 V8 P04196 HRG Hex(4)HexNAc(3) 344
#> 9 V9 P04196 HRG Hex(4)HexNAc(4)NeuAc(1) 344
#> 10 V10 P10909 CLU Hex(5)HexNAc(4) 291
#> # ℹ 211 more rows
The variable information tibble is like a detailed ID card for each glycoform 🆔 - it contains everything you need to know: the protein, glycosylation site, and glycan structures.
get_sample_info(exp)
#> # A tibble: 12 × 2
#> sample group
#> <chr> <chr>
#> 1 20241224-LXJ-Nglyco-H_1 H
#> 2 20241224-LXJ-Nglyco-H_2 H
#> 3 20241224-LXJ-Nglyco-H_3 H
#> 4 20241224-LXJ-Nglyco-M_1 M
#> 5 20241224-LXJ-Nglyco-M_2 M
#> 6 20241224-LXJ-Nglyco-M_3 M
#> 7 20241224-LXJ-Nglyco-Y_1 Y
#> 8 20241224-LXJ-Nglyco-Y_2 Y
#> 9 20241224-LXJ-Nglyco-Y_3 Y
#> 10 20241224-LXJ-Nglyco-C_1 C
#> 11 20241224-LXJ-Nglyco-C_2 C
#> 12 20241224-LXJ-Nglyco-C_3 C
Our sample information tibble features a crucial “group” column 🏷️. Here’s the key: “H” = healthy, “M” = hepatitis, “Y” = cirrhosis, and “C” = hepatocellular carcinoma. This gives us a perfect setup for comparative analysis!
🚀 One Interface to Rule Them All
Here’s where glystats
really shines! ✨ Every function
follows a simple, intuitive naming pattern: gly_
+ analysis
name. Think gly_ttest()
for t-tests, gly_pca()
for PCA, and so on.
Pro tip: leverage RStudio’s auto-completion to discover all available functions! 💡
Let’s dive into action with an ANOVA analysis to identify differentially expressed glycoforms:
anova_res <- gly_anova(exp)
#> ℹ Number of groups: 4
#> ℹ Groups: "C", "H", "M", and "Y"
Boom! 💥 Analysis complete in just one line! gly_anova()
intelligently follows the glycoverse
naming conventions,
automatically detecting the “group” column in your sample info and
fitting an ANOVA model for each glycoform.
📋 Understanding Your Results
All glystats
functions return a consistent,
well-structured list with two key components:
-
tidy_result
: Clean, analysis-ready tibbles in tidy format 📊 -
raw_result
: The original result objects from underlying statistical functions 🔧
For gly_anova()
, the tidy_result
contains
two informative tibbles: main_test
and
post_hoc_test
.
anova_res$tidy_result
#> $main_test
#> # A tibble: 221 × 13
#> variable protein gene glycan_composition protein_site term df sumsq
#> <chr> <chr> <chr> <comp> <int> <chr> <dbl> <dbl>
#> 1 V1 P08185 SERPINA6 Hex(5)HexNAc(4)Neu… 176 group 3 68.3
#> 2 V2 P04196 HRG Hex(5)HexNAc(4)Neu… 344 group 3 161.
#> 3 V3 P04196 HRG Hex(5)HexNAc(4) 344 group 3 126.
#> 4 V4 P10909 CLU Hex(6)HexNAc(5) 291 group 3 22.9
#> 5 V5 P04196 HRG Hex(5)HexNAc(4)Neu… 344 group 3 472.
#> 6 V6 P04196 HRG Hex(5)HexNAc(4) 345 group 3 59.8
#> 7 V7 P04196 HRG Hex(5)HexNAc(4)dHe… 344 group 3 208.
#> 8 V8 P04196 HRG Hex(4)HexNAc(3) 344 group 3 108.
#> 9 V9 P04196 HRG Hex(4)HexNAc(4)Neu… 344 group 3 9.20
#> 10 V10 P10909 CLU Hex(5)HexNAc(4) 291 group 3 86.6
#> # ℹ 211 more rows
#> # ℹ 5 more variables: meansq <dbl>, statistic <dbl>, p_value <dbl>,
#> # p_adj <dbl>, post_hoc <chr>
#>
#> $post_hoc_test
#> # A tibble: 506 × 9
#> variable protein gene glycan_composition protein_site group1 group2 p_value
#> <chr> <chr> <chr> <comp> <int> <chr> <chr> <dbl>
#> 1 V1 P08185 SERP… Hex(5)HexNAc(4)Ne… 176 NA NA NA
#> 2 V2 P04196 HRG Hex(5)HexNAc(4)Ne… 344 C H 2.47e-7
#> 3 V2 P04196 HRG Hex(5)HexNAc(4)Ne… 344 C M 9.99e-1
#> 4 V2 P04196 HRG Hex(5)HexNAc(4)Ne… 344 C Y 6.28e-2
#> 5 V2 P04196 HRG Hex(5)HexNAc(4)Ne… 344 H M 2.34e-7
#> 6 V2 P04196 HRG Hex(5)HexNAc(4)Ne… 344 H Y 1.09e-6
#> 7 V2 P04196 HRG Hex(5)HexNAc(4)Ne… 344 M Y 5.36e-2
#> 8 V3 P04196 HRG Hex(5)HexNAc(4) 344 C H 2.41e-4
#> 9 V3 P04196 HRG Hex(5)HexNAc(4) 344 C M 5.10e-1
#> 10 V3 P04196 HRG Hex(5)HexNAc(4) 344 C Y 6.12e-1
#> # ℹ 496 more rows
#> # ℹ 1 more variable: p_adj <dbl>
Notice something cool? 😎 gly_anova()
thoughtfully adds
back all the descriptive columns from your variable tibble. Want to
control this behavior? Just use the add_info
parameter!
The raw_result
houses two lists of models - one for the
main test, one for post hoc comparisons:
names(anova_res$raw_result)
#> [1] "main_test" "post_hoc_test"
🔧 Maximum Flexibility Mode
Feeling constrained by [glyexp::experiment()]? Fear not! 🦸♀️ Every
gly_xxx()
function comes with a flexible
gly_xxx_()
sibling that works with standard R objects. For
instance, gly_anova_()
happily accepts plain matrices:
expr_mat <- get_expr_mat(exp)
groups <- factor(get_sample_info(exp)$group)
anova_res2 <- gly_anova_(expr_mat, groups)
This adaptability makes glystats
a perfect team player
🤝 - it seamlessly integrates into existing analysis pipelines and
workflows, no matter what data structures you’re already using!
🎪 The Complete Analytical Arsenal
Ready to explore the full power of glystats
? Here’s your
complete toolkit for glycomics and glycoproteomics data analysis:
-
🔬 Differential Expression Analysis:
-
gly_ttest()
: Two-sample t-test -
gly_wilcox()
: Wilcoxon rank sum test -
gly_anova()
: One-way ANOVA -
gly_kruskal()
: Kruskal-Wallis rank sum test -
gly_limma()
: Linear models for microarray data (limma) -
gly_fold_change()
: Calculate fold change
-
-
📐 Dimensionality Reduction:
-
gly_pca()
: Principal component analysis -
gly_tsne()
: t-distributed stochastic neighbor embedding (t-SNE) -
gly_umap()
: Uniform manifold approximation and projection (UMAP) -
gly_oplsda()
: Orthogonal partial least squares discriminant analysis (OPLS-DA) -
gly_plsda()
: Partial least squares discriminant analysis (PLS-DA)
-
-
🧩 Clustering:
-
gly_kmeans()
: K-means clustering -
gly_hclust()
: Hierarchical clustering -
gly_consensus_clustering()
: Consensus clustering -
gly_wgcna()
: Weighted gene co-expression network analysis (WGCNA)
-
-
⏱️ Survival Analysis:
-
gly_cox()
: Cox proportional hazards model
-
-
🎯 Enrichment Analysis:
-
gly_enrich_go()
: Gene Ontology enrichment analysis -
gly_enrich_kegg()
: KEGG enrichment analysis -
gly_enrich_reactome()
: Reactome enrichment analysis
-
- 🔧 Additional Tools:
🚀 What’s Next on Your Journey?
Ready to dive deeper into the glycoverse
? Here’s your
roadmap to success:
📥 Data Import: Start with glyread to seamlessly import your data into
glyexp::experiment()
objects🧹 Data Preprocessing: Use glyclean to polish and prepare your data for analysis
📊 Statistical Analysis: You’re here! Use
glystats
to unlock powerful insights from your glycomics data🎨 Visualization: Stay tuned! We’re crafting an amazing
glyvis
package for stunning data visualizations
Happy analyzing! 🎉✨