Getting Started with glystats • glystats

When we talk about omics data analysis, we’re usually diving into exciting territories like differential expression analysis, PCA, survival analysis, and much more! 🧬 glystats brings all these powerful analyses together under one unified, user-friendly interface.

🎯 Key Feature: glystats offers two levels of interfaces to fit your workflow:

gly_xxx(): seamlessly works with [glyexp::experiment()], the beating heart of the glycoverse ecosystem 💓
gly_xxx_(): flexible enough to work with general inputs like matrices or data frames

This vignette focuses on the gly_xxx() interface. New to [glyexp::experiment()]? No worries! Check out its introduction first to get up to speed.

library(glystats)
library(glyexp)
library(glyread)
library(glyclean)
#> 
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#> 
#>     aggregate
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#> 
#>     select_var
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

🔍 Meet Your Data

Let’s start by exploring our demo dataset:

# Here we use `glyread` to read in the data and use `glyclean` to perform preprocessing
exp <- read_pglyco3_pglycoquant("glycopeptides.list", sample_info = "sample_info.csv") |> auto_clean()
#> ℹ Reading data
#> ℹ Finding leader proteins
#> ✔ Finding leader proteins [82ms]
#> 
#> ℹ Reading dataColumn group converted to <factor>.ℹ Parsing glycan compositions and structures
#> Column group converted to <factor>.✔ Parsing glycan compositions and structures [277ms]
#> 
#> ℹ Reading data✔ Reading data [772ms]
#> 
#> ℹ Normalizing data (Median)
#> ✔ Normalizing data (Median) [130ms]
#> 
#> ℹ Removing variables with >50% missing values
#> ✔ Removing variables with >50% missing values [19ms]
#> 
#> ℹ Imputing missing values
#> ℹ Sample size <= 30, using sample minimum imputation
#> ℹ Imputing missing values✔ Imputing missing values [15ms]
#> 
#> ℹ Aggregating data
#> ✔ Aggregating data [193ms]
#> 
#> ℹ Normalizing data again
#> ✔ Normalizing data again [14ms]
exp
#> 
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 221 variables
#> ℹ Sample information fields: group <fct>
#> ℹ Variable information fields: protein <chr>, glycan_composition <comp>, protein_site <int>, gene <chr>

Look at that! 🎉 We’ve got a [glyexp::experiment()] packed with 12 samples and 263 glycoforms. That’s plenty of data to work with!

get_var_info(exp)
#> # A tibble: 221 × 5
#>    variable protein glycan_composition      protein_site gene    
#>    <chr>    <chr>   <comp>                         <int> <chr>   
#>  1 V1       P08185  Hex(5)HexNAc(4)NeuAc(2)          176 SERPINA6
#>  2 V2       P04196  Hex(5)HexNAc(4)NeuAc(1)          344 HRG     
#>  3 V3       P04196  Hex(5)HexNAc(4)                  344 HRG     
#>  4 V4       P10909  Hex(6)HexNAc(5)                  291 CLU     
#>  5 V5       P04196  Hex(5)HexNAc(4)NeuAc(2)          344 HRG     
#>  6 V6       P04196  Hex(5)HexNAc(4)                  345 HRG     
#>  7 V7       P04196  Hex(5)HexNAc(4)dHex(2)           344 HRG     
#>  8 V8       P04196  Hex(4)HexNAc(3)                  344 HRG     
#>  9 V9       P04196  Hex(4)HexNAc(4)NeuAc(1)          344 HRG     
#> 10 V10      P10909  Hex(5)HexNAc(4)                  291 CLU     
#> # ℹ 211 more rows

The variable information tibble is like a detailed ID card for each glycoform 🆔 - it contains everything you need to know: the protein, glycosylation site, and glycan structures.

get_sample_info(exp)
#> # A tibble: 12 × 2
#>    sample                  group
#>    <chr>                   <fct>
#>  1 20241224-LXJ-Nglyco-H_1 H    
#>  2 20241224-LXJ-Nglyco-H_2 H    
#>  3 20241224-LXJ-Nglyco-H_3 H    
#>  4 20241224-LXJ-Nglyco-M_1 M    
#>  5 20241224-LXJ-Nglyco-M_2 M    
#>  6 20241224-LXJ-Nglyco-M_3 M    
#>  7 20241224-LXJ-Nglyco-Y_1 Y    
#>  8 20241224-LXJ-Nglyco-Y_2 Y    
#>  9 20241224-LXJ-Nglyco-Y_3 Y    
#> 10 20241224-LXJ-Nglyco-C_1 C    
#> 11 20241224-LXJ-Nglyco-C_2 C    
#> 12 20241224-LXJ-Nglyco-C_3 C

Our sample information tibble features a crucial “group” column 🏷️. Here’s the key: “H” = healthy, “M” = hepatitis, “Y” = cirrhosis, and “C” = hepatocellular carcinoma. This gives us a perfect setup for comparative analysis!

🚀 One Interface to Rule Them All

Here’s where glystats really shines! ✨ Every function follows a simple, intuitive naming pattern: gly_ + analysis name. Think gly_ttest() for t-tests, gly_pca() for PCA, and so on.

Pro tip: leverage RStudio’s auto-completion to discover all available functions! 💡

Let’s dive into action with an ANOVA analysis to identify differentially expressed glycoforms:

anova_res <- gly_anova(exp)
#> ℹ Number of groups: 4
#> ℹ Groups: "C", "H", "M", and "Y"
#> ℹ Pairwise comparisons will be performed, with levels coming first as reference groups.

Boom! 💥 Analysis complete in just one line! gly_anova() intelligently follows the glycoverse naming conventions, automatically detecting the “group” column in your sample info and fitting an ANOVA model for each glycoform.

📋 Understanding Your Results

All glystats functions return a consistent, well-structured list with two key components:

tidy_result: Clean, analysis-ready tibbles in tidy format 📊
raw_result: The original result objects from underlying statistical functions 🔧

For gly_anova(), the tidy_result contains two informative tibbles: main_test and post_hoc_test.

You can use get_tidy_result() to get the tidy result tibble:

get_tidy_result(anova_res, "main_test")
#> # A tibble: 221 × 13
#>    variable term     df sumsq meansq statistic   p_val  p_adj post_hoc   protein
#>    <chr>    <chr> <dbl> <dbl>  <dbl>     <dbl>   <dbl>  <dbl> <chr>      <chr>  
#>  1 V1       group     3 68.3   22.8      6.72  0.0141  0.0527 NA         P08185 
#>  2 V10      group     3 86.6   28.9      7.86  0.00906 0.0408 C_vs_H;H_… P10909 
#>  3 V100     group     3 44.2   14.7      0.527 0.676   0.755  NA         P01871 
#>  4 V101     group     3  8.57   2.86     4.00  0.0520  0.122  NA         P01871 
#>  5 V102     group     3  5.66   1.89     0.367 0.779   0.819  NA         P01871 
#>  6 V103     group     3  3.32   1.11     6.41  0.0160  0.0547 NA         P01871 
#>  7 V104     group     3 14.0    4.68     6.76  0.0139  0.0527 NA         P01871 
#>  8 V105     group     3  7.37   2.46     1.51  0.284   0.409  NA         P01871 
#>  9 V106     group     3  5.72   1.91     1.86  0.215   0.335  NA         P10909 
#> 10 V107     group     3 10.8    3.60     0.829 0.514   0.625  NA         P01871 
#> # ℹ 211 more rows
#> # ℹ 3 more variables: glycan_composition <comp>, protein_site <int>, gene <chr>

Notice something cool? 😎 gly_anova() thoughtfully adds back all the descriptive columns from your variable tibble. Want to control this behavior? Just use the add_info parameter!

The raw_result houses two lists of models - one for the main test, one for post hoc comparisons:

names(get_raw_result(anova_res))
#> [1] "main_test"     "post_hoc_test"

get_tidy_result() and get_raw_result() are useful to be used in pipes:

exp |>
  gly_anova() |>
  get_tidy_result("main_test") |>
  filter(p_adj < 0.05)
#> ℹ Number of groups: 4
#> ℹ Groups: "C", "H", "M", and "Y"
#> ℹ Pairwise comparisons will be performed, with levels coming first as reference groups.
#> # A tibble: 57 × 13
#>    variable term     df  sumsq meansq statistic   p_val   p_adj post_hoc protein
#>    <chr>    <chr> <dbl>  <dbl>  <dbl>     <dbl>   <dbl>   <dbl> <chr>    <chr>  
#>  1 V10      group     3  86.6   28.9       7.86 9.06e-3 4.08e-2 C_vs_H;… P10909 
#>  2 V11      group     3 176.    58.7      20.6  4.02e-4 5.93e-3 C_vs_H;… P04196 
#>  3 V111     group     3 225.    75.0      46.9  2.01e-5 4.94e-4 C_vs_H;… P10909 
#>  4 V114     group     3   9.68   3.23      8.09 8.31e-3 3.83e-2 C_vs_H;… P01871 
#>  5 V12      group     3  81.2   27.1       9.82 4.66e-3 2.72e-2 C_vs_H;… P13671 
#>  6 V123     group     3  10.9    3.62      9.72 4.80e-3 2.72e-2 C_vs_H;… P10909 
#>  7 V124     group     3   9.14   3.05     11.0  3.25e-3 2.40e-2 C_vs_H;… P02790 
#>  8 V128     group     3 309.   103.      146.   2.56e-7 1.62e-5 C_vs_H;… P01860 
#>  9 V129     group     3 273.    90.9      11.2  3.14e-3 2.40e-2 H_vs_M;… P01860 
#> 10 V13      group     3 158.    52.7     141.   2.92e-7 1.62e-5 C_vs_H;… P04196 
#> # ℹ 47 more rows
#> # ℹ 3 more variables: glycan_composition <comp>, protein_site <int>, gene <chr>

🔧 Maximum Flexibility Mode

Feeling constrained by [glyexp::experiment()]? Fear not! 🦸‍♀️ Every gly_xxx() function comes with a flexible gly_xxx_() sibling that works with standard R objects. For instance, gly_anova_() happily accepts plain matrices:

expr_mat <- get_expr_mat(exp)
groups <- factor(get_sample_info(exp)$group)
anova_res2 <- gly_anova_(expr_mat, groups)

This adaptability makes glystats a perfect team player 🤝 - it seamlessly integrates into existing analysis pipelines and workflows, no matter what data structures you’re already using!

🎪 The Complete Analytical Arsenal

Ready to explore the full power of glystats? Here’s your complete toolkit for glycomics and glycoproteomics data analysis:

🔬 Differential Expression Analysis:
- gly_ttest(): Two-sample t-test
- gly_wilcox(): Wilcoxon rank sum test
- gly_anova(): One-way ANOVA
- gly_kruskal(): Kruskal-Wallis rank sum test
- gly_limma(): Linear models for microarray data (limma)
- gly_fold_change(): Calculate fold change
📐 Dimensionality Reduction:
- gly_pca(): Principal component analysis
- gly_tsne(): t-distributed stochastic neighbor embedding (t-SNE)
- gly_umap(): Uniform manifold approximation and projection (UMAP)
- gly_oplsda(): Orthogonal partial least squares discriminant analysis (OPLS-DA)
- gly_plsda(): Partial least squares discriminant analysis (PLS-DA)
🧩 Clustering:
- gly_kmeans(): K-means clustering
- gly_hclust(): Hierarchical clustering
- gly_consensus_clustering(): Consensus clustering
- gly_wgcna(): Weighted gene co-expression network analysis (WGCNA)
⏱️ Survival Analysis:
- gly_cox(): Cox proportional hazards model
🎯 Enrichment Analysis:
- gly_enrich_go(): Gene Ontology enrichment analysis
- gly_enrich_kegg(): KEGG enrichment analysis
- gly_enrich_reactome(): Reactome enrichment analysis
🔧 Additional Tools:
- gly_cor(): Correlation analysis
- gly_roc(): Receiver operating characteristic (ROC) analysis

🚀 What’s Next on Your Journey?

Ready to dive deeper into the glycoverse? Here’s your roadmap to success:

📥 Data Import: Start with glyread to seamlessly import your data into glyexp::experiment() objects
🧹 Data Preprocessing: Use glyclean to polish and prepare your data for analysis
📊 Statistical Analysis: You’re here! Use glystats to unlock powerful insights from your glycomics data
🎨 Visualization: Stay tuned! We’re crafting an amazing glyvis package for stunning data visualizations

Happy analyzing! 🎉✨