Skip to contents

When we talk about omics data analysis, we’re usually diving into exciting territories like differential expression analysis, PCA, survival analysis, and much more! 🧬 glystats brings all these powerful analyses together under one unified, user-friendly interface.

🎯 Key Feature: glystats offers two levels of interfaces to fit your workflow:

  • gly_xxx(): seamlessly works with [glyexp::experiment()], the beating heart of the glycoverse ecosystem 💓
  • gly_xxx_(): flexible enough to work with general inputs like matrices or data frames

This vignette focuses on the gly_xxx() interface. New to [glyexp::experiment()]? No worries! Check out its introduction first to get up to speed.

library(glystats)
library(glyexp)
library(glyread)
library(glyclean)
#> 
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#> 
#>     aggregate

🔍 Meet Your Data

Let’s start by exploring our demo dataset:

# Here we use `glyread` to read in the data and use `glyclean` to perform preprocessing
exp <- read_pglyco3_pglycoquant("glycopeptides.list", sample_info = "sample_info.csv") |> auto_clean()
#>  Reading data
#>  Finding leader proteins
#>  Finding leader proteins [97ms]
#> 
#>  Reading data Parsing glycan compositions and structures
#>  Parsing glycan compositions and structures [253ms]
#> 
#>  Reading data Reading data [782ms]
#> 
#>  Normalizing data (Median)
#>  Normalizing data (Median) [134ms]
#> 
#>  Removing variables with >50% missing values
#>  Removing variables with >50% missing values [20ms]
#> 
#>  Imputing missing values
#>  Sample size <= 30, using sample minimum imputation
#>  Imputing missing values Imputing missing values [15ms]
#> 
#>  Aggregating data
#>  Aggregating data [100ms]
#> 
#>  Normalizing data again
#>  Normalizing data again [14ms]
exp
#> 
#> ── Experiment ──────────────────────────────────────────────────────────────────
#>  Expression matrix: 12 samples, 221 variables
#>  Sample information fields: group
#>  Variable information fields: protein, gene, glycan_composition, and protein_site

Look at that! 🎉 We’ve got a [glyexp::experiment()] packed with 12 samples and 263 glycoforms. That’s plenty of data to work with!

get_var_info(exp)
#> # A tibble: 221 × 5
#>    variable protein gene     glycan_composition      protein_site
#>    <chr>    <chr>   <chr>    <comp>                         <int>
#>  1 V1       P08185  SERPINA6 Hex(5)HexNAc(4)NeuAc(2)          176
#>  2 V2       P04196  HRG      Hex(5)HexNAc(4)NeuAc(1)          344
#>  3 V3       P04196  HRG      Hex(5)HexNAc(4)                  344
#>  4 V4       P10909  CLU      Hex(6)HexNAc(5)                  291
#>  5 V5       P04196  HRG      Hex(5)HexNAc(4)NeuAc(2)          344
#>  6 V6       P04196  HRG      Hex(5)HexNAc(4)                  345
#>  7 V7       P04196  HRG      Hex(5)HexNAc(4)dHex(2)           344
#>  8 V8       P04196  HRG      Hex(4)HexNAc(3)                  344
#>  9 V9       P04196  HRG      Hex(4)HexNAc(4)NeuAc(1)          344
#> 10 V10      P10909  CLU      Hex(5)HexNAc(4)                  291
#> # ℹ 211 more rows

The variable information tibble is like a detailed ID card for each glycoform 🆔 - it contains everything you need to know: the protein, glycosylation site, and glycan structures.

get_sample_info(exp)
#> # A tibble: 12 × 2
#>    sample                  group
#>    <chr>                   <chr>
#>  1 20241224-LXJ-Nglyco-H_1 H    
#>  2 20241224-LXJ-Nglyco-H_2 H    
#>  3 20241224-LXJ-Nglyco-H_3 H    
#>  4 20241224-LXJ-Nglyco-M_1 M    
#>  5 20241224-LXJ-Nglyco-M_2 M    
#>  6 20241224-LXJ-Nglyco-M_3 M    
#>  7 20241224-LXJ-Nglyco-Y_1 Y    
#>  8 20241224-LXJ-Nglyco-Y_2 Y    
#>  9 20241224-LXJ-Nglyco-Y_3 Y    
#> 10 20241224-LXJ-Nglyco-C_1 C    
#> 11 20241224-LXJ-Nglyco-C_2 C    
#> 12 20241224-LXJ-Nglyco-C_3 C

Our sample information tibble features a crucial “group” column 🏷️. Here’s the key: “H” = healthy, “M” = hepatitis, “Y” = cirrhosis, and “C” = hepatocellular carcinoma. This gives us a perfect setup for comparative analysis!

🚀 One Interface to Rule Them All

Here’s where glystats really shines! ✨ Every function follows a simple, intuitive naming pattern: gly_ + analysis name. Think gly_ttest() for t-tests, gly_pca() for PCA, and so on.

Pro tip: leverage RStudio’s auto-completion to discover all available functions! 💡

Let’s dive into action with an ANOVA analysis to identify differentially expressed glycoforms:

anova_res <- gly_anova(exp)
#>  Number of groups: 4
#>  Groups: "C", "H", "M", and "Y"

Boom! 💥 Analysis complete in just one line! gly_anova() intelligently follows the glycoverse naming conventions, automatically detecting the “group” column in your sample info and fitting an ANOVA model for each glycoform.

📋 Understanding Your Results

All glystats functions return a consistent, well-structured list with two key components:

  • tidy_result: Clean, analysis-ready tibbles in tidy format 📊
  • raw_result: The original result objects from underlying statistical functions 🔧

For gly_anova(), the tidy_result contains two informative tibbles: main_test and post_hoc_test.

anova_res$tidy_result
#> $main_test
#> # A tibble: 221 × 13
#>    variable protein gene     glycan_composition  protein_site term     df  sumsq
#>    <chr>    <chr>   <chr>    <comp>                     <int> <chr> <dbl>  <dbl>
#>  1 V1       P08185  SERPINA6 Hex(5)HexNAc(4)Neu…          176 group     3  68.3 
#>  2 V2       P04196  HRG      Hex(5)HexNAc(4)Neu…          344 group     3 161.  
#>  3 V3       P04196  HRG      Hex(5)HexNAc(4)              344 group     3 126.  
#>  4 V4       P10909  CLU      Hex(6)HexNAc(5)              291 group     3  22.9 
#>  5 V5       P04196  HRG      Hex(5)HexNAc(4)Neu…          344 group     3 472.  
#>  6 V6       P04196  HRG      Hex(5)HexNAc(4)              345 group     3  59.8 
#>  7 V7       P04196  HRG      Hex(5)HexNAc(4)dHe…          344 group     3 208.  
#>  8 V8       P04196  HRG      Hex(4)HexNAc(3)              344 group     3 108.  
#>  9 V9       P04196  HRG      Hex(4)HexNAc(4)Neu…          344 group     3   9.20
#> 10 V10      P10909  CLU      Hex(5)HexNAc(4)              291 group     3  86.6 
#> # ℹ 211 more rows
#> # ℹ 5 more variables: meansq <dbl>, statistic <dbl>, p_value <dbl>,
#> #   p_adj <dbl>, post_hoc <chr>
#> 
#> $post_hoc_test
#> # A tibble: 506 × 9
#>    variable protein gene  glycan_composition protein_site group1 group2  p_value
#>    <chr>    <chr>   <chr> <comp>                    <int> <chr>  <chr>     <dbl>
#>  1 V1       P08185  SERP… Hex(5)HexNAc(4)Ne…          176 NA     NA     NA      
#>  2 V2       P04196  HRG   Hex(5)HexNAc(4)Ne…          344 C      H       2.47e-7
#>  3 V2       P04196  HRG   Hex(5)HexNAc(4)Ne…          344 C      M       9.99e-1
#>  4 V2       P04196  HRG   Hex(5)HexNAc(4)Ne…          344 C      Y       6.28e-2
#>  5 V2       P04196  HRG   Hex(5)HexNAc(4)Ne…          344 H      M       2.34e-7
#>  6 V2       P04196  HRG   Hex(5)HexNAc(4)Ne…          344 H      Y       1.09e-6
#>  7 V2       P04196  HRG   Hex(5)HexNAc(4)Ne…          344 M      Y       5.36e-2
#>  8 V3       P04196  HRG   Hex(5)HexNAc(4)             344 C      H       2.41e-4
#>  9 V3       P04196  HRG   Hex(5)HexNAc(4)             344 C      M       5.10e-1
#> 10 V3       P04196  HRG   Hex(5)HexNAc(4)             344 C      Y       6.12e-1
#> # ℹ 496 more rows
#> # ℹ 1 more variable: p_adj <dbl>

Notice something cool? 😎 gly_anova() thoughtfully adds back all the descriptive columns from your variable tibble. Want to control this behavior? Just use the add_info parameter!

The raw_result houses two lists of models - one for the main test, one for post hoc comparisons:

names(anova_res$raw_result)
#> [1] "main_test"     "post_hoc_test"

🔧 Maximum Flexibility Mode

Feeling constrained by [glyexp::experiment()]? Fear not! 🦸‍♀️ Every gly_xxx() function comes with a flexible gly_xxx_() sibling that works with standard R objects. For instance, gly_anova_() happily accepts plain matrices:

expr_mat <- get_expr_mat(exp)
groups <- factor(get_sample_info(exp)$group)
anova_res2 <- gly_anova_(expr_mat, groups)

This adaptability makes glystats a perfect team player 🤝 - it seamlessly integrates into existing analysis pipelines and workflows, no matter what data structures you’re already using!

🎪 The Complete Analytical Arsenal

Ready to explore the full power of glystats? Here’s your complete toolkit for glycomics and glycoproteomics data analysis:

🚀 What’s Next on Your Journey?

Ready to dive deeper into the glycoverse? Here’s your roadmap to success:

  1. 📥 Data Import: Start with glyread to seamlessly import your data into glyexp::experiment() objects

  2. 🧹 Data Preprocessing: Use glyclean to polish and prepare your data for analysis

  3. 📊 Statistical Analysis: You’re here! Use glystats to unlock powerful insights from your glycomics data

  4. 🎨 Visualization: Stay tuned! We’re crafting an amazing glyvis package for stunning data visualizations

Happy analyzing! 🎉✨