Skip to contents

When we talk about omics data analysis, we’re usually diving into exciting territories like differential expression analysis, PCA, survival analysis, and much more! 🧬 glystats brings all these powerful analyses together under one unified, user-friendly interface.

🎯 Key Feature: glystats offers two levels of interfaces to fit your workflow:

  • gly_xxx(): seamlessly works with [glyexp::experiment()], the beating heart of the glycoverse ecosystem πŸ’“
  • gly_xxx_(): flexible enough to work with general inputs like matrices or data frames

This vignette focuses on the gly_xxx() interface. New to [glyexp::experiment()]? No worries! Check out its introduction first to get up to speed.

library(glystats)
library(glyexp)
library(glyread)
library(glyclean)
#> 
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#> 
#>     aggregate
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#> 
#>     select_var
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

πŸ” Meet Your Data

Let’s start by exploring our demo dataset:

# Here we use `glyread` to read in the data and use `glyclean` to perform preprocessing
exp <- read_pglyco3_pglycoquant("glycopeptides.list", sample_info = "sample_info.csv") |> auto_clean()
#> β„Ή Reading data
#> β„Ή Finding leader proteins
#> βœ” Finding leader proteins [82ms]
#> 
#> β„Ή Reading dataColumn group converted to <factor>.β„Ή Parsing glycan compositions and structures
#> Column group converted to <factor>.βœ” Parsing glycan compositions and structures [277ms]
#> 
#> β„Ή Reading dataβœ” Reading data [772ms]
#> 
#> β„Ή Normalizing data (Median)
#> βœ” Normalizing data (Median) [130ms]
#> 
#> β„Ή Removing variables with >50% missing values
#> βœ” Removing variables with >50% missing values [19ms]
#> 
#> β„Ή Imputing missing values
#> β„Ή Sample size <= 30, using sample minimum imputation
#> β„Ή Imputing missing valuesβœ” Imputing missing values [15ms]
#> 
#> β„Ή Aggregating data
#> βœ” Aggregating data [193ms]
#> 
#> β„Ή Normalizing data again
#> βœ” Normalizing data again [14ms]
exp
#> 
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#> β„Ή Expression matrix: 12 samples, 221 variables
#> β„Ή Sample information fields: group <fct>
#> β„Ή Variable information fields: protein <chr>, glycan_composition <comp>, protein_site <int>, gene <chr>

Look at that! πŸŽ‰ We’ve got a [glyexp::experiment()] packed with 12 samples and 263 glycoforms. That’s plenty of data to work with!

get_var_info(exp)
#> # A tibble: 221 Γ— 5
#>    variable protein glycan_composition      protein_site gene    
#>    <chr>    <chr>   <comp>                         <int> <chr>   
#>  1 V1       P08185  Hex(5)HexNAc(4)NeuAc(2)          176 SERPINA6
#>  2 V2       P04196  Hex(5)HexNAc(4)NeuAc(1)          344 HRG     
#>  3 V3       P04196  Hex(5)HexNAc(4)                  344 HRG     
#>  4 V4       P10909  Hex(6)HexNAc(5)                  291 CLU     
#>  5 V5       P04196  Hex(5)HexNAc(4)NeuAc(2)          344 HRG     
#>  6 V6       P04196  Hex(5)HexNAc(4)                  345 HRG     
#>  7 V7       P04196  Hex(5)HexNAc(4)dHex(2)           344 HRG     
#>  8 V8       P04196  Hex(4)HexNAc(3)                  344 HRG     
#>  9 V9       P04196  Hex(4)HexNAc(4)NeuAc(1)          344 HRG     
#> 10 V10      P10909  Hex(5)HexNAc(4)                  291 CLU     
#> # β„Ή 211 more rows

The variable information tibble is like a detailed ID card for each glycoform πŸ†” - it contains everything you need to know: the protein, glycosylation site, and glycan structures.

get_sample_info(exp)
#> # A tibble: 12 Γ— 2
#>    sample                  group
#>    <chr>                   <fct>
#>  1 20241224-LXJ-Nglyco-H_1 H    
#>  2 20241224-LXJ-Nglyco-H_2 H    
#>  3 20241224-LXJ-Nglyco-H_3 H    
#>  4 20241224-LXJ-Nglyco-M_1 M    
#>  5 20241224-LXJ-Nglyco-M_2 M    
#>  6 20241224-LXJ-Nglyco-M_3 M    
#>  7 20241224-LXJ-Nglyco-Y_1 Y    
#>  8 20241224-LXJ-Nglyco-Y_2 Y    
#>  9 20241224-LXJ-Nglyco-Y_3 Y    
#> 10 20241224-LXJ-Nglyco-C_1 C    
#> 11 20241224-LXJ-Nglyco-C_2 C    
#> 12 20241224-LXJ-Nglyco-C_3 C

Our sample information tibble features a crucial β€œgroup” column 🏷️. Here’s the key: β€œH” = healthy, β€œM” = hepatitis, β€œY” = cirrhosis, and β€œC” = hepatocellular carcinoma. This gives us a perfect setup for comparative analysis!

πŸš€ One Interface to Rule Them All

Here’s where glystats really shines! ✨ Every function follows a simple, intuitive naming pattern: gly_ + analysis name. Think gly_ttest() for t-tests, gly_pca() for PCA, and so on.

Pro tip: leverage RStudio’s auto-completion to discover all available functions! πŸ’‘

Let’s dive into action with an ANOVA analysis to identify differentially expressed glycoforms:

anova_res <- gly_anova(exp)
#> β„Ή Number of groups: 4
#> β„Ή Groups: "C", "H", "M", and "Y"
#> β„Ή Pairwise comparisons will be performed, with levels coming first as reference groups.

Boom! πŸ’₯ Analysis complete in just one line! gly_anova() intelligently follows the glycoverse naming conventions, automatically detecting the β€œgroup” column in your sample info and fitting an ANOVA model for each glycoform.

πŸ“‹ Understanding Your Results

All glystats functions return a consistent, well-structured list with two key components:

  • tidy_result: Clean, analysis-ready tibbles in tidy format πŸ“Š
  • raw_result: The original result objects from underlying statistical functions πŸ”§

For gly_anova(), the tidy_result contains two informative tibbles: main_test and post_hoc_test.

You can use get_tidy_result() to get the tidy result tibble:

get_tidy_result(anova_res, "main_test")
#> # A tibble: 221 Γ— 13
#>    variable term     df sumsq meansq statistic   p_val  p_adj post_hoc   protein
#>    <chr>    <chr> <dbl> <dbl>  <dbl>     <dbl>   <dbl>  <dbl> <chr>      <chr>  
#>  1 V1       group     3 68.3   22.8      6.72  0.0141  0.0527 NA         P08185 
#>  2 V10      group     3 86.6   28.9      7.86  0.00906 0.0408 C_vs_H;H_… P10909 
#>  3 V100     group     3 44.2   14.7      0.527 0.676   0.755  NA         P01871 
#>  4 V101     group     3  8.57   2.86     4.00  0.0520  0.122  NA         P01871 
#>  5 V102     group     3  5.66   1.89     0.367 0.779   0.819  NA         P01871 
#>  6 V103     group     3  3.32   1.11     6.41  0.0160  0.0547 NA         P01871 
#>  7 V104     group     3 14.0    4.68     6.76  0.0139  0.0527 NA         P01871 
#>  8 V105     group     3  7.37   2.46     1.51  0.284   0.409  NA         P01871 
#>  9 V106     group     3  5.72   1.91     1.86  0.215   0.335  NA         P10909 
#> 10 V107     group     3 10.8    3.60     0.829 0.514   0.625  NA         P01871 
#> # β„Ή 211 more rows
#> # β„Ή 3 more variables: glycan_composition <comp>, protein_site <int>, gene <chr>

Notice something cool? 😎 gly_anova() thoughtfully adds back all the descriptive columns from your variable tibble. Want to control this behavior? Just use the add_info parameter!

The raw_result houses two lists of models - one for the main test, one for post hoc comparisons:

names(get_raw_result(anova_res))
#> [1] "main_test"     "post_hoc_test"

get_tidy_result() and get_raw_result() are useful to be used in pipes:

exp |>
  gly_anova() |>
  get_tidy_result("main_test") |>
  filter(p_adj < 0.05)
#> β„Ή Number of groups: 4
#> β„Ή Groups: "C", "H", "M", and "Y"
#> β„Ή Pairwise comparisons will be performed, with levels coming first as reference groups.
#> # A tibble: 57 Γ— 13
#>    variable term     df  sumsq meansq statistic   p_val   p_adj post_hoc protein
#>    <chr>    <chr> <dbl>  <dbl>  <dbl>     <dbl>   <dbl>   <dbl> <chr>    <chr>  
#>  1 V10      group     3  86.6   28.9       7.86 9.06e-3 4.08e-2 C_vs_H;… P10909 
#>  2 V11      group     3 176.    58.7      20.6  4.02e-4 5.93e-3 C_vs_H;… P04196 
#>  3 V111     group     3 225.    75.0      46.9  2.01e-5 4.94e-4 C_vs_H;… P10909 
#>  4 V114     group     3   9.68   3.23      8.09 8.31e-3 3.83e-2 C_vs_H;… P01871 
#>  5 V12      group     3  81.2   27.1       9.82 4.66e-3 2.72e-2 C_vs_H;… P13671 
#>  6 V123     group     3  10.9    3.62      9.72 4.80e-3 2.72e-2 C_vs_H;… P10909 
#>  7 V124     group     3   9.14   3.05     11.0  3.25e-3 2.40e-2 C_vs_H;… P02790 
#>  8 V128     group     3 309.   103.      146.   2.56e-7 1.62e-5 C_vs_H;… P01860 
#>  9 V129     group     3 273.    90.9      11.2  3.14e-3 2.40e-2 H_vs_M;… P01860 
#> 10 V13      group     3 158.    52.7     141.   2.92e-7 1.62e-5 C_vs_H;… P04196 
#> # β„Ή 47 more rows
#> # β„Ή 3 more variables: glycan_composition <comp>, protein_site <int>, gene <chr>

πŸ”§ Maximum Flexibility Mode

Feeling constrained by [glyexp::experiment()]? Fear not! πŸ¦Έβ€β™€οΈ Every gly_xxx() function comes with a flexible gly_xxx_() sibling that works with standard R objects. For instance, gly_anova_() happily accepts plain matrices:

expr_mat <- get_expr_mat(exp)
groups <- factor(get_sample_info(exp)$group)
anova_res2 <- gly_anova_(expr_mat, groups)

This adaptability makes glystats a perfect team player 🀝 - it seamlessly integrates into existing analysis pipelines and workflows, no matter what data structures you’re already using!

πŸŽͺ The Complete Analytical Arsenal

Ready to explore the full power of glystats? Here’s your complete toolkit for glycomics and glycoproteomics data analysis:

πŸš€ What’s Next on Your Journey?

Ready to dive deeper into the glycoverse? Here’s your roadmap to success:

  1. πŸ“₯ Data Import: Start with glyread to seamlessly import your data into glyexp::experiment() objects

  2. 🧹 Data Preprocessing: Use glyclean to polish and prepare your data for analysis

  3. πŸ“Š Statistical Analysis: You’re here! Use glystats to unlock powerful insights from your glycomics data

  4. 🎨 Visualization: Stay tuned! We’re crafting an amazing glyvis package for stunning data visualizations

Happy analyzing! πŸŽ‰βœ¨