Skip to contents

When we talk about omics data analysis, we’re usually diving into exciting territories like differential expression analysis, PCA, survival analysis, and much more! 🧬 glystats brings all these powerful analyses together under one unified, user-friendly interface.

🎯 Key Feature: glystats offers two levels of interfaces to fit your workflow:

  • gly_xxx(): seamlessly works with [glyexp::experiment()], the beating heart of the glycoverse ecosystem πŸ’“
  • gly_xxx_(): flexible enough to work with general inputs like matrices or data frames

This vignette focuses on the gly_xxx() interface. New to [glyexp::experiment()]? No worries! Check out its introduction first to get up to speed.

library(glystats)
library(glyexp)
library(glyread)
library(glyclean)
#> 
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#> 
#>     aggregate
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#> 
#>     select_var
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

πŸ” Meet Your Data

Let’s start by exploring our demo dataset:

# Here we use `glyread` to read in the data and use `glyclean` to perform preprocessing
exp <- read_pglyco3_pglycoquant("glycopeptides.list", sample_info = "sample_info.csv") |> auto_clean()
#> β„Ή Reading data
#> β„Ή Finding leader proteins
#> βœ” Finding leader proteins [84ms]
#> 
#> β„Ή Reading dataColumn group converted to <factor>.β„Ή Parsing glycan compositions and structures
#> Column group converted to <factor>.βœ” Parsing glycan compositions and structures [448ms]
#> 
#> β„Ή Reading dataβœ” Reading data [948ms]
#> 
#> 
#> ── Normalizing data ──
#> 
#> β„Ή No QC samples found. Using default normalization method based on experiment type.
#> β„Ή Experiment type is "glycoproteomics". Using `normalize_median()`.
#> βœ” Normalization completed.
#> 
#> ── Removing variables with too many missing values ──
#> 
#> β„Ή No QC samples found. Using all samples.
#> β„Ή Applying preset "discovery"...
#> β„Ή Total removed: 2 (0.67%) variables.
#> βœ” Variable removal completed.
#> 
#> ── Imputing missing values ──
#> 
#> β„Ή No QC samples found. Using default imputation method based on sample size.
#> β„Ή Sample size <= 30, using `impute_sample_min()`.
#> βœ” Imputation completed.
#> 
#> ── Aggregating data ──
#> 
#> β„Ή Aggregating to "gf" level
#> βœ” Aggregation completed.
#> 
#> ── Normalizing data again ──
#> 
#> β„Ή No QC samples found. Using default normalization method based on experiment type.
#> β„Ή Experiment type is "glycoproteomics". Using `normalize_median()`.
#> βœ” Normalization completed.
#> 
#> ── Correcting batch effects ──
#> 
#> β„Ή Batch column  not found in sample_info. Skipping batch correction.
#> βœ” Batch correction completed.
exp
#> 
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#> β„Ή Expression matrix: 12 samples, 225 variables
#> β„Ή Sample information fields: group <fct>
#> β„Ή Variable information fields: protein <chr>, glycan_composition <comp>, protein_site <int>, gene <chr>

Look at that! πŸŽ‰ We’ve got a [glyexp::experiment()] packed with 12 samples and 263 glycoforms. That’s plenty of data to work with!

get_var_info(exp)
#> # A tibble: 225 Γ— 5
#>    variable                        protein glycan_composition protein_site gene 
#>    <glue>                          <chr>   <comp>                    <int> <chr>
#>  1 P08185-N176-Hex(5)HexNAc(4)Neu… P08185  Hex(5)HexNAc(4)Ne…          176 SERP…
#>  2 P04196-N344-Hex(5)HexNAc(4)Neu… P04196  Hex(5)HexNAc(4)Ne…          344 HRG  
#>  3 P04196-N344-Hex(5)HexNAc(4)     P04196  Hex(5)HexNAc(4)             344 HRG  
#>  4 P10909-N291-Hex(6)HexNAc(5)     P10909  Hex(6)HexNAc(5)             291 CLU  
#>  5 P04196-N344-Hex(5)HexNAc(4)Neu… P04196  Hex(5)HexNAc(4)Ne…          344 HRG  
#>  6 P04196-N345-Hex(5)HexNAc(4)     P04196  Hex(5)HexNAc(4)             345 HRG  
#>  7 P04196-N344-Hex(5)HexNAc(4)dHe… P04196  Hex(5)HexNAc(4)dH…          344 HRG  
#>  8 P04196-N344-Hex(4)HexNAc(3)     P04196  Hex(4)HexNAc(3)             344 HRG  
#>  9 P04196-N344-Hex(4)HexNAc(4)Neu… P04196  Hex(4)HexNAc(4)Ne…          344 HRG  
#> 10 P10909-N291-Hex(5)HexNAc(4)     P10909  Hex(5)HexNAc(4)             291 CLU  
#> # β„Ή 215 more rows

The variable information tibble is like a detailed ID card for each glycoform πŸ†” - it contains everything you need to know: the protein, glycosylation site, and glycan structures.

get_sample_info(exp)
#> # A tibble: 12 Γ— 2
#>    sample                  group
#>    <chr>                   <fct>
#>  1 20241224-LXJ-Nglyco-H_1 H    
#>  2 20241224-LXJ-Nglyco-H_2 H    
#>  3 20241224-LXJ-Nglyco-H_3 H    
#>  4 20241224-LXJ-Nglyco-M_1 M    
#>  5 20241224-LXJ-Nglyco-M_2 M    
#>  6 20241224-LXJ-Nglyco-M_3 M    
#>  7 20241224-LXJ-Nglyco-Y_1 Y    
#>  8 20241224-LXJ-Nglyco-Y_2 Y    
#>  9 20241224-LXJ-Nglyco-Y_3 Y    
#> 10 20241224-LXJ-Nglyco-C_1 C    
#> 11 20241224-LXJ-Nglyco-C_2 C    
#> 12 20241224-LXJ-Nglyco-C_3 C

Our sample information tibble features a crucial β€œgroup” column 🏷️. Here’s the key: β€œH” = healthy, β€œM” = hepatitis, β€œY” = cirrhosis, and β€œC” = hepatocellular carcinoma. This gives us a perfect setup for comparative analysis!

πŸš€ One Interface to Rule Them All

Here’s where glystats really shines! ✨ Every function follows a simple, intuitive naming pattern: gly_ + analysis name. Think gly_ttest() for t-tests, gly_pca() for PCA, and so on.

Pro tip: leverage RStudio’s auto-completion to discover all available functions! πŸ’‘

Let’s dive into action with an ANOVA analysis to identify differentially expressed glycoforms:

anova_res <- gly_anova(exp)
#> β„Ή Number of groups: 4
#> β„Ή Groups: "C", "H", "M", and "Y"
#> β„Ή Pairwise comparisons will be performed, with levels coming first as reference groups.

Boom! πŸ’₯ Analysis complete in just one line! gly_anova() intelligently follows the glycoverse naming conventions, automatically detecting the β€œgroup” column in your sample info and fitting an ANOVA model for each glycoform.

πŸ“‹ Understanding Your Results

All glystats functions return a consistent, well-structured list with two key components:

  • tidy_result: Clean, analysis-ready tibbles in tidy format πŸ“Š
  • raw_result: The original result objects from underlying statistical functions πŸ”§

For gly_anova(), the tidy_result contains two informative tibbles: main_test and post_hoc_test.

You can use get_tidy_result() to get the tidy result tibble:

get_tidy_result(anova_res, "main_test")
#> # A tibble: 225 Γ— 13
#>    variable     protein glycan_composition protein_site gene  term     df  sumsq
#>    <glue>       <chr>   <comp>                    <int> <chr> <chr> <dbl>  <dbl>
#>  1 P08185-N176… P08185  Hex(5)HexNAc(4)Ne…          176 SERP… group     3  67.7 
#>  2 P04196-N344… P04196  Hex(5)HexNAc(4)Ne…          344 HRG   group     3 161.  
#>  3 P04196-N344… P04196  Hex(5)HexNAc(4)             344 HRG   group     3 126.  
#>  4 P10909-N291… P10909  Hex(6)HexNAc(5)             291 CLU   group     3  23.1 
#>  5 P04196-N344… P04196  Hex(5)HexNAc(4)Ne…          344 HRG   group     3 472.  
#>  6 P04196-N345… P04196  Hex(5)HexNAc(4)             345 HRG   group     3  60.0 
#>  7 P04196-N344… P04196  Hex(5)HexNAc(4)dH…          344 HRG   group     3 208.  
#>  8 P04196-N344… P04196  Hex(4)HexNAc(3)             344 HRG   group     3 109.  
#>  9 P04196-N344… P04196  Hex(4)HexNAc(4)Ne…          344 HRG   group     3   9.66
#> 10 P10909-N291… P10909  Hex(5)HexNAc(4)             291 CLU   group     3  87.0 
#> # β„Ή 215 more rows
#> # β„Ή 5 more variables: meansq <dbl>, statistic <dbl>, p_val <dbl>, p_adj <dbl>,
#> #   post_hoc <chr>

Notice something cool? 😎 gly_anova() thoughtfully adds back all the descriptive columns from your variable tibble. Want to control this behavior? Just use the add_info parameter!

The raw_result houses two lists of models - one for the main test, one for post hoc comparisons:

names(get_raw_result(anova_res))
#> [1] "main_test"     "post_hoc_test"

get_tidy_result() and get_raw_result() are useful to be used in pipes:

exp |>
  gly_anova() |>
  get_tidy_result("main_test") |>
  filter(p_adj < 0.05)
#> β„Ή Number of groups: 4
#> β„Ή Groups: "C", "H", "M", and "Y"
#> β„Ή Pairwise comparisons will be performed, with levels coming first as reference groups.
#> # A tibble: 60 Γ— 13
#>    variable      protein glycan_composition protein_site gene  term     df sumsq
#>    <glue>        <chr>   <comp>                    <int> <chr> <chr> <dbl> <dbl>
#>  1 P04196-N344-… P04196  Hex(5)HexNAc(4)Ne…          344 HRG   group     3 161. 
#>  2 P04196-N344-… P04196  Hex(5)HexNAc(4)             344 HRG   group     3 126. 
#>  3 P04196-N344-… P04196  Hex(5)HexNAc(4)Ne…          344 HRG   group     3 472. 
#>  4 P04196-N344-… P04196  Hex(5)HexNAc(4)dH…          344 HRG   group     3 208. 
#>  5 P10909-N291-… P10909  Hex(5)HexNAc(4)             291 CLU   group     3  87.0
#>  6 P04196-N344-… P04196  Hex(5)HexNAc(4)dH…          344 HRG   group     3 176. 
#>  7 P13671-N855-… P13671  Hex(5)HexNAc(4)dH…          855 C6    group     3  81.1
#>  8 P04196-N344-… P04196  Hex(4)HexNAc(3)dH…          344 HRG   group     3 159. 
#>  9 P04196-N344-… P04196  Hex(5)HexNAc(4)dH…          344 HRG   group     3 150. 
#> 10 P01019-N161-… P01019  Hex(5)HexNAc(4)Ne…          161 AGT   group     3  46.4
#> # β„Ή 50 more rows
#> # β„Ή 5 more variables: meansq <dbl>, statistic <dbl>, p_val <dbl>, p_adj <dbl>,
#> #   post_hoc <chr>

πŸ”§ Maximum Flexibility Mode

Feeling constrained by [glyexp::experiment()]? Fear not! πŸ¦Έβ€β™€οΈ Every gly_xxx() function comes with a flexible gly_xxx_() sibling that works with standard R objects. For instance, gly_anova_() happily accepts plain matrices:

expr_mat <- get_expr_mat(exp)
groups <- factor(get_sample_info(exp)$group)
anova_res2 <- gly_anova_(expr_mat, groups)

This adaptability makes glystats a perfect team player 🀝 - it seamlessly integrates into existing analysis pipelines and workflows, no matter what data structures you’re already using!

πŸŽͺ The Complete Analytical Arsenal

Ready to explore the full power of glystats? Here’s your complete toolkit for glycomics and glycoproteomics data analysis:

πŸš€ What’s Next on Your Journey?

Ready to dive deeper into the glycoverse? Here’s your roadmap to success:

  1. πŸ“₯ Data Import: Start with glyread to seamlessly import your data into glyexp::experiment() objects

  2. 🧹 Data Preprocessing: Use glyclean to polish and prepare your data for analysis

  3. πŸ“Š Statistical Analysis: You’re here! Use glystats to unlock powerful insights from your glycomics data

  4. 🎨 Visualization: Stay tuned! We’re crafting an amazing glyvis package for stunning data visualizations

Happy analyzing! πŸŽ‰βœ¨