
Getting Started with glystats
glystats.RmdWhen we talk about omics data analysis, weβre usually diving into
exciting territories like differential expression analysis, PCA,
survival analysis, and much more! 𧬠glystats brings all
these powerful analyses together under one unified, user-friendly
interface.
π― Key Feature: glystats offers two
levels of interfaces to fit your workflow:
-
gly_xxx(): seamlessly works with [glyexp::experiment()], the beating heart of theglycoverseecosystem π -
gly_xxx_(): flexible enough to work with general inputs like matrices or data frames
This vignette focuses on the gly_xxx() interface. New to
[glyexp::experiment()]? No worries! Check out its introduction
first to get up to speed.
library(glystats)
library(glyexp)
library(glyread)
library(glyclean)
#>
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#>
#> aggregate
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:glyexp':
#>
#> select_var
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, unionπ Meet Your Data
Letβs start by exploring our demo dataset:
# Here we use `glyread` to read in the data and use `glyclean` to perform preprocessing
exp <- read_pglyco3_pglycoquant("glycopeptides.list", sample_info = "sample_info.csv") |> auto_clean()
#> βΉ Reading data
#> βΉ Finding leader proteins
#> β Finding leader proteins [82ms]
#>
#> βΉ Reading dataColumn group converted to <factor>.βΉ Parsing glycan compositions and structures
#> Column group converted to <factor>.β Parsing glycan compositions and structures [277ms]
#>
#> βΉ Reading dataβ Reading data [772ms]
#>
#> βΉ Normalizing data (Median)
#> β Normalizing data (Median) [130ms]
#>
#> βΉ Removing variables with >50% missing values
#> β Removing variables with >50% missing values [19ms]
#>
#> βΉ Imputing missing values
#> βΉ Sample size <= 30, using sample minimum imputation
#> βΉ Imputing missing valuesβ Imputing missing values [15ms]
#>
#> βΉ Aggregating data
#> β Aggregating data [193ms]
#>
#> βΉ Normalizing data again
#> β Normalizing data again [14ms]
exp
#>
#> ββ Glycoproteomics Experiment ββββββββββββββββββββββββββββββββββββββββββββββββββ
#> βΉ Expression matrix: 12 samples, 221 variables
#> βΉ Sample information fields: group <fct>
#> βΉ Variable information fields: protein <chr>, glycan_composition <comp>, protein_site <int>, gene <chr>Look at that! π Weβve got a [glyexp::experiment()] packed with 12 samples and 263 glycoforms. Thatβs plenty of data to work with!
get_var_info(exp)
#> # A tibble: 221 Γ 5
#> variable protein glycan_composition protein_site gene
#> <chr> <chr> <comp> <int> <chr>
#> 1 V1 P08185 Hex(5)HexNAc(4)NeuAc(2) 176 SERPINA6
#> 2 V2 P04196 Hex(5)HexNAc(4)NeuAc(1) 344 HRG
#> 3 V3 P04196 Hex(5)HexNAc(4) 344 HRG
#> 4 V4 P10909 Hex(6)HexNAc(5) 291 CLU
#> 5 V5 P04196 Hex(5)HexNAc(4)NeuAc(2) 344 HRG
#> 6 V6 P04196 Hex(5)HexNAc(4) 345 HRG
#> 7 V7 P04196 Hex(5)HexNAc(4)dHex(2) 344 HRG
#> 8 V8 P04196 Hex(4)HexNAc(3) 344 HRG
#> 9 V9 P04196 Hex(4)HexNAc(4)NeuAc(1) 344 HRG
#> 10 V10 P10909 Hex(5)HexNAc(4) 291 CLU
#> # βΉ 211 more rowsThe variable information tibble is like a detailed ID card for each glycoform π - it contains everything you need to know: the protein, glycosylation site, and glycan structures.
get_sample_info(exp)
#> # A tibble: 12 Γ 2
#> sample group
#> <chr> <fct>
#> 1 20241224-LXJ-Nglyco-H_1 H
#> 2 20241224-LXJ-Nglyco-H_2 H
#> 3 20241224-LXJ-Nglyco-H_3 H
#> 4 20241224-LXJ-Nglyco-M_1 M
#> 5 20241224-LXJ-Nglyco-M_2 M
#> 6 20241224-LXJ-Nglyco-M_3 M
#> 7 20241224-LXJ-Nglyco-Y_1 Y
#> 8 20241224-LXJ-Nglyco-Y_2 Y
#> 9 20241224-LXJ-Nglyco-Y_3 Y
#> 10 20241224-LXJ-Nglyco-C_1 C
#> 11 20241224-LXJ-Nglyco-C_2 C
#> 12 20241224-LXJ-Nglyco-C_3 COur sample information tibble features a crucial βgroupβ column π·οΈ. Hereβs the key: βHβ = healthy, βMβ = hepatitis, βYβ = cirrhosis, and βCβ = hepatocellular carcinoma. This gives us a perfect setup for comparative analysis!
π One Interface to Rule Them All
Hereβs where glystats really shines! β¨ Every function
follows a simple, intuitive naming pattern: gly_ + analysis
name. Think gly_ttest() for t-tests, gly_pca()
for PCA, and so on.
Pro tip: leverage RStudioβs auto-completion to discover all available functions! π‘
Letβs dive into action with an ANOVA analysis to identify differentially expressed glycoforms:
anova_res <- gly_anova(exp)
#> βΉ Number of groups: 4
#> βΉ Groups: "C", "H", "M", and "Y"
#> βΉ Pairwise comparisons will be performed, with levels coming first as reference groups.Boom! π₯ Analysis complete in just one line! gly_anova()
intelligently follows the glycoverse naming conventions,
automatically detecting the βgroupβ column in your sample info and
fitting an ANOVA model for each glycoform.
π Understanding Your Results
All glystats functions return a consistent,
well-structured list with two key components:
-
tidy_result: Clean, analysis-ready tibbles in tidy format π -
raw_result: The original result objects from underlying statistical functions π§
For gly_anova(), the tidy_result contains
two informative tibbles: main_test and
post_hoc_test.
You can use get_tidy_result() to get the tidy result
tibble:
get_tidy_result(anova_res, "main_test")
#> # A tibble: 221 Γ 13
#> variable term df sumsq meansq statistic p_val p_adj post_hoc protein
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 V1 group 3 68.3 22.8 6.72 0.0141 0.0527 NA P08185
#> 2 V10 group 3 86.6 28.9 7.86 0.00906 0.0408 C_vs_H;H_β¦ P10909
#> 3 V100 group 3 44.2 14.7 0.527 0.676 0.755 NA P01871
#> 4 V101 group 3 8.57 2.86 4.00 0.0520 0.122 NA P01871
#> 5 V102 group 3 5.66 1.89 0.367 0.779 0.819 NA P01871
#> 6 V103 group 3 3.32 1.11 6.41 0.0160 0.0547 NA P01871
#> 7 V104 group 3 14.0 4.68 6.76 0.0139 0.0527 NA P01871
#> 8 V105 group 3 7.37 2.46 1.51 0.284 0.409 NA P01871
#> 9 V106 group 3 5.72 1.91 1.86 0.215 0.335 NA P10909
#> 10 V107 group 3 10.8 3.60 0.829 0.514 0.625 NA P01871
#> # βΉ 211 more rows
#> # βΉ 3 more variables: glycan_composition <comp>, protein_site <int>, gene <chr>Notice something cool? π gly_anova() thoughtfully adds
back all the descriptive columns from your variable tibble. Want to
control this behavior? Just use the add_info parameter!
The raw_result houses two lists of models - one for the
main test, one for post hoc comparisons:
names(get_raw_result(anova_res))
#> [1] "main_test" "post_hoc_test"get_tidy_result() and get_raw_result() are
useful to be used in pipes:
exp |>
gly_anova() |>
get_tidy_result("main_test") |>
filter(p_adj < 0.05)
#> βΉ Number of groups: 4
#> βΉ Groups: "C", "H", "M", and "Y"
#> βΉ Pairwise comparisons will be performed, with levels coming first as reference groups.
#> # A tibble: 57 Γ 13
#> variable term df sumsq meansq statistic p_val p_adj post_hoc protein
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 V10 group 3 86.6 28.9 7.86 9.06e-3 4.08e-2 C_vs_H;β¦ P10909
#> 2 V11 group 3 176. 58.7 20.6 4.02e-4 5.93e-3 C_vs_H;β¦ P04196
#> 3 V111 group 3 225. 75.0 46.9 2.01e-5 4.94e-4 C_vs_H;β¦ P10909
#> 4 V114 group 3 9.68 3.23 8.09 8.31e-3 3.83e-2 C_vs_H;β¦ P01871
#> 5 V12 group 3 81.2 27.1 9.82 4.66e-3 2.72e-2 C_vs_H;β¦ P13671
#> 6 V123 group 3 10.9 3.62 9.72 4.80e-3 2.72e-2 C_vs_H;β¦ P10909
#> 7 V124 group 3 9.14 3.05 11.0 3.25e-3 2.40e-2 C_vs_H;β¦ P02790
#> 8 V128 group 3 309. 103. 146. 2.56e-7 1.62e-5 C_vs_H;β¦ P01860
#> 9 V129 group 3 273. 90.9 11.2 3.14e-3 2.40e-2 H_vs_M;β¦ P01860
#> 10 V13 group 3 158. 52.7 141. 2.92e-7 1.62e-5 C_vs_H;β¦ P04196
#> # βΉ 47 more rows
#> # βΉ 3 more variables: glycan_composition <comp>, protein_site <int>, gene <chr>π§ Maximum Flexibility Mode
Feeling constrained by [glyexp::experiment()]? Fear not! π¦ΈββοΈ Every
gly_xxx() function comes with a flexible
gly_xxx_() sibling that works with standard R objects. For
instance, gly_anova_() happily accepts plain matrices:
expr_mat <- get_expr_mat(exp)
groups <- factor(get_sample_info(exp)$group)
anova_res2 <- gly_anova_(expr_mat, groups)This adaptability makes glystats a perfect team player
π€ - it seamlessly integrates into existing analysis pipelines and
workflows, no matter what data structures youβre already using!
πͺ The Complete Analytical Arsenal
Ready to explore the full power of glystats? Hereβs your
complete toolkit for glycomics and glycoproteomics data analysis:
-
π¬ Differential Expression Analysis:
-
gly_ttest(): Two-sample t-test -
gly_wilcox(): Wilcoxon rank sum test -
gly_anova(): One-way ANOVA -
gly_kruskal(): Kruskal-Wallis rank sum test -
gly_limma(): Linear models for microarray data (limma) -
gly_fold_change(): Calculate fold change
-
-
π Dimensionality Reduction:
-
gly_pca(): Principal component analysis -
gly_tsne(): t-distributed stochastic neighbor embedding (t-SNE) -
gly_umap(): Uniform manifold approximation and projection (UMAP) -
gly_oplsda(): Orthogonal partial least squares discriminant analysis (OPLS-DA) -
gly_plsda(): Partial least squares discriminant analysis (PLS-DA)
-
-
π§© Clustering:
-
gly_kmeans(): K-means clustering -
gly_hclust(): Hierarchical clustering -
gly_consensus_clustering(): Consensus clustering -
gly_wgcna(): Weighted gene co-expression network analysis (WGCNA)
-
-
β±οΈ Survival Analysis:
-
gly_cox(): Cox proportional hazards model
-
-
π― Enrichment Analysis:
-
gly_enrich_go(): Gene Ontology enrichment analysis -
gly_enrich_kegg(): KEGG enrichment analysis -
gly_enrich_reactome(): Reactome enrichment analysis
-
- π§ Additional Tools:
π Whatβs Next on Your Journey?
Ready to dive deeper into the glycoverse? Hereβs your
roadmap to success:
π₯ Data Import: Start with glyread to seamlessly import your data into
glyexp::experiment()objectsπ§Ή Data Preprocessing: Use glyclean to polish and prepare your data for analysis
π Statistical Analysis: Youβre here! Use
glystatsto unlock powerful insights from your glycomics dataπ¨ Visualization: Stay tuned! Weβre crafting an amazing
glyvispackage for stunning data visualizations
Happy analyzing! πβ¨