Skip to contents

Perform automatic data preprocessing on glycoproteomics or glycomics data. This function applies an intelligent preprocessing pipeline that includes normalization, missing value handling, imputation, aggregation (for glycoproteomics data), and batch effect correction.

For glycomics data, this function calls these functions in sequence:

For glycoproteomics data, this function calls these functions in sequence:

Usage

auto_clean(
  exp,
  group_col = "group",
  batch_col = "batch",
  qc_name = "QC",
  normalize_to_try = NULL,
  impute_to_try = NULL,
  remove_preset = "discovery",
  batch_prop_threshold = 0.3,
  check_batch_confounding = TRUE,
  batch_confounding_threshold = 0.4
)

Arguments

exp

A glyexp::experiment() containing glycoproteomics or glycomics data.

group_col

The column name in sample_info for groups. Default is "group". Can be NULL when no group information is available.

batch_col

The column name in sample_info for batches. Default is "batch". Can be NULL when no batch information is available.

qc_name

The name of QC samples in the group_col column. Default is "QC". Only used when group_col is not NULL.

normalize_to_try

Normalization functions to try. A list. Default includes:

impute_to_try

Imputation functions to try. A list. Default includes:

remove_preset

The preset for removing variables. Default is "discovery". Available presets:

  • "simple": remove variables with more than 50% missing values.

  • "discovery": more lenient, remove variables with more than 80% missing values, but ensure less than 50% of missing values in at least one group.

  • "biomarker": more strict, remove variables with more than 40% missing values, and ensure less than 60% of missing values in all groups.

batch_prop_threshold

The proportion of variables that must show significant batch effects to perform batch correction. Default is 0.3 (30%).

check_batch_confounding

Whether to check for confounding between batch and group variables. Default to TRUE.

batch_confounding_threshold

The threshold for Cramer's V to consider batch and group variables highly confounded. Only used when check_batch_confounding is TRUE. Default to 0.4.

Value

A modified glyexp::experiment() object.

Examples

library(glyexp)
exp <- real_experiment
auto_clean(exp)
#> 
#> ── Normalizing data ──
#> 
#> No QC samples found. Using default normalization method based on experiment
#> type.
#> Experiment type is "glycoproteomics". Using `normalize_median()`.
#> 
#> ── Removing variables with too many missing values ──
#> 
#> No QC samples found. Using all samples.
#> Applying preset "discovery"...
#> Total removed: 24 (0.56%) variables.
#> 
#> ── Imputing missing values ──
#> 
#> No QC samples found. Using default imputation method based on sample size.
#> Sample size <= 30, using `impute_sample_min()`.
#> 
#> ── Aggregating data ──
#> 
#> Aggregating to "gfs" level
#> 
#> ── Normalizing data again ──
#> 
#> No QC samples found. Using default normalization method based on experiment
#> type.
#> Experiment type is "glycoproteomics". Using `normalize_median()`.
#> 
#> ── Correcting batch effects ──
#> 
#>  Batch column  not found in sample_info. Skipping batch correction.
#> 
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#>  Expression matrix: 12 samples, 3979 variables
#>  Sample information fields: group <fct>
#>  Variable information fields: protein <chr>, glycan_composition <glyrpr_c>, glycan_structure <glyrpr_s>, protein_site <int>, gene <chr>