Skip to contents

Perform automatic data preprocessing on glycoproteomics or glycomics data. This function applies an intelligent preprocessing pipeline that includes normalization, missing value handling, imputation, aggregation (for glycoproteomics data), and batch effect correction.

For glycomics data, this function calls these functions in sequence:

For glycoproteomics data, this function calls these functions in sequence:

Usage

auto_clean(
  exp,
  group_col = "group",
  batch_col = "batch",
  qc_name = "QC",
  normalize_to_try = NULL,
  impute_to_try = NULL,
  remove_preset = "discovery",
  batch_prop_threshold = 0.3,
  check_batch_confounding = TRUE,
  batch_confounding_threshold = 0.4,
  standardize_variable = TRUE
)

Arguments

exp

A glyexp::experiment() containing glycoproteomics or glycomics data.

group_col

The column name in sample_info for groups. Default is "group". Can be NULL when no group information is available.

batch_col

The column name in sample_info for batches. Default is "batch". Can be NULL when no batch information is available.

qc_name

The name of QC samples in the group_col column. Default is "QC". Only used when group_col is not NULL. Can be NULL when no QC samples are available.

normalize_to_try

Normalization functions to try. A list. Default includes:

impute_to_try

Imputation functions to try. A list. Default includes:

remove_preset

The preset for removing variables. Default is "discovery". Available presets:

  • "simple": remove variables with more than 50% missing values.

  • "discovery": more lenient, remove variables with more than 80% missing values, but ensure less than 50% of missing values in at least one group.

  • "biomarker": more strict, remove variables with more than 40% missing values, and ensure less than 60% of missing values in all groups.

batch_prop_threshold

The proportion of variables that must show significant batch effects to perform batch correction. Default is 0.3 (30%).

check_batch_confounding

Whether to check for confounding between batch and group variables. Default to TRUE.

batch_confounding_threshold

The threshold for Cramer's V to consider batch and group variables highly confounded. Only used when check_batch_confounding is TRUE. Default to 0.4.

standardize_variable

Whether to call glyexp::standardize_variable() after aggregation. Set to FALSE to skip network calls for faster testing. Default is TRUE.

Value

A modified glyexp::experiment() object.

Examples

library(glyexp)
exp <- real_experiment
auto_clean(exp)
#> 
#> ── Normalizing data ──
#> 
#>  No QC samples found. Using default normalization method based on experiment type.
#>  Experiment type is "glycoproteomics". Using `normalize_median()`.
#>  Normalization completed.
#> 
#> ── Removing variables with too many missing values ──
#> 
#>  No QC samples found. Using all samples.
#>  Applying preset "discovery"...
#>  Total removed: 24 (0.56%) variables.
#>  Variable removal completed.
#> 
#> ── Imputing missing values ──
#> 
#>  No QC samples found. Using default imputation method based on sample size.
#>  Sample size <= 30, using `impute_sample_min()`.
#>  Imputation completed.
#> 
#> ── Aggregating data ──
#> 
#>  Aggregating to "gfs" level
#>  Aggregation completed.
#> 
#> ── Normalizing data again ──
#> 
#>  No QC samples found. Using default normalization method based on experiment type.
#>  Experiment type is "glycoproteomics". Using `normalize_median()`.
#>  Normalization completed.
#> 
#> ── Correcting batch effects ──
#> 
#>  Batch column  not found in sample_info. Skipping batch correction.
#>  Batch correction completed.
#> 
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#>  Expression matrix: 12 samples, 3979 variables
#>  Sample information fields: group <fct>
#>  Variable information fields: protein <chr>, glycan_composition <comp>, glycan_structure <struct>, protein_site <int>, gene <chr>