Skip to contents

Perform automatic data preprocessing on glycoproteomics or glycomics data. This function applies a standardized preprocessing pipeline that includes normalization, missing value handling, imputation, aggregation (for glycoproteomics data), and batch effect correction.

Usage

auto_clean(exp)

Arguments

exp

A glyexp::experiment() containing glycoproteomics or glycomics data.

Value

A modified glyexp::experiment() object.

Details

The preprocessing pipeline differs based on the experiment type:

For Glycoproteomics Data:

  1. Median normalization

  2. Remove variables with \>50% missing values

  3. Automatic imputation (method depends on sample size)

  4. Automatic aggregation (gfs level if structure available, otherwise gf level)

  5. Final median normalization

  6. Intelligent batch effect correction

For Glycomics Data:

  1. Median quotient normalization

  2. Remove variables with \>50% missing values

  3. Automatic imputation (method depends on sample size)

  4. Total area normalization

  5. Intelligent batch effect correction

Automatic Imputation Strategy:

  • <=30 samples: Sample minimum imputation

  • 31-100 samples: Minimum probability imputation

  • \>100 samples: MissForest imputation

Automatic Aggregation Strategy (Glycoproteomics Only):

  • If glycan_structure column exists: Aggregate to "gfs" level

  • If no glycan_structure column: Aggregate to "gf" level

Intelligent Batch Effect Correction:

The function first detects batch effects using ANOVA. Batch correction is only performed if more than 10% of variables show significant batch effects (p < 0.05). If a group column exists in the sample information, it will be used as a covariate in both detection and correction to preserve biological variation. The batches are defined by the batch column in the sample information tibble.

Examples

if (FALSE) { # \dontrun{
# For glycoproteomics data
cleaned_exp <- auto_clean(glycoprot_exp)

# For glycomics data
cleaned_exp <- auto_clean(glycomics_exp)
} # }