
Automatic Data Preprocessing
auto_clean.Rd
Perform automatic data preprocessing on glycoproteomics or glycomics data. This function applies a standardized preprocessing pipeline that includes normalization, missing value handling, imputation, aggregation (for glycoproteomics data), and batch effect correction.
Arguments
- exp
A
glyexp::experiment()
containing glycoproteomics or glycomics data.
Value
A modified glyexp::experiment()
object.
Details
The preprocessing pipeline differs based on the experiment type:
For Glycoproteomics Data:
Median normalization
Remove variables with \>50% missing values
Automatic imputation (method depends on sample size)
Automatic aggregation (gfs level if structure available, otherwise gf level)
Final median normalization
Intelligent batch effect correction
For Glycomics Data:
Median quotient normalization
Remove variables with \>50% missing values
Automatic imputation (method depends on sample size)
Total area normalization
Intelligent batch effect correction
Automatic Imputation Strategy:
<=30 samples: Sample minimum imputation
31-100 samples: Minimum probability imputation
\>100 samples: MissForest imputation
Automatic Aggregation Strategy (Glycoproteomics Only):
If
glycan_structure
column exists: Aggregate to "gfs" levelIf no
glycan_structure
column: Aggregate to "gf" level
Intelligent Batch Effect Correction:
The function first detects batch effects using ANOVA.
Batch correction is only performed if more than 10% of variables
show significant batch effects (p < 0.05).
If a group
column exists in the sample information,
it will be used as a covariate in both detection and correction
to preserve biological variation.
The batches are defined by the batch
column in the sample information tibble.