
Automatic Data Preprocessing
auto_clean.RdPerform automatic data preprocessing on glycoproteomics or glycomics data. This function applies an intelligent preprocessing pipeline that includes normalization, missing value handling, imputation, aggregation (for glycoproteomics data), and batch effect correction.
For glycomics data, this function calls these functions in sequence:
For glycoproteomics data, this function calls these functions in sequence:
Usage
auto_clean(
exp,
group_col = "group",
batch_col = "batch",
qc_name = "QC",
normalize_to_try = NULL,
impute_to_try = NULL,
remove_preset = "discovery",
batch_prop_threshold = 0.3,
check_batch_confounding = TRUE,
batch_confounding_threshold = 0.4
)Arguments
- exp
A
glyexp::experiment()containing glycoproteomics or glycomics data.- group_col
The column name in sample_info for groups. Default is "group". Can be NULL when no group information is available.
- batch_col
The column name in sample_info for batches. Default is "batch". Can be NULL when no batch information is available.
- qc_name
The name of QC samples in the
group_colcolumn. Default is "QC". Only used whengroup_colis not NULL.- normalize_to_try
Normalization functions to try. A list. Default includes:
normalize_median(): median normalizationnormalize_median_abs(): absolute median normalizationnormalize_total_area(): total area mormalizationnormalize_quantile(): quantile normalizationnormalize_loessf(): LoessF normalizationnormalize_loesscyc(): LoessCyc normalizationnormalize_rlr(): RLR normalizationnormalize_rlrma(): RLRMA normalizationnormalize_rlrmacyc(): RLRMAcyc normalization
- impute_to_try
Imputation functions to try. A list. Default includes:
impute_zero(): zero imputationimpute_sample_min(): sample-wise minimum imputationimpute_half_sample_min(): half sample-wise minimum imputationimpute_sw_knn(): sample-wise KNN imputationimpute_fw_knn(): feature-wise KNN imputationimpute_bpca(): BPCA imputationimpute_ppca(): PPCA imputationimpute_svd(): SVD imputationimpute_min_prob(): minimum probability imputationimpute_miss_forest(): MissForest imputation
- remove_preset
The preset for removing variables. Default is "discovery". Available presets:
"simple": remove variables with more than 50% missing values.
"discovery": more lenient, remove variables with more than 80% missing values, but ensure less than 50% of missing values in at least one group.
"biomarker": more strict, remove variables with more than 40% missing values, and ensure less than 60% of missing values in all groups.
- batch_prop_threshold
The proportion of variables that must show significant batch effects to perform batch correction. Default is 0.3 (30%).
- check_batch_confounding
Whether to check for confounding between batch and group variables. Default to TRUE.
- batch_confounding_threshold
The threshold for Cramer's V to consider batch and group variables highly confounded. Only used when
check_batch_confoundingis TRUE. Default to 0.4.
Value
A modified glyexp::experiment() object.
Examples
library(glyexp)
exp <- real_experiment
auto_clean(exp)
#>
#> ── Normalizing data ──
#>
#> No QC samples found. Using default normalization method based on experiment
#> type.
#> Experiment type is "glycoproteomics". Using `normalize_median()`.
#>
#> ── Removing variables with too many missing values ──
#>
#> No QC samples found. Using all samples.
#> Applying preset "discovery"...
#> Total removed: 24 (0.56%) variables.
#>
#> ── Imputing missing values ──
#>
#> No QC samples found. Using default imputation method based on sample size.
#> Sample size <= 30, using `impute_sample_min()`.
#>
#> ── Aggregating data ──
#>
#> Aggregating to "gfs" level
#>
#> ── Normalizing data again ──
#>
#> No QC samples found. Using default normalization method based on experiment
#> type.
#> Experiment type is "glycoproteomics". Using `normalize_median()`.
#>
#> ── Correcting batch effects ──
#>
#> ℹ Batch column not found in sample_info. Skipping batch correction.
#>
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 3979 variables
#> ℹ Sample information fields: group <fct>
#> ℹ Variable information fields: protein <chr>, glycan_composition <glyrpr_c>, glycan_structure <glyrpr_s>, protein_site <int>, gene <chr>