Automatic Data Preprocessing

Perform automatic data preprocessing on glycoproteomics or glycomics data. This function applies an intelligent preprocessing pipeline that includes normalization, missing value handling, imputation, aggregation (for glycoproteomics data), and batch effect correction.

For glycomics data, this function calls these functions in sequence:

For glycoproteomics data, this function calls these functions in sequence:

Usage

auto_clean(
  exp,
  group_col = "group",
  batch_col = "batch",
  qc_name = "QC",
  normalize_to_try = NULL,
  impute_to_try = NULL,
  remove_preset = "discovery",
  batch_prop_threshold = 0.3,
  check_batch_confounding = TRUE,
  batch_confounding_threshold = 0.4
)

Arguments

exp

A glyexp::experiment() containing glycoproteomics or glycomics data.

group_col

The column name in sample_info for groups. Default is "group". Can be NULL when no group information is available.

batch_col

The column name in sample_info for batches. Default is "batch". Can be NULL when no batch information is available.

qc_name

The name of QC samples in the group_col column. Default is "QC". Only used when group_col is not NULL.

normalize_to_try

Normalization functions to try. A list. Default includes:

normalize_median(): median normalization
normalize_median_abs(): absolute median normalization
normalize_total_area(): total area mormalization
normalize_quantile(): quantile normalization
normalize_loessf(): LoessF normalization
normalize_loesscyc(): LoessCyc normalization
normalize_rlr(): RLR normalization
normalize_rlrma(): RLRMA normalization
normalize_rlrmacyc(): RLRMAcyc normalization

impute_to_try

Imputation functions to try. A list. Default includes:

impute_zero(): zero imputation
impute_sample_min(): sample-wise minimum imputation
impute_half_sample_min(): half sample-wise minimum imputation
impute_sw_knn(): sample-wise KNN imputation
impute_fw_knn(): feature-wise KNN imputation
impute_bpca(): BPCA imputation
impute_ppca(): PPCA imputation
impute_svd(): SVD imputation
impute_min_prob(): minimum probability imputation
impute_miss_forest(): MissForest imputation

remove_preset

The preset for removing variables. Default is "discovery". Available presets:

"simple": remove variables with more than 50% missing values.
"discovery": more lenient, remove variables with more than 80% missing values, but ensure less than 50% of missing values in at least one group.
"biomarker": more strict, remove variables with more than 40% missing values, and ensure less than 60% of missing values in all groups.

batch_prop_threshold

The proportion of variables that must show significant batch effects to perform batch correction. Default is 0.3 (30%).

check_batch_confounding

Whether to check for confounding between batch and group variables. Default to TRUE.

batch_confounding_threshold

The threshold for Cramer's V to consider batch and group variables highly confounded. Only used when check_batch_confounding is TRUE. Default to 0.4.

Value

A modified glyexp::experiment() object.

Examples

library(glyexp)
exp <- real_experiment
auto_clean(exp)
#> 
#> ── Normalizing data ──
#> 
#> No QC samples found. Using default normalization method based on experiment
#> type.
#> Experiment type is "glycoproteomics". Using `normalize_median()`.
#> 
#> ── Removing variables with too many missing values ──
#> 
#> No QC samples found. Using all samples.
#> Applying preset "discovery"...
#> Total removed: 24 (0.56%) variables.
#> 
#> ── Imputing missing values ──
#> 
#> No QC samples found. Using default imputation method based on sample size.
#> Sample size <= 30, using `impute_sample_min()`.
#> 
#> ── Aggregating data ──
#> 
#> Aggregating to "gfs" level
#> 
#> ── Normalizing data again ──
#> 
#> No QC samples found. Using default normalization method based on experiment
#> type.
#> Experiment type is "glycoproteomics". Using `normalize_median()`.
#> 
#> ── Correcting batch effects ──
#> 
#> ℹ Batch column  not found in sample_info. Skipping batch correction.
#> 
#> ── Glycoproteomics Experiment ──────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 3979 variables
#> ℹ Sample information fields: group <fct>
#> ℹ Variable information fields: protein <chr>, glycan_composition <glyrpr_c>, glycan_structure <glyrpr_s>, protein_site <int>, gene <chr>

Usage

Arguments

Value

See also

Examples