
Defining Custom Traits
custom-traits.Rmd
This vignette provides a comprehensive guide for defining custom
derived traits using glydet
. Before proceeding, please
ensure you have read the “Get Started with glydet” vignette and are
familiar with the fundamental concepts of derived traits and
meta-properties.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(glydet)
library(glyexp)
#>
#> Attaching package: 'glyexp'
#> The following object is masked from 'package:dplyr':
#>
#> select_var
library(glyclean)
#>
#> Attaching package: 'glyclean'
#> The following object is masked from 'package:stats':
#>
#> aggregate
library(glyrepr)
exp <- auto_clean(real_experiment)
#> ℹ Normalizing data (Median)
#> ✔ Normalizing data (Median) [137ms]
#>
#> ℹ Removing variables with >50% missing values
#> ✔ Removing variables with >50% missing values [22ms]
#>
#> ℹ Imputing missing values
#> ℹ Sample size <= 30, using sample minimum imputation
#> ℹ Imputing missing values✔ Imputing missing values [26ms]
#>
#> ℹ Aggregating data
#> ✔ Aggregating data [872ms]
#>
#> ℹ Normalizing data again
#> ✔ Normalizing data again [19ms]
Custom Traits
glydet
provides three trait factory functions for
creating custom derived traits:
-
prop()
for calculating the abundance proportion of a specific glycan subset -
ratio()
for computing the abundance ratio between two glycan subsets -
wmean()
for calculating the weighted mean of a quantitative property, weighted by glycan abundances
All derived traits in glydet
are constructed using these
three factory functions. The definitions of all built-in traits
demonstrate their usage:
-
TM
:prop(Tp == "highmannose")
-
TH
:prop(Tp == "hybrid")
-
TC
:prop(Tp == "complex")
-
MM
:wmean(nM, within = (Tp == "highmannose"))
-
CA2
:prop(nA == 2, within = (Tp == "complex"))
-
CA3
:prop(nA == 3, within = (Tp == "complex"))
-
CA4
:prop(nA == 4, within = (Tp == "complex"))
-
TF
:prop(nF > 0)
-
TFc
:prop(nFc > 0)
-
TFa
:prop(nFa > 0)
-
TB
:prop(B)
-
SG
:wmean(nS / nG)
-
GA
:wmean(nG / nA)
-
TS
:prop(nS > 0)
These definitions utilize meta-properties as building blocks. For
instance, T
represents glycan type, nM
denotes
the number of mannose residues, and so forth. To retrieve all available
built-in meta-properties, use names(all_mp_fns())
.
The complete list of built-in meta-properties is provided below:
Name | Function | Type | Description |
---|---|---|---|
T |
n_glycan_type() |
factor | Type of the glycan, either “complex”, “hybrid”, “highmannose”, or “pausimannose” |
B |
has_bisecting() |
logical | Whether the glycan has a bisecting GlcNAc |
nA |
n_antennae() |
integer | Number of antennae |
nF |
n_fuc() |
integer | Number of fucoses |
nFc |
n_core_fuc() |
integer | Number of core fucoses |
nFa |
n_arm_fuc() |
integer | Number of arm fucoses |
nG |
n_gal() |
integer | Number of galactoses |
nGt |
n_terminal_gal() |
integer | Number of terminal galactoses |
nS |
n_sia() |
integer | Number of sialic acids |
nM |
n_man() |
integer | Number of mannoses |
The following sections provide detailed explanations of each factory function.
Trait Factories
prop()
prop()
creates proportion traits, which represent the
most common type of derived traits. These traits calculate the relative
abundance of a specific glycan subset within a defined population.
Examples include the proportion of core-fucosylated glycans within all
glycans, or the proportion of tetra-antennary glycans within the complex
glycan subset.
prop()
accepts an expression that evaluates to a logical
vector. Both built-in and custom meta-properties (covered later) can be
referenced within the expression.
For example, the proportion of core-fucosylated glycans within all glycans is defined as:
prop(nFc > 0)
Since nFc
is an integer meta-property representing the
number of core fucoses, the expression nFc > 0
evaluates
to a logical vector, creating a valid trait definition.
Consider this simpler example:
prop(B)
This demonstrates a straightforward trait definition. Since
B
(bisecting GlcNAc presence) is already a logical
meta-property, prop(B)
constitutes a valid trait
definition.
A more complex example demonstrates compound logical operations:
prop(nS > 0 & nFa > 0)
This trait calculates the proportion of glycans containing both
sialic acid and arm fucose. Both nS > 0
and
nFa > 0
represent logical expressions, combined using
the logical AND operator (&
). Any R logical operator
(including |
, !
, etc.) can be utilized within
these expressions.
Arithmetic calculations can also be incorporated within expressions.
For example, the CF
trait is defined as:
prop(nF > 0)
We know that nFc + nFa
is equivalent to nF
,
so the above definition is equivalent to:
prop(nFc + nFa > 0)
As a general principle, any R expression that evaluates to a logical
vector is valid for use with prop()
.
The traits described above calculate proportions relative to the
entire glycan population. To customize the denominator (reference
population), the within
parameter can be employed. For
example, calculating the proportion of bi-antennary glycans within the
complex glycan subset (CA2
) requires this
approach.
The within
parameter accepts an expression that
evaluates to a logical vector, following the same syntax as the primary
parameter.
The CA2
trait is defined as follows:
prop(nA == 2, within = (Tp == "complex"))
Note that parentheses around Tp == "complex"
are
optional. Consequently, this alternative definition is equally
valid:
prop(nA == 2, within = Tp == "complex")
Combining these two parameters enables the creation of sophisticated trait definitions. For instance, the proportion of sialylated glycans within the subset of core-fucosylated tetra-antennary glycans is defined as:
prop(nS > 0, within = (nFc > 0 & nA == 4))
Having mastered the usage of prop()
, examine the
definitions of built-in proportion traits to reinforce your
understanding.
Practice exercises:
- Define a trait calculating the proportion of glycans containing terminal galactose.
- Define a trait calculating the proportion of glycans with terminal galactose but no sialic acid.
- Define a trait calculating the proportion of glycans with exactly two antennae and no bisecting GlcNAc.
- Define a trait calculating the proportion of glycans with bisecting GlcNAc within the bi-antennary glycan subset.
Solutions are provided at the end of this vignette.
ratio()
ratio()
creates ratio traits that represent the quotient
of total abundances between two glycan groups. Examples include the
ratio of complex to hybrid glycans, or the ratio of bisecting to
non-bisecting glycans.
ratio()
accepts two expressions that evaluate to logical
vectors, similar to prop()
.
For example, the ratio of complex to hybrid glycans is defined as:
ratio(Tp == "complex", Tp == "hybrid")
The first expression defines the numerator, while the second expression defines the denominator.
The within
parameter can be employed to apply
restrictions to both numerator and denominator. For example, the ratio
of bisecting to non-bisecting glycans within the bi-antennary subset is
defined as:
ratio(B, !B, within = (nA == 2))
This represents syntactic sugar for the more verbose:
ratio(B & (nA == 2), (!B) & (nA == 2))
The within
parameter provides clearer semantics and
reduces redundancy.
Note that prop()
represents a special case of
ratio()
, where prop(cond, within)
is
mathematically equivalent to
ratio(cond & within, within)
. The following two
definitions are functionally identical:
# using prop()
prop(nFc > 0, within = (Tp == "complex"))
# using ratio()
ratio(nFc > 0 & Tp == "complex", Tp == "complex")
However, prop()
is recommended when appropriate for
enhanced readability, as it provides more intuitive semantics that align
with natural language. The prop()
example above can be
interpreted as “the proportion of core-fucosylated glycans within all
complex glycans,” whereas the ratio()
version requires
additional cognitive processing.
Practice exercises:
- Define a trait calculating the ratio of core-fucosylated to non-core-fucosylated glycans.
- Define a trait calculating the ratio of tetra-antennary to tri-antennary glycans within the complex glycan subset.
- Redefine the
CA2
trait usingratio()
.
wmean()
wmean()
creates weighted-mean traits, which represent
the most sophisticated yet powerful type of derived traits. These traits
calculate the abundance-weighted average of quantitative properties
across glycan populations. Examples include the average number of
antennae across all glycans, or the average degree of sialylation per
galactose across the entire glycan repertoire.
Before exploring wmean()
usage, it is essential to
understand the weighted-mean concept.
Consider three glycans: G1
, G2
, and
G3
. Their respective antennae counts are 1, 2, and 3, with
relative abundances of 50%, 20%, and 30%. The weighted-mean number of
antennae is calculated as:
(1 * 0.5 + 2 * 0.2 + 3 * 0.3) / (0.5 + 0.2 + 0.3)
#> [1] 1.8
This value represents the abundance-weighted average degree of branching across the glycan population.
This concept provides substantial analytical power, enabling the calculation of numerous biologically meaningful traits.
Consider another example: the average degree of sialylation per
antenna across all glycans. Using the same three glycans
(G1
, G2
, and G3
) with antennae
counts of 1, 2, and 3, sialic acid counts of 1, 1, and 3, respectively,
and relative abundances of 50%, 20%, and 30%.
First, calculate the sialylation degree per antenna for each glycan (sialic acid count divided by antenna count):
c(1/1, 1/2, 3/3)
#> [1] 1.0 0.5 1.0
Subsequently, incorporate the abundance weighting:
(1/1 * 0.5 + 1/2 * 0.2 + 3/3 * 0.3) / (0.5 + 0.2 + 0.3)
#> [1] 0.9
This yields the abundance-weighted average degree of sialylation per antenna across the glycan population.
Having established the weighted-mean concept and mastered
prop()
and ratio()
fundamentals, we can now
examine wmean()
implementation.
wmean()
accepts expressions that evaluate to numeric
vectors, contrasting with the logical vectors required by
prop()
and ratio()
.
For example, the average number of antennae across all glycans is defined as:
wmean(nA)
The average degree of sialylation per antenna across all glycans is defined as:
wmean(nS / nA)
The within
parameter can be utilized to restrict the
weighted-mean calculation to specific glycan subsets. For example, the
average number of antennae within complex glycans is defined as:
wmean(nA, within = (Tp == "complex"))
Practice exercises:
- Define a trait calculating the average degree of sialylation per antenna.
- Define a trait calculating the average number of arm fucoses.
Using Custom Traits
Having learned how to define custom traits, we can now implement them.
derive_traits()
includes a trait_fns
parameter that accepts a named list of derived trait functions. Note
that the three factory functions described above return functions as
their output.
R supports functional programming paradigms, treating functions as first-class objects. Functions can be passed as arguments to other functions and returned from function calls.
While understanding this concept enhances R proficiency, it is not
prerequisite for usage—simply define traits as a named list and provide
it to the trait_fns
parameter.
The following example examines sialylation degree within glycan subsets of varying antenna counts:
my_traits <- list(
A2SG = wmean(nS / nA, within = (nA == 2)),
A3SG = wmean(nS / nA, within = (nA == 3)),
A4SG = wmean(nS / nA, within = (nA == 4))
)
derive_traits(exp, trait_fns = my_traits)
The identifiers “A2SG”, “A3SG”, and “A4SG” represent the derived trait names.
When trait_fns
is omitted, derive_traits()
internally invokes basic_traits()
to utilize built-in trait
functions.
Validating Trait Definitions
How can you ensure your trait definitions are accurate and
meaningful? The explain_trait()
function provides an
intuitive way to validate your trait definitions by generating
human-readable explanations.
This function helps you:
- Verify that your trait logic matches your intended analysis
- Understand complex trait expressions at a glance
- Debug and refine trait definitions before deployment
Let’s examine the A2SG
trait defined earlier as an
example:
explain_trait(wmean(nS / nA, within = (nA == 2)))
#> [1] "Abundance-weighted mean of degree of sialylation per antenna within bi-antennary glycans."
The function interprets your trait expression and returns a clear, natural language description of what the trait calculates. This validation step is particularly valuable when working with complex trait definitions or when collaborating with team members who need to understand your analytical approach.
Custom Meta-Properties
The preceding examples utilized exclusively built-in meta-properties. As the building blocks of derived traits, understanding how to define custom meta-properties will bring you more flexibility.
glydet
provides two means of defining custom
meta-properties:
- By providing meta-property functions to the
mp_fns
parameter ofderive_traits()
. - By using columns in the variable information tibble as meta-properties.
Defining Custom Meta-Properties with Functions
Meta-property functions must accept a
glyrepr::glycan_structure()
vector and return a vector of
corresponding meta-property values.
For example, the built-in meta-property B
is defined as
follows:
# Simplified version for illustration purposes
function(glycans) {
motif <- "HexNAc(??-?)Hex(??-?)HexNAc(??-?)HexNAc(??-"
glymotif::have_motif(glycans, motif, alignment = "core")
}
This implementation utilizes glymotif::have_motif()
to
determine motif presence within glycan structures. Meta-property
function definitions offer considerable flexibility. Any tools from the
glycoverse
ecosystem or broader R environment can be
employed, provided the function accepts a
glyrepr::glycan_structure()
vector and returns an atomic
vector. All built-in meta-properties utilize the glymotif
package for motif matching and counting operations. For detailed
information, consult the glymotif
documentation.
The following example demonstrates custom meta-property creation:
my_mp_fns <- list(
# Number of Lewis antigens
nLe = ~ glymotif::count_motif(.x, "Hex(??-?)[dHex(??-?)]HexNAc(??-"),
# Number of poly-LacNAc units
nPl = ~ glymotif::count_motif(.x, "Hex(??-?)HexNAc(??-?)Hex(??-?)HexNAc(??-")
)
Note that purrr-style lambda functions provide a concise syntax for meta-property function definitions.
Traits can be defined using these new meta-properties:
my_traits <- list(
# Average number of Lewis antigens per antenna
LeA = wmean(nLe / nA),
# Average number of poly-LacNAc units per glycan
Pl = wmean(nPl)
)
These expressions incorporate both custom meta-properties
(nLe
and nPl
) and the built-in meta-property
nA
.
Implementation of these custom meta-properties and traits proceeds as follows:
derive_traits(exp, trait_fns = my_traits, mp_fns = c(my_mp_fns, all_mp_fns()))
#>
#> ── Traitproteomics Experiment ──────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 548 variables
#> ℹ Sample information fields: group <chr>
#> ℹ Variable information fields: protein <chr>, protein_site <int>, trait <chr>, gene <chr>
Ensure that custom meta-properties are combined with built-in meta-properties when the latter are required within trait expressions.
Let’s see another example about sialic acid linkage types. Neu5Ac can have two different linkage types: a2-3 and a2-6. We can define two custom meta-properties to count the number of sialic acids with each linkage type:
sia_mp_fns <- list(
# Number of a2-3 sialic acids
nL = ~ glymotif::count_motif(.x, "NeuAc(a2-3)Hex(??-"),
# Number of a2-6 sialic acids
nE = ~ glymotif::count_motif(.x, "NeuAc(a2-6)Hex(??-")
)
And define two traits to calculate the degree of a2-3 and a2-6 sialylation per galactose:
sia_traits <- list(
# Average degree of a2-3 sialylation per galactose
LG = wmean(nL / nG),
# Average degree of a2-6 sialylation per galactose
EG = wmean(nE / nG)
)
And calculate the traits:
derive_traits(exp, trait_fns = sia_traits, mp_fns = c(sia_mp_fns, all_mp_fns()))
#>
#> ── Traitproteomics Experiment ──────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 548 variables
#> ℹ Sample information fields: group <chr>
#> ℹ Variable information fields: protein <chr>, protein_site <int>, trait <chr>, gene <chr>
This is an example of how you can violate the ambiguity assumption of built-in meta-properties and derived traits, by introducing more sophisticated and linkage-awared meta-properties.
Defining Custom Meta-Properties with Columns
The first method is handy if all information you need is already
encoded in glycan structures. However, there are situations that you
need to rely on other meta-data. In this case, you can use columns in
the variable information tibble as meta-properties directly, by
specifying the mp_cols
parameter.
For example, let’s use the sialic acid linkage type example above. You might have used special derivatization methods to differentiate the two linkage types. This information might end up in a column in the variable information tibble, not directly in the glycan structures.
# Here we assume all sialic acids are a2-6
exp2 <- exp |>
mutate_var(
n_a26_sia = count_mono(glycan_structure, "NeuAc"),
n_a23_sia = 0L
)
exp2 |>
get_var_info() |>
filter(n_a26_sia > 0) |>
pull(glycan_structure)
#> <glycan_structure[2526]>
#> [1] NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)HexNAc(??-
#> [2] NeuAc(??-?)Hex(??-?)HexNAc(??-?)[HexNAc(??-?)]Hex(??-?)[Hex(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)HexNAc(??-
#> [3] NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)HexNAc(??-
#> [4] NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)HexNAc(??-
#> [5] NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)HexNAc(??-
#> [6] NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)HexNAc(??-
#> [7] NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[NeuAc(??-?)Hex(??-?)[dHex(??-?)]HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)HexNAc(??-
#> [8] NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> [9] NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[Hex(??-?)HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> [10] NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)[NeuAc(??-?)Hex(??-?)HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)HexNAc(??-
#> ... (2516 more not shown)
#> # Unique structures: 526
See? No linkage information about sialic acids is in the glycan
structures. But we have two columns: n_a26_sia
and
n_a23_sia
storing this information.
Now we can define two traits to calculate the degree of a2-3 and a2-6 sialylation per galactose:
sia_traits <- list(
# Average degree of a2-3 sialylation per galactose
LG = wmean(nL / nG),
# Average degree of a2-6 sialylation per galactose
EG = wmean(nE / nG)
)
And use mp_cols
to tell derive_traits()
to
use these columns as meta-properties:
derive_traits(exp2, trait_fns = sia_traits, mp_cols = c(nL = "n_a23_sia", nE = "n_a26_sia"))
#>
#> ── Traitproteomics Experiment ──────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 548 variables
#> ℹ Sample information fields: group <chr>
#> ℹ Variable information fields: protein <chr>, protein_site <int>, trait <chr>, gene <chr>, n_a23_sia <int>