
Creating Experiments
create-exp.Rmd
This vignette will guide you through creating an
experiment()
object from scratch. Think of it as your
step-by-step recipe for building the foundation of your glycomics
analysis!
Quick tip: If you’re already using
glyread
to import your data, you can focus on the “Sample Information” section and breeze through the rest.
What You’ll Need
Creating an experiment()
object is like assembling a
puzzle – you need three key pieces to make everything fit together
perfectly:
- Expression Matrix: Your numeric data arranged with variables as rows and samples as columns (the heart of your dataset!)
- Sample Information: A tibble containing all the juicy details about your samples – groups, batches, demographics, you name it
- Variable Information: A tibble describing your variables – proteins, peptides, glycan compositions, and more
You’ll also sprinkle in some experiment metadata to complete the picture.
Ready? Let’s dive in!
Step 1: Getting Your Samples Organized
Let’s start with the sample information – think of this as your sample’s ID card plus all the important details about each one.
The golden rule here is simple: your first column
must be named sample
, and every entry
needs to be unique (no duplicates allowed!). These are your sample
identifiers – the names that tie everything together.
For the remaining columns, you have complete freedom! Add any information that matters for your analysis: groups, batches, patient demographics, treatment conditions – whatever helps tell your story.
Pro Tips for Column Names
The glycoverse
family has some favorite column names
that make everything work seamlessly:
-
group
: Your experimental conditions or treatments (and yes, make it a factor!).
This is your star player – mostglystats
functions rely on this column. -
batch
: For tracking sample batches (another factor, please!).
Essential forglyclean::remove_batch_effect()
. No batch column? No problem –glyclean::auto_clean()
will simply skip batch correction.
Note: We’re constantly expanding this list, so stay tuned for more conventions!
Let’s build our sample information table:
sample_info <- tibble(
sample = c("S1", "S2", "S3", "S4", "S5", "S6"),
group = factor(c("A", "A", "A", "B", "B", "B"), levels = c("A", "B")),
batch = factor(c(1, 2, 1, 2, 1, 2), levels = c(1, 2))
)
sample_info
#> # A tibble: 6 × 3
#> sample group batch
#> <chr> <fct> <fct>
#> 1 S1 A 1
#> 2 S2 A 2
#> 3 S3 A 1
#> 4 S4 B 2
#> 5 S5 B 1
#> 6 S6 B 2
Step 2: Describing Your Variables
Now for the variable information – this is where you describe what each measurement actually represents.
Good news for glycoproteomics folks: If you’re using
glyread
, sit back and relax! It automatically extracts all
this information from your software output files.
For glycomics experiments: You’ll likely need to build this manually, but don’t worry – glycomics data structures are much more straightforward than their glycoproteomics cousins.
The Variable Naming Game
Just like with samples, your first column must be
named variable
with unique values. Here’s the thing though
– these names don’t need to be fancy or meaningful. In fact,
glyread
happily uses simple names like “V1”, “V2”, “V3” by
default, and that works perfectly fine!
Column Conventions by Experiment Type
For glycomics experiments, these columns are your friends:
-
glycan_composition
: Your glycan composition as aglyrepr::glycan_composition()
object -
glycan_structure
: Your glycan structure as aglyrepr::glycan_structure()
object
(You could use character strings thatglyparse::auto_parse()
understands, but we recommend the parsed objects to avoid repetitive parsing later)
For glycoproteomics experiments, you’ll want these additional columns:
-
protein
: UniProt accession (character) -
protein_site
: Glycosylation site position on the protein (integer) -
gene
: Gene name (character) – optional but nice to have -
peptide
: Peptide sequence (character) – optional -
peptide_site
: Peptide site position for glycan attachment (integer) – optional
Let’s create our variable information table:
var_info <- tibble(
variable = c("V1", "V2", "V3"),
glycan_composition = glyrepr::glycan_composition(
c(GalNAc = 1),
c(Gal = 1, GalNAc = 1),
c(Gal = 1, GalNAc = 1, GlcNAc = 1)
),
glycan_structure = glyrepr::as_glycan_structure(c(
"GalNAc(a1-",
"Gal(b1-3)GalNAc(a1-",
"Gal(b1-3)[GlcNAc(a1-6)]GalNAc(a1-"
))
)
var_info
#> # A tibble: 3 × 3
#> variable glycan_composition glycan_structure
#> <chr> <comp> <struct>
#> 1 V1 GalNAc(1) GalNAc(a1-
#> 2 V2 Gal(1)GalNAc(1) Gal(b1-3)GalNAc(a1-
#> 3 V3 Gal(1)GlcNAc(1)GalNAc(1) Gal(b1-3)[GlcNAc(a1-6)]GalNAc(a1-
Step 3: Building Your Expression Matrix
Time for the main event – your expression matrix! This is where all your actual measurements live.
The layout is straightforward: variables as rows, samples as columns.
You don’t need to perform any transformation on your data, especially
log-transformation. glycoverse
functions will handle the
data transformation internally if needed.
The Matching Game
Here’s the important bit: your row names must match the
variable
column from your variable information, and your
column names must match the sample
column from your sample
information. Think of it as connecting the dots between your data and
metadata.
Bonus: The order doesn’t need to be perfect – the functions are smart enough to line everything up correctly!
Let’s create our expression matrix:
# Create a simple matrix with 3 variables and 6 samples
expr_mat <- matrix(
rnorm(18, mean = 10, sd = 2), # Some realistic-looking data
nrow = 3,
ncol = 6
)
rownames(expr_mat) <- var_info$variable
colnames(expr_mat) <- sample_info$sample
expr_mat
#> S1 S2 S3 S4 S5 S6
#> V1 7.199913 9.988857 6.356365 9.434589 14.130050 6.273977
#> V2 10.510634 11.243105 9.505349 8.892601 6.738021 8.955975
#> V3 5.125473 12.296823 9.511601 11.257964 11.024854 9.894796
Step 4: Bringing It All Together
Congratulations! You’ve got all the pieces of the puzzle. Now let’s
assemble your experiment()
object:
exp <- experiment(expr_mat, sample_info, var_info, exp_type = "glycomics", glycan_type = "N")
exp
#>
#> ── Glycomics Experiment ────────────────────────────────────────────────────────
#> ℹ Expression matrix: 6 samples, 3 variables
#> ℹ Sample information fields: group <fct>, batch <fct>
#> ℹ Variable information fields: glycan_composition <comp>, glycan_structure <struct>
Don’t forget the metadata: You’ll need to specify
your experiment type ("glycomics"
or
"glycoproteomics"
) and glycan type (like "N"
for N-linked, "O"
for O-linked, etc.). These little details
help downstream functions understand exactly what they’re working
with.
And there you have it – your very own experiment()
object, ready for all the exciting analyses that await! 🎉