
Get Started with glyread
glyread.Rmd
The Great Glycoproteomics Data Chaos 📊
Picture this: you’ve just finished running your glycoproteomics experiment through your favorite identification software—maybe pGlyco3, MSFragger-Glyco, or Byonic. You’re excited to dive into the results, but then reality hits: you’re staring at a massive spreadsheet with 120+ columns, cryptic column names, and data scattered everywhere like confetti after a celebration gone wrong!
Sound familiar? Welcome to the wonderful world of glycoproteomics data formats!
Each software tool speaks its own unique “dialect”—what one calls “Proteins,” another might call “Protein.Accessions,” and yet another prefers “UniProt_IDs.” It’s like trying to have a conversation at the Tower of Babel, but for data scientists.
Enter glyread—your universal translator! 🌟
Think of glyread
as your personal data butler who speaks
fluent “software chaos” and translates everything into beautiful,
organized, analysis-ready data. No more wrestling with column names, no
more manual reformatting, no more pulling your hair out over
inconsistent formats. Just clean, tidy data that’s ready for the fun
part: actual analysis!
🎯 Important Note: All functions in
glyread
return a glyexp::experiment()
object—the lingua franca of the glycoverse
ecosystem. If
you haven’t met this elegant data structure yet, we highly recommend
taking a quick detour to its
introduction first. Trust us, it’s worth the journey! 🚀
A Real-World Example: Taming the pGlyco3 + pGlycoQuant Beast 🐉
Let’s dive into a classic glycoproteomics workflow that’ll make you
appreciate glyread
’s magic!
The Setup: You’re using pGlyco3 (the glycopeptide identification wizard 🧙♂️) paired with pGlycoQuant (the quantification maestro 🎼). This dynamic duo is incredibly powerful, but their output? Well, let’s just say it’s… comprehensive.
Here’s what a typical pGlycoQuant output file looks like:
read_tsv("glycopeptides.list")
#> New names:
#> Rows: 500 Columns: 120
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "\t" chr
#> (13): GlySpec, PepSpec, RawName, Peptide, Mod, Glycan(H,N,A,F), GlycanC... dbl
#> (105): Scan, RT, PrecursorMH, PrecursorMZ, Charge, Rank, PeptideMH, GlyI... lgl
#> (2): Empty_Separator, ...120
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `` -> `...120`
#> # A tibble: 500 × 120
#> GlySpec PepSpec RawName Scan RT PrecursorMH PrecursorMZ Charge Rank
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 20241224-LX… 202412… 202412… 266 58.1 2880. 961. 3 1
#> 2 20241224-LX… 202412… 202412… 2870 621. 3537. 885. 4 1
#> 3 20241224-LX… 202412… 202412… 2878 622. 3246. 812. 4 1
#> 4 20241224-LX… 202412… 202412… 2884 623. 3537. 708. 5 1
#> 5 20241224-LX… 202412… 202412… 3015 642. 2932. 978. 3 1
#> 6 20241224-LX… 202412… 202412… 3079 652. 3829. 958. 4 1
#> 7 20241224-LX… 202412… 202412… 3118 657. 3537. 885. 4 1
#> 8 20241224-LX… 202412… 202412… 3121 657. 3829. 767. 5 1
#> 9 20241224-LX… 202412… 202412… 3129 658. 3537. 708. 5 1
#> 10 20241224-LX… 202412… 202412… 3137 659. 3246. 812. 4 1
#> # ℹ 490 more rows
#> # ℹ 111 more variables: Peptide <chr>, Mod <chr>, PeptideMH <dbl>,
#> # `Glycan(H,N,A,F)` <chr>, GlycanComposition <chr>, PlausibleStruct <chr>,
#> # GlyID <dbl>, GlyFrag <chr>, GlyMass <dbl>, GlySite <dbl>, TotalScore <dbl>,
#> # PepScore <dbl>, GlyScore <dbl>, CoreMatched <dbl>, MassDeviation <dbl>,
#> # PPM <dbl>, GlyIonRatio <dbl>, byIonRatio <dbl>, czIonRatio <dbl>,
#> # GlyDecoy <dbl>, PepDecoy <dbl>, Ion_163.06 <dbl>, Ion_366.14 <dbl>, …
That’s 120 columns of pure, unadulterated data chaos! While this comprehensive output is fantastic for software interoperability, it’s like trying to find a needle in a haystack when you just want to get to your analysis.
What you actually need (the needle in our haystack):
- 🧬 Glycoform descriptions: proteins, sites, glycan compositions, glycan structures
- 📊 Quantification results: the actual numbers from pGlycoQuant
The old way: Manually wrestling with this 120-column monster, writing custom parsing scripts, debugging column name mismatches, and generally questioning your life choices.
The glyread way: One function call. Seriously. ✨
Meet read_pglyco3_pglycoquant()
—your new best
friend:
exp <- read_pglyco3_pglycoquant("glycopeptides.list", sample_info = "sample_info.csv")
#> ℹ Reading data
#> ℹ Performing protein inference
#> ✔ Performing protein inference [94ms]
#>
#> ℹ Reading dataℹ Parsing glycan compositions and structures
#> ✔ Parsing glycan compositions and structures [3.2s]
#>
#> ℹ Reading data✔ Reading data [3.6s]
exp
#>
#> ── Experiment ──────────────────────────────────────────────────────────────────
#> ℹ Expression matrix: 12 samples, 298 variables
#> ℹ Sample information fields: group
#> ℹ Variable information fields: peptide, peptide_site, protein, protein_site, gene, glycan_composition, and glycan_structure
Ta-da! Look at that beautiful transformation! From 120-column chaos to organized elegance in one line of code.
Let’s peek at what treasures we’ve extracted:
🏷️ Variable Information - Meet Your Glycopeptides:
get_var_info(exp)
#> # A tibble: 298 × 8
#> variable peptide peptide_site protein protein_site gene glycan_composition
#> <chr> <chr> <int> <chr> <int> <chr> <comp>
#> 1 GP1 JKTQGK 1 P08185 176 SERP… Hex(5)HexNAc(4)Ne…
#> 2 GP2 HSHNJJSS… 5 P04196 344 HRG Hex(5)HexNAc(4)Ne…
#> 3 GP3 HSHNJJSS… 5 P04196 344 HRG Hex(5)HexNAc(4)
#> 4 GP4 HSHNJJSS… 5 P04196 344 HRG Hex(5)HexNAc(4)Ne…
#> 5 GP5 HJSTGCLR 2 P10909 291 CLU Hex(6)HexNAc(5)
#> 6 GP6 HSHNJJSS… 5 P04196 344 HRG Hex(5)HexNAc(4)Ne…
#> 7 GP7 HSHNJJSS… 6 P04196 345 HRG Hex(5)HexNAc(4)
#> 8 GP8 HSHNJJSS… 5 P04196 344 HRG Hex(5)HexNAc(4)dH…
#> 9 GP9 HSHNJJSS… 5 P04196 344 HRG Hex(4)HexNAc(3)
#> 10 GP10 HSHNJJSS… 5 P04196 344 HRG Hex(4)HexNAc(4)Ne…
#> # ℹ 288 more rows
#> # ℹ 1 more variable: glycan_structure <structure>
📋 Sample Information - Know Your Experiments:
get_sample_info(exp)
#> # A tibble: 12 × 2
#> sample group
#> <chr> <chr>
#> 1 20241224-LXJ-Nglyco-H_1 H
#> 2 20241224-LXJ-Nglyco-H_2 H
#> 3 20241224-LXJ-Nglyco-H_3 H
#> 4 20241224-LXJ-Nglyco-M_1 M
#> 5 20241224-LXJ-Nglyco-M_2 M
#> 6 20241224-LXJ-Nglyco-M_3 M
#> 7 20241224-LXJ-Nglyco-Y_1 Y
#> 8 20241224-LXJ-Nglyco-Y_2 Y
#> 9 20241224-LXJ-Nglyco-Y_3 Y
#> 10 20241224-LXJ-Nglyco-C_1 C
#> 11 20241224-LXJ-Nglyco-C_2 C
#> 12 20241224-LXJ-Nglyco-C_3 C
📊 Expression Matrix - Your Data’s Heart and Soul:
get_expr_mat(exp)[1:5, ]
#> 20241224-LXJ-Nglyco-H_1 20241224-LXJ-Nglyco-H_2 20241224-LXJ-Nglyco-H_3
#> GP1 31054.12 NA 457398.3
#> GP2 NA 136556.05 NA
#> GP3 NA 15717.87 427312.0
#> GP4 285613.66 268250.71 2621248.6
#> GP5 27588555.39 19527065.26 32930089.6
#> 20241224-LXJ-Nglyco-M_1 20241224-LXJ-Nglyco-M_2 20241224-LXJ-Nglyco-M_3
#> GP1 7616346 7391049 6267864
#> GP2 22675686 16675442 114423292
#> GP3 10813133 9746325 25348175
#> GP4 993255511 1099069766 1106268049
#> GP5 32500720 NA 26346060
#> 20241224-LXJ-Nglyco-Y_1 20241224-LXJ-Nglyco-Y_2 20241224-LXJ-Nglyco-Y_3
#> GP1 23059718 15010885 740942
#> GP2 115717950 90594397 55977605
#> GP3 33607210 21284262 28608146
#> GP4 547660361 753702172 556784303
#> GP5 21780632 19862189 12764805
#> 20241224-LXJ-Nglyco-C_1 20241224-LXJ-Nglyco-C_2 20241224-LXJ-Nglyco-C_3
#> GP1 NA NA 10655.62
#> GP2 305840564 428631806 16064212.10
#> GP3 32885077 35418588 6372648.51
#> GP4 669332806 696696106 112287653.59
#> GP5 25946392 18860878 4316119.03
The magic? All the essential information for
downstream analysis has been carefully extracted, cleaned, and packaged
into a beautiful glyexp::experiment()
object. No more data
archaeology—just clean, analysis-ready data! ✨
Peek Under the Hood: The Magic Explained 🔧✨
Curious about what just happened? Let’s lift the hood and see the
sophisticated machinery that read_pglyco3_pglycoquant()
runs behind the scenes. Spoiler alert: it’s doing a lot
of heavy lifting so you don’t have to!
The 8-Step Data Transformation Dance:
- Smart File Reading: Reads your data with intelligent column type detection—no more “everything is a character” surprises!
- Column Cleaning Magic: Extracts clean UniProt accessions from messy “Proteins” columns and handles all those pesky formatting inconsistencies.
- Protein Inference Intelligence: Runs a sophisticated “parsimony” algorithm to resolve protein assignment ambiguities—because biology is complicated, but your data doesn’t have to be!
- 📊 PSM-to-Glycopeptide Aggregation: Intelligently combines PSM-level quantification into meaningful glycopeptide-level measurements.
- Sample Information Validation: Cross-checks sample names and ensures everything matches up perfectly.
- 🎯 Data Extraction: Carefully extracts variable information and expression matrix while maintaining data integrity.
- 🧬 Glycan Parsing Power: Leverages the amazing glyparse package to parse glycan compositions and structures into proper data types.
-
Final Assembly: Packages everything into a
beautiful, analysis-ready
experiment()
object.
The result? What used to take hours of custom scripting now happens in seconds, with bulletproof reliability!
Meet the Whole glyread Family!
One function down, but there’s so much more! glyread
is
like a Swiss Army knife for glycoproteomics data—it speaks the language
of virtually every major identification and quantification workflow out
there. 🔧
🌟 The Complete Toolkit:
-
read_byonic_byologic()
: The dynamic duo of Byonic identification + Byologic quantification -
read_byonic_pglycoquant()
: Byonic’s precision meets pGlycoQuant’s power -
read_msfragger()
: MSFragger-Glyco’s all-in-one identification and quantification magic -
read_pglyco3()
: Pure pGlyco3 with its built-in quantification capabilities -
read_pglyco3_pglycoquant()
: The workflow we just explored—pGlyco3 + pGlycoQuant perfection
Coming Soon to a Package Near You: We’re actively
working on supporting Glyco-Decipher, GPQuest, and other exciting tools.
The glycoproteomics software landscape is evolving rapidly, and
glyread
is evolving right alongside it! 🚀
💡 Pro Tip: No matter which workflow you choose, you can expect the same consistent, clean data format on the other side. It’s like having a universal remote for your glycoproteomics data!
The Universal Data Language: Consistent Columns Across All Functions
Here’s the beautiful part: no matter which glyread
function you use, you’ll always get the same consistent, predictable
data structure. It’s like having a universal translator that ensures
everyone speaks the same language!
🏷️ Your Variable Information Columns - The Standard Cast:
-
peptide
🧬: The peptide sequence (character) - your protein’s building blocks -
peptide_site
: Glycosylation site on the peptide (integer) - where the magic happens -
protein
🏷️: UniProt accession (character) - your protein’s official ID card -
protein_site
🎯: Glycosylation site on the protein (integer) - the full-length coordinates -
gene
🧬: Gene symbol (character) - the genetic blueprint behind it all -
glycan_composition
: Aglyrepr::glycan_composition()
object - what’s in your glycan -
glycan_structure
: Aglyrepr::glycan_structure()
object - how it’s all connected
🎯 Special Note on Those Last Two: The
glycan_composition
and glycan_structure
columns aren’t just plain text—they’re sophisticated data types from the
amazing glyrepr
package. Think of them as smart objects that know how to do glycan
math!
💡 Pro Tip: Getting familiar with
glyrepr
objects is absolutely worth the investment. They
unlock a whole world of glycan analysis capabilities that would be
impossible with plain text. Trust us on this one!
Your Glycoverse Adventure Awaits! 🚀✨
Congratulations! You’ve just mastered the art of taming
glycoproteomics data chaos with glyread
. But this is just
the beginning of your journey through the glycoverse
ecosystem!
🎯 Your Next Destinations:
- glyclean: Your data preprocessing powerhouse! Think of it as Marie Kondo for glycoproteomics data—it’ll help you clean, filter, and organize your experiments until they spark joy.
- glymotif: The motif hunting expedition! Discover hidden patterns and recurring structural themes in your glycan data. It’s like being a detective, but for sugar molecules!
-
📊 glyexp: Your
experimental data command center! Master the
experiment()
object and unlock the full power of synchronized data manipulation.
The Best Part? Since you’re now fluent in
experiment()
objects thanks to glyread
, all
these packages will feel like natural extensions of your workflow. It’s
like learning a new language and suddenly being able to read an entire
library!
Happy glycan hunting! 🧬🎯
Remember: Great glycoproteomics analysis starts with great data import. You’ve got the tools—now go make some discoveries! 💫