Skip to contents

Your Universal Glycan Text Translator 🔄

Welcome to the world of glycan text parsing! If you’ve ever worked with glycan data from different sources, you know the frustration: every database, software tool, and research group seems to have their own way of representing glycan structures in text format.

That’s where glyparse comes to the rescue! 🚀

Think of glyparse as your universal glycan translator — it can read glycan structures written in many different “languages” and convert them all into a unified format that your computer can understand and work with.

Note: All functions in glyparse return glyrepr::glycan_structure objects. If you are unfamiliar with glyrepr, you can read the documentation here.

The Babel Tower of Glycan Text Formats 🗼

Before we dive in, let’s see what we’re dealing with. Here’s the same N-glycan core structure written in different formats:

Format Example Where You’ll See It
IUPAC-condensed Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc Literature, UniCarbKB
IUPAC-short Mana3(Mana6)Manb4GlcNAcb4GlcNAc Literature, UniCarbKB
IUPAC-extended alpha-D-Man-(1->3)-[alpha-D-Man-(1->6)]-beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-D-GlcNAc Literature, UniCarbKB
GlycoCT Complex multi-line format Literature, GlycomeDB
WURCS WURCS=2.0/3,5,4/[...]/1-1-2-3-3/a4-b1_b4-c1... Literature, GlyTouCan
pGlyco (N(N(H(H(H))))) pGlyco software results
StrucGP A2B2C1D1E2fedcba StrucGP software results

Confusing, right? 😵‍💫 glyparse understands them all!

Your Parsing Toolkit 🛠️

glyparse provides seven specialized parsers, each optimized for a specific format:

All parsers follow the same pattern:

Let’s start with the IUPAC formats.

IUPAC-Condensed: The Literature Standard

This format is widely used in scientific literature and databases like UniCarbKB.

Want to know more about IUPAC-condensed format? Check this out!

# Single structure
iupac_condensed <- "Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-3)Gal(b1-4)Glc"
structure <- parse_iupac_condensed(iupac_condensed)
structure
#> <glycan_structure[1]>
#> [1] Neu5Ac(a2-3)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-3)Gal(b1-4)Glc(?1-
#> # Unique structures: 1
# Multiple structures at once
glycans <- c(
  "Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc",  # N-glycan core
  "Gal(b1-3)GalNAc",                                    # O-glycan core 1
  "Neu5Ac(a2-3)Gal(b1-3)[GlcNAc(b1-6)]GalNAc"         # O-glycan core 2
)

structures <- parse_iupac_condensed(glycans)
structures
#> <glycan_structure[3]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(?1-
#> [2] Gal(b1-3)GalNAc(?1-
#> [3] Neu5Ac(a2-3)Gal(b1-3)[GlcNAc(b1-6)]GalNAc(?1-
#> # Unique structures: 3

IUPAC-Short: Literature’s Favorite

This compact format is popular in research papers because it saves space:

# The same structures in short format
iupac_short <- c(
  "Mana3(Mana6)Manb4GlcNAcb4GlcNAc",
  "Galb3GalNAc", 
  "Neu5Aca3Galb3(GlcNAcb6)GalNAc"
)

short_structures <- parse_iupac_short(iupac_short)
short_structures
#> <glycan_structure[3]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(?1-
#> [2] Gal(b1-3)GalNAc(?1-
#> [3] Neu5Ac(a2-3)Gal(b1-3)[GlcNAc(b1-6)]GalNAc(?1-
#> # Unique structures: 3

Notice how much more compact this is! The parser is smart enough to infer common linkage positions (like Neu5Ac always being a2-linked).

IUPAC-Extended: The Formal One

This verbose format includes full chemical names and stereochemistry:

iupac_extended <- paste0(
  "α-D-Manp-(1→3)[α-D-Manp-(1→6)]-β-D-Manp-(1→4)",
  "-β-D-GlcpNAc-(1→4)-β-D-GlcpNAc-(1→"
)
extended_structure <- parse_iupac_extended(iupac_extended)
extended_structure
#> <glycan_structure[1]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> # Unique structures: 1

Part 2: Database Formats — The Heavy Hitters 💪

GlycoCT: The Precision Format

GlycoCT is used in literature for precise representation and in databases like GlycomeDB. It’s more complex but extremely precise:

glycoct <- paste0(
  "RES\n",
  "1b:b-dglc-HEX-1:5\n",
  "2b:b-dgal-HEX-1:5\n", 
  "3b:a-dgal-HEX-1:5\n",
  "LIN\n",
  "1:1o(4+1)2d\n",
  "2:2o(3+1)3d"
)

glycoct_structure <- parse_glycoct(glycoct)
glycoct_structure
#> <glycan_structure[1]>
#> [1] Gal(a1-3)Gal(b1-4)Glc(b1-
#> # Unique structures: 1

WURCS: The Complex Structure Format

WURCS (Web3 Unique Representation of Carbohydrate Structures) is used in literature for complex structures and in databases like GlyTouCan:

wurcs <- paste0(
  "WURCS=2.0/3,3,2/",
  "[a2122h-1b_1-5][a1122h-1b_1-5][a1122h-1a_1-5]/",
  "1-2-3/a4-b1_b3-c1"
)

wurcs_structure <- parse_wurcs(wurcs)
wurcs_structure
#> <glycan_structure[1]>
#> [1] Man(a1-3)Man(b1-4)Glc(b1-
#> # Unique structures: 1

Part 3: Software-Specific Formats — The Specialists 🔬

pGlyco Format: Proteomics Tool

If you work with glycoproteomics, you might encounter pGlyco’s parenthetical notation:

pglyco <- "(N(F)(N(H(H(N))(H(N(H))))))"
pglyco_structure <- parse_pglyco_struc(pglyco)
pglyco_structure
#> <glycan_structure[1]>
#> [1] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> # Unique structures: 1

This cryptic notation actually represents a complex N-glycan:

  • N = HexNAc
  • F = Fuc
  • H = Hex (Man or Gal)

StrucGP Format: Alphabetical System

StrucGP uses a letter-based encoding system:

strucgp <- "A2B2C1D1E2F1fedD1E2edcbB5ba"
strucgp_structure <- parse_strucgp_struc(strucgp)
strucgp_structure
#> <glycan_structure[1]>
#> [1] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> # Unique structures: 1

The Bottom Line 🎯

glyparse transforms the chaos of glycan text formats into order. No matter where your glycan data comes from—databases, literature, or software tools—you can now parse it into a unified format for analysis.

Next steps:

  • Explore the glyrepr package for structure manipulation
  • Try glymotif for motif analysis of your parsed structures
  • Use glyexp for experimental data analysis
  • Check out the rest of the glycoverse ecosystem!

Happy parsing! 🧬✨