A comprehensive database for glycoprotein-disease associations with machine learning-based
disease predictions. GlycoDisease integrates knowledge from PubMed literature using
large language models and provides a unified resource for studying the role of
glycoproteins in human diseases.
Knowledge Extraction: LLM-based extraction from PubMed abstracts
Protein Embeddings: ESM-2 (esm2_t12_35M_UR50D, 480 dimensions → 100 PCA)
Glycan Features: 38 features from UniProt and GlyGen
Prediction Model: XGBoost classifier (AUROC: 0.885, AUPRC: 0.510)