Table 1 Classification performance and identified motifs for common lectins

LectinConc. (μg/ml)AUC (Validation)AUC (Train)Top Motif* 
Agaricus bisporus agglutinin (ABA)1000.934 (0.034)0.947 (0.006)(*3,4,6)GlcNAc α 
Concanavalin A (Con A)100.971 (0.031)0.982 (0.015)Man α1-3(*2,4)Man 
Dolichos biflorus agglutinin (DBA)1000.839 (0.069)0.897 (0.042)(*3,4,6)GalNAc 
Human DC-SIGN tetramer2000.841 (0.062)0.955 (0.026)Man α1-3(Man α1-6)(*2,4)Man α 
Griffonia simplicifolia Lectin I isolectin B4 (GSL I-B4)100.867 (0.061)0.953 (0.014)(*2,3,4,6)Gal α1-3Gal β 
Influenza hemagglutinin (HA) (A/Puerto Rico/8/34) (H1N1)2000.913 (0.105)0.973 (0.023)(*8,9)Neu5Ac α 
Influenza HA (A/harbor seal/Massachusetts/1/2011) (H3N8)2000.959 (0.028)0.958 (0.007)(*8,9)Neu5Ac α2-3(*2,4,6)Gal 
Jacalin10.882 (0.055)0.896 (0.009)(*4,6)GalNAc α/ β 
Lens culinaris agglutinin (LCA)100.964 (0.032)0.976 (0.008)Man α1-3Man α 
Maackia amurensis lectin I (MAL-I)100.833 (0.035)0.848 (0.053)(*2,4,6)Gal β1-4(*3,6)GlcNAc α/ β 
Maackia amurensis lectin II (MAL-II)100.718 (0.078)0.814 (0.074)Gal β1-3GalNAc α 
Phaseolus vulgaris erythroagglutinin (PHA-E)100.959 (0.018)0.975 (0.009)(*2,4,6)Gal β1-4(*3,6)GlcNAc β1-2Man α1-3(Man α1-6)Man 
Phaseolus vulgaris leucoagglutinin (PHA-L)100.914 (0.126)0.967 (0.030)GlcNAc β1-6(*3,4)Man 
Peanut agglutinin (PNA)100.914 (0.048)0.943 (0.021)(*2,3,4,6)Gal β1-3GalNAc 
Pisum sativum agglutinin (PSA)100.890 (0.053)0.929 (0.028)Man α1-3(*2,4)Man 
Ricinus communis agglutinin I (RCA I/RCA120)100.953 (0.026)0.958 (0.008)(*2,3,4,6)Gal β1-4(*3,6)GlcNAc 
Soybean agglutinin (SBA)100.875 (0.061)0.938 (0.026)(*3,4,6)GalNAc 
Sambucus nigra agglutinin (SNA)100.950 (0.060)0.979 (0.010)Neu5Ac α2-6Gal β1-4GlcNAc 
Ulex europaeus agglutinin I (UEA I)1000.861 (0.049)0.895 (0.042)(*3)Fuc 
Wheat germ agglutinin (WGA)10.882 (0.021)0.901 (0.004)GlcNAc β1-3Gal β1-4(*3,6)GlcNAc β1-3(*2,4,6)Gal β1-4(*3,6)GlcNAc 
  1. Model performance was assessed using stratified 5-fold cross-validation, with Area Under the Curve (AUC) values calculated for both validation and training folds (shown as mean (s.d.)). The top motif is defined as the feature with the highest coefficient in the logistic regression classification model, and is shown for a single test/training split. Experimentally determined lectin specificities and associated citations are provided in Additional file 7
  2. *Note: Motifs are written in a modified CFG linear text nomenclature. A set of parentheses with connection types preceded by an asterisk indicates restricted connection types for the following residue. For example, a GlcNAc motif with restricted connections on C3 and C4 is indicated by (*3,4)GlcNAc