Distinguishing crystallographic from biological interfaces in protein complexes: role of intermolecular contacts and energetics for classification

Table 1 Performance of classification models based on different features and training algorithms

	Training Features	Bagging	Random Forest	Adaptive Boosting	Gradient Boosting	Neural Network	Average
S1	BSA	0.74	0.74	0.81	0.81	0.55	0.73
S1	BSA	(0.51)	(0.51)	(0.43)	(0.41)	(0.50)	(0.47)
S2	RCs	0.86	0.86	0.85	0.86	0.85	0.86
S2	RCs	(0.50)	(0.50)	(0.51)	(0.50)	(0.54)	(0.51)
S3	CC, CP, CA, PP, AP, AA	0.89	0.90	0.89	0.89	0.89	0.89
S3	CC, CP, CA, PP, AP, AA	(0.67)	(0.70)	(0.69)	(0.67)	(0.67)	(0.68)
S4	CC, CP, CA, PP, AP, AA, ANIS, CNIS, PNIS	0.90 (0.69)	0.90 (0.69)	0.89 (0.66)	0.89 (0.67)	0.89 (0.67)	0.89 (0.68)
S5	CC, CP, CA, PP, AP, AA, LD, G, A, L, M, F, W, K, Q, E, S, P, V, I, C, Y, H, R, N, D, T	0.92 (0.74)	0.92 (0.73)	0.91 (0.74)	0.92 (0.71)	0.91 (0.77)	0.92 (0.74)
S6	CC, CP, CA, PP, AP, AA, ANIS, CNIS, PNIS, LD, G, A, L, M, F, W, K, Q, E, S, P, V, I, C, Y, H, R, N, D, T	0.92 (0.73)	0.92 (0.75)	0.91 (0.74)	0.93 (0.70)	0.92 (0.76)	0.92 (0.74)
E1	HS	0.76	0.76	0.83	0.82	0.82	0.80
E1	HS	(0.59)	(0.59)	(0.62)	(0.62)	(0.59)	(0.60)
E2	Eelec, Evdw, Edes	0.87	0.87	0.87	0.87	0.85	0.87
E2	Eelec, Evdw, Edes	(0.64)	(0.61)	(0.62)	(0.62)	(0.68)	(0.63)
	CC, CP, CA, PP, AP, AA, ANIS, CNIS, PNIS, LD, G, A, L, M, F, W, K, Q, E, S, P, V, I, C, Y, H, R, N, D, T, Eelec, Evdw, Edes	0.92 (0.72)	0.93 (0.73)	0.92 (0.74)	0.93 (0.72)	0.90 (0.77)	0.92 (0.74)
C		0.92 (0.72)	0.93 (0.73)	0.92 (0.74)	0.93 (0.72)	0.90 (0.77)	0.92 (0.74)

Accuracy values calculated according Eq. 2 in “Methods”
The predictive accuracies have been reported for several classification models tested. Nine sets of features have been used to train new predictive models, based on structural properties (S1, S2, S3, S4, S4, S6), energetics (E1, E2) and a combination of structure and energetics (C). For each set of training features, five machine learning algorithms have been used for the training (Bagging, Random Forest, Adaptive Boosting, Gradient Boosting and Neural Network). For the trained models, the accuracies on the Many [34] and the DC [15] (numbers in brackets) datasets are reported. The accuracy on the Many is reported as average of the 10-fold cross validation. In brackets the accuracy over the DC dataset is reported

ISSN: 1471-2105