Mining protein loops using a structural alphabet and statistical exceptionality

BMC Bioinformatics

Table 1 Quantification of the structural word extraction from the non-redundant data set.

Words
Number of words	3310	166	2214	930
(%)	(11.7%)	(5.0%)	(66.9%)	(28.1%)
Number of fragments	249953	11435	129781	108737
(%)	(60.2%)	(4.6%)	(51.9%)	(43.5%)
Nb fragments/word	75.5	68.9	58.6	116.9*
All-loop coverage rate	72.7%	5.1%	46.5%	40.2%
Short-loop coverage rate	70.3%	4.4%	38.9%	39.3%
Long-loop coverage rate	74.9%	5.7%	53.9%	41.1%
Loops containing at least one word	84.8%	9.8%	60.3%	58.2%
Short loops containing at least one word	79.7%	6.1%	48.1%	49.4%
Long loops containing at least one word	97.8%	19.1%	90.9%	80.4%

1: words seen more than 30 times. 2: under-represented words, 3: non-significant words, 4: over-represented words, '*': significantly higher occurrence according to a Kruskal-Wallis test. Coverage rates are given on a per structural letter basis. Numbers within brackets denote the percentage of words/fragments with respect to the 28274 words/415071 fragments of the whole data set (column 1) and with respect to the 3310 words/249953 fragments in W set_≥30 (columns 2 to 4).

ISSN: 1471-2105