The Biodiversity Informatics Potential Index
© BioMed Central Ltd 2011
Published: 15 December 2011
Skip to main content
© BioMed Central Ltd 2011
Published: 15 December 2011
Biodiversity informatics is a relatively new discipline extending computer science in the context of biodiversity data, and its development to date has not been uniform throughout the world. Digitizing effort and capacity building are costly, and ways should be found to prioritize them rationally. The proposed 'Biodiversity Informatics Potential (BIP) Index' seeks to fulfill such a prioritization role. We propose that the potential for biodiversity informatics be assessed through three concepts: (a) the intrinsic biodiversity potential (the biological richness or ecological diversity) of a country; (b) the capacity of the country to generate biodiversity data records; and (c) the availability of technical infrastructure in a country for managing and publishing such records.
Broadly, the techniques used to construct the BIP Index were rank correlation, multiple regression analysis, principal components analysis and optimization by linear programming. We built the BIP Index by finding a parsimonious set of country-level human, economic and environmental variables that best predicted the availability of primary biodiversity data accessible through the Global Biodiversity Information Facility (GBIF) network, and constructing an optimized model with these variables. The model was then applied to all countries for which sufficient data existed, to obtain a score for each country. Countries were ranked according to that score.
Many of the current GBIF participants ranked highly in the BIP Index, although some of them seemed not to have realized their biodiversity informatics potential. The BIP Index attributed low ranking to most non-participant countries; however, a few of them scored highly, suggesting that these would be high-return new participants if encouraged to contribute towards the GBIF mission of free and open access to biodiversity data.
The BIP Index could potentially help in (a) identifying countries most likely to contribute to filling gaps in digitized biodiversity data; (b) assisting countries potentially in need (for example mega-diverse) to mobilize resources and collect data that could be used in decision-making; and (c) allowing identification of which biodiversity informatics-resourced countries could afford to assist countries lacking in biodiversity informatics capacity, and which data-rich countries should benefit most from such help.
Progress in biodiversity informatics (methodologies and tools extending contemporary computer science and informatics principles in the context of biodiversity data ) is not homogeneous throughout the world, with the differences apparently due more to the economic status of countries than to their estimated biodiversity richness , as is the case for data availability in literature . Digitizing all available data already existing in analog form or locked in unavailable databases has been shown to be impractical [2, 4, 5]. Therefore, digitizing efforts, related informatics infrastructure development and capacity building, being limited, should be both prioritized and encouraged.
help identify countries or economies most likely to be able to contribute to filling gaps in digitized data, as well as being most likely to absorb, implement and reliably build required informatics infrastructure and capacity in biodiversity informatics;
provide a prioritization mechanism, by integrating a number of parameters that might be related to the state of biodiversity informatics in individual countries: infrastructure capacity (financial, human and technical resources), data accessibility, and fitness for use of accessible data;
help countries, especially those with the most need (for example mega-diverse countries, or those whose biodiversity is most endangered), to mobilize resources and collect data that could be used in decision-making; and
be used as an equalizing measure involved in any biodiversity informatics compensation mechanisms across countries; for instance, the BIP Index might allow identification of countries with a high level of biodiversity informatics resources that could afford to invest some of those resources in countries lacking them, in an efficient way that would be most likely to produce useful, quality data after initial capacity building.
The intrinsic biodiversity potential of a country (broadly, its biological or ecological richness and factors favoring it), related to its physical, biological and environmental characteristics.
The raw data generation potential, producing basic data records (specimens, samples, observations), and
The quality data generation potential, producing biodiversity value-added records by generating additional data enhancing their fitness for use.
The availability of technical infrastructure in a country for hosting, managing and sharing biodiversity data records, both produced in the country as a result of its own biodiversity potential and data generation capacity, or existing in the country as a result of research efforts directed towards other countries.
The capacity to generate primary biodiversity data, and
The capacity to discover, curate and make available such data for public access.
In this context,
Primary biodiversity data are documented events manifesting the occurrence of an identified biological entity in a definite space and time;
Primary biodiversity data are atomized into primary biodiversity records (PBRs) that can be hosted by the country generating them, or by any other country; and
'Hosting' here means that a facility in a country makes the PBRs accessible to any interested party, following the principles of free and open access to data.
With these definitions in mind, the BIP Index is a composite of a number of country-level indicator variables (data, statistics or indexes representing any measurable, scalable or ordered concept that are available as a single measure for a country) that can predict the state of biodiversity informatics in countries.
Dimensions. To identify adequate variables, some response variables or known proxies for the state of biodiversity informatics were needed. Predictor variables could then be compared with the proxies if cases could be found, and a general model could be derived to be applicable to the remainder countries.
Number of PBRs occurring in each country (whether published by that same country or by another country), hereinafter DAT, as an indicator of the potential raw biodiversity data produced in that country.
Number of geo-referenced PBRs occurring in each country (whether published by that same country or by another country), hereinafter GRF, as an indicator of the higher quality biodiversity data produced in that country.
Number of PBRs made public by a country (whether occurring in that same country or in another country), hereinafter HOST, as an indicator of the technical hosting capacity of that country.
Number of different taxa, generally at the species level, listed in the PBRs occurring in a country, hereinafter SPCS, as an indicator of the potential raw biodiversity data existing in that country.
The BIP Index is a composite of predictions for these four dimensions based on the predictor variables, tested against these dimensions known from current GBIF participant countries.
DAT and GRF are closely related variables (GRF being a subset of DAT) and in the final BIP Index formulation, these two dimensions are weighted and amalgamated into one, yielding the three-dimensional vector that forms the current version of the BIP Index. Further, SPCS can be combined with the DAT-GRF dimension into the 'data generation' axis, theoretically orthogonal (but not uncorrelated) to the 'data hosting' axis represented by HOST. In theory, a country with rich biodiversity (SPCS) and large biomass-related size (DAT-GRF) should have a higher potential to produce biodiversity data, other parameters being equal.
Predictor variables. The BIP Index attempts to explain the response variables from a relatively small set of meaningful predictor variables. Thus, much of the work in developing the BIP Index was choosing which predictors, from many available, would contribute to the formulation of the BIP Index and which predictors would have little or no predicting power and could be discarded.
Economic power indicators, which may underlie efforts at directing resources towards research and obtaining data. These can in turn be related to sociological indicators, as well as raw power. Example indicators are: gross domestic product (GDP), purchasing power parity (PPP), per-capita income (PCI) and economic models; geographical indicators such as size and exclusive economic zone (EEZ); social indicators such as population, percentage literacy, percentage employment and Gini coefficient.
Data potential indicators. Biodiversity richness, as measured through appropriate proxies that may result in data: higher biodiversity or larger relative natural areas might mean more potential data. Conversely, reduced biodiversity through soil use may reduce data expectation. Example parameters are: species richness and diversity, hotspots, ecological footprint, number of endemic species and number of collections.
Informatics capacity. The data availability can be enhanced by power, but the databasing and sharing depends on information technology capacity. Example indicators are: digital opportunity index (DOI), educational level and bandwidth per capita.
Many predictor variables were naturally correlated with intrinsic country variables related to its 'size'. For instance, the total amount of parkland surface in a large country could naturally be larger than that of a smaller country. Therefore, those variables that would acquire a different meaning when taking into account some basic feature of the country were relativized into derived variables, by dividing them according the country's size, population, or gross domestic product (GDP) variables. Some variables with skewed distributions were also log-normalized. Derived variables were added to the database.
Human welfare and social development indicators: DVH
Economic development indicators: DVE
Information technology indicators: ICT
Resource availability and power indicators: PWR
Financial power indicators: PWF
Biological diversity data indicators: BIO
Ecological, environmental and human impact indicators: ENV
Physical characteristics of country: GEO
Population size and features: POP
3,695 variables were identified as related to the development of countries or societies in category 1, which can be described as 'human indicators', dependent on human development. In addition, 202 variables related specifically to the technical infrastructure needed for informatics development. 1,093 variables were identified in category 2. Some of these may have been influenced by human development, but on themselves may evolve independently. Collectively, they describe the 'environment' that may in turn drive (or compose) biodiversity and therefore be related to the existence of data, irrespective of whether the data have been discovered or not. Category 3 includes variables related to the 'size' or 'weight' (such as area, GDP, or population) of the country that can be used to relativize other variables. 95 variables belonged to category 3.
Some of the variables were in turn composite indexes or ranks calculated from other variables. The main sources for these potential indicator variables were:
The Food and Agriculture Organization of the United Nations 
The Global Biodiversity Information Facility 
The Global Footprint Network 
The International Telecommunications Union 
The International Union for the Conservation of Nature 
The Legatum Institute 
The New Economics Forum 
The United Nations Development Program 
The United Nations Environment Program 
The World Bank's World Development Index Database 
The World Resources Institute 
The World Values Survey Network 
Furthermore, response (biodiversity informatics) data were also collected, including literature, meta-analyses of GBIF data, and results from at least two Task Group provisional reports: the Content Needs Assessment (CNA) Task Group (AHA, VC, and DP Faith, personal communication) and the Global Strategy and Action Plan for Mobilization of Natural History Collections Data (GSAP-NHC) Task Group .
Most variables were collected from the sources through organized queries, or in some cases digitized from semi-digital sources. Whenever possible or available, time series were collated as selected annual data. The time span ranged from 1990 to the latest available data, with a majority of series including data from 1990, 1995, 2000, and all the years in the 21st century up to 2008 or even 2009 for a few variables. In all, the collection included some 36,700 annual datasets under scrutiny.
As the different sources provide data in different formats, all data have to be compiled into a manageable data format. A database was constructed with a common field structure to accommodate data from disparate sources in a way amenable to analysis. The table-like sources were converted into a vector file, where each record was an individual datum with attributes relating its source, type, variable name, year, and country. This file, containing over 4 million records for primary (not derived) variables, including missing values, became the base source.
The next step was to reorganize the data into time series and variables. From the base source, tables of country versus latest available variable (or country versus year versus variable) were produced as needed and a working file containing the latest available data from selected variables for each country, as well as the derivative variables, was created. This 800,000-record table was the one effectively subjected to statistical analysis (Figure 3) and is available online as a CSV file .
Although the constructed database contained country- and year-specific data that theory suggested could have had some meaning (either known or potential) for the drivers or dimensions of the BIP Index, there was no point in including too many variables in the index. If there were too many missing values, for instance, meaningful inference could be prevented. Besides, the purpose of the BIP Index was not only to predict biodiversity informatics capacity, but also to provide some insight on what factors were important and what were not. Therefore, an initial filtering of variables was made by discarding those not significantly correlated with at least one of the dimensions (Figure 2).
As a majority of variables and all response variables showed non-normal distributions, and many resisted statistical renormalization attempts, Spearman's rank correlation was chosen to discard both variables with non-significant correlations and significant variables with Spearman's rank correlation coefficient < 0.5 ('low-response' variables). Correlations were made pair-wise, using all possible data pairs for each pair predictor-response. About 50% of the variables were thus discarded. The remaining variables were replaced by their ranks and normalized (rescaled) to lie between 0 (lowest rank of the set) and 1 (highest rank); the normalization was of the type:
x(n) = [X - X(min)]/[X(max) - X(min)]. (1)
Initial number of predictor variables
A known problem in correlating a set of predictor variables with a set of response variables is the effect of high correlations between predictors that may appear, lending these predictors undue weight. In multiple regression models, this is known as collinearity . To remove this effect, highly correlated predictor variables were substituted by a composite created from a principal components analysis (PCA) , which was also tested by regression against the response variables.
The missing values for the variables were also a cause of major concern. The prevalence of missing data forced the index to use available data only, rather than the usual sum of components found in common multiple regression models. As Inboden and Streeter  explain, ideally all variables contributing to a composite index should have data, as the index would otherwise lack a component. There are three possible approaches to solve this: data imputation (missing data are substituted by a reasonable imputation), flexible indexing (the contribution of each variable to the index for a country is weighed according to the number of variables for which data are available), or discarding the variable. In the BIP Index, variables with excessive missing data were discarded either totally or from the country's index, and imputations were not made, but the indexes were weighted according to the number of variables available for each country. For the final composite BIP Index, a measure of the degree to which the missing variables may have affected the result is provided, and countries with excess missing variables were not issued a BIP Index ranking.
Multiple regression analyses (MRA) were used to obtain an approximate idea of the degree to which variations in the rank of the predictor variables, for instance number of endemic species, might correlate with variations in the rank of the response variables such as amount of digitally available data. The MRA coefficients thus became the initial parameters of the model, which could also be further adjusted empirically at a later stage (Figure 2).
36 step-wise MRAs were performed for each driver against dimension. Only significantly correlated variables were retained in the model. For each retained variable, the regression coefficient c i was saved for use in the model as a weight factor for the ith variable in the model, x i .
where x i is each of s variables used directly in the driver, w aj is each of the n correlated variables that are replaced by the j PC, z aj is the weight assigned to w aj within the j PC, and c i , c j are the regression coefficients of the variables or PC against response variables.
Final number of predictor variables
To each driver for each dimension, a coefficient f dk was given to weight the driver within the final BIP Index: a higher coefficient would mean a higher importance of that driver in that dimension, relative to other drivers in the same dimension. For instance, if the coefficient for driver DVH was low for dimension GRF, that would mean that DVH variables would have little impact on the GRF capacity. Although in theory the selection of this coefficient could be arbitrarily based in judgment, in the BIP Index the drivers' coefficients were found by linear programming (LP) so as to obtain the highest possible correlation between the drivers and the response variables.
The initial, seed values of the coefficients for the LP optimization process were those of the MRA coefficients for each driver. Drivers were combined and the resulting BIP Index dimension was tested against the corresponding response variable: for instance, all nine drivers for DAT were weighted by their coefficients (resulting from the corresponding MRA), and then these coefficients f dk were made to fluctuate in a Monte Carlo loop by random walk. On each loop, the correlation coefficient was reevaluated and the new values of f dk were retained if they increased. The loop was repeated until no improvement was observed in the correlation coefficient.
Once the coefficients for drivers were found by LP (each driver, in turn, being a combination of predictor variables or PCA scores of variables), a BIP Index dimension was found as an average of drivers available for such dimension.
The final BIP Index score, used to rank the countries, was a combination of the four predicted dimensions M, obtained by weighted Euclidean distance of SPCS, HOST, and the weighted average of GRF and DAT. To attribute relative importance to each dimension, another coefficient e a was applied to each dimension. This coefficient was entirely arbitrary and based solely on expert judgment, and actually constitutes a tuning factor for BIP Index that allows it to stress any of the concept groups in it: data generation, or data hosting. Although we have judged the four dimensions as shown below (see 'Overall BIP Index'), stressing data publishing and intrinsic biodiversity potential more than raw data generation capacity, other uses of BIP Index may seek to rank countries according to this capacity using appropriate e a coefficients.
and D dk is as in equation (2).
Additional file 1 shows the set of variables selected by rank correlation, MRA and PCA for each driver in each dimension. Beta is the corrected regression coefficient for the variable, or the PCA score on component 1 of the corresponding PCA. (In the model, PCA scores have been transformed to percentages of PCA scores; they should not be compared directly with regression coefficients for the raw variables.) Coefficients are applicable to the standardized ranks of variables.
Table of coefficients
The overall BIP Index for a country has been defined as the average Euclidean distance to the origin of the dimensions in the BIP Index (DAT, GRF, HOST, SPCS). In the current formulation of the BIP Index, these dimensions have been assigned the following coefficients: DAT: 0.1; GRF: 0.2; HOST: 0.4; SPCS: 0.3.
Therefore, a country is a point in a four-dimensional space, the dimensions being the four BIP Index components multiplied by their importance coefficients.
The plot also shows the potential for data share equalization. Countries in the bottom right region of the plot are not likely to produce many data, but could host data from large potential data-generator countries in the top left part of the plot that may lack this capacity.
Rank of selected countries according to their Biodiversity Informatics Potential Index
KOREA, REPUBLIC OF
IRAN (ISLAMIC REPUBLIC OF)
UNITED ARAB EMIRATES
SLOVAKIA (SLOVAK REPUBLIC)
SYRIAN ARAB REPUBLIC
TRINIDAD AND TOBAGO
LIBYAN ARAB JAMAHIRIYA
MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF
CENTRAL AFRICAN REPUBLIC
BOSNIA AND HERZEGOVINA
MOLDOVA, REPUBLIC OF
It should be noted once again that the BIP Index is calculated on standardized ranks of the variables. Therefore, relative differences in the BIP Index between countries do not translate into a measure of potential other than for the specific purpose of ranking the countries according to this scale.
To the best of our knowledge, the BIP Index as presented here is the first ever attempt at developing and prototyping a matrix of (a) assessing progress to date, (b) rationalizing future investment and (c) ensuring uniform progress in the field of biodiversity informatics. During the conceptualization and prototyping exercise, we have tried to ensure that all possible parameters and factors that would affect such an index, and for which data could be found, were taken into consideration. Nevertheless, we recognize that arguments can always be put forward in favor of inclusion of some additional factors and omission of some existing ones. Thus, the BIP Index is and will continue to be a complex, evolving exercise. This is mainly because a multitude of factors influence the relevance, robustness and acceptance of such an index. In the future, three key aspects will improve the relevance, robustness and acceptance of BIP Index: (i) validation, (ii) indicator robustness and (iii) increased attention to and investment in biodiversity informatics.
This being the first BIP Index, its outcomes and inferences drawn from it need to be tested and verified in biodiversity rich (especially mega-diverse), developing and under-developed regions, as well as data-rich countries. This will help in realizing the relative fitness of the BIP Index, and identifying parameters that will further strengthen the index. It is therefore essential that feedback be received from the stakeholder communities and experts involved in development of similar indices on the significance and usability of such an index, before the next version of the BIP Index. Specific inputs on the methodology adopted, inclusion and/or omission of parameters will be extremely useful in enhancing the robustness and usefulness of the BIP Index.
The present version of the BIP Index has been developed by drawing data from multiple sources. Thus, granularity and temporal scales of these data resources vary from one another. As evident from preceding sections, normalization of such heterogeneous and multi-varied indicators is a daunting task, which makes developing an index of this nature a complex process. During this exercise we felt the need for increased accessibility to key data and parameters that might influence the BIP Index, especially data on the state of the art of biodiversity information and biodiversity informatics in non-GBIF countries, because a mechanism to access such data from these nations is currently lacking. Thus, accessibility to more up-to-date, accurate data on various parameters will help in developing a stable, credible and representative BIP Index.
Biodiversity informatics as a scientific discipline is in its relatively early stages, and is not recognized as a mainstream discipline on an equal footing in all regions of the globe. Furthermore, it receives a varied degree of scientific and socio-political attention in different regions. Thus, the global investment in biodiversity informatics is unequal. We believe that outcomes and inferences of the BIP Index will encourage a rationalization and harmonization process of increased yet uniform attention and investment in biodiversity informatics, especially in the regions with high potential to make rapid progress. This will generate more data on parameters that influence BIP Index development and its robustness.
We therefore hypothesize that the relevance, robustness and acceptance of the BIP Index is directly proportional to validation, indicator robustness and attention and investment to biodiversity informatics.
A further issue is our choice of countries as units for developing the BIP Index. Our choice of a 'country-based BIP Index' is intentional because attention and investment in biodiversity informatics is determined and influenced by nations on the basis of several considerations and not by the sub-disciplines, ecosystem focus or priorities.
Finally, there is a need for furthering development and communication of this and subsequent versions of the BIP Index by the GBIF. We believe that GBIF, being the inter-governmental initiative in the area of biodiversity informatics, is the natural venue to support the development of the BIP Index. As GBIF aims to be the foremost global resource for biodiversity information , it requires a mechanism and/or instrument to (a) assess the state of the art of biodiversity informatics, (b) suggest the potential of countries to strengthen, advance and benefit from investment in biodiversity informatics, and (c) harmonize global progress in biodiversity informatics. We believe that the BIP Index provides one such comprehensive mechanism that can encourage countries in strengthening, investing and collaborating to ensure that biodiversity information is freely and openly accessible to anyone, anytime and anywhere for the benefit of the science, society and a sustainable future.
Improved discovery and accessibility of biodiversity data helps to address both scientific and social issues. Furthermore, it is essential for informed decisions for sustainable development of biotic resources and the ecosystems that harbor them. However, this calls for uniform spread and accessibility of such data. Unfortunately, our progress in biodiversity informatics to date is not uniform across the globe. We do not have yet a mechanism to measure our progress in biodiversity informatics that can encourage countries in making demand-driven and deterministic investment in achieving uniform progress in biodiversity informatics. We believe that such uniform progress will help to reduce the existing imbalance of accessibility to biodiversity.
The BIP Index could potentially help in identifying countries most likely to contribute to filling gaps in digitized biodiversity data; assist countries potentially in need (for example mega-diverse countries) to mobilize resources and collect data that could be used in decision-making; and allow identification of which biodiversity-informatics-resourced countries could afford to assist countries lacking in biodiversity informatics capacity.
Further investigations in stabilizing and enriching the BIP Index are essential. Following validation, appropriate parameterization is likely to be essential during the next version of the BIP Index to ascertain or enhance its robustness. This will certainly require a number of iterations of the BIP Index in years to come. Given the political attention and trend of increased investment in biodiversity science, the BIP Index will help in rationalizing such an investment, leading to better understanding of the state and progress in the area of biodiversity informatics. The BIP Index should prove a useful tool for local to global initiatives such as the Intergovernmental Panel on Climate Change (IPCC), the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES), the Convention on Biological Diversity (CBD), and Group on Earth Observations Biodiversity Observation Network (GEO-BON). As the BIP Index proves useful in harmonizing the generation, discovery, publishing and accessibility of biodiversity data, it can potentially form an essential mechanism in the science-policy-society interface for biodiversity.
All authors are grateful to the University of Navarra and to the Global Biodiversity Information Facility, and to Tim Hirsch for comments.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 15, 2011: Data publishing framework for primary biodiversity data. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S15. Publication of the supplement was supported by the Global Biodiversity Information Facility.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.