SEDE-GPS: socio-economic data enrichment based on GPS information

Background Microbes are essentail components of all ecosystems because they drive many biochemical processes and act as primary producers. In freshwater ecosystems, the biodiversity in and the composition of microbial communities can be used as indicators for environmental quality. Recently, some environmental features have been identified that influence microbial ecosystems. However, the impact of human action on lake microbiomes is not well understood. This is, in part, due to the fact that environmental data is, albeit theoretically accessible, not easily available. Results In this work, we present SEDE-GPS, a tool that gathers data that are relevant to the environment of an user-provided GPS coordinate. To this end, it accesses a list of public and corporate databases and aggregates the information in a single file, which can be used for further analysis. To showcase the use of SEDE-GPS, we enriched a lake microbial ecology sequencing dataset with around 18,000 socio-economic, climate, and geographic features. The sources of SEDE-GPS are public databases such as Eurostat, the Climate Data Center, and OpenStreetMap, as well as corporate sources such as Twitter. Using machine learning and feature selection methods, we were able to identify features in the data provided by SEDE-GPS that can be used to predict lake microbiome alpha diversity. Conclusion The results presented in this study show that SEDE-GPS is a handy and easy-to-use tool for comprehensive data enrichment for studies of ecology and other processes that are affected by environmental features. Furthermore, we present lists of environmental, socio-economic, and climate features that are predictive for microbial biodiversity in lake ecosystems. These lists indicate that human action has a major impact on lake microbiomes. SEDE-GPS and its source code is available for download at http://SEDE-GPS.heiderlab.de Electronic supplementary material The online version of this article (10.1186/s12859-018-2419-4) contains supplementary material, which is available to authorized users.


Background
The global positioning system (GPS), established in 1972 and made publicly available in 2000, allows for the exact identification of every spot on the surface of the earth [1]. Consequentially, when studying geographically localized objects or processes such as ecosystems, their location can easily be specified using GPS coordinates.
Many natural processes are strongly influenced by characteristics of their surroundings, i.e., it is known that chemical composition, size of different habitats, and *Correspondence: dominik.heider@uni-marburg.de 1 Faculty of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032 Marburg (Lahn), Germany Full list of author information is available at the end of the article socio-economic features such as human population size, can influence the (microbial) biodiversity in ecosystems [2][3][4][5]. Therefore, having access to environmental characteristics and including them in analyses is crucial when trying to understand natural processes.
In the current study, we describe the novel tool SEDE-GPS (Socio-economic data enrichment based on GPS information), which can be used to enrich data sets with data from public and publicly available corporate databases based on user-specified GPS information. The current version of SEDE-GPS accesses Open Street Map (OSM), the Climate Data Center (CDC), Eurostat, and Twitter. SEDE-GPS has an easy-to-use graphical user interface and enables researchers to enrich their data with environmental and socio-economic information based on GPS information. This may lead to new insights into the influence of environmental and socio-economic features on a wide range of processes.
As an exemplary use-case of SEDE-GPS, we use it in order to identify features that have an impact on microbial biodiversity. To this end, we calculate different alpha diversity metrics from a sequencing dataset sampled from a set of alpine lakes in Austria. We then use feature selection and machine learning methods to determine features from the output of SEDE-GPS that can be used to predict these alpha diversity metrics. Our results show that both microbial Eukaryotes and Prokaryotes are impacted by different environmental features. Nevertheless, for both domains, the area and number of city structures (or lack thereof ) and other human-related features carry high predictive power. Implementation SEDE-GPS can be used via both a graphical user interface (GUI) and a command line interface. As main input, SEDE-GPS takes a list of at least one GPS coordinate. Additionally, SEDE-GPS needs a set of parameters specifying which databases will be queried and restrictions on the subfields to be downloaded. In the GUI, these parameters can be selected via mouse-click, however, in the command line version, these parameters need to be specified in a config file. The output of the different modules implemented in SEDE-GPS is temporarily saved and removed after being merged to a final output file in the csv format. This is due to the fact that the output of SEDE-GPS can be too large for regular-sized memory.
In the following, we will discuss the sources for data enrichment currently used by SEDE-GPS (Fig. 1).

OSM: Land use statistics
Open Street Map (OSM) is a community-generated, worldwide map. It is used by SEDE-GPS to gather information on land-use of the area that surrounds a given GPS position [6]. An area with an user-defined perimeter is extracted from relevant map tiles of the OSM database. As OSM maps are represented in Mercator projection, SEDE-GPS compensates for latitudinal distortion. From this map excerpt, the relative amount of pixels covered by different map legend objects are calculated by thresholding for their respective colors. This will calculate the fraction of area around the user-provided GPS position that is covered by, e.g., forests, city structures, or bodies of water.

OSM: POIs
In addition to the map itself, OSM also hosts a database that contains the locations of specific points of interests (POIs), such as special buildings or touristically relevant objects [6]. This module queries the OSM API and counts the number of the different POIs in a perimeter of userdefined size around the GPS coordinates. As the OSM API reacts to queries slowly, this module is the largest contributor to the runtime of SEDE-GPS. Therefore, for larger analyses, it is advisable to manually download the socalled planetfile from OSM and to use it as an additional input for SEDE-GPS.

Eurostat: detailed regional statistics
The Eurostat database contains highly detailed governmentally collected data from the EU and EFTA member states [7]. Its regional database provides statistics on economic and social composition of centrally defined NUTS (Nomenclature des unités territoriales statistiques) regions. This module first determines the NUTS region that corresponds to the user-specified GPS position by querying the Google Maps database for the GPS positions' postal code. With around 17,500 features, this module's output represents 99.4% of all features gathered by SEDE-GPS.

CDC: European climate data
Via the CDC, a ftp server mainained by the Deutscher Wetterdienst (DWD), it is possible to publicly and freely access European climate data that dates back to 2010 [8]. The data has an interpolated spatial resolution of 5 km and a chronological resolution of a day or a month. This module requires a date as additional input and calculates average values of, e.g., temperature or windiness for the specified day, month, and/or year.

Twitter
The short messages sent out by users of Twitter (socalled tweets) can be location-tagged, and their number can be used to estimate tourist interest in a POI. The Twitter module of SEDE-GPS collects and counts tweets sent from a user-specified perimeter around the GPS coordinates. Twitter limits the access to its data so that SEDE-GPS can access all tweets that were sent in the last 7 days, but can only send 75 queries per 15 min. For a large number of GPS coordinates, this module will, therefore, require a long runtime.

Calculation of alpha diversity indices
The sequence data analyzed in the current study was taken from [9,10] (Additional file 1). It stems from a set of alpine Austrian lakes, which were sampled in order to study the change of lake microbial ecosystems of three different lakes over time [9] and the difference in microbiome composition over many lakes [10]. 16s and 18s SSU rRNA sequences were sequenced using a 454 deep-sequencing amplicon approach [9,10]. In the current study, only samples that were taken in August 2006 and contain more This data can then be used in analyses of any geo-located process. Due to the huge amount of features present in the dataset after data enrichment with SEDE-GPS, we recommend including a feature selection step before using the gathered data for model construction, e.g., based on machine learning. Data sources are represented by their respective logos which were taken from Wikimedia (https://commons.wikimedia.org/wiki/ Main_Page) than 1000 sequences were analyzed. 16s and 18s rRNA sequences were analyzed separately.
In order to estimate biodiversity within the samples, we calculated four different alpha diversity indices, namely Shannon's Entropy H , Simpson diversity D, Simpson evenness E, and the Chao1 Estimator C, at the maximum possible sequencing depth with QIIME [11]. These indices describe the mean species richness or diversity at the local level [12] and are described by the following equations: where R is the number of species, n i the number of individuals in species i, N the total number of individuals, S 1 the number of singletons (i.e., the number of species with only one individuum), and S 2 the number of doubletons (i.e., the number of species with exactly two individuals).

Feature selection and feature evaluation
Before using the output of SEDE-GPS for machine learning, we employed a feature selection step. To this end, features containing missing values and with low variance (e.g., with more than 25% zeroes) were discarded. Next, we used the R package EFS (Ensemble Feature Selection) in order to rank the remaining features according to their importance. EFS is an ensemble learning feature selection method, that corrects for biases of the single methods when weighting features [13,14]. Although EFS has been developed for feature selection in classification studies, we used an adapted version of EFS, which can be used for regression studies. Stability of the features gathered over multiple runs of EFS were assessed by calculating the mean pairwise distance between the feature lists. To this end, we calculated Kendall's τ and the Jaccard distance using the R packages kendall and philentropy [15,16]. For two ranked lists of observations x and y of length n, Kendall's τ is defined as with c being the number of pairs of concordant observations (x i , y i ) and (x j , y j ) with x i < x j and y i < y j , d the number of discordant observations with i and j indices in the lists x and y, respectively. The Jaccard distance d J for two lists x and y is defined as Therefore, for two feature lists with a maximum distance, the Jaccard distance would assume a value of 1 and Kendal's τ a value of −1. These values were calculated from feature lists that contain the 50 features that were ranked most important by EFS.
Sets of correlating features were determined using Spearman correlation at a correlation coefficient cutoff of larger than 0.7.

Machine learning
We trained and evaluated eleven different machine learning models (as implemented in the R package caret [17]) using a leave-one-out cross-validation (LOOCV) scheme. These models included generalized linear models (glmnet), bayesian lasso (blasso), support vector machines (svmLinear and svmRadial), k-nearest neighbors (knn), Regression Trees (CART: rpart, bagged CART: treebag), Random Forests (rf ), and stochastic and extreme gradient boosting (gbm and xgbTree). Models were evaluated by comparing the predicted values for all iterations to the real alpha diversity values, resulting in R 2 values. Confidence intervals for the models' performance were calculated from the distribution of R 2 values that were gathered from 1000x bootstrapped pairs of predicted and observed target variables. Their distributions were visualized using boxplots.
The machine learning models were tested for overfitting using a permutation test. To this end, the target variable was permutated and after feature selection with EFS, mache learning models were trained using the same approach as described above. R 2 values were calculated and collected for 1000 repetitions of this procedure. Finally, the number of times t the resulting R 2 value is larger than or equal to the R 2 value received with an unpermutated target variable was counted. Significance in terms of a p value was calculated by p = t/1000.

Data enrichment using SEDE-GPS
SEDE-GPS is structured modularily, with every module querying a certain database or API and, if necessary, data pre-and postprocessing steps ( Table 1). The modules that query the Open Streetmap (OSM) databases, e.g., have to account for the fact that their maps are in a Pseudo-Mercator projection or calculate a bounding box for counting of POIs. Some of the APIs queried by SEDE-GPS limit the number of queries that are handled in a certain amount of time (Twitter) or answer intentionally slowly (OSM). Similarly, the number of features provided by the different modules varies greatly, with Eurostat contributing by far the most the highest number of features, respectively (Table 1).
In order to showcase the use of SEDE-GPS, we planned to identify features that are predictive for the microbial biodiversity in a set of 39 alpine Austrian lakes. From these lakes, water samples were taken from which both 16s and 18s rRNA were sequenced and the geo-location of the sampling was recorded using GPS [9,10]. These GPS coordinates were used as an input for SEDE-GPS, with all modules enabled, using radii of 1, 2, and 5 km and the date of sampling as additional input for modules for which this is necessary. This resulted in around 17,900 features.
The resulting dataset was observed to be highly sparse, with especially the output of the Eurostat and Twitter module showing a high degree of sparsity. Furthermore, a very small amount of features contained missing values, which we attributed to either errors in the databases or in the communication with the API. Therefore, features were discarded that contained any missing values or zeroes for more than a third of the instances. This procedure reduced the number of features per lake to around 1,200.

Calculation of biodiversity metrics
The 16s and 18s rRNA sequencing datasets were processed separately using a QIIME pipeline [11]. Samples that contained less than 1000 sequences were discarded, which lead to differing numbers of lakes for which Eukaryotic and Prokaryotic biodiversity data were available. As biodiversity indicators, four different Alpha diversity metrics (Shannon's entropy, Simpson diversity, Simpson evenness, and the Chao1 estimator) were calculated after rarefaction ("Methods" section). We used multiple different metrics as they each measure biodiversity in specific ways and therefore emphasize different species distribution characteristics [18][19][20]. As the alpha diversity metrics were calculated for 16s and 18s rRNA separately, this resulted in maximally eight different biodiversity indicators for each lakes.

Identification of important features using EFS
In order to find features in the output of SEDE-GPS that are predictive for lake microbial biodiversity, we used the R package EFS (Ensemble Feature Selection) and the eight alpha diversity metrics as target variable in separate analyses [13,14]. EFS is an ensemble feature selection Runtime means and standard deviation were calculated from ten measurements method that assigns weights to the features in an unbiased manner according to their predictiveness for the target value. Using the average weight of the features as cutoff, features below this cutoff were discarded. To verify that the selected features are both descriptive and were not selected due to overfitting, eleven different machine learning models were trained to predict the eight alpha diversity values from the EFS-selected SEDE-GPS features. The models showed profoundly differences in performance (Table 2) with xgbTree showing near perfect performance for all target variables (Fig. 2). In order to confirm that the performance of the models is not due to overfitting, we performed a permutation test for the four best-performing machine learning models. For all target variables and machine learning models, this resulted in a p-value of less than 0.001.
Taken together, these results show that the features selected by EFS were not selected due to overfitting but are helpful for predicting alpha diversity metrics for prokaryotes and microbial eukaryotes in lakes.

Stability and importance of features
Due to the fact that leave-one-out cross validation (LOOCV) was used to train and validate the machine learning models, multiple weighted feature lists were calculated for every target variable. Overfitting of EFS would have resulted in drastically different feature weights in the LOOCV iterations. In order to show that EFS did not overfit in the analyses presented here, we assess the stability of the features selected in the LOOCV iterations using both Kendall's τ and Jaccard distance as feature list distance measures. These results show that the features selected by EFS show a high degree of stability and that the feature selection is not the result of overfitting (Fig. 3).
When manually examining selected features, it is important to keep in mind that the first step of feature selection in EFS is correlation based. This means that from sets of features that correlate, only the most descriptive feature is kept in the feature set. Therefore, for datasets processed with EFS, each feature label must be viewed as stand-in for a set of correlating features. Table 3 shows the five most important features for predicting the different alpha diversity metrics, with each feature name being  Table S1). This examination was limited to five features per target variable because both the average feature weight and the stability of the feature position decrease quickly with increasing rank of the feature (Fig. 4, Additional file 3: Figure S1). The resulting feature lists for Prokaryotes and microbial Eukaryotes show major differences, while using different alpha diversity metrics result, especially for Prokaryotes, in similar feature lists (Table 3).

SEDE-GPS
In this paper, we present SEDE-GPS, which can be used to drastically increase the number of features for datasets that contain GPS-located samples. Accessing four different data sources via five modules, it provides around 18,000 numerical features that contain socio-economic, geographic, and climate information (Table 1).
Currently, due to the choice of databases SEDE-GPS queries, this tool has a number of limitations. Both the CDC and Eurostat modules return only data for Fig. 2 Performance of machine learning models predicting microbial lake alpha diversity based on the output of SEDE-GPS. Stars represent the performance of models trained on the respecitve dataset, box plots represent confidence intervals of R 2 values gathered from the respective model. Models were trained on the output of SEDE-GPS after feature selection and evaluated using LOOCV ("Methods" section). Only results for the four best-performing models are shown; for the others, see Table 2 Fig. 3 Stability of feature lists over LOOCV iterations. Jaccard distances and Kendall's τ were calculated for pairs of feature lists for the 50 most important features of each dataset. Dots and error bars represent average values and standard deviations of values, respectively. At maximum distance, the Jaccard distance and Kendall's τ would assume a value of 1 and −1, respectively. Both feature lists are rather stable, however, the feature lists of the Prokaryote datasets are more stable than their Eukaryote counterparts GPS coordinates in Europe, while the OSM modules and Twitter module will work for any GPS coordinate. Similarly, the databases queried by SEDE-GPS do not contain meaningful data for most marine GPS coordinates. In the future, we seek to overcome these limitations by including more data sources and thus extending SEDE-GPS both to new regions and to new data types and formats.
Similarly, the specific limitations and pecularities of the databases currently used by SEDE-GPS are important for the interpretation of their data. OSM contains usergenerated and user-curated information which might be of inconsistent albeit high quality or level of detail [6]. Eurostat, as a governmentally curated database, on the other hand, exhibits a level of detail which is generally lower that that of OSM as it can only be queried for defined NUTS regions [7]. As these regions are of widely differing sizes one might want to normalize data gathered from Eurostat to the area of the respective NUTS region. We decided not to implement this normalization step in SEDE-GPS as postprocessing steps not accessible to the user generally might introduce unwanted artifacts. The information gathered from Twitter comes with multiple caveats: For one, only very few processes will be directly influenced by the number of messages sent via Twitter and this number will thus, in most cases, function as a proxy for other information. Additionally, the number of tweets will show a certain amount of variance over time, with the amount of variance being possibly also location-dependent.
Because of a rate limitation in API queries, both the OSM modules and the Twitter module are the biggest contributors to SEDE-GPS's runtime, especially for datasets with many GPS coordinates. It would be possible to speed up the OSM modules by reading the data from a so-called planetfile (an image of the OSM databases) instead of using API queries. This is, currently, not implemented in SEDE-GPS, as the planetfile is very large and a speed improvement would, therefore, only exist for very large GPS datasets.
Central to the design of SEDE-GPS is the fact that it does not perform any field-specific data postprocessing. Therefore, the output of SEDE-GPS can be used for studies in a wide variety of scientific fields. Nevertheless, for some applications, postprocessing steps might be advisable.

Microbial ecology
In this study, we showcase the use of SEDE-GPS for microbial ecology. From the output of SEDE-GPS and using machine learning methods, we were able to identify features that can be used as predictors of both Eukaryote and Prokaryote alpha diversity in a set of alpine lakes. Implicitly, in this study, we assumed that environmental features have a bigger impact on microbial biodiversity than historical contingencies and recent events. We acknowledge that this notion, succinctly formulated as "everything is everywhere, but the environment selects", is highly debated [21][22][23][24]. Furthermore, we do not take into account that the composition of microbial communities can be majorly influenced by recent events or the microenvironment of the sampling position [25,26]. These assumptions are neccesary because the dataset analyzed here does not contain multiple samples that were collected on different time points for each of the lakes. However, we are not aware of such an ecological microbial sequencing dataset with a quality, geographic extensiveness, and also uniformity of sample preparation comparable to the one we analyzed here.
The features we identified as most predictive for microbial biodiversity differed greatly between Eukaryotes and Prokaryotes, supporting the notion that microorganisms from these domains have different ecological roles [21,24,27,28]. In contrast to this, the most predictive features for the different alpha diversity indices calculated from Prokaryotic sequences show a high degree of similarity. This indicates that the alpha diversity metrics used in this study essentially capture the same central distribution characteristics of the composition, at least for this domain of life.
Recently, many studies identified environmental and geographic features such as temperature, pH, climate, ion and nutrient concentration, and elevation-related environmental parameters as major drivers of the composition of lake microbiomes [4,10,21,[29][30][31]. Some of these features were also identified as highly impactful in our analysis (Table 3), albeit somewhat hidden under feature labels such as "Climate, Demography, City Structures" for temperature or "Economy (parking, GDP, Agrarian structures), Population" for nutrient concentration. While this clearly is a consequence of the field-agnostic nature of the data provided by SEDE-GPS, it might also point to possible sources for impact on biodiversity.
Therefore, our results also suggest that human action has an direct or indirect impact on lake micrbiome composition. Although an impact of urbanization on biodiversity is well known for other areas of ecology [32][33][34][35], this is the first time, to our knowledge, that it has been described for microorganisms. Surprisingly, our results suggest that urbanization has a positive effect on Prokaryote biodiversity, as, e.g., the area of the environment covered by streets correlates positively with all biodiversity indices used in this study ( Table 3). The negative impact of forest area might therefore stem from the fact that areas covered with forests cannot also be urban regions. Importantly, one should not fall into the trap of assuming that a higher biodiversity necessarily signifies a well-functioning ecosystem [20] and take the results presented here to mean that more streets would improve lake ecosystems. Nevertheless, these results indicate that the processes that govern microbial ecology are very different from those that regard the ecology of larger organisms [9,21,28].
Further analyses will be needed to solidify the results of this study. In part, this is due to the fact that the samples and lakes included in this analysis are limited in number and are geographically close to each other [22,24,25,36]. Therefore, for a more thorough analysis, larger datasets from more variable sites would be neccessary, as currently only available from large-scale environmental sequencing efforts such as the Earth Microbiome Project [37] or the 1000 Springs Project [28,38]. Nevertheless, on the basis of the results presented here, experiments can be designed in order to illuminate the mechanistic and causal relationships between environmental features and microbial biodiversity.

Conclusion
This study shows how to use SEDE-GPS in order to enhance datasets that contain scarce amounts information on the environment of geo-located, observed processes. Analysing the output of SEDE-GPS leads to the identification of environmental, socio-economical, and climate features that influence the studied process. These results can then act as basis for further hypothesis-driven research projects. SEDE-GPS is available at http://www. SEDE-GPS.heiderlab.de.