 Methodology
 Open Access
 Published:
How to balance the bioinformatics data: pseudonegative sampling
BMC Bioinformatics volume 20, Article number: 695 (2019)
Abstract
Background
Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem.
Results
In this study, we propose a new data sampling approach, called pseudonegative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a maxrelevance minredundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudonegative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudonegative samples to reduce the computation cost. Consequently, the discovered pseudonegative samples have strong relevance to positive samples and less redundancy to negative ones.
Conclusions
To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudonegative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudonegative samples grows with the improvement of prediction accuracy on all dataset.
Background
The work is motivated by the realworld requirement in bioinformatic data processing: it is very common that negative samples greatly dominate positive samples, and this phenomena is called data imbalance problem. In general, we cannot achieve genetic data mining with limited positive samples. So, we think that: whether we could use positive samples by mixing pseudonegative data (which is classified to be negative data, but they are similar to positive samples with the maximum relevance and they have the minimum redundancy with negative samples) to predict the categories of samples. Because of the lack of enough positive samples, the biologist cannot perform experiments. Consequently, some positive samples cannot be identified or categorised as negative samples which can be viewed defined as pseudonegative samples. So how to select these pseudonegative samples will be an alternative method to solve the imbalanced data problem in bioinformatics.
In the postgenome era, with the wide application of various highthroughput technologies, biological data has rapidly increased [1, 2]. Machine learning technology can be applied to discovery important information for understand complex biological processes from largescale biological data [3–9]. However, imbalanced data is a very common phenomenon in the real dataset (where the positive sample is the minority class). Many bioinformatics applications require class imbalance learning, such as gene expression data [10, 11], proteinDNA binding data [12, 13], N^{6}methylation sites in mRNAs [14], splice sites prediction [15], prediction of microRNAs [16], prediction of protein interaction [17–21], transcription factor binding sites prediction [22, 23] and so on. In this scenario, the performance of the minority classes can be greatly underestimated [24].
To the best of our knowledge, researchers have proposed some strategies to degrade the influence of imbalance data. These existing methods can be classified into datalevel approaches and algorithmiclevel approaches [25, 26]. In regard of datalevel approaches, resampling techniques are employed to balance the sample space w.r.t. an imbalanced dataset in order to alleviate the negative effect of the skewed distribution of samples in the learning process. Resampling methods are very commonlyused approach because they are independent of classifiers. Resampling techniques can be classified into three categories depending on the method used to balance the proportion of positive and negative samples: (1) oversampling: eliminating the negative effect of skewed distribution by generating new samples of minority class. Two widelyused approaches to generate minority samples are Random OverSampling (ROS) which randomly duplicate the minority samples, and SMOTE. (2) Undersampling: balance the data by discard the samples from the majority class. The simplest yet most effective method is Random UnderSampling (RUS) which involved the random elimination of majority class examples [27]. RUS deals with the class imbalance problems in an effectively fashion. (3) Hybrid methods: these are a combination of the oversampling and undersampling method. The commonlyused algorithmiclevel approach is costsensitive learning method which assigns higher costs to the minority class [28, 29].
However, RUS often loses some important classification information and ROS is timeconsuming and often results in the phenomenon of overfitting. So, it is essential to propose advanced data sampling approaches to maintain the structure of groups and generate new data according to its underlying distribution.
To overcome the problems caused by the imbalanced bioinformatic data, we first propose the pseudonegative sampling approach based on Maxrelevance and Minredundancy Pearson correlation coefficient (called MMPCC). In the MMPCC approach, Pearson correlation coefficients are used to measure the similarity between positive and negative samples and the coefficients are learned from positive and negative samples based on the maxrelevance and minredundancy criteria. The new algorithm can discover the pseudonegative samples which may be viewed as positive samples, but their labels are negative. This proposed sampling approach aims at alleviating the imbalanced ratio. The experiments are applied on two UCI data and three reallife bioinformatics data.
Contribution: The original contributions of this study can be summarized as follows.
1) We propose a concept of pseudonegative samples and present a pseudonegative sampling method which is based on the maxrelevance and minredundancy Pearson correlation coefficient in supervised learning. In particular, both positive and negative samples are taken into full consideration in order to find optimal pseudonegative samples.
2) We use an incremental searching method for calculating the coefficient of positive and negative samples, which can avoid the high computational cost in selecting the subsets of pseudonegative samples.
3) We conduct extensive experiments and the results demonstrate the advantage of the MMPCC method for handling the imbalanced bioinformatic data.
Methods
Pseudonegative sampling method
Although pseudonegative samples are viewed to be negative, but they are similar to positive samples with the maximum relevance and they have the minimum redundancy with negative samples. The key idea of pseudonegative sampling approach is to select a subset from the negative samples and classify them into positive class by the method of maxrelevance and minredundancy on Pearson correlation coefficient in the phase of training. The formal definition of pseudonegative samples is given as follows.
Definition 1 (Pseudonegative samples).
Given a positive data set S^{+}=\(\{(x^{+}_{1},y^{+}_{1}),(x^{+}_{2},y^{+}_{2}),...,(x^{+}_{m},y^{+}_{m})\}\), a negative data set S^{−}=\(\{(x^{}_{1},y^{}_{1}),(x^{}_{2},y^{}_{2}),...,(x^{}_{n},y^{}_{n})\}\), then a pseudonegative data set is represented by S^{∗}=\(\{(x^{*}_{1},y^{*}_{1}),(x^{*}_{2},y^{*}_{2}),...,(x^{*}_{l},y^{*}_{l})\}\), where m is the total number of positive data, n is the total number of negative data, m≪n, and l is the number of pseudonegative samples.
The purpose of our method is to identify the pseudonegative sample set S^{∗} (which might contain l samples) based on S^{+} and S^{−}, where l<m.
One of the famous sequential search methods is the incremental sample search algorithm, and we employ it in the study. To achieve the incremental sample searching, the pseudonegative sample set starts from \(S^{*}_{0}=\varnothing \), and a quantitative criterion \(Q(S^{*}_{i})\) is used to measure the similarity of samples in \(S^{*}_{i}\).
In each round of searching, a sample S^{∗′} would be added in the sample set \(S^{*}_{k}\).
where
\(Q(S_{i}^{*})\) plays an important role in the sample selection, which can be defined with different requirements. The validation accuracy is utilized to evaluate the new sample subsets. In this study, the metric of Eq. 3 is utilized to evaluate the similarity of samples in \(S_{k1}^{*}\) and S^{∗′}, and the corresponding quantitative criterion is given by the following equation:
where S^{∗′} is a potential pseudonegative sample and \(S^{*}_{k1}\) is the pseudonegative sample set, and A represents the validation accuracy.
In this study, we employ the Pearson correlation coefficient between samples in order to select a new sample. \(Q(S_{i}^{*})\) can be transformed to be the following equation:
The details of calculating the Pearson correlation coefficient are given in the following.
Maxrelevance and minredundancy on pearson correlation coefficient
Pearson correlation coefficient (PCC) [30] is defined on the covariance matrix, which is a method to evaluate the strength of the relationship between two vectors. In general, the coefficient between two vectors α_{i} and α_{j} is defined as follows:
According to the maxrelevance, PCC beyond negative sample and positive sample are formalized as follows:
where \(S^{}_{i} \in S^{}, i \in N, S^{+}_{j} \in S^{+}\) and j∈M agreeing with the maxrelevance criterion. The most relevant feature set can be obtained by maximizing \(D(S^{}_{i},S^{+}_{j})\).
Based on the minredundancy criterion, the samples could be selected by the following equation:
where \(S^{}_{i} \in S^{}\) and \(S^{*}_{k}\in S^{*}\),
In terms of incremental search method, an operator Ψ(D,R) is defined in Equation 10 in order to optimize the maxrelevance and minredundancy information. The best selected sample S^{∗′} is given as follows:
Assume we have the sample subsets \(S^{*}_{k1}\) which have k1 samples. In the next step of searching, the k^{th} sample is obtained from the sample subsets \(\{S^{}S^{*}_{k1}\}\). Then, \(S^{*}_{k}\) can be calculated by Eq. 12 based on Ψ(D,R).
where \(S^{}_{i}\in \{S^{}S^{*}_{k1}\}\) and \(S^{+}_{j} \in S^{+}\).
The proposed pseudonegative sampling algorithm
Based on the aforementioned preliminaries, we propose a pseudonegative sampling algorithm based on the maxrelevance and minredundancy on Pearson correlation coefficient, which is called MMPCC. The detail of the MMPCC algorithm is presented in Algorithm 1 and the flow chart is shown in Fig. 1.
As described in Algorithm 1, the selected pseudonegative samples can be updated step by step. Firstly, the maxrelevance between the negative sample and the positive sample is calculated by Equation 7 in order to choose candidate pseudonegative samples. Then, the new selected sample will be identified based on the minredundancy of samples in the selected pseudonegative subsets by Equation 9. Lastly, the new sample will be identified to be pseudonegative sample by Equation 12.
It is worthwhile to note that l is specified by experts in order to determine how many pseudonegative samples should be inserted into the positive sample set.
The computational complexity of MMPCC, MAXR and MINR includes two parts: the computation of similarity matrices and the computation of sample ranking. The operator ψ_{MAXR} can be obtained via Equation 7, the operator ψ_{MINR} can be calculated by Equation 9 and the MMPCC model be figured out by Equation 12.
As for MAXR, the computation of Pearson correlation coefficient of all pairwise negative data and positive data requires the complexity of O(n∗m∗f), where n is the number of negative data, m is the number of positive data and f is the number of attributes of each data. As for MINR, the computation complexity is O(n∗l∗f), where l is the number of pseudonegative samples. Therefore, the computation complexity of MMPCC is the sum of MAXR and MINR, that is, O(n∗m∗f+n∗l∗f).
Classification methods
Random forests
The classifier of Random forests [31, 32] is an ensemble learning method, which works by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.
Neural networks
A neural network [33] is composed of several simple "neurons", and the output of a neuron will be the input of another. The connections of the biological neuron are modeled as weights. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be 1 and 1.
AdaBoost
AdaBoost, short for "Adaptive Boosting", which is a general ensemble method [34]. It focuses on classification problems and aims to convert a set of weak classifiers into a strong one. The final equation for classification can be represented by:
where f_{m} represent the m^{th} weak classifier and θ_{m} is the corresponding weight. It is exactly the weighted combination of M weak classifiers.
Discriminant analysis
Discriminant analysis(DA) is one of the classification methods. The basic idea is that: two or more clusters or populations are priori known and one or more observations are classified into one of the known populations according to the measure characteristics [35]. Let X is a qdimensional vector representing an observation from one of several possible classes. If the category is unknown, X can be classified using the discriminant analysis approach. Alternatively, it can be used to characterize the difference between classes via a discriminant function.
Datasets
In order to evaluate the prediction performance of MMPCC on pseudonegative sampling, we compare it with the stateoftheart prediction methods. In experiments, we use four UCI Repository datasets [36] and three real bioinformatic datasets. Table 1 introduces the detail of the datasets.
From Table 1, we can see that the number of attributes of each dataset is 9, 3, 10, 49, 180, 180 and 25, respectively. We use all attributes of each dataset in MMPCC. In MMPCC, the Pearson correlation coefficient is used to calculate the similarity between negative and positive samples in Equation 7, and is also applied in Equation 10 and Equation 12. Additionally, the coefficient between two vectors α_{i} and α_{j} in Equation 5 is obtained by all attributes of each dataset.
In Table 1, Positive represents the number of positive samples, Negative represents the number of negative samples, and Ratio = Negative Numbers / Positive Numbers.
More specifically, the first UCI datasets, Contraceptive Method Choice (CMC) contains 333 minority samples and 1140 majority samples, and the number of attributes is 9. The second UCI datasets, Haberman’s Survival Data, contains 81 minority samples and 225 majority samples, and the number of attribute is 3. The third dataset Solar Flare records the number of solar flares. Each attribute calculates the number of a certain type of Solar Flare within 24 hours. Each instance represents the number of all types of flares in an active region on the sun. The data contains 69 minority classes and 1320 majority classes, with 10 attributes. The fourth datasets Oil contains 41 minority classes and 896 majority classes, including 49 attributes.
The first bioinformatic datasets, SNP data [37], included 183 positive samples and 2891 negative samples, and the number of attributes is 25. The second bioinformatic datasets, PDNA543 [38], consists of 543 protein sequences, which are all related into the PDB (Protein Data Bank) before October 10, 2014. There are 9,549 DNAbinding residues as positive samples and 134,995 nonbinding residues as negative samples in PDNA543. The third bioinformatic datasets, PDNA316, is constructed by Si et al 39], which has 316 DNAbinding protein chains and 5,609 binding residues and 67,109 nonbinding residues.
Evaluation metrics
In this study, four metrics are used to evaluate the performance of different classifiers, including Sensitivity (Sen), Specificity (Spe), Accuracy (Acc)and the Mathew’s Correlation Coefficient (MCC). They are calculated according to the following equations:
where TP is the number of true positives TN is the number of true negatives, FP is the number of false positives, FN is the number of false negatives, P is the number of positives, and N is the number of negatives.
Sensitivity indicates how well the test predicts the true positives, Specificity measures how well the test predicts the true negatives, Accuracy is expected to measure how well the test predicts both true positives and negatives, and MCC considers true and false positives and negatives. So, the higher the values of these evaluation metrics, the better the results.
Results
The purpose of the evaluation is to examine the effectiveness of our proposed MMPCC method on selecting the pseudonegative samples. Four sets of experiments are conducted. Experiment 1 compares the different percentage of pseudonegative sampling on two UCI datasets. Experiment 2 compares the different percentage of pseudonegative sampling on three bioinformatic datasets. Experiment 3 compares MMPCC with the maxrelevance and the minredundancy methods on the PDNA316 dataset, which aims to evaluate the relation between the relevance and the redundancy. For simplicity, the maxrelevance method is represented by MAXR and the minredundancy method is represented by MINR. Experiment 4 compares MMPCC with other sampling methods on the bioinformatic datasets.
In experiments, fivefold crossvalidation is used to train the dataset. In order to give comprehensive results, Discriminant Analysis, AdaBoost, Random Forest and Neural Networks are employed for classification. We use DA, Adaboost, RF and NN to represent these four classifiers in the experiments, respectively.
Experiment 1: experiments on UCI datasets
This set of experiments examines the contribution of different percentage of pseudonegative sampling on the UCI datasets [36]. The results are shown in Table 2 and Table 3. As mentioned previously, we use the metrics of Sen, Spe, Acc and MCC.
Table 2 presents the performance of different classifiers on the CMC dataset, where the percentage of pseudonegative samples changes from 0% to around 50%. 0% means the dataset is not used pseudonegative sampling. We can see that the performance is improved with larger percentage of pseudonegative samples, where the Random Forest method achieve 28.19%, 39.22%, 43.94%, 50.87%, 56.45% and 62% for Sen when the percentage of pseudonegative samples is fixed to 0%, 10%, 20%, 30%, 40% and 50%, respectively. In addition, the Acc value is 78.2 %,78.75%, 78.41%, 78.48%, 79.57% and 79.63% and the MCC value is 0.27, 0.369, 0.404, 0.448, 0.505 and 0.532. The performance of different evaluation metrics show a trend of increasing with a higher percentage of pseudonegative samples, which agrees with the realworld situation that: if we add more positive samples, the classifier will have better performance.
Furthermore, the Neural networks method achieves 27.01%, 40.92%, 47.28%, 53.39%, 54.94% and 61.02% for Sen when the percentage of pseudonegative samples is fixed to 0%, 10%, 20%, 30%, 40% and 50%, respectively. Moreover, the MCC value is 0.161, 0.302, 0.368, 0.439, 0.439 and 0.505. For Discriminant analysis method, the Sen values are increased by 9.38%, 17.6%, 37.35%, 52.46%, 59.46% and 66.78% and the MCC values are increased by 0.156, 0.198, 0.351, 0.438, 0.485 and 0.530 on different percentage of pseudonegative samples, respectively. Similarly, the performance of the AdaBoost classifier obtain improvement on Sen and MCC, which demonstrates the effectiveness of the proposed pseudonegative sampling method.
Table 3 shows similar results on different metrics as Table 2, which verify that pseudonegative sampling is very useful in classify the imbalance data and can obtain good performance of classification. Furthermore, the results indicates that pseudonegative samples can be viewed as positive samples and be used to classify objects. For the instability of MMPCC, the results are often not unique in Table 3. There are three reasons about this issue: Firstly, four classification methods were employed, DA, RF, NN and AdaBoost in this study. Different machine learning method has different character, so the experiment results have little instability. Secondly, the value of Sensitivity and Specificity has little instability, but the value of MCC is more stable in most of experiments. As the Sensitivity and Specificity are the singular assessment metrics, MCC considers true and false positives and negatives and is generally regarded as a balanced measure. MCC can be used even if the class size is very different. Finally, the performance of different evaluation metrics shows a trend of increasing with a higher percentage of pseudonegative samples.
Experiment 2: experiments on realLife bioinformatic datasets
In this section, we demonstrate the effectiveness of the proposed method, MMPCC, on the real bioinformatic datasets, including PDNA543 [38], PDNA316 [39] and SNP data [37]. The results are given in Fig. 1, Fig. 2 and Fig. 3.
Position Specific Scoring Matrix (PSSM) was used to extract the features from protein sequences of PDNA543 and PDNA316. PSSM is a very important type of evolutionary feature, which is obtained by running the PSIBLAST program to search the SwissProt database via three iteration, with 10^{−3} as the Evalue cutoff for multiple sequence alignment. In PSSM, there are 20 scores for each sequence position and each score implies the conservation degree of a specific residue type on that position. For each data instance, all the scaled scores in PSSM are used as its evolution features. In this study, we use the window size with 9 residues, and then obtain a vector of normalized PSSM scores whose dimensions of features are 9 ×20=180.
Figure 2 shows the classification performance on PDNA543 dataset under different percentage of pseudonegative samples, where RFSen and NNSen represent the Sensitivity value of RF and NN classifiers and RFMCC and NNMCC represent the MCC value of RF and NN classifiers.
The Sen and MCC metric of NN increase with the percentage of pseudonegative samples changing from 0% to 50%. When the percentage of pseudonegative samples changes from 0% to 30%, the Sen and MCC of RF algorithm keep unchanged. However, when the percentage of pseudonegative samples is above 30%, there is a clear trend that RF has better performance as the percentage of pseudonegative samples grows.
Figure 3 illustrates the classification performance on the PDNA316 dataset under different percentage of pseudonegative samples. The performance of RF is better than NN when the percentage is 0% and %10 in terms of Sen and MCC. When the percentage is above 20%, the performance of NN increases drastically and is better than RF, which shows that adding more pseudonegative samples could help greatly improve the performance of classification. However, the performance of RF is almost unchanged. This is because the pseudonegative samples has little effect on the RF algorithm in this dataset.
Figure 4 shows the classification performance for data SNP on different percentage of pseudonegative samples. The Sen of NN grows rapidly among different percentages of pseudonegative samples and the MCC of NN gradually increases when the percentage changes from 0% to 30%, and then the fluctuate is small from 40% to 50%. We can find that the Sen and MCC of RF grows as the percentage of pseudonegative samples gradually increases.
Generally speaking, this set of experiments illustrated that the pseudonegative samples are very important and can be used to improve the effectiveness of classification.
Experiment 3: comparison of mMPCC, mAXR and mINR on the pDNA316 datasets
In this section, we employ the fivefold crossvalidation to estimate the prediction performance of the proposed MMPCC method on four metrics. We compared MMPCC with other sampling methods including MAXR (maxrelevance method based on Equation 7) and MINR (the minredundancy method based on Equation 9) [30]. In experiments, the PDNA316 dataset is employed to evaluate the effectiveness of MMPCC. The comparison results are shown in Fig. 5.
According to Fig. 5, it is straightforward to find that MMPCC outperforms the MAXR and MINR method in terms of Sen, Spe, Acc and MCC in the RF and NN classifiers. From Fig. 5(a), the pseudonegative samples have a big influence on the Sen value. The Sen value of MMPCC is significantly better than MAXR and MINR, when NN is used as a classifier. For the RF classifier, MAXR is the best one when more pseudonegative samples are added. By Fig. 5(b), with the increases of the percentage of pseudonegative samples, the Spe value of MMPCC is very stable on RF and NN. This can be explained by the reason that some pseudonegative samples are still negative ones. In addition, the Sen value can be improved with the cost of degradation of Spe value. Figure 5(c) demonstrates that the MMPCC method is the most stable method on Acc in the RF classifier. Figure 5(d) shows that the MCC value of MMPCC significantly outperforms the MAXR and the MINR methods. The performance of MAXR is better than MINR. The experimental results indicate that MMPCC attempts to utilize more representative samples and find the pseudonegative samples (which can be viewed as positive samples) from the majority negative samples.
Experiment 4: comparison of mMPCC and classical sampling methods on bioinformatic datasets
In order to verify the advantage of our method, we also compare the prediction performance of MMPCC with other classical oversampling method, i.e., SMOTE method [40], on the PDNA316 dataset.
SMOTE is an oversampling approach in which the minority class is oversampled by creating “synthetic” examples rather than by oversampling with replacement. The minority class is oversampled by taking each minority class sample and introducing synthetic examples along the line segments joining any of the k minority class nearest neighbors. Depending on the amount of oversampling required, neighbors from the k nearest neighbors are randomly chosen. In order to compare the performance of the algorithm, we use the default value 5 nearest neighbors the same as the reference [40]. The results of comparison performance are shown in Table 4. Because neural network can learn and model the relationships between inputs and outputs that are nonlinear and complex, and make generalizations and inferences. The runtime performance of random forest is quite good, and they are commonlyused to deal with the unbalanced and missing data.
According to Table 4, we can observe that MMPCC outperforms the SMOTE method in terms of all evaluation metrics. Taking MCC as an example, the MMPCC value in the NN classifier under different percentages of pseudonegative samples are 0.312, 0.405, 0.464, 0.513 and 0.543, respectively, and the improvements are 0.152, 0.205, 0.248, 0.27 and 0.277, respectively compared to the SMOTE method. For other three evaluation metrics, the MMPCC method outperforms the SMOTE sampling method as well. As for RF classifier, Table 4 shows that the performance of MMPCC is better than that of the SMOTE method. As shown in Table 4, with the increase of percentage, the MCC value of the MMPCC in the RF classifier are 0.333, 0.337, 0.351, 0.363 and 0.367, respectively, and the improvements are 0.098, 0.091, 0.101, 0.105 and 0.109, respectively over the SMOTE method. This is due to the fact that a number of duplicated or artificial samples were introduced by oversampling techniques for largescale imbalanced data. But for MMPCC, there is no manmade duplicated data. In terms of the MMPCC sampling method, the pseudonegative sampling technique helps identify more useful samples from the negative class which is often neglected, so it performs better than the SMOTE sampling method.
Experiment 5: experiments on highly imbalance ratio datasets
In order to validate the performance of the proposed method on highly imbalance Ratio datasets, the comparative evaluation on two UCI datasets, Solar Flare and Oil, are performed. The dataset Solar Flare contains 69 minority classes and 1320 majority classes; with 10 attributes, and the Ratio is 19.1. The Oil dataset contains 41 minority classes and 896 majority classes, including 49 attributes, and the Ratio is 21.9.
Table 5 demonstrates the classification results of the Solar Flare dataset with highly imbalance Ratio. Overall, the performance is increased with a larger percentage of pseudonegative samples.For example, the random forest method obtain 1.43%, 4.00%, 13.53%, 25.03%, 20.68% and 32.57% for Sen as the percentage of pseudonegative samples is fixed to 0%, 10%, 20%, 30%, 40% and 50%, respectively. Moreover, the MCC value is 0.01, 0.06, 0.23, 0.39, 0.32 and 0.44. For the neural networks method, the Sen values are increased from 7.25%, 8.01%, 20.88%, 32.03%, 28.33% to 35.48% and the MCC values are increased from 0.05, 0.06, 0.24, 0.33, 0.28 to 0.38 on different percentage of pseudonegative samples. We can conclude that the performances of different evaluation metrics show a significant improvement with a higher percentage of pseudonegative samples, even in the situation of highly imbalance Ratio.
Table 6 demonstrates the classification results of the Oil dataset with highly imbalance Ratio. From the Table 6, the random forest method achieves 14.50%, 19.60%, 33.53%, 39.83%, 50.36% and 49.76% for Sen when the percentage of pseudonegative samples is fixed to 0%, 10%, 20%, 30%, 40% and 50%, respectively. In addition, the MCC value is 0.27, 0.32, 0.43, 0.50, 0.63 and 0.59. For the neural networks method, the Sen values are increased from 52.18%, 51.95%, 41.26%, 45.83%, 54.96% to 48.09% and the MCC values are increased from 0.58, 0.54, 0.48, 0.51, 0.55 to 0.51 with different percentage of pseudonegative samples. It indicates that the proposed method is prone to improve the discrimination of minority class while retains the considerable stability.
Furthermore, Fig. 6 shows the classification performance on the Solar Flare dataset under different percentage of pseudonegative samples. From Fig. 6(a), the Sen metric of neural network increase with the percentage of pseudonegative samples changing from 0% to 50%. Even there is little fluctuation from 40% to 50%. It maybe the distribution of original dataset is unclear. In the future, we will consider how to choose the percentage of pseudonegative samples automatically. For MCC performance, similar phenomenon can be obtained from Fig. 6(b).
Figure 7 shows the tendency of Oil dataset with highly imbalance Ratio in neural network and random forest classification. We can see that Sen and MCC of random forest gradually increase when the percentage changes from 0% to 50% in Fig. 7(a). However, the value of Sen and MCC of neural network has some fluctuate from 0% to 50%. It indicated that random forest is more stability of the proposed method for this dataset. Similar trends of MCC performance can be obtained from Fig. 7(b).
Discussion
Here we designed a supervised learning method based on maxrelevance and minredundant criterion beyond Pearson correlation coefficient and tested on four UCI datasets and three real bioinformatics datasets. Our results indicated that MMPCC is better than other sampling methods in terms of several evaluation metrics. The performance of different evaluation metrics shows a trend of increasing with a higher percentage of pseudonegative samples. On the other hand, different machine learning method has different character, so the experiment results have little instability. We also observed that MMPCC method can have good performance even in the situation of highly imbalance Ratio. This reveals that pseudonegative samples are good at solving the imbalance dataset problem.
Conclusions
In this study, we propose a new sampling method, which is called pseudonegative sampling, to handle the imbalanced classification problem based on Pearson correlation coefficient which integrates the maxrelevant and minredundant. In addition, an incremental searching method is used to find the target sample with little cost of computation. The experimental results demonstrate the superior performance of our method compared to other algorithms for imbalanced classification problems.
In future, we will apply the proposed MMPCC algorithm in more realworld bioinformatic applications with largescale imbalanced data. We will investigate the possibility of extending the MMPCC method to handle multipleclassification problem. Furthermore, we will use the stateoftheart machine learning methods [41–46] to handle the imbalanced classification problem.
Availability of data and materials
All relevant data are included in this published article and its additional files. The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Abbreviations
 Acc:

Accuracy
 CMC:

Contraceptive method choice
 FN:

The number of false negative
 FP:

The number of false positive
 MAXR:

Maxrelevance method
 MCC:

Mathews correlation coefficient
 MINR:

Minredundancy method
 MMPCC:

Maxrelevance and minredundancy Pearson correlation coefficient
 PCC:

Pearson correlation coefficient
 Pre:

Precision
 PSSM:

Position specific scoring matrix
 Sen:

Sensitivity
 Spe:

Specificity
 TN:

The number of true negatives
 TP:

The number of true positive
References
Greene CS, Tan J, Ung M, Moore JH, Cheng C. Big data bioinformatics. J Cell Physiol. 2014; 229(12):1896–900.
Greene AC, Giffin KA, Greene CS, Moore JH. Adapting bioinformatics curricula for big data. Brief Bioinforma. 2015; 17(1):43–50.
Zhang Y, Cao X, Sheng Z. Genemo: a search engine for webbased functional genomic data. Nucleic Acids Res. 2016; 44(Web Server issue):122–7.
Zhang Y, Pu Y, Zhang H, Cong Y, Zhou J. An extended fractional kalman filter for inferring gene regulatory networks using timeseries data. Chemometrics Intell Lab Syst. 2014; 138:57–63.
Liu B, Weng F, Huang DS, Chou KC. iro3wpseknc: Identify dna replication origins by threewindowbased pseknc. Bioinformatics. 2018; 34(18):3086–93.
Chuai G, Ma H, Yan J, Chen M, Hong N, Xue D, Zhou C, Zhu C, Chen K, Duan B, et al. Deepcrispr: optimized crispr guide rna design by deep learning. Genome Biol. 2018; 19(1):80.
Liu B, Yang F, Huang DS, Chou KC. ipromoter2l: a twolayer predictor for identifying promoters and their types by multiwindowbased pseknc. Bioinformatics. 2017; 34(1):33–40.
Yuan L, Zhu L, Guo WL, Zhou X, Zhang Y, Huang Z, Huang DS. Nonconvex penalty based lowrank representation and sparse regression for eqtl mapping. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2017; 14(5):1154–64.
Zhu L, Zhang HB, Huang DS. Direct auc optimization of regulatory motifs. Bioinformatics. 2017; 33(14):243–51.
Yu H, Ni J, Zhao J. Acosampling: An ant colony optimizationbased undersampling method for classifying imbalanced dna microarray data. Neurocomputing. 2013; 101(2):309–18.
Deng S, Yuan J, Huang D, Wang Z. Sfaps: An r package for structure/function analysis of protein sequences based on informational spectrum method. In: 2013 IEEE International Conference on Bioinformatics and Biomedicine. Washington: IEEE: 2014. p. 29–34.
Zhang Y, Qiao S, Ji S, Zhou J. Ensemblecnn: Predicting dna binding sites in protein sequences by an ensemble deep learning method. In: 14th International Conference on Intelligent Computing. Berlin: SpringerVerlag: 2018. p. 301–6.
Hu J, Li Y, Zhang M, Yang X, Shen HB, Yu DJ. Predicting proteindna binding residues by weightedly combining sequencebased features and boosting multiple svms. IEEE/ACM Trans Comput Biol Bioinforma. 2017; 14(6):1389–98.
Zhao Z, Peng H, Lan C, Zheng Y, Fang L, Li J. Imbalance learning for the prediction of n6methylation sites in mrnas. BMC Genomics. 2018; 19(1):574.
Du X, Yao Y, Diao Y, Zhu H, Zhang Y, Li S. Deepss: Exploring splice site motif through convolutional neural network directly from dna sequence. IEEE Access. 2018; 6:32958–78.
Liu B, Li J, Cairns MJ. Identifying mirnas, targets and functions. Brief Bioinforma. 2012; 15(1):1–19.
Zhang Y, Zhang D, Mi G, Ma D, Li G, Guo Y, Li M, Zhu M. Using ensemble methods to deal with imbalanced data in predicting proteinprotein interactions. Computat Biol Chem. 2012; 36(2):36–41.
Zhu L, Deng SP, You ZH, Huang DS. Identifying spurious interactions in the proteinprotein interaction networks using local similarity preserving embedding. IEEE/ACM Trans Comput Biol Bioinforma (TCBB). 2017; 14(2):345–52.
Huang DS, Zhang L, Han K, Deng S, Yang K, Zhang H. Prediction of proteinprotein interactions based on proteinprotein correlation using least squares regression. Curr Protein Peptide Sci. 2014; 15(6):553–60.
You ZH, Lei YK, Gui J, Huang DS, Zhou X. Using manifold embedding for assessing and predicting protein interactions from highthroughput experimental data. Bioinformatics. 2010; 26(21):2744–51.
Xia JF, Zhao XM, Huang DS. Predicting protein–protein interactions from protein sequences using meta predictor. Amino Acids. 2010; 39(5):1595–9.
Shen Z, Bao W, Huang DS. Recurrent neural network for predicting transcription factor binding sites. Sci Rep. 2018; 8(1):15270.
Guo WL, Huang DS. An efficient method to transcription factor binding sites imputation via simultaneous completion of multiple matrices with positional consistency. Mole BioSystems. 2017; 13(9):1827–37.
Dan Y, Xu S, Yang W, Sun C, Yu H. A review of class imbalance learning methods in bioinformatics. Curr Bioinforma. 2015; 10(4):360–9.
Guo H, Li Y, Shang J, Gu M, Huang Y, Gong B. Learning from classimbalanced data: Review of methods and applications. Expert Syst Appl. 2017; 73:220–39.
Liu B, Li K, Huang DS, Chou KC. ienhancerel: identifying enhancers and their strength with ensemble learning approach. Bioinformatics. 2018; 34(22):3835–42.
Hassan AR, Haque MA. An expert system for automated identification of obstructive sleep apnea from singlelead ecg using random under sampling boosting. Neurocomputing. 2017; 235:122–30.
Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R. Costsensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst. 2017; 29(8):3573–87.
Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R. Costsensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst. 2017; 29(8):3573–87.
Jin X, Bo T, He H, Hong M. Semisupervised feature selection based on relevance and redundancy criteria. IEEE Trans Neural Netw Learn Syst. 2016; 28(9):1974–84.
Pons T, Vazquez M, Mateyhernandez ML, Brunak S, Valencia A, Izarzugaza JM. Kinmutrf: a random forest classifier of sequence variants in the human protein kinase superfamily. Bmc Genomics. 2016; 17(2):396.
Wang X, Lin P, Ho JW. Discovery of celltype specific dna motif grammar in cisregulatory elements using random forest. BMC Genomics. 2018; 19(1):929.
Dutta S, Madan S, Parikh H, Sundar D. An ensemble micro neural network approach for elucidating interactions between zinc finger proteins and their target dna. Bmc Genomics. 2016; 17(Suppl 13):1033.
GutiRrezTobal GC, lvarez D, Del CF, Hornero R. Utility of adaboost to detect sleep apneahypopnea syndrome from singlechannel airflow. IEEE Trans Biomed Engineer. 2015; 63(3):636–46.
Jin X, Zhao M, Chow TWS, Pecht M. Motor bearing fault diagnosis using trace ratio linear discriminant analysis. IEEE Trans Ind Electron. 2013; 61(5):2441–51.
Asuncion A. Uci machine learning repository. 2013. https://archive.ics.uci.edu/ml/index.php.
Quan Z, Guo M, Yang L, Jun, Wang. A classification method for classimbalanced data and its application on bioinformatics. J Comput Res Dev. 2010; 47(8):1407–14.
Hu J, Li Y, Zhang M, Yang X, Shen HB, Yu DJ. Predicting proteindna binding residues by weightedly combining sequencebased features and boosting multiple svms. IEEE/ACM Trans Comput Biol Bioinforma. 2017; 14(6):1389–98.
Si J, Zhang Z, Lin B, Schroeder M, Huang B. Metadbsite: a meta approach to improve protein dnabinding sites prediction. Bmc Syst Biol. 2011; 5(1):7.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: synthetic minority oversampling technique. J Artif Intell Res. 2002; 16(1):321–57.
Qiao S, Han N, Wang J, Li R, Gutierrez LA, Wu X. Predicting longterm trajectories of connected vehicles via the prefixprojection technique. IEEE Trans Intell Trans Syst. 2018; 19(7):2305–15.
Qiao S, Shen D, Wang X, Han N, Zhu W. A selfadaptive parameter selection trajectory prediction approach via hidden Markov models. IEEE Trans Intell Trans Syst. 2015; 16(1):284–96.
Qiao S, Han N, Zhu W, Gutierrez LA. TraPlan: an effective threeinone trajectoryprediction model in transportation networks. IEEE Trans Intell Trans Syst. 2015; 16(3):1188–98.
Qiao S, Han N, Gao Y, Li R, Huang J, Guo J, Gutierrez LA, Wu X. A fast parallel community discovery model on complex networks through approximate optimization. IEEE Trans Knowl Data Engineer. 2018; 30(9):1638–51.
Qiao S, Tang C, Jin H, Long T, Dai S, Ku Y, Chau M. PutMode: prediction of uncertain trajectories in moving objects databases. Appl Intell. 2010; 33(3):370–86.
Qiao S, Han N, Zhou J, Li R, Jin C, Gutierrez LA. Socialmix: A familiaritybased and preferenceaware location suggestion approach. Engineer Appl Artif Intell. 2018; 68:192–204.
Acknowledgements
Not applicable.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 25, 2019: Proceedings of the 2018 International Conference on Intelligent Computing (ICIC 2018) and Intelligent Computing and Biomedical Informatics (ICBI) 2018 conference: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement25.
Funding
Publication costs are funded by the National Natural Science Foundation of China under Grants (Nos. 61702058, 61772091, 61802035, 71701026, 61962006), the China Postdoctoral Science Foundation funded project (No. 2017M612948), the Scientific Research Foundation for Education Department of Sichuan Province under Grant (No. 18ZA0098), the Innovative Research Team Construction Plan in Universities of Sichuan Province under Grant (No. 18TD0027), the Natural Science Foundation of Guangxi under Grant (No. 2018GXNSFDA138005), the Sichuan Science and Technology Program under Grant (Nos. 2018JY0448, 2019YFG0106, 2019YFS0067), the Scientific Research Foundation for Advanced Talents of Chengdu University of Information Technology under Grant (Nos. KYTZ201717, KYTZ201715, KYTZ201750), the Scientific Research Foundation for Young Academic Leaders of Chengdu University of Information Technology under Grant (Nos. J201706, J201701). Guangdong Province Key Laboratory of Popular High Performance Computers under Grant (No. 2017B030314073).
Author information
Affiliations
Contributions
Zhang YQ, Qiao SJ and Zhou JL conceived the study and developed the approach. Lu RZ, Han N and Liu DX carried out and optimized the experiments. All authors contributed to result interpretation. All authors contributed to the drafting and revision of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhang, Y., Qiao, S., Lu, R. et al. How to balance the bioinformatics data: pseudonegative sampling. BMC Bioinformatics 20, 695 (2019). https://doi.org/10.1186/s1285901932694
Published:
DOI: https://doi.org/10.1186/s1285901932694
Keywords
 Imbalanced data
 Pseudonegative sampling
 Pearson correlation coefficients
 Maxrelevance
 Minredundancy