How to balance the bioinformatics data: pseudo-negative sampling

Background Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. Results In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. Conclusions To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.


Background
The work is motivated by the real-world requirement in bioinformatic data processing: it is very common that negative samples greatly dominate positive samples, and this phenomena is called data imbalance problem. In general, we cannot achieve genetic data mining with limited positive samples. So, we think that: whether we could use positive samples by mixing pseudo-negative data (which is classified to be negative data, but they are similar to positive samples with the maximum relevance and they have the minimum redundancy with negative samples) to predict the categories of samples. Because of the lack of enough positive samples, the biologist cannot perform experiments. Consequently, some positive samples cannot be identified or categorised as negative samples which can be viewed defined as pseudo-negative samples. So how to select these pseudo-negative samples will be an alternative method to solve the imbalanced data problem in bioinformatics.
To the best of our knowledge, researchers have proposed some strategies to degrade the influence of imbalance data. These existing methods can be classified into data-level approaches and algorithmic-level approaches [25,26]. In regard of data-level approaches, re-sampling techniques are employed to balance the sample space w.r.t. an imbalanced dataset in order to alleviate the negative effect of the skewed distribution of samples in the learning process. Resampling methods are very commonlyused approach because they are independent of classifiers. Resampling techniques can be classified into three categories depending on the method used to balance the proportion of positive and negative samples: (1) oversampling: eliminating the negative effect of skewed distribution by generating new samples of minority class. Two widely-used approaches to generate minority samples are Random Over-Sampling (ROS) which randomly duplicate the minority samples, and SMOTE. (2) Under-sampling: balance the data by discard the samples from the majority class. The simplest yet most effective method is Random Under-Sampling (RUS) which involved the random elimination of majority class examples [27]. RUS deals with the class imbalance problems in an effectively fashion. (3) Hybrid methods: these are a combination of the oversampling and under-sampling method. The commonlyused algorithmic-level approach is cost-sensitive learning method which assigns higher costs to the minority class [28,29].
However, RUS often loses some important classification information and ROS is time-consuming and often results in the phenomenon of overfitting. So, it is essential to propose advanced data sampling approaches to maintain the structure of groups and generate new data according to its underlying distribution.
To overcome the problems caused by the imbalanced bioinformatic data, we first propose the pseudonegative sampling approach based on Max-relevance and Min-redundancy Pearson correlation coefficient (called MMPCC). In the MMPCC approach, Pearson correlation coefficients are used to measure the similarity between positive and negative samples and the coefficients are learned from positive and negative samples based on the max-relevance and min-redundancy criteria. The new algorithm can discover the pseudo-negative samples which may be viewed as positive samples, but their labels are negative. This proposed sampling approach aims at alleviating the imbalanced ratio. The experiments are applied on two UCI data and three real-life bioinformatics data.
Contribution: The original contributions of this study can be summarized as follows.
1) We propose a concept of pseudo-negative samples and present a pseudo-negative sampling method which is based on the max-relevance and min-redundancy Pearson correlation coefficient in supervised learning. In particular, both positive and negative samples are taken into full consideration in order to find optimal pseudonegative samples.
2) We use an incremental searching method for calculating the coefficient of positive and negative samples, which can avoid the high computational cost in selecting the subsets of pseudo-negative samples.
3) We conduct extensive experiments and the results demonstrate the advantage of the MMPCC method for handling the imbalanced bioinformatic data.

Pseudo-negative sampling method
Although pseudo-negative samples are viewed to be negative, but they are similar to positive samples with the maximum relevance and they have the minimum redundancy with negative samples. The key idea of pseudonegative sampling approach is to select a subset from the negative samples and classify them into positive class by the method of max-relevance and min-redundancy on Pearson correlation coefficient in the phase of training. The formal definition of pseudo-negative samples is given as follows. Definition 1 (Pseudo-negative samples). Given a pos- where m is the total number of positive data, n is the total number of negative data, m n, and l is the number of pseudo-negative samples. The purpose of our method is to identify the pseudonegative sample set S * (which might contain l samples) based on S + and S − , where l < m.
One of the famous sequential search methods is the incremental sample search algorithm, and we employ it in the study. To achieve the incremental sample searching, the pseudo-negative sample set starts from S * 0 = ∅, and a quantitative criterion Q(S * i ) is used to measure the similarity of samples in S * i . In each round of searching, a sample S * would be added in the sample set S * k . where Q(S * i ) plays an important role in the sample selection, which can be defined with different requirements. The validation accuracy is utilized to evaluate the new sample subsets. In this study, the metric of Eq. 3 is utilized to evaluate the similarity of samples in S * k−1 and S * , and the corresponding quantitative criterion is given by the following equation: where S * is a potential pseudo-negative sample and S * k−1 is the pseudo-negative sample set, and A represents the validation accuracy.
In this study, we employ the Pearson correlation coefficient between samples in order to select a new sample. Q(S * i ) can be transformed to be the following equation: The details of calculating the Pearson correlation coefficient are given in the following.

Max-relevance and min-redundancy on pearson correlation coefficient
Pearson correlation coefficient (PCC) [30] is defined on the covariance matrix, which is a method to evaluate the strength of the relationship between two vectors. In general, the coefficient between two vectors α i and α j is defined as follows: According to the max-relevance, PCC beyond negative sample and positive sample are formalized as follows: where S − i ∈ S − , i ∈ N, S + j ∈ S + and j ∈ M agreeing with the max-relevance criterion. The most relevant feature set can be obtained by maximizing D(S − i , S + j ).
Based on the min-redundancy criterion, the samples could be selected by the following equation: In terms of incremental search method, an operator (D, R) is defined in Equation 10 in order to optimize the max-relevance and min-redundancy information. The best selected sample S * is given as follows: Assume we have the sample subsets S * k−1 which have k-1 samples. In the next step of searching, the k th sample is obtained from the sample subsets {S − − S * k−1 }. Then, S * k can be calculated by Eq. 12 based on (D, R).

The proposed pseudo-negative sampling algorithm
Based on the aforementioned preliminaries, we propose a pseudo-negative sampling algorithm based on the maxrelevance and min-redundancy on Pearson correlation coefficient, which is called MMPCC. The detail of the MMPCC algorithm is presented in Algorithm 1 and the flow chart is shown in Fig. 1.
As described in Algorithm 1, the selected pseudonegative samples can be updated step by step. Firstly, the max-relevance between the negative sample and the positive sample is calculated by Equation 7 in order to choose candidate pseudo-negative samples. Then, the new selected sample will be identified based on the minredundancy of samples in the selected pseudo-negative subsets by Equation 9. Lastly, the new sample will be identified to be pseudo-negative sample by Equation 12.
It is worthwhile to note that l is specified by experts in order to determine how many pseudo-negative samples should be inserted into the positive sample set.
The computational complexity of MMPCC, MAXR and MINR includes two parts: the computation of similarity matrices and the computation of sample ranking. The operator ψ MAXR can be obtained via Equation 7, the operator ψ MINR can be calculated by Equation 9 and the MMPCC model be figured out by Equation 12.
As for MAXR, the computation of Pearson correlation coefficient of all pairwise negative data and positive data requires the complexity of O(n * m * f ), where n is the number of negative data, m is the number of positive data and f is the number of attributes of each data. As for MINR, the computation complexity is O(n * l * f ), where l is the number of pseudo-negative samples. Therefore, the computation complexity of MMPCC is the sum of MAXR and MINR, that is, O(n * m * f + n * l * f ).

Algorithm 1 Pseudo-negative sampling by Max-relevance and Min-redundancy on Pearson Correlation Coefficient
l is the number of pseudo-negative samples. 1: Initialize the target sample subsets S * 0 =∅ and the available sample subsets S − α = S − − S * l . 2: for k=1 to l do 3: for each S − j in S − α and each S + i in S n do 4: search for the new sample S * k according to:

Random forests
The classifier of Random forests [31,32] is an ensemble learning method, which works by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.

Neural networks
A neural network [33] is composed of several simple "neurons", and the output of a neuron will be the input of another. The connections of the biological neuron are modeled as weights. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be -1 and 1.

AdaBoost
AdaBoost, short for "Adaptive Boosting", which is a general ensemble method [34]. It focuses on classification problems and aims to convert a set of weak classifiers into a strong one. The final equation for classification can be represented by: where f m represent the m th weak classifier and θ m is the corresponding weight. It is exactly the weighted combination of M weak classifiers.

Discriminant analysis
Discriminant analysis(DA) is one of the classification methods. The basic idea is that: two or more clusters or populations are priori known and one or more observations are classified into one of the known populations according to the measure characteristics [35]. Let X is a q-dimensional vector representing an observation from one of several possible classes. If the category is unknown, X can be classified using the discriminant analysis approach. Alternatively, it can be used to characterize the difference between classes via a discriminant function.

Datasets
In order to evaluate the prediction performance of MMPCC on pseudo-negative sampling, we compare it with the state-of-the-art prediction methods. In experiments, we use four UCI Repository datasets [36] and three real bioinformatic datasets. Table 1 introduces the detail of the datasets. From Table 1, we can see that the number of attributes of each dataset is 9, 3, 10, 49, 180, 180 and 25, respectively. We use all attributes of each dataset in MMPCC. In MMPCC, the Pearson correlation coefficient is used to calculate the similarity between negative and positive samples in Equation 7, and is also applied in Equation 10 and Equation 12. Additionally, the coefficient between two vectors α i and α j in Equation 5 is obtained by all attributes of each dataset.
In Table 1, Positive represents the number of positive samples, Negative represents the number of negative samples, and Ratio = Negative Numbers / Positive Numbers.
More specifically, the first UCI datasets, Contraceptive Method Choice (CMC) contains 333 minority samples and 1140 majority samples, and the number of attributes is 9. The second UCI datasets, Haberman's Survival Data, contains 81 minority samples and 225 majority samples, and the number of attribute is 3. The third dataset Solar Flare records the number of solar flares. Each attribute calculates the number of a certain type of Solar Flare within 24 hours. Each instance represents the number of all types of flares in an active region on the sun. The data contains 69 minority classes and 1320 majority classes, with 10 attributes. The fourth datasets Oil contains 41 minority classes and 896 majority classes, including 49 attributes.
The first bioinformatic datasets, SNP data [37], included 183 positive samples and 2891 negative samples, and the number of attributes is 25. The second bioinformatic datasets, PDNA-543 [38], consists of 543 protein sequences, which are all related into the PDB (Protein Data Bank) before October 10, 2014. There are 9,549 DNA-binding residues as positive samples and 134,995 non-binding residues as negative samples in PDNA-543. The third bioinformatic datasets, PDNA-316, is constructed by Si et al [39], which has 316 DNA-binding protein chains and 5,609 binding residues and 67,109 non-binding residues.

Evaluation metrics
In this study, four metrics are used to evaluate the performance of different classifiers, including Sensitivity (Sen), Specificity (Spe), Accuracy (Acc)and the Mathew's Correlation Coefficient (MCC). They are calculated according to the following equations: where TP is the number of true positives TN is the number of true negatives, FP is the number of false positives, FN is the number of false negatives, P is the number of positives, and N is the number of negatives. Sensitivity indicates how well the test predicts the true positives, Specificity measures how well the test predicts the true negatives, Accuracy is expected to measure how well the test predicts both true positives and negatives, and MCC considers true and false positives and negatives. So, the higher the values of these evaluation metrics, the better the results.

Results
The purpose of the evaluation is to examine the effectiveness of our proposed MMPCC method on selecting the pseudo-negative samples. Four sets of experiments are conducted. Experiment 1 compares the different percentage of pseudo-negative sampling on two UCI datasets. Experiment 2 compares the different percentage of pseudo-negative sampling on three bioinformatic datasets. Experiment 3 compares MMPCC with the max-relevance and the min-redundancy methods on the PDNA-316 dataset, which aims to evaluate the relation between the relevance and the redundancy. For simplicity, the max-relevance method is represented by MAXR and the min-redundancy method is represented by MINR. Experiment 4 compares MMPCC with other sampling methods on the bioinformatic datasets.
In experiments, five-fold cross-validation is used to train the dataset. In order to give comprehensive results, Discriminant Analysis, AdaBoost, Random Forest and Neural Networks are employed for classification. We use DA, Adaboost, RF and NN to represent these four classifiers in the experiments, respectively.

Experiment 1: experiments on UCI datasets
This set of experiments examines the contribution of different percentage of pseudo-negative sampling on the UCI datasets [36]. The results are shown in Table 2 and Table 3. As mentioned previously, we use the metrics of Sen, Spe, Acc and MCC.  Similarly, the performance of the AdaBoost classifier obtain improvement on Sen and MCC, which demonstrates the effectiveness of the proposed pseudo-negative sampling method. Table 3 shows similar results on different metrics as Table 2, which verify that pseudo-negative sampling is very useful in classify the imbalance data and can obtain good performance of classification. Furthermore, the results indicates that pseudo-negative samples can be viewed as positive samples and be used to classify objects. For the instability of MMPCC, the results are often not unique in Table 3. There are three reasons about this issue: Firstly, four classification methods were employed, DA, RF, NN and AdaBoost in this study. Different machine learning method has different character, so the experiment results have little instability. Secondly, the value of Sensitivity and Specificity has little instability, but the value of MCC is more stable in most of experiments. As the Sensitivity and Specificity are the singular assessment metrics, MCC considers true and false positives and negatives and is generally regarded as a balanced measure. MCC can be used even if the class size is very different. Finally, the performance of different evaluation metrics shows a trend of increasing with a higher percentage of pseudo-negative samples.

Experiment 2: experiments on real-Life bioinformatic datasets
In this section, we demonstrate the effectiveness of the proposed method, MMPCC, on the real bioinformatic datasets, including PDNA-543 [38], PDNA-316 [39] and SNP data [37]. The results are given in Fig. 1, Fig. 2 and Fig. 3. Position Specific Scoring Matrix (PSSM) was used to extract the features from protein sequences of PDNA-543 and PDNA-316. PSSM is a very important type of evolutionary feature, which is obtained by running the PSI-BLAST program to search the SwissProt database via three iteration, with 10 −3 as the E-value cutoff for multiple sequence alignment. In PSSM, there are 20 scores for each sequence position and each score implies the conservation degree of a specific residue type on that position. For each data instance, all the scaled scores in PSSM are used as its evolution features. In this study, we use the window size with 9 residues, and then obtain a vector of normalized PSSM scores whose dimensions of features are 9×20=180. Figure 2 shows the classification performance on PDNA-543 dataset under different percentage of pseudonegative samples, where RF-Sen and NN-Sen represent the Sensitivity value of RF and NN classifiers and RF-MCC and NN-MCC represent the MCC value of RF and NN classifiers.
The Sen and MCC metric of NN increase with the percentage of pseudo-negative samples changing from 0% to 50%. When the percentage of pseudo-negative samples changes from 0% to 30%, the Sen and MCC of RF algorithm keep unchanged. However, when the percentage of pseudo-negative samples is above 30%, there is a clear trend that RF has better performance as the percentage of pseudo-negative samples grows.  Figure 3 illustrates the classification performance on the PDNA-316 dataset under different percentage of pseudo-negative samples. The performance of RF is better than NN when the percentage is 0% and %10 in terms of Sen and MCC. When the percentage is above 20%, the performance of NN increases drastically and is better than RF, which shows that adding more pseudonegative samples could help greatly improve the performance of classification. However, the performance of RF is almost unchanged. This is because the pseudo-negative samples has little effect on the RF algorithm in this dataset. Figure 4 shows the classification performance for data SNP on different percentage of pseudo-negative samples. The Sen of NN grows rapidly among different percentages of pseudo-negative samples and the MCC of NN gradually increases when the percentage changes from 0% to 30%, and then the fluctuate is small from 40% to 50%. We can find that the Sen and MCC of RF grows as the percentage of pseudo-negative samples gradually increases.
Generally speaking, this set of experiments illustrated that the pseudo-negative samples are very important and can be used to improve the effectiveness of classification.

Experiment 3: comparison of mMPCC, mAXR and mINR on the pDNA-316 datasets
In this section, we employ the five-fold cross-validation to estimate the prediction performance of the proposed MMPCC method on four metrics. We compared MMPCC with other sampling methods including MAXR (max-relevance method based on Equation 7) and MINR (the min-redundancy method based on Equation 9) [30]. In experiments, the PDNA-316 dataset is employed to evaluate the effectiveness of MMPCC. The comparison results are shown in Fig. 5.   Fig. 4 Performance comparison of RF and NN classifiers on SNP data under different percentage of pseudo-negative samples According to Fig. 5, it is straightforward to find that MMPCC outperforms the MAXR and MINR method in terms of Sen, Spe, Acc and MCC in the RF and NN classifiers. From Fig. 5(a), the pseudo-negative samples have a big influence on the Sen value. The Sen value of MMPCC is significantly better than MAXR and MINR, when NN is used as a classifier. For the RF classifier, MAXR is the best one when more pseudo-negative samples are added. By Fig. 5(b), with the increases of the percentage of pseudo-negative samples, the Spe value of MMPCC is very stable on RF and NN. This can be explained by the reason that some pseudo-negative samples are still negative ones. In addition, the Sen value can be improved with the cost of degradation of Spe value. Figure 5(c) demonstrates that the MMPCC method is the most stable method on Acc in the RF classifier. Figure 5(d) shows that the MCC value of MMPCC significantly outperforms the MAXR and the MINR methods. The performance of MAXR is better than MINR. The experimental results indicate that MMPCC attempts to utilize more representative samples and find the pseudo-negative samples (which can be viewed as positive samples) from the majority negative samples.

Experiment 4: comparison of mMPCC and classical sampling methods on bioinformatic datasets
In order to verify the advantage of our method, we also compare the prediction performance of MMPCC with other classical over-sampling method, i.e., SMOTE method [40], on the PDNA-316 dataset.
SMOTE is an over-sampling approach in which the minority class is over-sampled by creating "synthetic" examples rather than by over-sampling with replacement. The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any of the k minority class nearest neighbors. Depending on the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. In order to compare the performance of the algorithm, we use the default value 5 nearest neighbors the same as the reference [40]. The results of comparison performance are shown in Table 4. Because neural network can learn and model the relationships between inputs and outputs that are nonlinear and complex, and make generalizations and inferences. The runtime performance of random forest is quite good, and they are commonly-used to deal with the unbalanced and missing data.
According to Table 4, we can observe that MMPCC outperforms the SMOTE method in terms of all evaluation metrics. Taking Table 4 shows that the performance of MMPCC is better than that of the SMOTE method. As shown in Table 4, with the increase of percentage, the MCC value of the MMPCC in the RF classifier are 0.333, 0.337, 0.351, 0.363 and 0.367, respectively, and the improvements are 0.098, 0.091, 0.101, 0.105 and 0.109, respectively over the SMOTE method. This is due to the fact that a number of duplicated or artificial samples were introduced by over-sampling techniques for large-scale imbalanced data. But for MMPCC, there is no man-made duplicated data. In terms of the MMPCC sampling method, the pseudo-negative sampling technique helps identify more useful samples from the negative class which is often neglected, so it performs better than the SMOTE sampling method.

Experiment 5: experiments on highly imbalance ratio datasets
In order to validate the performance of the proposed method on highly imbalance Ratio datasets, the comparative evaluation on two UCI datasets, Solar Flare and Oil, are performed. The dataset Solar Flare contains 69 minority classes and 1320 majority classes; with 10 attributes, and the Ratio is 19.1. The Oil dataset contains 41 minority   Furthermore, Fig. 6 shows the classification performance on the Solar Flare dataset under different percentage of pseudo-negative samples. From Fig. 6(a), the Sen metric of neural network increase with the percentage of pseudo-negative samples changing from 0% to 50%. Even  there is little fluctuation from 40% to 50%. It maybe the distribution of original dataset is unclear. In the future, we will consider how to choose the percentage of pseudonegative samples automatically. For MCC performance, similar phenomenon can be obtained from Fig. 6(b). Figure 7 shows the tendency of Oil dataset with highly imbalance Ratio in neural network and random forest classification. We can see that Sen and MCC of random forest gradually increase when the percentage changes from 0% to 50% in Fig. 7(a). However, the value of Sen and MCC of neural network has some fluctuate from 0% to 50%. It indicated that random forest is more stability of the proposed method for this dataset. Similar trends of MCC performance can be obtained from Fig. 7(b).

Discussion
Here we designed a supervised learning method based on max-relevance and min-redundant criterion beyond Pearson correlation coefficient and tested on four UCI datasets and three real bioinformatics datasets. Our results indicated that MMPCC is better than other sampling methods in terms of several evaluation metrics. The performance of different evaluation metrics shows a trend of increasing with a higher percentage of pseudo-negative samples. On the other hand, different machine learning method has different character, so the experiment results have little instability. We also observed that MMPCC method can have good performance even in the situation of highly imbalance Ratio. This reveals that pseudo-negative samples are good at solving the imbalance dataset problem.

Conclusions
In this study, we propose a new sampling method, which is called pseudo-negative sampling, to handle the imbalanced classification problem based on Pearson correlation coefficient which integrates the max-relevant and min-redundant. In addition, an incremental searching method is used to find the target sample with little cost of computation. The experimental results demonstrate the superior performance of our method compared to other algorithms for imbalanced classification problems.
In future, we will apply the proposed MMPCC algorithm in more real-world bioinformatic applications with large-scale imbalanced data. We will investigate the possibility of extending the MMPCC method to handle multiple-classification problem. Furthermore, we will use the state-of-the-art machine learning methods [41][42][43][44][45][46] to handle the imbalanced classification problem.