The general framework of the proposed model is shown in Fig. 1, which consists of five phases: ASD dataset preparation, gene ontology enrichment, Gene pre-classification process, Classification and evaluation. In the first phase, the dataset is collected from the Simons Foundation Autism Research Initiative (SFARI) gene database. SFARI Gene is a database specialized in autism research, which spots gene candidates as one of the autism genes. Secondly, genes are annotated using gene ontology, and the similarity between genes is calculated using different similarity functions such as Resnik, Wang, Relevance, and the proposed hybrid gene similarity (HGS) function. Then, resample the class distribution to be balanced before classification. In the fourth phase, ASD genes are predicted using Random Forest (RF) [39], Support Vector Machine (SVM) [40], Naive Bayes (NB) [41], K-nearest neighbor (KNN) [42], Adaptive boosting (AdaBoost), and Gradient Boosting classifiers to classify ASD genes perfectly. Finally, All classifiers are evaluated using cross-fold validation, and the performance of the classifiers is measured using precision, recall, f-measure, and accuracy.
ASD dataset preparation
Simons Foundation Autism Research Initiative (SFARI) gene database https://gene.sfari.org/ is used to assist the proposed model. SFARI contains all genes associated with ASD classified as in Fig. 2. Each gene has an evidence score that reflects how it is associated with the evolution of autism disease. SFARI genes are categorized into seven different categories based on their evidence score. Genes with the highest confidence relating to ASD belong to category one, and genes with less confidence than genes in category one, which may be strong candidates for ASD gene, belong to category two. Categories three and four have the lowest evidence of ASD candidate genes. Category five has an indirect relationship with ASD, and category six is not supported by ASD. Therefore, in this research, categories one, two,three, and four are used for the analysis. Moreover, another type of syndrome gene in a specific column has symptoms or signs which may correlate with ASD. While dataset preparation, only syndrome genes that belong to categories one, two, three, and four will participate in the analysis. SFARI database sets categories one and two as the highest confidence genes (HCG) and three and four as the lowest confidence genes (LCG).
Gene ontology enrichment
ASD genes are enriched using gene ontology (GO) [15] to calculate the functional similarity between genes. Gene annotation means that each gene is annotated with terms extracted from GO database. The gene ontology (GO) is constructed as a hierarchal graph that annotates genes in terms. Each term in GO is represented with a node, and the relations between nodes are included in the edges. Each term belongs to one of these three categories, which describes the different functions as follows:
-
Molecular Function Gene Ontology (MFGO).
-
Biological Process Gene Ontology (BPGO).
-
Cellular Component Gene Ontology (CCGO).
The gene ontology consists of three core branches. The first one, molecular function, illustrates the activity itself, regardless of the reasons or where these actions could happen. On the other hand, the biological process describes the relation between the initial configurations and the final product, ignoring the mechanism of the process itself. The third is a cellular component that figures the positioning relative to the entire cell structure.
The proposed model focuses on the biological process of gene ontology for analysis. A gene functional similarity matrix must be built to classify the candidate’s ASD genes. Then, measuring the similarity between genes indicates the semantic similarity between their terms. Therefore, if the terms of genes are similar in their semantic value, their genes also must be identical in their functions. Different gene functional similarity methods are used, such as Resnik [43], Relevance [44], and Wang [45]. Resnik and Relevance are information content-based methods (IC), which utilize all the information in the ontology corpus file to measure the semantic similarity between two genes. Wang’s method depends on the structure of GO, so it considers as a graph-based method.
Resnik is based on the information content of terms, which is the negative logarithm of the probability of the term as in Eq. 1.
$$\begin{aligned} IC_{t} = - log(Pro(t)) \end{aligned}$$
(1)
Pro(t) is the probability of term t, which is the occurrence number of term t in the GO corpus as in Eq. 2. The relationship between IC and the amount of information that this term contains is negative, which means if this term rarely appears in the corpus, it will have more amount of information content.
$$\begin{aligned} Pro(t) = \dfrac{\textit{Number Of t}_{\textit{Children}}}{\textit{Total Num of Terms in the Corpus}} \end{aligned}$$
(2)
After that, the semantic similarity between the two terms is calculated using the information content of their most common informative ancestor (MICA) as in Eq. 3.
$$\begin{aligned} termsimilrity_{Resnik}(t_{1},t_{2}) = IC(MICA) \end{aligned}$$
(3)
Relevance method also depends on IC calculations as in Eq. 4
$$\begin{aligned} Relevance = \dfrac{2*IC(MICA)(1-Pro(MICA))}{IC(t_{1})+IC(t_{2})} \end{aligned}$$
(4)
Wang, in Eq. 5, calculates the similarity between genes terms depending on the position of these terms in the GO-directed graph and their linkage with their ancestors. Therefore, Wang considers the relations is-a and part-of-edges.
$$\begin{aligned} similarity_{Wang}(X,Y) = \dfrac{\sum _{t \in T_{X} \cap T_{Y}}S_{X}(t)+ S_{Y}(t)}{SV(X)+SV(Y)} \end{aligned}$$
(5)
A hybrid gene similarity (HGS) function is proposed to measure the similarity between two ASD genes. HGS uses Wang as the basic function considering the number of term children, given their ancestor nodes with its descendent nodes. Alg. 1 and 2 illustrate the robust algorithm steps of the HGS method, which helps measure the similarity between two genes. This method uses a GO graph to calculate the number of children nodes rather than using IC values of the term and integrates this number with the Wang method.
The gene functional similarity matrix should be calculated before gene classification, which is the semantic similarity between genes. Figure 3 represents how we can measure the semantic similarity using their annotated terms from GO. Algo. 1 illustrates the steps to build “TermSimM” which contains all semantic values between two gene terms. Then the average best matching strategy [29] is used to mix the semantic similarity between gene ontology terms. First, we extracted all annotated terms of two genes \(g_{1}\) and \(g_{2}\). Each term in \(g_{1}\) will be calculated with all terms of \(g_{2}\) as in Fig. 3. For each term, the directed acyclic graph “DAG” is extracted from GO. DAG of x as in Algo. 1 is the term x with its ancestor terms \(T_{x}\) and the edges \(E_{x}\) between these terms. GO is represented in three branches (MFGO, BPGO, CCGO). Our experiment involves only BPGO branch. After that, the contributed semantic value of each term is calculated using steps in Algo. 2, which is the semantic function of Wang method using different weight function. The weight \(W_{e}\) in Wang method reflect the semantic value of term edges. Researches in [45, 46] find that the number of children of a specific term is negatively related to its IC Value. Therefore, the semantic weight function (\(w_{e}\)) assigns different values for d constant depending on the type of edge, for part-of relation d equals 0.3 and 0.4 for is-a relationship. The C constant value represents the suitable minimum value of correlation with other methods when c is equal to 0.67. Hence, HGS depends on Wang’s method using the number of the ancestor’s children rather than the information content of ancestor terms. This method saves time when computing the similarity between two genes rather than the IC-based methods.
Gene pre-classification process
The Autism Spectrum database SFARI has a problem of unbalanced class distribution, where the majority class is negative (Not ASD), and the minority class is positive (ASD). Dealing with the dataset as it is will result in false classification with high accuracy, which biases the machine learning classifiers and result in neglecting the minority class. Therefore, dealing with this problem, resample dataset class distribution is the best choice. Resampling techniques can be either deleting some examples randomly from the majority class (random undersampling) or duplicating some examples from the minority class (random oversampling ). To neglect the overfitting of data, random undersampling class distribution skips some of the examples from the majority class randomly until the dataset becomes balanced as in Eq. 6
$$\begin{aligned} PrecUnder = \dfrac{\textit{num of positive instances}}{\textit{num of negative intstances}}*100 \end{aligned}$$
(6)
Classification
Baseline classifiers
Different machine learning classification techniques are used to evaluate the proposed model, such as Naive Bayes (NB) [41], Support Vector Machine (SVM) [40], K-nearest neighbors (KNN) [42], and Random Forest (RF) [39]. The input for this phase are two functional similarity matrices, one for the highest confidence genes (HCG) and the second for the lowest confidence genes (LCG). Therefore, NB, SVM, KNN, and RF are applied to HCG and LCG. Naive Bayes is a Bayesian classification technique, which is based on calculating the conditional probability that is called the“ Bayes Theorem.” NB method is fast, accurate, and suitable for high dimensional data, but it is considered that all features are independent, which is not acceptable in most applications.
Support Vector Machine is a supervised machine learning technique that treats its predictors as dependent features. SVM draws a separate line to split the input data into groups and then uses this line to predict new data on the place side. SVM seeks to find the most suitable place to put the hyperplane, separating the data into classes, effectively giving high performance. There are two types of SVM, linear SVM and radial SVM. SVM works well with low dimensional data.
Ensemble learning techniques
Boosting is one of the ensemble learning techniques utilized to enhance the performance of the proposed model for predicting ASD genes. It is an iterative technique to build a strong learner from a set of weak learners. It corrects the previous model error sequentially, as the second weak learner model attempts to correct the error from the first model, etc. Two different algorithms of boosting are used to propose a more accurate model for predicting autism genes.
Adaptive Boosting M1 (AdaBoost) is the trivial boosting technique, as shown in Fig. 4. It runs at decision stumps as weak learner models, aggregate stronger ones, enhancing the predictive model performance. The steps of the AdaBoost algorithm are in Algo. 3. In the beginning, all training samples are given equal weights, which indicates that all samples are equally important, “one divided by the total number of samples. ”After that, in each iteration of building a new decision stump, these weights will be updated to guide the building of the decision stump (DS). The value of total error and alpha have an opposite relationship; if the total error decreases, then the weak learner (DS) influences the training sample prediction. The total error is a summation of incorrectly classified instance weights. The idea of AdaBoost is to minimize the loss function. In this technique, the exponential loss function gives more weight to misclassified instances and the opposite to correctly classified cases. The algorithm builds decision stumps, either by reaching the number of tree input parameters or the error becomes zero. Finally, the output is a substantial learner prediction value, which is the summation of all hypotheses from the weak learner.
The value of alpha may be positive or negative:
-
Positive alpha means that the predicted class label is equal to the actual sample class, which indicates that the samples are correctly classified. Accordingly, the weights for these samples are decreased.
-
Negative alpha means that the predicted class label and the actual sample class are unequal, indicating that the samples are not correctly classified. Accordingly, the weights for these samples are increased to build the next weak learners (Decision Stump) to not repeat these misclassified instances in the following stump.
Gradient Boostingis another updated boosting algorithm that aims to form strong learners from weak learners using gradient and iterative algorithms. Gradient algorithm proposes to minimize the loss function and must be able to have derivation. Figure 5 shows the main process of the proposed gene prediction-based regularized gradient boosting classification model. HEC-ASD based on gradient boosting depends on four components for enhancing the prediction of ASD genes as follows:
-
Loss Function, which measures the efficiency of the proposed model in classifying new genes that measure the difference value between the predicted value and the actual observed value.
-
Weak learners are used in the training phase, which results in low accuracy with high error; decision stumps are utilized to be the weak learner.
-
Additive model, which means that the model works sequentially, adding trees (weak learners) iteratively and additive. In each iteration, the loss function should be decreased to form a stronger learner model.
-
Regularization parameters are parameters used to regulate the loss function to prevent overfitting or underfitting problems. The parameters are the number of trees, learning rate, maximum depth, and lambda l2 regularization. The learning rate is used to decrease the iterative gradient steps. Lambda ? l2 regularization is a hyperparameter that measures the regulation degree.
HEC-ASD, based on gradient boosting, utilized the Log loss function to minimize the total prediction error using Eq. 7, where \(y_{i}\) is the actual observed class value.
$$\begin{aligned} log loss = -\frac{1}{N}\sum \limits _{i=1}^{N}y_{i}*log(p(y_{i}))+(1-y_{i})*log(1-p(y_{i})) \end{aligned}$$
(7)