Identifying Alzheimer’s disease-related proteins by LRRGD

Background Alzheimer’s disease (AD) imposes a heavy burden on society and every family. Therefore, diagnosing AD in advance and discovering new drug targets are crucial, while these could be achieved by identifying AD-related proteins. The time-consuming and money-costing biological experiment makes researchers turn to develop more advanced algorithms to identify AD-related proteins. Results Firstly, we proposed a hypothesis “similar diseases share similar related proteins”. Therefore, five similarity calculation methods are introduced to find out others diseases which are similar to AD. Then, these diseases’ related proteins could be obtained by public data set. Finally, these proteins are features of each disease and could be used to map their similarity to AD. We developed a novel method ‘LRRGD’ which combines Logistic Regression (LR) and Gradient Descent (GD) and borrows the idea of Random Forest (RF). LR is introduced to regress features to similarities. Borrowing the idea of RF, hundreds of LR models have been built by randomly selecting 40 features (proteins) each time. Here, GD is introduced to find out the optimal result. To avoid the drawback of local optimal solution, a good initial value is selected by some known AD-related proteins. Finally, 376 proteins are found to be related to AD. Conclusion Three hundred eight of three hundred seventy-six proteins are the novel proteins. Three case studies are done to prove our method’s effectiveness. These 308 proteins could give researchers a basis to do biological experiments to help treatment and diagnostic AD.


Background
Alzheimer's disease [1] has become the greatest threat to the elderly. At present, there is no effective drug for AD. Many studies have reported that neurodegenerative diseases such as Alzheimer's disease are closely related to aging diseases and can interact with each other [2,3]. Many scholars reported that abnormal behavior of specific proteins is the key to cause AD [4,5]. This is because the main pathological feature of AD patients is that a large number of beta amyloid (A beta) deposits are formed outside the neurons in the cortex and hippocampus and neurofibrillary tangles (NFT) are formed in neurons with tau protein as the main component [6,7].
Recently, finding alternatives to diagnosing AD has become a hot issue [8]. Ray et al. found 18 plasma proteins have high specificity in AD patients. They then found that these proteins were associated with Aβ and tau levels in CSF. Then the Human Discovery Multi-Analyte Profile (MAP) has become a popular tool to identify plasma analytes. But, these exciting results raise a major issue that it is hard to reproduce these protein panels [8]. Gisslen M et al. [9] found that the correlation between CSF and plasma NFL was stronger than tau. Olsson B et al. [10] confirmed this view, and they found that the NFL was increasing in both AD patients and MCI's CSF. Studies have found this phenomenon in serum and plasma samples as well [11]. O'Bryant et al. [12] used a serum-based algorithm to distinguish AD from Parkinson's disease and cross-validated this algorithm. At present, biological experiments and bioinformatics methods are the most widely used methods. Lista et al. [13] reviewed the blood biomarkers of AD disease based on mass spectrometry. They concluded that about 20 proteins may be potential biomarkers of AD diseases. They also emphasized that the molecular level of neurodegenerative diseases (such as AD) may change 20 years before the onset of clinical symptoms.
Complex protein interactions could be researched by Protein-protein interaction (PPI) network [2,14,15]. Most PPI networks are built based on genes' relationship. Shubhabrata et al. [16] used dense module searching (DMS) method to integrate gene-wide association results into PPI network and identified candidate genes or sub-networks for AD. However, most of protein networks are static network which has highly average and idealized network structures. In fact, with the change of external conditions, some proteins will be degraded, while others will be translated [17]. This would result in the new protein interactions and disappearance of old protein interactions.
Based on the prior knowledge of protein interaction and biology, some researchers use machine learning [17,18] and pattern classification methods [19] to predict diseases-related protein interaction. Machine learning methods include Bayesian network method [20], Markov model method [21], Random Forest method [22] and Support Vector Machine method [23] etc. Barber et al. [24] uses Simulated Annealing (SA) to select the proteins most relevant to AD and uses Random Forest (RF) to classify patients based on these proteins. The best model trained in serum can significantly predict disease status with AUC of 0.66. At the same time, training with serum data and testing by CSF data, the AUC is 0.77. However, machine learning method usually needs negative samples, but in fact, negative samples are hard to obtain. Therefore, in this paper, we consider the problem of identifying AD-related proteins as a regression problem, which makes it unnecessary for us to obtain negative sets. This can greatly improve the accuracy of recognition and reduce the false positive rate.

Data collection and database content Disease ontology
Three thousand five hundred twenty-four kinds of diseases are downloaded from Disease Ontology (DO) which is an authoritative website that contains comprehensive disease related knowledge [25]. The concept of each disease or disease is a node in DO. Each node has an ID. There is a subordinate relationship between nodes. Similarity between AD and other diseases could be obtained based on DO using similarity calculation methods.

Uniprot
UniProt [26] consists of three parts: UniProt Knowledgebase (UniProt), which is the information access center of protein sequence, function, classification, cross-reference, etc. UniProt Non-redundant Reference (UniRef) database, which combines closely related protein sequences into a single record to improve search speed; currently, three sub-libraries are formed according to sequence similarity, namely UniRef100, UniRef90 and UniRef50; UniProt Archive (UniParc) is a repository that records the history of all protein sequences. Users can query database by text, search database by BLAST program, or download data directly by FTP. All known diseases-related proteins could be obtained by UniProt.

Gene ontology
Gene ontology (GO) is one of the most successful ontology in the field of biomedicine. It provides a standard and accurate term set for describing the molecular function, biological process and other related information, which is widely used in the field of biomedical research.
The principle of Resnik's method and Lin's method is same. Both of them calculate similarity by GO terms, but Resnik's method uses the information content (IC) of the most informative common ancestor (MICA) between two terms. However, Wang's method improves Resnik's method. It considers multiple common ancestors. PSB: associations of GO terms are considered. Semfensim: semantic and gene functional association are intergrated to calculate similarity. Since it is hard to recognize which method is the best, all of them are used to calculate similarities. Finally, 3524 diseases' similarity with AD are calculated. Therefore, each disease gets 5 different similarity values, and we add these five values together as the final similarity. Figure 1 shows all the similarities which are higher than 1 between 3524 diseases and AD. Two thousand six hundred sixty-three of three thousand five hundred twentyfour diseases' similarity is lower than 1, so they did not show in the Fig. 1. As we can see, since 99% diseases' similarities are less than 3.5, 3.5 is set as a threshold to retain only a small number of diseases most associated with AD.
Finally, there are 34 diseases left. Table 1 shows their similarity with AD and the names of them.

Extracting features
Firstly, the 34 disease's name are obtained by the ID of DO. Then, we obtained 34 disease-related proteins on the Uniprot. To ensure the accuracy of the results, only human and reviewed proteins are selected.
We excluded two disease: DOID: 936 'brain disease' and DOID: 14332 'postencephalitic Parkinson disease'. Brain disease is related to more than 2000 proteins and it is a large group of diseases and includes AD. postencephalitic Parkinson disease has no related information in Uniprot, so we removed this disease from data too. Therefore, 32 diseases are left and we obtained 32 diseases-related proteins by Uniprot. Figure 2 shows the number of proteins for each protein. AD is related to 299 proteins. Therefore, 33 kinds of diseases are related to 2827 proteins. Some of the 2827 proteins are duplicated, which indicates that similar diseases share similar proteins. Firstly, we removed the redundant proteins and 1608 kinds of proteins are left. To our surprise, 43.1% proteins are redundant. So there must be some AD-related proteins that we have not known that they are related to AD, but we have known that they act on AD's similar diseases.
As we mentioned before, proteins are the features for similarity. Therefore, the dimension of feature's matrix is 1608. Each disease corresponds to a 1608*1 feature matrix.
Each protein has a weight for similarity and it represents its relationship with AD. Constantly iterating over these weights so that they can map to similarities and get their relationship with AD.

Map features to similarity by logistics regression
Firstly, we normalized all diseases' similarity. All similarities are transformed into a number between 0 and 1.
The similarity between AD and AD itself should be the max number in all methods. For Resnik method, the max number is 4 and other methods are 1. Therefore, the max similarity is 8. Then we could normalize all other diseases' similarity by eq. (1).
Thirty-two diseases are 32 samples and 1608 proteins are 1608 dimensions of feature. It is a typical high dimension and small samples problem. LR could hardly solve this problem. Therefore, we borrowed the idea of Random Forests (RF). Forty features (proteins) are randomly selected to build model each time. The 40 features (proteins) would be put back after building model. We selected 40 features because ffiffiffiffiffiffiffiffiffi ffi 1608 p ≈ 40. This is the typical way to select the number of features in RF. We would repeat 400 times so that each protein would be selected nearly 10 times.
After building models every time, GD is used to find out optimal result. Since GD is easy to get local optimal solution rather than global optimal solution, we used the known AD-related protein as the initial value of the iteration. In this way, the initial value is very close to the global optimal solution so we can get the global optimal solution with fewer iterations. Figure 3 shows the work flow of selecting features and building models.
The workflow of LR is shown in Table 2.
Through the above steps, we can build a logistic regression function: h θ ðxÞ ¼ 1 1þe −θ T x . X which is our input is 1608 proteins for each disease, the output h(x) is the similarity between each disease and AD.
Obviously, the similarity between disease and AD is not the result we hope to obtain. So if we can find a suitable weight for each protein, the similarity between AD and AD itself would be 1. Then, the weight is reasonable and we can obtain the AD-related protein by these weights.

Find AD-related proteins by gradient descent
Therefore, Gradient Descent (GD) is introduced to solve the model obtained by LR.
GD is a kind of optimization method. The work flow of GD is shown in Table 3.
Through the above steps, feature matrix of AD-related proteins are obtained. The 1 in matrix represents that this protein is related to AD. Figure 4 shows our workflow. Firstly, the similarity between AD and other diseases could be calculated. Then We can get diseases similar to AD. In addition, these diseases-related proteins could be obtained by Uniprot. Finally, LR could be used to build models. After that, GD should be used to obtain the optimal results.

A. Data process B. Result
Step 2. Construction loss function y is true similarity, m is the number of sample  Table 3 Work flow of GD Work flow of GD Step 1. Finding descent direction Step 2. Moving x x = x − k∇ k is descent rate.
Step 3. Repeat step 2, until satisfied with the following equation ε is any constant.
Since 400 models are built by LR, 400 kinds of results are obtained. Each protein has 10 times chances to be selected as features and algorithm can judge whether it is related to AD. Therefore, the maximum number of times for each protein to be related to AD is 10, and the minimum number is 0. Figure 5 shows the times that proteins are thought to be related to AD.
As we can see in Fig. 5, more than 500 kinds of proteins are unrelated to AD. Algorithm never gets results that they are AD-related proteins. However, about 50 kinds of proteins are identified to be related to AD for 10 times.  Times that proteins are thought to be related to AD Seven times is set as a threshold to select AD-related proteins. If proteins are thought to be related to AD more than 7 times by algorithm, the proteins are related to AD. Otherwise, we did not consider them as ADrelated proteins. There are 376 such proteins.
The Fig. 6 shows the proportion of newly discovered proteins and known proteins.
As we can see, 18% of 376 proteins are known ADrelated proteins. Most of proteins are associated with AD-like diseases and researchers do not know that they are associated with AD.

C. Case study
Three case studies are done to verify our method's effectiveness. We selected three novel proteins from 308 novel AD-related proteins.

SUMO-conjugating enzyme UBC9
In UniProt, there is no information about the relationship between this protein and AD. Our method identifies the strong correlation between AD and AD. (10 times). Several research have found that UBC9 plays an important role in AD due to its function is associated with the aggregation of betaamyloid protein (Aβ). It can interact with target protein and change their localization, activity, or stability. LE Mcmillan et al. [31] demonstrated this in 2011.

Kinesin light chain 2 (KLC2)
APP is known important gene to AD. KLC2 can interact with APP and it is considered to be related to AD. Kamal et al. [32] reported that KLC2 can affect transport of APP into axons. S Matsuda et al. 's study [33] also demonstrated that KLC2 causes AD by affecting APP. 3. Kinesin heavy chain isoform 5C (KIF5C) KIF5A showed pan-neuronal distribution in the nervous system. KIF5B plays an important roles in the maintenance of motor neurons rather than in their formation. D Sepulvedafalla et al. [34] found that KIF5C are highly related to familial AD and neurodegeneration.

Discussion
Identifying the AD-related proteins can help us treatment and diagnose AD better. It saves lots of researchers' time and money. Doing biological experiments by the priority is an efficient way to understand the mechanism of AD.
Here we purposed a method to identify the AD-related proteins based on a hypothesis which is similar disease share similar proteins. Here is no doubt that proteins have contribution to the similarity of symptoms between diseases.
Therefore, the first step is to calculate the similarity between other diseases and AD. We totally used 5 methods to obtain the similarity. 3.5 was set as threshold to screen diseases which are most related to AD. There are 34 diseases left. Then, we downloaded these diseases- Fig. 6 The proportion of known AD-related proteins to novel AD-related proteins related proteins by Uniprot. Due to the reason mentioned in method section C, 2 diseases are excluded.
Then we aggregate the proteins that correspond to these diseases. Each protein is a one-dimensional feature, and we try to map these features to similarity. Because this is a small sample of high-dimensional problems, the use of LR alone is not enough to solve this problem. Here, we borrowed the idea of RF: randomly selected 40 features to build model by LR each time. Then, GD is introduced to find out the optimal result. After 400 models are built, we summarized the whole results and set 7 as threshold to screen the AD-related proteins.
Finally, we obtained 376 proteins which are related to AD. Three hundred eight of three hundred seventy-six proteins are novel. We selected three of them to do case studies to prove our method's effectiveness.

Conclusions
Identification of disease-related proteins is essential for developing new drugs and understanding the pathogenesis. In view of the shortcomings of current machine learning methods and protein interaction networks, we propose a regression method, which can effectively avoid the shortcomings of obtaining negative samples and the inability of the network to change dynamically. It provides a new way to solve disease-related proteins, that is, to transform classification or clustering problems into regression problems.
This paper proposes a hypothesis that similar diseases share similar proteins. A total of 2827 proteins were obtained by searching 32 disease-related proteins in Uniprot, but they are only 1608 kinds of proteins, which shows that this hypothesis is valid. Similar diseases have multiple protein duplications.
In the aspect of algorithm innovation, we combine LR with RF to solve the problem of small sample and high dimension. In order to overcome the problem that GD often falls into local optimum, we get a very reasonable initial iteration value.
The results show that this method has certain practical value and is helpful for further research. Through our method, we can find more disease-related proteins.