TLGP: a flexible transfer learning algorithm for gene prioritization based on heterogeneous source domain

Background Gene prioritization (gene ranking) aims to obtain the centrality of genes, which is critical for cancer diagnosis and therapy since keys genes correspond to the biomarkers or targets of drugs. Great efforts have been devoted to the gene ranking problem by exploring the similarity between candidate and known disease-causing genes. However, when the number of disease-causing genes is limited, they are not applicable largely due to the low accuracy. Actually, the number of disease-causing genes for cancers, particularly for these rare cancers, are really limited. Therefore, there is a critical needed to design effective and efficient algorithms for gene ranking with limited prior disease-causing genes. Results In this study, we propose a transfer learning based algorithm for gene prioritization (called TLGP) in the cancer (target domain) without disease-causing genes by transferring knowledge from other cancers (source domain). The underlying assumption is that knowledge shared by similar cancers improves the accuracy of gene prioritization. Specifically, TLGP first quantifies the similarity between the target and source domain by calculating the affinity matrix for genes. Then, TLGP automatically learns a fusion network for the target cancer by fusing affinity matrix, pathogenic genes and genomic data of source cancers. Finally, genes in the target cancer are prioritized. The experimental results indicate that the learnt fusion network is more reliable than gene co-expression network, implying that transferring knowledge from other cancers improves the accuracy of network construction. Moreover, TLGP outperforms state-of-the-art approaches in terms of accuracy, improving at least 5%. Conclusion The proposed model and method provide an effective and efficient strategy for gene ranking by integrating genomic data from various cancers.


Background
Genes are basic units of organisms, which execute critical biological processes to maintain the operation of life. And, DNA mutations change the sequences of genes, resulting in variations of gene structure and functions, which originate cancers [1]. Therefore, genes serve as bio-markers for cancer diagnosis and target genes of drugs, which are the foundation of cancer therapy [2,3]. It is of great significance to identify pathogenic genes for revealing the underlying mechanisms of cancers because it helps biological researchers to handle mountains of public and private omics data to maximize the yield of downstream biological validation.
Pathogenic gene detection corresponds to the gene prioritization problem, which aims to ranking genes according their importance, where important genes are more likely to be pathogenic. Great efforts have been devoted to gene ranking, which can be categorized into two groups, i.e. biological experiment-and computation-based approaches. The methods of the first category validate the functions and structure of genes to select pathogenic genes by employing biological experiments. The advantage of biological experiment-based methods is accurate, whereas the drawback is time-and finance-consuming. To overcome these issues, the computation-based methods provide an alternative for experiment-based methods, which utilize machine learning techniques to predict the possible pathogenic genes by exploiting genomic data of cancers. The underlying assumption for computational based algorithms is that genes with similar structure have similar biological functions and patterns [4][5][6].
Many algorithms have been developed for gene ranking [7][8][9][10][11][12][13][14][15][16], where the difference among them lies on how to define and measure the similarity between the pathogenic and non-pathogenic genes. The most intuitive and straightforward strategy is to calculate the distance between pathogenic and non-pathogenic genes in terms of features [8]. If the candidate gene is very close to pathogenic genes, it is reasonable to consider the candidate gene as pathogenic genes. The key factor behind the similarity strategy is how to construct the features for genes. And, algorithms employ various types of features, for example, PROSPECTR [17] explores sequence-based features. However, feature similarity approaches are criticized for the low accuracy because they only explore the relation between a pair of genes. To solve this problem, many classification algorithms are adopted to predict pathogenic genes, including rule-base decision tree [18] and support vector machine (SVM) [19]. These algorithms significantly outperform the feature similarity strategy since they make use of features of whole genes. To further improve the performance of algorithms, Moreau et al. [20] suggest that it is promising to integrate complex and heterogeneous data to identify the most interesting genes for biological validation from candidates.
Even though the classification-based methods achieve an excellent performance on gene prioritization, they require a large number of positive and negative samples to ensure the reliability of classifiers. When the training set is insufficient, these algorithms are criticized for the low accuracy. Furthermore, they cannot explore the indirect relations among genes. Network is a powerful tool for characterizing and describing the complex systems, which has been successfully applied to social analysis [21][22][23][24] and biology [25][26][27][28][29][30][31][32]. Therefore, great efforts, such as CIPHER [4], MDGC [7], PageRank [9], DNRC [12], ToppGene [13], RWRH [14], MRF [15], and IBNPKATZ [16], have been devoted to the gene prioritization with an immediate purpose to improve the accuracy of prediction by exploring the topological structure of cancer networks. Compared with these classification-based methods, there are two advantages of network-based methods. First, the network-based algorithms do not require a large training set to rank genes. Second, these algorithms can explore the indirected relations among genes by exploiting the topological structure of networks, such as short paths and percolation. The difference among the network-based methods depends on how to make use of the topological structure of networks. For example, IBNPKATZ [16] prioritizes genes by combining the Katz index and network projection. RWRH [14] relies on the heterogeneous network structure, which adopts random walk to exploit gene-phenotype relationship. MRF [15] employs genes and subnetwork to explore gene-disease relation. PRINCE [32] adopts the information propagation of networks to rank genes, which precisely predicts disease-causing genes.
Even though network-based and similarity-based approaches have been successfully applied to gene prioritization, their performance is not desirable when the number of pathogenic genes is limited. Even worse, these algorithms are not applicable when the number of pathogenic genes is less than a threshold. However, the number of known pathogenic genes for many complex diseases, particularly for the rare diseases, is small because the current knowledge of them is limited. Recently, transfer learning [33][34][35][36] overcomes this problem by learning knowledge from source domains into the target domain with limited labelled objects, which significantly improves the performance of algorithms. More specifically, different from the traditional machine learning techniques, transfer learning aims to transfer knowledge from some previous tasks to a target task when the latter has a few of high-quality training data. It is also one of the major motivation of this study.
To improve the accuracy of gene ranking, we propose a novel transfer learning algorithm (called TLGP) for gene prioritization with few or even no pathogenic genes (called TLGP) in the target cancer, where transfers knowledge of cancers in source domains. The target cancer only compromises the gene expression profile, whereas the gene expression profiles and pathogenic genes of cancers exist in source domain. shown in Fig. 1, TLGP consists of four components: affinity matrix construction, dimension reduction in source domain, fusion network construction, and gene prioritization on the fusion network. Specifically, TLGP construct the affinity matrix quantifies the similarity of genes among various cancers. And, to obtain knowledge in cancers, we employ the dimension reduction to learn the low-dimensional representation of genes in the source cancers, where pathogenic and non-pathogenic genes are well separated. Then, TLGP automatically transfers knowledge from source domain into the target cancer and learns the gene similarity network for the target cancer, which is more reliable than that based on the gene expression profile of the target cancer. Finally, we prioritize genes in target cancer using a typical gene ranking algorithm.
In summary, the contributions of this study can be summarized as follows.
• A novel transfer learning algorithm for gene ranking is proposed, where the knowledge from other cancers can be transferred to the target cancer to improve the accuracy of algorithms. The TLGP algorithm also offers an alternative for integrative analysis of the heterogeneous genomic data. • The proposed algorithm extends the application of algorithms for gene prioritization because it works well on cancers with no or limited pathogenic genes. It also serves as a flexible framework for gene prioritization. • The experimental results demonstrate the proposed algorithm significantly improves the accuracy of algorithms.

Results and discussion
A comparative comparison is performed to fully validate the performance of the proposed algorithm.

Data and setting
We select breast and lung cancers as target and source domains, respectively. The pathogenic and non-pathogenic genes for breast and lung cancer are derived from COSMIC. 1 The RNA-seq expression profiles of breast and lung cancer are downloaded from TCGA, where FPKM (Fragments Per Kilobase of transcript per Million fragments mapped) is used. The protein interaction network is downloaded from BioGRID. 2 The pathogenic gene list for the breast cancer is used as benchmark to testify the accuracy of algorithms.
To fully validate the performance of the proposed algorithm on the gene prioritization, six state-of-the-art approaches, such as SSC [30], CIPHER [4], PRINCE [32], MDGC [7] and PageRank [9], are selected for a comparative comparison. These algorithms are selected because they achieve an excellent performance on the gene prioritization by using various strategy to exploit the topological structure of networks. For example, SSC [30], defines the similarity on the protein interaction network and use random walk on global network to detect disease-related genes, while CIPHER [4] constructs a regression model under the assumption that two closer genes in the molecular interaction network tend to cause similar phenotypes. SSC and CIPHER only explore the local information of networks to prioritize genes, while PRINCE [30] and PageRank [9] rank genes by using the random walk to explore the global information of networks with the underlying assumption that genes that cause similar diseases tend to be closed in the protein interaction network. MDGC [7] is a multi-view clustering method which generalizes the single-view discriminative K-means, and then prioritizes genes by making use of the degree of known diseases genes and statistical methods. All these algorithms run on the protein interaction networks to rank genes with the default values of parameters.
To measure the accuracy of algorithms, we check the number of pathogenic genes among the top k genes.

Fusion network is more enriched by protein interactions
TLGP extracts knowledge in lung cancer and transfers it into breast cancer to construct the gene fusion network. Thus, it is natural to ask what is the difference between the learned fusion network and gene co-expression network based on the gene expression profiles, i.e., which one is better.
To address this issue, the biological experiment validated protein interactions are selected as the gold standard to measure the quality of the fusion network. We check the percentage of edges in the fusion and gene co-expression network that overlap with the protein interactions. Since both fusion and co-expression networks are weighted, we select these edges in each network whose weights are greater than a predefined threshold. The percentages of edges overlapping with the protein interactions for the fusion and co-expression networks on various thresholds are shown in Fig. 2. The threshold is defined as α× mean of edge weights in network, where the red bar denotes the  percentage of the fusion network constructed by TLGP and the blue represents that of the gene co-expression network. From Fig. 2, it is easy to assert that the edges in fusion network are more enriched by the protein interactions than the gene co-expression network at all thresholds. Specifically, 2.8% of edges in fusion network are overlapped with protein interactions, while only 1.9% for gene co-expression network when α=1.2. These result indicates that the fusion network is more reliable than the gene co-expression network, implying that transferring knowledge from other cancers improves the accuracy of network construction. There are two possible reasons to explain why the fusion network constructed by TLGP is more reliable than the gene co-expression network. First, the integrative analysis of the gene expression and pathogenic gene list remove the noise in the source cancer. Second, the knowledge in the source cancer is transferred to the fusion networks, thereby improving the quality of the fusion network. Figure 2 demonstrates the proposed algorithm can remove noise in genomic data and constructs the reliable fusion network. Then, we ask whether the constructed fusion network can improve the accuracy of gene prioritization. To comprehensively testify the performance of the proposed algorithm, we use two types of gene lists, such as pathogenic and cancer causal genes, to evaluate the performance of algorithms. The percentage of top k genes that are overlapped with the known pathogenic genes is shown in Fig. 3, where panel a is the accuracy of various algorithms with k=100 and panel b with k=200. From Fig. 3a, it is easy to conclude that the accuracy of TLGP is significantly higher than the others. CIPHER are inferior to TLGP, and it is much more precise than the SSC, MDGC, and PRINCE. The SSC algorithm is the worst. The reason is that it only exploits the local topology of networks, which fails to characterize the centrality of genes in the networks. Specifically, the accuracy of TLGP is 38.0%, which is 7% higher than that of when the top 100 genes are selected. There two reasons to explain why TLGP significantly outperforms the others. First, TLGP integrates heterogeneous genomic data for gene prioritization, thereby providing a better strategy to characterize the centrality of cancer related genes. Second, TLGP transfers knowledge from the source cancer to the target cancer, which improves the reliability and accuracy of the fusion network. The comparison between TLGP and PRINCE further demonstrates that the transfer learning strategy can significantly improve the accuracy of gene prioritization. Figure 3b shows the accuracy of algorithms on gene prioritization with k=200, where the similar tendency repeats. The proposed algorithm adopts PRINCE for gene prioritization. Then, we ask whether the excellent performance of TLGP is co-factor by the PRINCE algorithm [32]. We apply two algorithms, such as PRINCE [32] and PageRank [9], on the fusion and gene co-expression networks. The results are presented in Fig. 4, where panel a1 and a2 contain the accuracy of PRINCE on these two types of networks, and panel b1 and b2 are those of PageRank. It is easy to conclude that all these algorithms achieve a much better performance on the fusion network than that on the gene co-expression network. These results imply the superiority of the proposed algorithm on gene prioritization.

Performance on ranking pathogenic genes
The above experiment validate the percentage of top k genes overlapped with the pathogenic genes, which is insufficient to fully validate the performance of algorithms for gene prioritization. Here, we investigate the uniquely identified pathogenic genes, i.e., these pathogenic genes that only can be discovered by a specific genes in the top k genes. To make a comprehensive comparison, we compare TLGP with the others to investigate whether the proposed method is efficient to rank the pathogenic genes These results further demonstrate the proposed algorithm can identify the pathogenic genes of the breast cancer that cannot be discovered by the other algorithms, indicating the superiority of TLGP for gene prioritization. The possible reason is that the functions of some pathogenic genes are complex that cannot be fully characterized by using one type of genomic data. TLGP integrates the heterogeneous genomic data, improving the accuracy of prediction.

Parameter sensitivity
Finally, we investigate how the parameters effect the performance of the proposed algorithm. Notice that two parameters are involved, where the number of features for dimension reduction, and parameter determines the importance of penalty. TLGP empirically select the best values for parameters. Then, we investigate how the parameter for the penalty effects the performance of TLGP. How the accuracy of the proposed algorithm changes as increases from 0.01 to 15 is shown in Fig. 6b. The performance of TLGP achieves the best performance when ∈[0.01, 5]. The accuracy of TLGP decreases as parameter increases from 5 to 15. The reason is that when the value of lambda is large, the penalty dominates the objective, resulting in the low accuracy. In this study, we set lambda=1.

Conclusions
Gene ranking is one of the fundamental problems in bio-informatics, which are critical for the cancer diagnosis and therapy. The existing algorithms make use of the networks and cancer-causing genes to predict the centrality of genes. However, these algorithms are criticized for their low accuracy when the number of cancer-causing genes is limited. Furthermore, these algorithms cannot be applied to the gene prioritization when no known cancer-causing gene is available. Actually, the number of cancer-causing genes for many cancers is limited, particularly for these rare diseases. To solve this problem, we propose a transfer learning based algorithm for gene prioritization with no pathogenic genes in target cancer, where knowledge in the source cancers is incorporated into the target cancer to improve the performance of algorithms. The experimental results demonstrate that the proposed algorithm significantly outperforms the current algorithms on the gene ranking. The proposed algorithm also has some limitations, which will be improved by further research: • The gene expression profiles in the source and target cancers have the same distributions because they are generated by using the platform. How to transfer knowledge for the heterogeneous genomic data from the source domain to target domain, such as the gene expression in the source domain and methylation data in the target domain, is also promising to further improve the performance of gene ranking. • In this study, only one source cancer is adopted for transfer learning. How transfer knowledge from the multiple source domains is also a critical problem for gene prioritization.
Designing effective and efficient algorithms to address the above two issues would be promising for gene prioritization.

Methods
In this section, we address the objective function, optimization and analysis of algorithms are successively addressed.

Preliminaries
Before describing the details of TLGP, let us introduce some notations that are widely used in the next subsections. In this study, matrices are denoted by capital letters, and vectors by bold lowercase letters. Given the gene expression profiles as an matrix X with the ith row and jth column element x ij , where the row denotes a gene and the column corresponds to a patient. The ith row (column) is denoted by be the gene expression profiles of the source and target cancer, respectively. Let the binary vector y = {y 1 , . . . , y n } is an indicator for the pathogenic genes in the source cancer, where y i =1 if the ith gene is pathogenic, 0 otherwise. Given an undirected and weighted network G = (V , E) with vertex set V = (v 1 , . . . , v n ) (n is the number of node) and edge set E = {(v i , v j )} , the weighted adjacent matrix W = (w ij ) n×n is constructed, where element w ij denotes the weight on edge (v i , v j ) . If G is an un-weighted network, w ij is 1 if v i and v j are connected, 0 otherwise. Let w i. (1) ψ : V � → R + ,

Objective function
The overview of the proposed algorithm is shown in Fig. 1, which consists of the affinity matrix construction, dimension reduction in source domain, fusion network construction and gene ranking. The ultimate goal of TLGP is to learn a reliable and fused network for genes, where the heterogeneous genomic data from the source and target domains are integrated by using transfer learning. In transfer learning, two critical techniques are involved, i.e., how to extract knowledge from source domain and how to transfer knowledge to target domain, which are also two factors for the objective function of the proposed algorithm.
To transfer knowledge in the source cancer, we need to quantify the similarity between the source and target cancer because it decides where the knowledge can be extracted. The purpose of domain adaptation is to use labeled data in the source domain to improve the performance of the target task when the target domain is similar to the source domain. However, when the distributions of the source and target domain differ greatly, the performance of transfer learning is undesirable. To solve this problem, many methods [37][38][39][40] explore how to narrow the difference in the distribution of features between the two domains through some transformations. For example, TCA [37] assumes that the marginal distribution between source domain and target domain is different but there exist a mapping function �(.) that projects two domains into a common space in which the discrepancy will be minimized. JDA [38] considers that both marginal distribution and conditional distribution between source domain and target domain are different and proposes to iteratively use the pseudo labels to approximate the true labels.
In this study, the distributions of the source and target cancer differ greatly because the gene expression profile and pathogenic genes are involved in the source cancer, whereas the target cancer only has the expression data. Therefore, we need to integrate the gene expression and pathogenic gene list. However, it is difficult to integrate the genomic data, particularly for the heterogeneous data [41]. To solve this problem, we use the pathogenic gene list to adjust the gene expression profiles with the underlying assumption that the pathogenic and non-pathogenic genes have different expression patterns. Thus, we expect to learn a representation for X [s] , denoted by A, such that the expression profiles of pathogenic and non-pathogenic genes are well separated, which can improve the accuracy of algorithms. LMNN [42] is adopted for this issue, which obtains new representation of the gene expression profiles of the source cancer using a project matrix H [s] ∈ R k×r by minimizing the approximation between the expression data and representation, i.e., where A ∈ R n×r is the new representation of X [s] .
Then, we consider how to transfer learning between the source and target cancer based on the gene expression profiles by constructing the affinity matrix S ∈ R n×n , element s ij denotes the absolute value of Pearson coefficient between x [s] i. and x [t] j. . The underlying assumption is that genes with the same or similar functions have the same or similar expression patterns. Thus, if a pair of genes have the similar expression patterns in the source and target cancer, we have enough reasons to believe that they share knowledge. If the ith gene in target cancer is similar to the jth gene in the source genes in terms of gene expression, we can transfer the knowledge between them. One issue that must be solved before transferring knowledge is to quantify how similarity they because it determines how much information can be transferred. The expression profile of the ith gene must be consistent with the representation in Eq. (2). We learn a project matrix S to measure the distance between them, i.e., where a j. is the jth row in A, and A is the Frobeneous norm of A. However, Eq. (3) quantifies the similarity in terms of gene expression profiles, ignoring the similarity of genes S. Actually, the shared knowledge for transferring is also determined by the similarity of gene pair. Thus, we weight the distance in Eq. (3) by using the similarity matrix S, which is re-written as Analogously, we expect in the fused network w ij receives heavy weight if the corresponding gene pair have the similar expression profiles in target domain, i.e., By combining Eqs. (4,5), we obtain the objective function as where �(w i,j ) is a penalty item, and parameter controls the importance of the penalty item (how parameter effects the performance is investigated in the experiments). The criterion for �(w i,j ) is that it is close to 0 when there exist an strong connection between the ith and jth genes, 1 otherwise. Here, we set it as ( √ w i,j − 1) 2 . In the next subsection, we deduce the optimization rules for the minimization problem in Eq. (6).

Optimization
Equation (6) involves two variables U and W because the matrix A is learned by using LMNN [42]. However, it is difficult to directly optimize Eq. (6) because of the nonconvexity. An iteration strategy is employed to optimize Eq. (6), where one variable is updated by fixing the other. The iteration continues until the algorithm is convergent.
Fixing U, we obtain the update rule for w i,j as When W is fixed, the second item of the objective function can be formulated as (3) �x [t] i. U − a j. � 2 , (4) s ij �x [t] i. U − a j. � 2 .
where L W is the Laplacian matrix of W. Furthermore, the first item of Eq. (6) can also be transformed into matrix trace as where D refers to the degree matrix of S. Submitting Eqs. (8) and (9), the objective function is written as The the partial derivative of U is deduced as According to KKT condition, by setting ∂� ∂U =0, we obtain the update rule for U as After obtain the fused network W, typical algorithms for gene prioritization, such as PRINCE [32], to rank genes in the target cancer. The procedure of TFGP is presented in Algorithm 1.

Algorithm analysis
On the space complexity, the expression profile of source and target domain requires space O(nm), where m is the maximum of the numbers of samples in the source and target cancer, i.e., m = ma, x{d [s] , d [t] } . The fusion matrix W and similarity matrix S requires space O(n 2 ) . Therefore, the overall space complexity is O(n 2 + nm) = O(n 2 ) because m ≪ n , demonstrating that the proposed method is efficient in terms of the space complexity. On the time complexity, the time for update W is O(n 2 ) . The running time for updating U is O(n 2 m) . Thus, the total running time is O(l(n 2 + n 2 m) = O(n 2 lm) , where l is the number of iterations. It is the same as that of nonnegative matrix factorization [43].