A heterogeneous graph convolutional attention network method for classification of autism spectrum disorder

Background Autism spectrum disorder (ASD) is a serious developmental disorder of the brain. Recently, various deep learning methods based on functional magnetic resonance imaging (fMRI) data have been developed for the classification of ASD. Among them, graph neural networks, which generalize deep neural network models to graph structured data, have shown great advantages. However, in graph neural methods, because the graphs constructed are homogeneous, the phenotype information of the subjects cannot be fully utilized. This affects the improvement of the classification performance. Methods To fully utilize the phenotype information, this paper proposes a heterogeneous graph convolutional attention network (HCAN) model to classify ASD. By combining an attention mechanism and a heterogeneous graph convolutional network, important aggregated features can be extracted in the HCAN. The model consists of a multilayer HCAN feature extractor and a multilayer perceptron (MLP) classifier. First, a heterogeneous population graph was constructed based on the fMRI and phenotypic data. Then, a multilayer HCAN is used to mine graph-based features from the heterogeneous graph. Finally, the extracted features are fed into an MLP for the final classification. Results The proposed method is assessed on the autism brain imaging data exchange (ABIDE) repository. In total, 871 subjects in the ABIDE I dataset are used for the classification task. The best classification accuracy of 82.9% is achieved. Compared to the other methods using exactly the same subjects in the literature, the proposed method achieves superior performance to the best reported result. Conclusions The proposed method can effectively integrate heterogeneous graph convolutional networks with a semantic attention mechanism so that the phenotype features of the subjects can be fully utilized. Moreover, it shows great potential in the diagnosis of brain functional disorders with fMRI data.


Backgound
Autism spectrum disorder (ASD) is a developmental disability that can cause significant social, communication and behavioral challenges [1].ASD has attracted great attention from neuroscientists and clinical scientists, who hope to clarify its pathogenic mechanism and find an effective treatment method [2].For children with ASD, early identification and intervention are important since they may mitigate disease severity and ameliorate the quality of the patients' lives.However, due to the complexity and heterogeneity of ASD, no effective biomarkers for ASD have been found at present.The diagnosis of ASD is mainly based on the interaction between individuals and clinicians [3,4].Many children cannot receive a final diagnosis until much older.
In the past decade, functional magnetic resonance imaging (fMRI) as a promising neuroimaging technique has been widely used for studying interregional functional connectivity (FC) in the human brain.In fMRI studies, FC is defined as the temporal correlation of blood oxygen level dependent signals measured in various brain regions.It is used to identify potential neuroimaging biomarkers for the diagnosis of neurological diseases [5,6].In some specific functional connectivity in the brains with ASD, abnormalities have been found.For instance, Monk et al. [7] discovered that intrinsic connectivity within the default network in ASD subjects has been altered, and that connectivity between these structures is related to specific ASD symptoms.Therefore, effective modelling with brain functional connectivity of fMRI data is conducive to the identification of biomarkers for ASD.
Based on fMRI data, many machine learning methods and deep learning methods have been proposed for ASD classification.Feng et al. [8] summarized the progress of ASD classification work with the Autism Brain Imaging Data Exchange (ABIDE) dataset in the last three years.Kong et al. [9] proposed an ASD-assisted diagnosis method based on a deep neural network (DNN).Mostafa et al. [10] proposed diagnosing ASD based on eigenvalues of brain networks and linear discriminant analysis (LDA).Ahmed et al. [11] designed a single volume image generator that converts individual fMRI images into a series of 2-dimensional images.Then they used an improved convolutional neural network to classify those generated images.Guo et al. [12] proposed a sparse autoencoder based feature selection method, and developed a DNN-based classification model for distinguishing ASD patients from typically developed controls.Heinsfeld et al. [13] extracted low-dimensional features from training samples with two stacked denoising autoencoders.Then they used an MLP to classify ASD and achieved a classification accuracy of 70% on the ABIDE dataset.Eslami et al. [14] proposed a framework called ASD-DiagNet to classify ASD by using only fMRI data.Hu et al. [15] proposed an interpretable fully connected neural network (FCNN) to identify ASD participants from fMRI data and obtained an accuracy of 69.81%.Liu et al. [16] improved ASD classification using dynamic functional connectivity (DFC) and multitask feature selection.They used a multikernel support vector machine (SVM) learning method for ASD classification and achieved an accuracy of 76.8% on the ABIDE I dataset.Brahim and Farrugia [17] presented an approach based on graph fourier transform (GFT) and SVM for the analysis of resting-state functional magnetic resonance imaging.Yin et al. [18] employed an autoencoder (AE) to learn advanced features from fMRI data.Then they trained a DNN with the learned features and achieved a classification accuracy of 76.2%.Haghighat et al. [19] proposed an age-dependent connectivity-based ASD computer aided diagnosis system using resting state fMRI.Wang et al. [20] proposed a multisite clustering and nested feature extraction (MC-NFE) method for fMRI-based ASD detection.Experimental results on 609 subjects from the ABIDE database suggest that the proposed MC-NFE outperforms several state-of-the-art methods in ASD detection.
Recently, graph neural networks, which generalize deep neural network models to graph structured data, have shown great advantages in model training and classification tasks [21].Researchers have tried to classify ASD data using graph models.In 2017, Parisot et al. [22] constructed a population graph using fMRI and phenotypic data, in which nodes and arc weights are associated with image-based feature vectors and phenotypic data, respectively.Then they applied a graph convolutional network (GCN) with the population graph as input to classify ASD.The results showed that integrating phenotypic data in classification tasks was beneficial.In 2018, Parisot et al. [23] further studied the impact of different feature selection strategies on the classification of ASD.They used a GCN in a semisupervised manner for node classification.A classification accuracy of 70.4% for the ABIDE dataset was achieved.Rakhimberdina et al. [24] proposed a population graph-based multimodel ensemble to classify patients with ASD and healthy controls (HCs).Compared with using a single model, the proposed method obtained higher accuracy on the ABIDE dataset.Jiang et al. [25] proposed a hierarchical GCN framework to learn graph feature embeddings for ASD classification.In the framework, the network topology information and subject's association are considered at the same time.Li et al. [26] proposed a graph neural network framework (BrainGNN) to analyse functional magnetic resonance images and discovered neurological biomarkers for ASD.Wen et al. [27] presented a prior brain structure learning-guided multiview graph convolutional neural network to learn common features for ASD classification.In our previous work [28], a combination of deep feature selection and GCN was proposed to classify ASD.First, the deep feature selection method of [29] was used to select the functional connection features of fMRI data.Then, a GCN was used to classify 871 subjects in the ABIDE I dataset, and a high classification accuracy of 79.5% was achieved, which is currently the highest.
As brain connectivity graphs are irregular graph structures, GCNs are well suited to handle such data structures.Thus, the classification performances of the above methods are significantly improved compared to traditional machine learning methods.However, it needs to be noted that in the above graph-based models for ASD classification, the graphs constructed are all homogeneous (i.e., only one type of node and one type of arc are constructed) in which the imaging features are mapped into node feature vectors while the phenotype features are mapped into arc weights.However, since arc weights are scalar, they cannot fully represent the phenotype features.Therefore, the performances still suffer from the limitation that all edges in the graph have an aggregated weight and the phenotypic data are not fully used.To solve this problem, this paper further investigates using graph neural networks to classify ASD patients from healthy controls.The goal of the present work is to fuse fMRI and phenotype information of subjects into a graph neural network so that better classification performance and more accurate diagnosis can be achieved.
In order to fully make use of the phenotype information of non-imaging data of the subjects, a heterogeneous population graph based on the fMRI and phenotypic data is constructed.At the same time, an attention mechanism is introduced so that different weights can be learned and aggregated important features can be extracted.Therefore, based on the heterogeneous graph, GCN and attention mechanism, a heterogeneous graph convolution attention network (HCAN) for the classification of ASD is proposed.This work is inspired by the work of [30], a heterogeneous graph attention network for node classification.Different from homogeneous graphs, heterogeneous graphs have multiple types of nodes and arcs.In HCAN, different phenotype features are mapped into different types of arcs; thus, richer hidden information is contained.
The main contribution of this work is summarized as follows.
• In this paper, a heterogeneous graph construction method is constructed for the ABIDE dataset.The heterogeneous graph contains not only imaging data features but also rich phenotypic data features.• Based on the heterogeneous graph, a heterogeneous graph convolution attention network for ASD classification is proposed.With the attention mechanism, the importance of phenotype information can be fully considered.• On the ABIDE dataset, the proposed method achieves the best classification accuracy of 82.9%, which is the new state-of-the-art and significantly outperforms previous approaches.
The rest of the paper is organized as follows.In Sect.2, the ABIDE dataset and the preprocessing of the data are introduced.In Sect.3, the proposed HCAN method, including the construction of a heterogeneous graph, the heterogeneous graph convolution network, the semantic attention network, and the model loss function, is shown.In Sect.4, some numerical results are shown, and the proposed method is compared with some other methods in the literature.Finally, conclusions are drawn in Sect. 5.

Data and preprocessing
This paper carries out research on the challenging public ABIDE I dataset [31], which aggregates data from 17 different international collection sites, sharing neuroimaging and phenotype data of 1112 subjects.In the experiment, 871 subjects (including 403 ASD patients and 468 healthy controls) who meet the imaging quality and atypical information criteria were used.The related phenotypic data, including ' Age' , 'Handedness' , and 'Sex' of these subjects are shown in Table 1.
The preprocessed data of the 871 subjects were downloaded from the Preprocessed Connectomes Project (http:// prepr ocess ed-conne ctomes-proje ct.org/).Data preprocessing was performed using the configurable pipeline for the analysis of connectomes.According to the Harvard-Oxford atlas, there are 111 ROIs in the brain [32].The mean time series for each ROI was calculated.Then the distance correlation coefficients between different mean time series were calculated to obtain a functional connection matrix.Finally, the 6105 elements belonging to the upper right triangle part of the matrix were extracted to form a functional connection feature vector.

The proposed method
In this section, the proposed HCAN method for the classification of ASD is introduced.The architecture of the proposed HCAN model is shown in Fig. 1, which includes a multilayer HCAN and an MLP.The input of the model is fMRI and phenotypic data, while the output is the prediction result (i.e., the probability of ASD) of each sample.
For a specified classification task, the HCAN model works as follows.First, a heterogeneous population graph using the fMRI and phenotypic data is constructed.Then, the heterogeneous graph is processed through a multilayer HCAN to extract fused features with semantic information.Next, the fused features will go through a dropout layer for regulation and are further fed into an MLP with softmax to output prediction results.The structure of an HCAN layer is shown in Fig. 2. Each HCAN layer consists of a heterogeneous graph convolutional network (HGCN) and a semantic attention network (SAN).
Next, the proposed method will be shown in detail from the following three parts: the construction of a population heterogeneous graph, the HCAN model, and the loss function of the model.

Heterogeneous graph construction
Different from homogeneous graphs, heterogeneous graphs are a special type of information network that involve multiple types of objective nodes or multiple types of arcs [33].
Definition 1 ([33]) Heterogeneous graph G = (V , E) consists of a node set V and an arc set E. Moreover, there is a mapping relationship φ : V → Q , and ψ : E → S , where Q is the node type collection, S is the arc type collection, and For a heterogeneous graph, two objective nodes can be connected through different semantic paths.These paths are called meta-paths.

Definition 2 ([34]
) For a heterogeneous graph G, a meta-path is defined as: In a heterogeneous graph, the relations defined by different meta-paths are different, and they can be used to analyse the composite connections and meanings between different nodes.Given a meta-path, for each node, its neighbor nodes are defined as all the Fig. 2 The structure of a HCAN layer.Each HCAN layer consists of a heterogeneous graph convolutional network (HGCN) and a semantic attention network other nodes on the path.A set of neighbors based on the meta-path contains structure information and specific semantics.
This paper constructs a heterogeneous population graph of the ABIDE dataset, where image-based functional connection features are contained in the nodes, while nonimage phenotype features are contained in the arcs.In the graph, there is only one type of node (i.e., sample nodes) being constructed.There is a one-to-one corresponding relationship between the nodes and the samples.Each node contains an image-based feature vector of a sample.For each sample, the functional connection feature vector after feature selection can be used as the feature vector of the sample node.
Once the sample nodes are set, they are connected by different arcs according to the non-image phenotype features of the samples.Specifically, according to a certain type of non-image phenotype feature, the samples with the same non-image phenotype attribute value are connected.Therefore, the number of arc types is equal to the number of involved non-image phenotype features.In this work, three types of arcs based on 'site' , 'sex' , and 'handedness' are constructed.For example, if a non-image phenotype feature is 'sex' , all the samples with the sex of 'male' are connected, while all the samples with the sex of 'female' are connected, and those connections are regarded as the arcs of the 'sex' type.All the arcs are undirected and unweighted, which forms an undirected unweighted heterogeneous graph.Figure 3 shows the construction of a heterogeneous population graph based on the ABIDE dataset, in which red, blue, and green are used to distinguish the three types of arcs based on 'site' , 'sex' , and 'handedness' , respectively.

Heterogeneous graph convolutional networks
Graph convolutional networks are important tools for graph data feature extraction.However, graph convolutional networks can only be used for training homogeneous graphs.Therefore, this research designs a heterogeneous graph convolutional network (HGCN) to extract features from heterogeneous graphs.The HGCN includes the decomposition of a heterogeneous graph and residual graph convolution networks.
In an HGCN, the constructed heterogeneous graph is first decomposed into several homogeneous graphs based on the meta-paths.Then, for each homogeneous graph, an independent residual graph convolution network is set up.Thus, for each sample node in the heterogeneous graph, different embedding vectors (representations) can be obtained through the forward propagation of different residual graph convolution networks, and they can be integrated as a weighted sum fused feature vector.

Decomposition of a heterogeneous graph
In a heterogeneous graph, sample nodes are connected with different types of arcs based on meta-paths.The neighbor connections represent a certain type of relation between the samples.The connected nodes have more potential similar features than the unconnected ones.For example, if two sample nodes are connected based on the 'node -sex -node' meta-path, then the two samples have the same 'sex' attribute.To fully use and mine the structure information and specific semantics information in a meta-path, the heterogeneous graph is decomposed into multiple homogeneous graphs based on meta-paths.
For a specific meta-path, when a node is connected with all its neighbor nodes in a new graph, a homogeneous graph can be obtained.For the ABIDE heterogeneous population graph, based on the three types of meta-paths, i.e., 'node -sex -node' , 'node -site -node' , and 'node -handedness -node' , three homogeneous graphs (see Fig. 4) can be obtained.It needs to be noted that all the nodes with their feature vectors in the homogeneous graph are inherited from the heterogeneous graph.

Residual graph convolutional networks
For each homogeneous graph, a residual graph convolutional network is constructed to extract features.Consider an undirected unweighted graph G = (V , E, A) , where V is a node set, |V | = n , E is an arc set, and A ∈ R N ×N is the adjacency matrix.Let D be the degree matrix and L be the normalized graph Laplacian; then, L = I N − D − 1 2 AD − 1 2 , where I N ∈ R N ×N is an identity matrix.L can be decomposed as L = U U with the matrix of eigenvectors U and the diagonal matrix of its eigenvalues .Suppose that each node i in the graph contains only one-dimensional feature x i , then the vector sig- nal formed for all the nodes is x ∈ R N .Let us consider spectral convolutions on graphs Fig. 4 Decomposition of a heterogeneous graph into homogeneous graphs based on meta-paths defined as the multiplication of signal x with a filter (convolution kernel function) g θ = diag(θ) parameterized by θ ∈ R N in the Fourier domain In view of the high computational complexity of graph convolution operations, the Chebyshev polynomial expansion method can be applied to approximate the convolution kernel function g θ .Usually, a first-order Chebyshev approximation is adopted.Thus, the convolution operation of a graph signal can be approximated as follows: where θ ′ is a convolution kernel parameter, Ã = A + I N , D is a diagonal matrix, and Dii = j Ãij .At this point, the graph convolution expression of the one-dimensional signal on the graph is obtained.Since each node may contain multiple features, i.e., the signal on a node is multi-channel, the one-dimensional signal x is generalized to be C channel signals X ∈ R N ×C .Suppose there are F convolution kernels (the number of con- volution kernels is also denoted as the hidden size of an HCAN layer), the convolution operation for X is as follows: where is a matrix of convolution kernel parameters, and Z ∈ R N ×F is the convolved signal matrix.
Therefore, the graph convolutional network has the following layer-wise propagation rule, where H (l) ∈ R N ×D is the output of the lth layer of the network ( H (0) = X ), σ denotes an activation function such as ReLU (•) = max(0, •) , and W (l) is the network parameter of the lth layer, which can be trained.Considering that the graph convolutional network is difficult to train, a residual connection is added to the graph convolutional network; thus, the above layer-wise propagation rule is changed to where M is a linear transformation matrix.When the dimensions of H (l) and H (l+1) are the same, M is an identity matrix.

Semantic attention networks
For each sample node, after forward propagation through the heterogeneous graph convolutional network, three embedding vectors can be obtained.Each embedding vector contains a piece of specific semantic information, which is related to its corresponding meta-path.Since the importance of that semantic information to the classification task is difficult to determine, a semantic-level attention network is constructed to learn the importance of different semantic information.Based on the three meta-paths, the attention weights for the three specific semantics are ), where Z 1 , Z 2 and Z 3 represent the embedding vectors of all the sample nodes obtained based on meta-paths 1 , 2 , and 3 , respectively, and attsem(•) represents the neural network for computing attention weights (which can be used to learn the importance of each semantic information through back-propagation).The process of computing semantic attention weights is shown in Fig. 5. Let z i j• be the jth row of Z i , an embedded vector of node j (j ∈ V ) based on meta-path i .It contains specific semantic information related to meta-path i .In a semantic attention network, first, the embedding vector z i j• is transformed into an embedding representation of the specific semantic through a learnable nonlinear transformation where W is a weight matrix, and b is an offset vector.Then, a learnable semantic-level attention vector q is used to measure the importance of the specific semantic by calculating the similarity between the embedding representation tanh(Wz and the semantic-level attention vector q.Next, for the specific semantic based on meta-path i , the average of those importance factors of all the nodes w i is calculated with Furthermore, a softmax function is used to normalize w i as a semantic attention weight.Suppose the semantic attention weight for meta-path i is β � i , then

tanh(Wz
, Fig. 5 Computation of attention weight β �i for embedding vector Z i in a semantic attention network which represents the contribution of the semantic based on meta-path i to the classification task.Obviously, the higher β � i is, the more important its semantic information is.For different tasks, β � i may be different.Finally, the weight β � i in the attention network is used as a coefficient to integrate embedding vectors Z i , i = 1, 2, 3 as a final embedding vector Z, Obviously, vector Z has the same dimension as Z 1 , Z 2 and Z 3 .It is the output vector of an HCAN layer.

The model loss function
The final embedding vector Z of the last HCAN layer will go through a dropout layer to drop part of the features.Then, the feature embeddings after dropout are fed into MLP with a softmax function to output a class vector y ′ , which is the prediction class value vector of the samples.Suppose T is a set of selected nodes, |T | is the number of nodes in T, and Y is the set of classes.For node l, we use y l i and y ′ l i to represent its true class value and predicted value, respectively.We use the cross-entropy loss function to calculate the loss between the predicted value and the true value.Let L T be the loss of node set T, then it is calculated as follows

Results and discussion
In this section, the proposed method is tested on the ABIDE I dataset.FC features and non-image phenotype features of the selected subjects are used to construct a heterogeneous population graph.
For each sample node, 800 features selected from the 6105 functional connectivity features with the deep feature selection method (see [28]) are utilized as the node feature vector.The model is implemented in PyTorch.Training of the model uses a computer that contains an Intel (R) Core (TM) i5-9300 H CPU with 4 cores running at 4.00 GHz and 8 GB RAM, and an NVIDA GeForce GTX 1650MQ GPU with 896 CUDA cores and 4 GB GDDR5.During the model training, GPU acceleration and the early stop technique are utilized.
The parameters of the model are set as follows.The HCAN model includes two HCAN layers and an MLP.For each HCAN layer, the hidden size is 20, while in the MLP, the number of output units is 2. The Adam algorithm is used to optimize the model loss, where the learning rate is set to 0.005, and the weight decay is set to 5 × 10 −4 .For the dropout layer, the dropout rate is set to 0.6.

Experiments on the ABIDE database
The proposed method is first tested on the whole dataset with 871 subjects.In the experiment, a 10-fold cross-validation schema that mixes data from all 17 sites while keeping the proportions between the different sites is used to evaluate the model performance.
The average accuracy (ACC), sensitivity (SEN), specificity (SPE) and area under curve (AUC) are reported.The proposed HCAN method achieves an average ACC of 82.9%, SEN of 76.7%, SPE of 86.6% and AUC of 84.6%.The running time of performing 10-fold cross validation is 256 s.Then, 5-fold cross-validation on each site is performed separately.The average ACC, SEN, SPE and AUC values are provided in Table 2. From the table, it can be seen that the SPE value of STANFORD is only 53.3% and the SEN value of SDSU is only 50%.The SEN values for both CALTECH and STANFORD are equal to 100%.This indicates that all the ASD subjects in the testing sets for the two sites were identified correctly.For CMU, it needs to be noted that there are only 11 subjects, and the ACC, SEN and SPE values are quite low (close to 60%).For all the datasets from different sites, the mean ACC, SEN, SPE and AUC values are 75.6%,72.6%, 77.3% and 83.0%, respectively.In general, the proposed method performs well on the per site datasets.

Impact of model hyperparameters
This paper carries out experiments to study the impact of the model hyperparameters on the classification performance.In the HCAN model, the following three hyperparameters, namely, the number of HCAN layers, hidden size, and dropout rate, are investigated.
First, the relationship between the number of HCAN layers and the classification performance is explored.The number of HCAN layers is gradually increased from 1 to 5 while keeping the hidden size 20 and the dropout rate 0.6 unchanged.The accuracy and F1 score are computed.Figure 6 shows the comparative boxplot of accuracy and F1.For boxplots, the distribution of data based on a five-number summary including minimum, first quartile, median, third quartile, and maximum is displayed; also mean values in solid points are shown.When the number of HCAN layers increases from 1 to 2, the model performance improves significantly, while when the number of HCAN layers continues to increase, the model performance decreases.
Then, the impact of hidden size on the classification results is studied.The number of HCAN layers and the dropout rate are kept at 2 and 0.6, respectively.The hidden size is changed from 12 to 28 with a step size of 4. Figure 7 shows the impact of the hidden size.Before the hidden size increases to 20, the model performance is improved with increasing hidden size.However, once the hidden size is over 20, the model performance worsens.
In general, hyperparameters such as the number of layers and the hidden size in the network are related to the model complexity.A network with a larger number of layers or hidden size is of higher complexity.It seems that when the model complexity is low, increasing the model complexity can significantly improve the model performance, but when the model complexity reaches a certain degree, increasing the model complexity will cause overfitting and decrease the model performance.Finally, the influence of the dropout rate on the model performance is investigated.Dropout can be used to improve the model performance by reducing overfitting.The dropout rate is changed from 0 to 0.8 with a step size of 0.2, while the number of HCAN layers and hidden size are kept at 2 and 20, respectively.Figure 8 shows the change of accuracy and F1 score with the dropout rate.Both the accuracy and F1 score achieve the highest value when the dropout rate is equal to 0.6.However, when the dropout rate is over 0.6, the model performance decreases significantly due to the loss of feature information.

Comparison with other methods
In our previous work [28], it was shown that the GCN method with deep feature selection is superior to some machine learning methods for the classification of ASD.In this work, the same comparisons are not repeated.Instead, to show the superior performance of our method, this paper compares the proposed method with some deep learning methods, i.e., MLP, HAN [30], GCN [28] and ASD-Diagnet [14].
In order to establish a fair comparison, all the above methods are implemented on the same computer and use the same 800 selected functional connection features.The same training and testing sets are used in the 10-fold cross-validation for all the methods.The parameters of MLP, HAN and GCN are optimally selected based on the gridsearch method.In the MLP, 3 hidden layers, 16 hidden neurons and a dropout rate of 0.2 are set; In the GCN, 1 hidden layer and a dropout rate of 0.3 are set, and the graph weight matrix is constructed as described in [28].In the HAN model, 2 HAN layers and 1 MLP layer are used; the output vector dimension for each HAN layer is 20; the output vector dimension of the MLP layer is 2; and the dropout rate is 0.6.For the MLP, HAN and HCAN models, a learning rate of 0.005 and weight decay of 5 × 10 −4 in the Adam optimizer are used.For ASD-DiagNet, the code from https:// github.com/ pcdsl ab/ ASD-DiagN et were downloaded, and the same parameters as the ones in [14] were used.
The average ACC, SEN, SPE and AUC values, as well as their standard deviations, are calculated.The running time for each method is also recorded.The results are listed in Table 3. From the table, it can be seen that the ACC, SEN and AUC of the HAN method are the lowest compared to the other methods, while the computation time of the HAN is the largest.Therefore, the performance of HAN is the worst.GCN and MLP perform better than ASD-DiagNet and HAN in terms of ACC, SEN, SPE, AUC and computational time.The proposed HCAN method achieves the best performance with an average accuracy of 82.9% and an average SEN of 86.6%.It is superior to the MLP, GCN, and HAN methods.It takes 256 s for HCAN to finish the 10-fold cross-validation, which is longer than MLP (156 s ) and GCN (186 s).This is because HCAN is more complicated than the MLP and GCN.
In the literature, except for Shao et al. [28], other researchers, i.e., Mostafa et al [10], Hu et al. [15], Liu et al. [16], Brahim and Farrugia [17], Yin et al. [18], Parisot et al. [22] and Rakhimberdina et al. [24], have also used the same 871 subjects (consisting of 403 patients with ASD and 468 healthy controls) in the ABIDE I dataset to classify ASD patients and normal controls.Therefore, this paper also compares the proposed method with these methods and summarizes the comparative results in Table 4.In the table, 'Reference' , 'Method' , 'Number of ROIs' (used for constructing features), and ' Accuracy' are listed.
From Table 4, it can be concluded that the proposed method performs the best among all the above methods.To the best of our knowledge, this result is so far the best in the literature for ASD classification with the selected 871 subjects.
The experimental results show that integrating non-imaging data has an important influence on the classification performance of ASD.By using all potential phenotypic measures and introducing an attention mechanism, new aggregated important features can be extracted from the HCAN network; thus, the classification performance can be improved.It needs to be noted that since the GCN involved in the model can only be applied to data with graphs of a fixed structure, if new subjects need to be predicted, it is necessary to reconstruct the graph using the phenotypic information of all the subjects.This will result in a high computational cost, which is the main limitation of the proposed method.

Conclusions
In

Fig. 1
Fig. 1 The architecture of the HCAN model, which inludes a multilayer HCAN and an MLP , and • refers to composition opera- tor on relations.

Fig. 3
Fig. 3 Construction of a heterogeneous graph with functional connection features and non-image phenotype features.Image-based functional connection features are contained in the nodes, while non-image phenotype features are contained in the arcs

Fig. 6 Fig. 7
Fig. 6 Impact of the HCAN layer number on the model performance

Fig. 8
Fig. 8 Impact of the dropout rate on the model performance

Table 2
Average ACC, SEN, SPE and AUC values on individual site data using 5-fold cross-validation with our proposed method Column 'ASD/HC' shows the number of subjects with ASD and healthy controls, respectively

Table 3
Comparative results of different methods on the whole ABIDE dataset with 871 subjectsThe highest average values of ACC, SEN, AUC, and SPE are indicated in bold

Table 4
ASD classification on the ABIDE dataset with 871 subjects this paper, a deep learning model, namely, the heterogeneous graph convolutional attention network model, is constructed.The model is based on a heterogeneous graph and integrates a GCN and an attention mechanism.It uses rs-fMRI data and phenotypic data to classify ASD.The model can effectively extract features from a heterogeneous graph by integrating semantic information of different meta-paths with an attention mechanism.Experimental results have shown that the proposed model outperforms other methods.It reaches the current state of the art.