SNP and gene networks construction and analysis from classification of copy number variations data
© Liu et al; licensee BioMed Central Ltd. 2011
Published: 27 July 2011
Skip to main content
© Liu et al; licensee BioMed Central Ltd. 2011
Published: 27 July 2011
Detection of genomic DNA copy number variations (CNVs) can provide a complete and more comprehensive view of human disease. It is interesting to identify and represent relevant CNVs from a genome-wide data due to high data volume and the complexity of interactions.
In this paper, we incorporate the DNA copy number variation data derived from SNP arrays into a computational shrunken model and formalize the detection of copy number variations as a case-control classification problem. More than 80% accuracy can be obtained using our classification model and by shrinkage, the number of relevant CNVs to disease can be determined. In order to understand relevant CNVs, we study their corresponding SNPs in the genome and a statistical software PLINK is employed to compute the pair-wise SNP-SNP interactions, and identify SNP networks based on their P-values. Our selected SNP networks are statistically significant compared with random SNP networks and play a role in the biological process. For the unique genes that those SNPs are located in, a gene-gene similarity value is computed using GOSemSim and gene pairs that have similarity values being greater than a threshold are selected to construct gene networks. A gene enrichment analysis show that our gene networks are functionally important.
Experimental results demonstrate that our selected SNP and gene networks based on the selected CNVs contain some functional relationships directly or indirectly to disease study.
Two datasets are given to demonstrate the effectiveness of the introduced method. Some statistical and biological analysis show that this shrunken classification model is effective in identifying CNVs from genome-wide data and our proposed framework has a potential to become a useful analysis tool for SNP data sets.
Copy number variation (CNV) is defined as a genomic segment range from one kilobase to several megabases in size, in which copy number differences have been observed by comparison of two or more reference genomes [1, 2]. Human beings ordinarily have two copies of each autosomal region, one per chromosome. CNVs can be caused by genomic rearrangements such as inversions, deletions and duplications.
The fact that DNA copy number variation is a widespread and common phenomenon among human beings was first discovered [3, 4] following the completion of the human genome project. The high variability of the copy number throughout the human genome have been found by investigating on 270 HapMap individuals [5, 6]. Various studies have been developed for genome-wide CNV analysis to suggest their significant role in understanding human genetics. Redon et al.  published the first comprehensive and global map of CNVs in a genome-wide scale for making the beginning of large-scale copy number analysis. Comparative genomic hybridization (CGH)  and array-based comparative genomic hybridization  can detect gains and losses of genomic segments. Based on CGH, Wang et al.  proposed a new algorithm “Cluster Along Chromosomes” (CLAC), which builds hierarchical clustering-style trees along each chromosome and selects the interesting clusters by controlling False Discovery Rate (FDR) at a certain level. Bayesian model is a popular method for CNV detection. Pique-Regi et al.  exploited the use of piecewise constant (PWC) vectors to represent genome copy number and sparse Bayesian learning (SBL) to detect CNA breakpoints. Wu et al.  implemented Bayesian segmentation approach to carry out segmentation and assigning copy number status simultaneously. Recently, Chen et al.  used the mean and variance change point model (MVCM) to detect CNVs or breakpoints with an approximate p-value for statistical testing. In addition, Oldridge et al.  quantitatively evaluated several processing and segmentation strategies when short-sequence oligonucleotide arrays are applied and provide guidelines to optimize performance based on study-specific objectives.
Besides the above mentioned CNV detection methods, high-density single nucleotide polymorphism (SNP) arrays are becoming more and more popular in CNV detection. The main reason is that the inheritance pattern and linkage disequilibrium of CNVs are similar to those of the SNPs [6, 7]. Zhao et al.  measured the locus-specific hybridization intensity by hybridizing genomic representations of both DNA to SNP arrays to detect some novel CNVs in cancer samples. Huang et al.  proposed an algorithm that used whole genome sampling analysis (WGSA) by jointly measuring perfect match intensity and discrimination ratios to identify copy number changes. An QuantiSNP algorithm  described in 2007 used Objective Bayes Hidden-Markov Model (OB-HMM) by incorporating log R ratio and B allele frequency to set certain hyperparameters as priors, and using a novel re-sampling framework to calibrate the model to a fixed false positive error rate for CNVs detection. Wang et al.  designed a tool named PennCNV, a hidden Markov model (HMM) based approach, for kilobase-resolution detection of CNVs by combining multiple sources of information together, not only the total signal intensity and allelic intensity ratio of each SNP marker, but also the neighboring distances, the allele frequency and the pedigree information.
Recently, disease classification using copy number variation data has been demonstrated by several groups [19–22]. The motivation of this kind of method is that appropriate machine learning models for disease classification using copy number variation data will be effective not only for clinical treatment, but also for identification of disease susceptibility loci. Generally, copy numbers at probe loci are used directly as features. In a CNV data set, the association between a disease and a set of relevant CNVs are investigated. Patients and normals are often categorized in groups according to their copy number changes. Thousands of CNVs in different regions of chromosomes are used to describe characteristics of patient/normal samples.
When many CNVs are used to detect the association between a disease and multiple marker genotypes, we expect in a typical data set that contains the CNV data of several thousands of CNVs in different individuals, it is common to find only several numbers of copy number positions having genetic patterns that are highly specific to each group of individuals. The CNVs are called the relevant CNVs, as opposed to the irrelevant CNVs that do not help much in identifying the group (i.e., individuals of the same type). Due to the large number of CNVs being irrelevant to each group, two individuals in the same group could have low similarity when measured by a simple similarity function that consider the characteristics of all CNVs. The groups may thus be undetectable by classification algorithms. Here, we are interested in the development of high-dimensional numerical classification algorithm that can identify group of individuals and their relevant CNVs, i.e., detect the association between a disease and multiple marker variations. In this paper, we address this problem by applying a shrunken dissimilarity measure to copy number variation values derived from genome-wide SNP genotyping data. Performance was measured via cross-validation classification accuracy. By shrinkage, the number of relevant CNVs can be determined. In order to understand relevant CNVs, we study their corresponding SNPs in the genome and a statistical software PLINK is employed to compute the pair-wise SNP-SNP interactions, and identify SNP network based on their P-values. Some statistical analysis are done to illustrate the significance of our selected SNP networks. For the unique genes that those SNPs are located in, a gene-gene similarity value is computed using GOSemSim and gene pairs that has a similarity value being greater than a threshold are selected to construct gene networks. A gene enrichment analysis are done to show that some molecular functions and biological process are significantly associated with our gene networks. The whole framework indicates that our classification model is efficient in high dimensional CNV detection and SNP and gene networks constructed afterwards have relationships directly or indirectly to disease study.
The outline of this paper is given as follows. In Section 2, we propose the LRR calculation for signal intensities, the shrunken dissimilarity measure to analyze CNV data classification and the logistic model to calculate SNP interactions. In Section 3, we present experimental results on two real CNV data sets derived from genome-wide SNP arrays, and illustrate their corresponding SNP and gene networks for statistical and biological analysis. Finally, we give concluding remarks in Section 4.
There are many available tools that can be used to do genotyping call and Log R Ratio (LRR) calculation, such as Birdsuite , CNAG , dChip [25, 26] and GLAD . We calculated LRR values using PennCNV , which is a public available software for copy number variation detection from SNP genotyping arrays. For each SNP in a genome-wide data, the raw signal intensity values for its two alleles A and B are measured and then subject to a normalization procedure. The normalized signal intensity values should reveal three clusters (AA, AB, BB) representing three distinct genotypes. A genotype calling is a genotype assignment based on these three canonical clusters.
The normalization procedure produces X and Y values for each SNP, representing the experiment-wide normalized signal intensity on alleles A and B, respectively. One additional measurement will then be calculated for each SNP, where R = X + Y refers to the total signal intensity. As for a normalized measure of total signal intensity, the LRR value for each SNP is then calculated as
LRR = log2(R observed /R expected ) (1)
where R expected is computed from linear interpolation of canonical genotype clusters . LRR refers to the logarithm (in base 2) of the total observed normalized intensity R observed of the two alleles A and B relative to the expected R expected . For normal diploid genotypes, LRR should fluctuate randomly around zero.
It is well-known that DNA microarray data, which can simultaneously measure the expression level of thousands of different genes, have been successfully used to identify genetic heterogeneity of disease. However, mircoarray data typically has a large number of genes (features) and relatively few samples (observations), meaning that conventional machine learning methods may fail when applied to such data. The Prediction Analysis for Microarrays (PAM), has recently been reported as a potential powerful tool for microarrays analysis, which is based on the technique of nearest shrunken centroid [29, 30]. Nearest shrunken centroid is the “de-noised” version of simple nearest centroid classification. The main idea of nearest centroid classification is to compare test sample to each class centroid. The class whose centroid the test sample is closest to is the predicted class for test sample. The centroid is the average value for each feature (or attribute) in each class, normalized by the pooled within-class standard deviation. The nearest shrinkage centroid classification shrinks each centroid toward the overall centroid for all classes by a certain amount, which is called a “threshold”. After the overall centroid is set to zero, this shrinkage is to move the centroid to zero by the threshold, i.e., if the value of centroid in magnitude is larger than the threshold, it is subtracted by the threshold; if it is less than the threshold, it is set to zero, so the corresponding feature disappears, and it isn’t used for the following classification. By doing so, nearest shrunken centroid can find out the minimal subset of genes which succinctly characterize each class. Shrunken centroid method is traditionally used to deal with microarray data. In this paper, we adopt this method for copy number variation values derived from SNP genotyping arrays. The normalized signal intensity Log R Ratio (LRR) value in each probe loci calculated from existing software, which we will give a detailed description in the next subsection, will be considered as a unique feature.
π k is the prior probability of class k.
As our experimental data sets are both disease-trait samples, it is feasible to test epistasis using PLINK to  detect SNP-SNP interactions, which is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analysis in a computationally efficient manner. All pairwise combinations of input SNPs can be tested using a logistic regression model, which is based on allele dosage for each SNP, A and B, and fits the model in the form of (5)
Y~b0 + b1.A + b2.B + b3.AB + e (5)
The test for interaction is based on the coefficient of b 3, therefore only considers allelic by allelic epistasis. We focus on symmetrical cases in our study, that means only unique pairs are analysed, for example, if SNP1*SNP2 is performed, SNP2*SNP1 will not be calculated again. The χ 2 statistics is applied and the odds ratio for interaction is interpreted in the standard manner: a value of 1.0 indicates no effect.
The core study of the Wellcome Trust Case Control Consortium (WTCCC) comprised an analysis of genetic signals from each of seven common human diseases (type 1 diabetes, type 2 diabetes, coronary artery disease, hypertension, bipolar disorder, rheumatoid arthritis and Croh disease) . Genotyping for type 1 diabetes (T1D) study was conducted by Affymetrix using the (“commercial”) Affymetrix 500K chip in 3504 samples with 2000 cases and 1504 controls.
The quantile normalized signal intensity data, which were generated from the Affymetrix intensity (‘CEL’) files and used as input to the Chiamo genotype calling program, was downloaded from WTCCC webpage and appropriately prepared according to the requirements of input file formats of PennCNV. A biological literature searching indicates that in chromosome 6, there are some well-known T1D related genes. Therefore, we chose chromosome 6 as an example to demonstrate our method. There are 31470 unique SNP loci in this chromosome, 1 markers have complete genotyping failure with confidence threshold of 0.01, 1042 markers do not have at least two types of genotypes, 2460 markers have abnormal patterns, so there are 27967 SNP markers have been analyzed to construct canonical clustering positions.
Distribution of LRR values for selected SNPs of T1D study.
Characteristics of SNP networks in different thresholds of T1D study.
NO. of networks
NO. of SNPs
NO. of SNP pairs
1(3), 2(4), 3(9), 4(20)
1(2), 2(3), 3(8), 4(21)
1(2), 2(2), 3(2), 4(2), 5(5)
1(1), 2(1), 3(1), 4(1), 5(4)
1(2), 2(2), 3(5)
1(1), 2(1), 3(4)
Significance of selected SNP networks in different thresholds of T1D study.
NO. of networks
1.2672 × 10–6
1.2765 × 10–7
3.1578 × 10–8
2.0741 × 10–8
5.6191 × 10–8
2.9939 × 10–8
3.5062 × 10–9
6.5545 × 10–9
3.7898 × 10–9
2.0716 × 10–8
1.9979 × 10–8
1.0436 × 10–7
5.8348 × 10–8
5.1991 × 10–9
2.1721 × 10–9
Our method can select 63 SNPs, 25 of them are located in gene coding areas, and these 25 SNPs belong to 24 unique genes, which are all shown in Table 4. After checking with NCBI, we found that some of the genes that our selected SNPs located in are directly or indirectly related to diabetes. For example, GMDS refers to GDP-mannose 4,6-dehydratase, which catalyzes the conversion of GDP-mannose to GDP-4-keto-6-deoxymannose, and it has been verified that mannose-binding lectin is a predictor of Type 1 Diabetes . MTHFD1L is an enzyme involved in THF synthesis in mitochondria, and mutations of mitochondria strongly associate with diabetes . And also for LAMA2, whose full name is laminin, alpha 2, is a major component of the basement membrane, the abnormal level of it has been reported to have significant relationship to the presence of diabetes .
Detailed gene descriptions of SNPs selected in chromosome 6 of T1D study.
runt-related transcription factor 2
nudix (nucleoside diphosphate linked moiety X)-type motif 3
G protein-coupled receptor 116
leucine rich repeat containing 16
butyrophilin, subfamily 3, member A1
chromosome 6 open reading frame 209
similar to PPP1R14B protein
pleckstrin homology domain containing, family G (with RhoGef domain) member 1
chromosome 6 open reading frame 210
hypothetical protein FLJ34503
dynein, axonemal, heavy polypeptide 8
sorting nexin 9
leucine rich repeat and fibronectin type III domain containing 2
erythrocyte membrane protein band 4.1-like 2
FYN oncogene related to SRC, FGR, YES
laminin, alpha 2 (merosin, congenital muscular dystrophy)
NADPH oxidase 3
laminin, alpha 2 (merosin, congenital muscular dystrophy)
unc-5 homolog C (C. elegans)-like
methylenetetrahydrofolate dehydrogenase (NADP+ dependent) 1-like an enzyme involved in THF synthesis in mitochondria
ribosomal protein S6 kinase, 90kDa, polypeptide 2
Gene networks formed for different threshold values of T1D study.
Number of gene pairs
Number of gene networks
We found some interesting relationships from SNP and gene networks. For example, for SNPs rs9321142 and rs9371491, which are interacted in the same SNP network under P-value of 0.1 in Figure 2, their corresponding genes are LAMA2 and MTHFD1L, but located in different gene networks in Figure 4, which means that maybe we can merge these two gene networks together, (shown with red dash line in Figure 4), and furthermore, can be consistent with existing biological functions of these two genes to diabetes. Another example, rs6912853 is interacted with rs2734990 under a very significant P-value of 9.43 x 10 – 5, but rs2734990 is located in intergenic area and do not have a record in Gene Ontology until now, maybe we can make use of rs6912853’s gene information, BTN3A1, to further analyze the inner functions of rs2734990 and extend GO afterward.
Gene Ontology terms significant associated with selected genes (P < 0.001) of T1D study.
superoxide-generating NADPH oxidase activity
response to gravity
folic acid and derivative biosynthetic process
tetrahydrofolate metabolic process
NADPH oxidase complex
oxidoreductase activity, acting on NADH or NADPH, with oxygen as acceptor
Hirschsprung (HSCR, MIM 142623), also known as aganglionic megacolon, is a congenital intestinal disease. Patients suffer from different extent of aganglionosis due to the absence of ganglion cells in the gastrointestinal tract. Trisomy 21 (Down syndrome) is the most recurrent structural variations found in HSCR, contributing to 2-10% of the cases.
This data set consists of 330 controls and 121 cases of Chinese ethnicity. In this analysis, we excluded the centromeric and telomeric regions using 500kb threshold because these regions tend to have bias in CNV calling. We transformed the probe intensities into LRR values from this SNP data set (n=3369) using PennCNV with the same parameter setting and procedure as mentioned above.
Distribution of LRR values for selected SNPs of Hirschsprung study.
As there are only 6 SNPs selected, the number is too small to construct any networks. So for this dataset, we did not do any analysis in SNP level.
Detailed descriptions of SNPs selected in chromosome 21 of Hirschsprung study.
Down syndrome cell adhesion molecule
Mutations in this gene cause enterokinase deficiency, a malabsorption disorder characterized by diarrhea and failure to thrive
T-cell lymphoma invasion and metastasis 1
Gene Ontology terms significant associated with selected genes (P < 0.0001) of Hirschsprung study.
positive regulation of axonogenesis
positive regulation of cell projection organization
ephrin receptor signaling pathway
regulation of cell projection organization
regulation of axon extension involved in axon guidance
ephrin receptor binding
In this paper, we use the method of nearest shrunken centroid for gene expression data, and apply it to tackle copy number variation data determined from genome-wide SNP arrays. The method can be implemented on a personal computer very efficiently. The relevant SNPs are selected for disease data. Experimental results are reported to show the effectiveness of our method. In particular, we find some SNPs that contain in some genes which are relevant to a particular disease. Based on the SNP and gene networks, we can find out some unknown relationships between their corresponding genes, which can be considered as an extension of existing GO knowledge. The existing Gene Ontology enrichment analysis tool also suggests that our selected genes are associated to some molecular function and biological process. In the future, we will study the following problems. Detailed biological analysis of CNVs determined from other genome-wide SNP data sets will be studied. Statistical and association study of selected SNPs can be carried out.
MN designed this study and developed the new algorithm. YL designed this study, coded the program, ran the experiments and wrote the manuscript. YF designed this study.
This work was financially supported by Research Grant Council  and Hong Kong Baptist University FRGs.
The authors would like to extend gratitude to the support by research grants, including HKU 765407M and HKU 775907M from the Hong Kong Research Grants Council, AOSPINE (AOSBRC-07-02) and AoE/M-04/04 from the University Grants Committee of Hong Kong. We are also grateful to Prof. Kathryn S.E. Cheah, Prof. Pak C. Sham, Prof. Paul K.H. Tam, Dr Maria-Merc, Garcia-Barcel, Dr YQ Song, Dr Danny Chan, Dr Kenneth Cheung and Dr Clara S. Tang for their assistance and kindness in data sharing.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 5, 2011: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S5.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.