- Methodology article
- Open Access
A mixture model with a reference-based automatic selection of components for disease classification from protein and/or gene expression levels
- Ivica Kopriva^{1}Email author and
- Marko Filipović^{1}
https://doi.org/10.1186/1471-2105-12-496
© Kopriva and Filipovićć; licensee BioMed Central Ltd. 2011
- Received: 29 June 2011
- Accepted: 30 December 2011
- Published: 30 December 2011
Abstract
Background
Bioinformatics data analysis is often using linear mixture model representing samples as additive mixture of components. Properly constrained blind matrix factorization methods extract those components using mixture samples only. However, automatic selection of extracted components to be retained for classification analysis remains an open issue.
Results
The method proposed here is applied to well-studied protein and genomic datasets of ovarian, prostate and colon cancers to extract components for disease prediction. It achieves average sensitivities of: 96.2 (sd = 2.7%), 97.6% (sd = 2.8%) and 90.8% (sd = 5.5%) and average specificities of: 93.6% (sd = 4.1%), 99% (sd = 2.2%) and 79.4% (sd = 9.8%) in 100 independent two-fold cross-validations.
Conclusions
We propose an additive mixture model of a sample for feature extraction using, in principle, sparseness constrained factorization on a sample-by-sample basis. As opposed to that, existing methods factorize complete dataset simultaneously. The sample model is composed of a reference sample representing control and/or case (disease) groups and a test sample. Each sample is decomposed into two or more components that are selected automatically (without using label information) as control specific, case specific and not differentially expressed (neutral). The number of components is determined by cross-validation. Automatic assignment of features (m/z ratios or genes) to particular component is based on thresholds estimated from each sample directly. Due to the locality of decomposition, the strength of the expression of each feature across the samples can vary. Yet, they will still be allocated to the related disease and/or control specific component. Since label information is not used in the selection process, case and control specific components can be used for classification. That is not the case with standard factorization methods. Moreover, the component selected by proposed method as disease specific can be interpreted as a sub-mode and retained for further analysis to identify potential biomarkers. As opposed to standard matrix factorization methods this can be achieved on a sample (experiment)-by-sample basis. Postulating one or more components with indifferent features enables their removal from disease and control specific components on a sample-by-sample basis. This yields selected components with reduced complexity and generally, it increases prediction accuracy.
Keywords
- Independent Component Analysis
- Support Vector Machine Classifier
- Nonnegative Matrix Factorization
- Sparseness Constraint
- Independent Component Analysis Algorithm
Background
Methods
This section derives sparse component analysis (SCA) approach to unsupervised decomposition of protein (mass spectra) and gene expression profiles using a novel mixture model of a sample. The model enables automatic selection of the two of the extracted components as case and control specific. They are retained for classification. In what follows, the problem motivation and definition are presented first. Then, LMM of a sample is introduced and its interpretation is described. Afterwards, a two-stage implementation of the SCA algorithm is described and discussed in details.
1.1 Problem formulation
As mentioned previously, bioinformatics problems often deal with data containing components that are imprinted in a sample by several interfering sources. As an example, brief description of endocrine signalling system, secreting hormones into a blood stream, is given in [1]. Likewise, reference [21] describes how different organs imprint their substances (metabolites) into a urine sample. As pointed out in [1] and [16] disease samples are combinations of several co-regulated components (signals) originating from different sources (organs) and disease specific component is actually "buried" within a sample. Hence we are dealing with the two problems simultaneously: a sample decomposition (component inference) problem and a classification (disease prediction) problem that is based on sample decomposition. Thus, automatic selection of one or more extracted components is of practical importance. It is also important that component selection is done without a use of label information in which case it can be used for classification.
where A ∈ ℝ^{N × M}and S ∈ ℝ^{M × K}, and M represents an unknown number of components present in a sample. Each component ${\left\{{\mathrm{s}}_{m}\in {\mathbb{R}}^{K}\right\}}_{m=1}^{M}$ is represented by a row vector of matrix S. Nonnegative relative concentration profiles ${\left\{{\mathrm{a}}_{m}\in {\mathbb{R}}_{+}^{N}\right\}}_{m=1}^{M}$ are represented by column vectors of matrix A and are associated with the particular components. Here, it will be presented how innovative version of the LMM (1) of a sample ${\left\{{\mathrm{x}}_{n}\in {\mathbb{R}}^{k}\right\}}_{m=1}^{M}$ enables automatic selection of the case (disease) and control specific components out of ${\left\{{\mathrm{s}}_{m}\right\}}_{m=1}^{M}$ components extracted by unsupervised factorization method: a two stage SCA. The method will then be demonstrated on a computational model as well as on a cancer prediction problem using well known proteomic and genomic datasets.
1.2 Novel additive linear mixture model of a sample
The LMM (1) is widely used in various bioinformatics problems [1–15]. Unless constraints are imposed on A and/or S, the matrix factorization implied by (1) is not unique. Typical constraints involve non-Gaussianity and statistical independence between components by ICA algorithms [6, 18], and non-negativity and sparseness constraints by NMF algorithms, [7, 11, 12, 19, 22, 23]. In addition to that, many ICA algorithms, as well as many NMF algorithms, also require the unknown number of components M to be less than or equal to the number of mixture samples N.
The first sample is a reference sample representing control group, x_{control} ∈ ℝ ^{ K } , in (2a) and case (disease) group, x_{disease} ∈ ℝ ^{ K } , in (2b). The second sample is actual test sample: $\mathrm{x}\in {\left\{{\mathrm{x}}_{n}\in {\mathbb{R}}^{k}\right\}}_{n=1}^{N}$. Coefficients of matrices ${\mathrm{A}}_{\mathsf{\text{control}}}\in {\mathbb{R}}_{+}^{2\times M}$ and ${\mathrm{A}}_{\mathsf{\text{disease}}}\in {\mathbb{R}}_{+}^{2\times M}$ in (2a) and (2b) refer to the amount of relative concentration at which related components are present in the mixture samples x and x_{control} in (2a) or x and x_{disease} in (2b). Source matrices S_{control} ∈ ℝ^{M × K}and S_{disease} ∈ ℝ^{M × K}contain (as row vectors), disease- and control specific components and, possibly, differentially not expressed components. Number of components M is assumed to be greater than or equal to 2. Evidently, for M = 2 existence of differentially not expressed components is not postulated. Importance of postulating components with indifferent features is to obtain less complex disease and control specific components used for classification (see also discussion in section 1.7). These components absorb features that do not vary substantially across the sample population. These features are removed automatically from each sample. The concentration is relative due to the fact that BSS methods enable estimation of the mixing and source matrices up to the scaling constant only. Therefore, it is customary to constrain the column vectors of the mixing matrix to unit ℓ_{2} or ℓ_{1} norm. The LMM proposed here is built upon an implicit assumption that disease specific features (m/z ratios or genes) are present in prevailing concentration in disease specific samples and in minor concentration in control specific samples. As opposed to that, control specific features are present in prevailing concentration in control specific samples and in minor concentration in disease specific samples. Features that are not differentially expressed are present in similar concentrations in both control and disease specific samples. These groups of features constitute components, whereas similarity of their concentration profiles enables automatic selection of the components extracted by unsupervised factorization. The assumption on the prevailing concentrations of up- and down-regulated features needs to be understood in the relative sense. It is justified on the basis of locality of proposed method since the components are extracted on a sample-by-sample basis. Thus, to be allocated in the same component (a case or a control specific one) feature does not need to be expressed in each sample equally strong. Since the LMMs (2a)/(2b) considered here are comprised of two samples only the non-negative mixing vectors are confined in the first quadrant of the plane spanned by control reference sample and test sample, see Figure 1a, or by disease reference sample and test sample, see Figure 1b. Thus, upon decomposition of the LMM (2a) into M components, the one associated with the mixing vector that confines the maximal angle with respect to the axis defined by control reference sample is selected as a disease specific component, Figure 1a. As opposed to that, the one associated with the mixing vector that confines the minimal angle with respect to the axis defined by control reference sample is selected as a control specific component. When decomposition is performed with respect to a disease reference sample, LMM (2b), the logic for an angle-based automatic selection of disease and control specific components is the opposite, see Figure 1b. The components not selected as disease or control specific are considered neutral i.e. not differentially expressed. Thus, LMMs (2a)/(2b) enable automatic selection of the components extracted by unsupervised factorization of mixture samples. Unlike selection method presented in [2] that is based on fixed thresholds which need to be determined by cross-validation, the thresholds (mixing angles) used in the method presented here are sample adaptive. An assumption that each feature is contained in disease specific and one of the neutral components, or control specific and one of the neutral components, represents a sparseness constraint. It enables solution of the related BSS problems through, in principle, two-stage SCA method described in section 1.3. However, sparseness constraint is not justified by mathematical reasons only but also, as emphasized in [3, 6, 11, 12], by the biological reasons. As noted in [6] this is necessary if underlying component (source signal) is going to be indicative of ongoing biological processes in a sample (cell, tissue, serum, etc.). The same conjecture has actually also been used in a three components based gene discovery method in [2]. In this respect, the sparseness constrained NMF methods for microarray data analysis proposed in [7, 11, 12] also assume the same working hypothesis. As discussed in [11, 12], it is the sparseness constraint that enabled biological relevance of obtained results. In microarray data analysis enforcement of sparseness constraint is biologically justified due to the fact that more sparse S gives rise to metagenes (if factorization is performed by NMF), or to the expression modes (if factorization is performed by ICA), that comprise few dominantly co-expressed genes which may indicate good local features for specific disease [11]. A subtle interpretation of the reference-based mixture model (2a)/(2b) reveals its several profound characteristics. Since placement of the features to each of the two or more postulated components is based on sample adaptive thresholds (decomposition is localized), one gene (or m/z ratio) may be highly up-regulated in a case of one sample and significantly less expressed in a case of an another sample. Yet, if it is contained in prevailing concentration in both samples it will be contained in both cases in the component automatically selected as disease or control specific. Moreover, sample adaptive component (feature) selection enables that features selected as up- (or down)-regulated in one sample be less (or more) expressed than differentially not expressed features in another sample. Thus, extracted components selected as disease or control specific are composed of multiple features with different expression levels and joint discriminative power rather than of several (or even single) features only.
For disease prediction, disease and control specific components can be used to train a classifier. The reason is that in each LMM (2a)/(2b) they are extracted with respect to different reference samples and, thus, carry on different but specific information. Hence, proposed method yields four components to be retained for classifier training. In accordance with Figure 1 they are denoted as ${\mathrm{s}}_{\mathsf{\text{controlref}}\mathsf{\text{.;}}n}^{\mathsf{\text{disease}}}$, ${\mathrm{s}}_{\mathsf{\text{controlref}}\mathsf{\text{.;}}n}^{\mathsf{\text{control}}}$, ${\mathrm{s}}_{\mathsf{\text{diseaseref}}\mathsf{\text{.;}}n}^{\mathsf{\text{control}}}$, and ${\mathrm{s}}_{\mathsf{\text{diseaseref}}\mathsf{\text{.;}}n}^{\mathsf{\text{disease}}}$, where n denotes index of a test sample x_{ n } used in current decomposition. Components extracted from N mixture samples, form four sets of labelled feature vectors as follows: ${\left\{{\mathrm{s}}_{\mathsf{\text{controlref}}\mathsf{\text{.;}}n}^{\mathsf{\text{disease}}},{y}_{n}\right\}}_{n=1}^{N}$, ${\left\{{\mathrm{s}}_{\mathsf{\text{controlref}}\mathsf{\text{.;}}n}^{\mathsf{\text{control}}},{y}_{n}\right\}}_{n=1}^{N}$, ${\left\{{\mathrm{s}}_{\mathsf{\text{diseaseref}}\mathsf{\text{.;}}n}^{\mathsf{\text{control}}},{y}_{n}\right\}}_{n=1}^{N}$ and ${\left\{{\mathrm{s}}_{\mathsf{\text{diseaseref}}\mathsf{\text{.;}}n}^{\mathsf{\text{disease}}},{y}_{n}\right\}}_{n=1}^{N}$. One or more classifiers can be trained on them and the one with the highest accuracy achieved through cross-validation is selected for a disease diagnosis.
Selection of the unknown number of components M is generally non-trivial problem in a matrix factorization and is the part of a model validation procedure. M is selected through cross-validation and postulated to be 2, 3, 4 or 5 because it directly determines the number of features used for classification. This follows from previously described interpretation of the LMM (2a) and (2b). Since disease prediction is based on four components selected as disease and control specific it is important that they are composed of features with the high discriminative power. It means that they should contain features which are truly disease or control specific. The component considered here as disease or control specific (as well as neutral) can actually be composed of features (m/z ratios or genes) belonging to multiple substances (metabolites, analytes) that share similar relative concentrations. This is practically important since it makes decomposition much less sensitive to an underestimation of the true total number of substances present in a sample. By setting the number of substances to predefined value M, proposed method is enforcing substances with similar concentrations to be linearly combined into one more complex components composed of disease, neutral or control specific features. Provided that concentration variability of these features across the samples is small, it would suffice to select overall number of components as M = 3 or even M = 2. (In the latter case, the existence of differentially not expressed features is not postulated at all). However, since we are dealing with biological samples it is more realistic to expect that relative concentrations could vary across the sample population. This is illustrated in Figures 1a and 1b by ellipsoids around vectors that represent average concentration profiles of each group of features (components). As seen from Figure 1, some features considered neutral can be present in the prevailing concentration in a certain number of samples than the features considered in a majority of the samples as disease (or control) specific. To partially remove such features from disease and/or control specific components, an unknown number of components M should be increased to M = 4 or perhaps even to M = 5. Thus, existence of two or three neutral components should be postulated. This is expected to yield less complex disease and control specific components and that is in agreement with the principle of parsimony (see also discussion in section 1.7). Model validation presented in section 1.4 suggests that this, indeed, is the case when concentration variability across the samples is significant. When it comes to the real world datasets, the information about number of components will not be known in advance. The strategy to comply with this uncertainty is to use the cross-validation and to verify whether increased number of components M indeed contributed to increased accuracy in disease prediction.
1.3 Sparse component analysis algorithm
Proposed feature extraction/component selection method is based on a decomposition of LMMs (2a)/(2b) comprised of two samples (reference sample and test sample) into M ≥ 2 components. From the BSS point of view this yields determined BSS problem when M = 2 and underdetermined BSS problem, when M ≥ 3 [26, 27, Chapter 10 in 17]. The enabling constraint for solving underdetermined BSS problems is a sparseness of the components and the methods are known under the common name as sparse component analysis (SCA) [26-29, Chapter 10 in 17]. As commented at the beginning of section 1.2 the overcomplete ICA, [Chapter 16 in 18, 24, 25], is basically reduced to SCA and also demands sparse sources. SCA has already been applied to microarray data analysis in [3, 6, 7, 11, 12]. It has also been used in [22, 23] to extract more than two components from the two mixture samples of nuclear magnetic resonance and mass spectra. A sparseness constraint implies that each particular feature point k = 1, ...,K (m/z ratio or gene) belongs to the several components only. To this end, for the two-samples based LMMs (2a)/(2b) used here, it is assumed that each feature point belongs to at most two components: either disease specific and neutral or control specific and neutral. From the viewpoint of biology, a plausibility of this assumption has been elaborated before.
Algorithmic approaches used to solve underdetermined BSS problem associated with (2a)/(2b) belong to the two main categories: (i) estimating concentration/mixing matrix and component matrix simultaneously by minimizing data fidelity terms ${\u2225\mathrm{X}-{\mathrm{A}}_{\mathsf{\text{control}}}{\mathrm{S}}_{\mathsf{\text{control}}}\u2225}_{F}^{2}$ or ${\u2225\mathrm{X}-{\mathrm{A}}_{\mathsf{\text{disease}}}{\mathrm{S}}_{\mathsf{\text{disease}}}\u2225}_{F}^{2}$, where X follows from the left side of (2a) or (2b). A minimization is usually done through the alternating least square (ALS) methodology with sparseness constraint imposed on source matrices S_{control} and S_{disease}, [19, 22, 23, 30–32]; (ii) estimating concentration/mixing matrices first by clustering and source/component matrices afterwards by solving underdetermined system of linear equations through minimization of the ℓ _{ p } norm, 0 < p ≤ 1, of the column vectors s_{ k } ∈ ℝ ^{ M } of S_{control} and S_{disease}, [25–29, 33–35]. As discussed in [6], a sparseness constrained minimization of the data fidelity term is sensitive to the choice of a sparseness constraint. On the other side, it has been recognized in [33–35] that accurate estimation of the concentration matrix enables accurate solution of even determined BSS problems. To this end, selection of feature points where only single component is present is of a special importance. At these points, feature vector and appropriate mixing vector are collinear. For example, if feature k belongs to component m then: x_{ k } ≈ a _{ m } s_{ mk }. Thus, clustering of a set of single component points (SCPs) ought to yield an accurate estimate of the mixing matrix. Its columns are represented by cluster centroids. It has been demonstrated in [33] that such estimation of the mixing matrix, where hierarchical clustering was used, yields more accurate solution of determined BSS problem: S = pinv(A)X, than the one obtained by ICA algorithms. Thus, selection of SCPs is of an essential importance for accurate estimation of the mixing matrix. Such feature points are identified from the overall number of K points using geometric criterion based on the notion that at them real and imaginary parts of the mixture samples point either in the same or in the opposite direction [33, 34]. Since protein (mass spectra) and gene expression levels are real sequences an analytic continuation [22] of mixture samples:
where $R\left({\stackrel{\u0303}{\mathrm{x}}}_{k}\right)$ and $I\left({\stackrel{\u0303}{\mathrm{x}}}_{k}\right)$ denote real and imaginary part of ${\stackrel{\u0303}{\mathrm{x}}}_{k}$ respectively, 'T' denotes transpose operation, $\u2225R\left({\stackrel{\u0303}{\mathrm{x}}}_{k}\right)\u2225$ and $\u2225I\left({\stackrel{\u0303}{\mathrm{x}}}_{k}\right)\u2225$ denote ℓ_{2}-norms of $R\left({\stackrel{\u0303}{\mathrm{x}}}_{k}\right)$ and $I\left({\stackrel{\u0303}{\mathrm{x}}}_{k}\right)$ while Δθ stands for the angular displacement from direction of either 0 or π radians. Evidently, Δθ determines quality of the selected SCPs and, thus, accuracy of the estimation of the mixing matrices A_{control} and A_{disease}. Setting Δθ to a small value (e.g., to an equivalent of 1^{0} ) enforces, with an overwhelming probability, the selection of feature points that contain one component only. If, however, all the components are not present in at least one feature point alone it may occur that corresponding columns of the mixing matrices will be estimated inaccurately. This problem can be alleviated by increasing the value of Δθ in which case the selected feature points may not contain one component only, but may rather be composed of one dominant component and one or more components present in a small amount.
A mixture model with a reference-based algorithm for feature extraction/component selection
Inputs. ${\left\{{\mathrm{x}}_{n}\in {\mathbb{R}}^{k},{y}_{n}\in \left\{1,-1\right\}\right\}}_{n=1}^{N}$samples and sample labels, where K represents number of feature points (m/z ratios or genes). |
---|
x_{control} ∈ ℝ^{ K } and x_{disease} ∈ ℝ^{ K } representing control and disease (case) groups of samples. |
Nested two-fold cross-validation. Parameters: single component points (SCPs) selection threshold in radian equivalents of Δ θ {1^{0}, 3^{0}, 5^{0}}; regularization constant λ∈ {10^{-2}λ_{max}, 10^{-4}λ_{max}, 10^{-6}λ_{max}}; number of components M ∈{2, 3, 4, 5}; parameters of selected classifier. |
Components selection from mixture samples. |
1. $\forall \mathrm{x}\in {\left\{{\mathrm{x}}_{n}\in {\mathbb{R}}^{k}\right\}}_{n=1}^{N}$ form a linear mixture models (LMMs) (2a) and (2b). |
2. For LMMs (2a)/(2b) select a set of single component points for a givenΔθ. |
3. On sets of SCPs use hierarchical clustering (other clustering methods can be used also) to estimate mixing matrices A_{control} and A_{disease} for a given M. |
4. Estimate source matrices S_{control} and S_{disease} by solving (3a) and (3b) respectively for a given regularization parameter λ. |
5. Use minimal and maximal mixing angles estimated from mixing matrices A control and A disease to select, following the logic illustrated in Fig. 2a and Fig. 2b, disease and control specific components: ${\mathrm{s}}_{\mathsf{\text{controlref}}\mathsf{\text{.;}}n}^{\mathsf{\text{disease}}}$, ${\mathrm{s}}_{\mathsf{\text{controlref}}\mathsf{\text{.;}}n}^{\mathsf{\text{control}}}$, ${\mathrm{s}}_{\mathsf{\text{diseaseref}}\mathsf{\text{.;}}n}^{\mathsf{\text{control}}}$ and ${\mathrm{s}}_{\mathsf{\text{diseaseref}}\mathsf{\text{.;}}n}^{\mathsf{\text{disease}}}$. |
End of component selection. |
End of nested two-fold cross-validation. |
Results and Discussion
Comparative performance results in ovarian cancer prediction. Sensitivities and specificities were estimated by 100 two-fold cross-validations (standard deviations are in brackets).
Method | Sensitivity/Specificity/Accuracy |
---|---|
Proposed method M = 3, Δθ = 5^{0} λ = 10^{-4}λ_{max} Linear SVM | Sensitivity: 96.2 (2.7)%; specificity: 93.6 (4.1)%; accuracy: 94.9% Control specific component extracted with respect to a cancer reference sample. |
Proposed method M = 4, Δθ = 3^{0} λ = 10^{-6}λ_{max} Linear SVM | Sensitivity: 95.4 (3)%; specificity: 94 (3.7)%; accuracy:94.7% Control specific component extracted with respect to a cancer reference sample. |
[1] | Sensitivity: 81.4 (7.1)%; specificity: 71.7 (6.6)% |
[42] | Sensitivity: 100%; specificity: 95% (one partition only: 50/50 training; 66/50 test). |
[44] | Accuracy averaged over 10 ten-fold partitions: 98-99% (sd: 0.3-0.8) |
[13] | Sensitivity: 98%, specificity: 95%, two-fold CV with 100 partitions. |
[45] | Average error rate of 4.1% with three-fold CV. |
Comparative performance results in prostate cancer prediction. Sensitivities and specificities were estimated by 100 two-fold cross-validations (standard deviations are in brackets).
Methods | Sensitivity/Specificity/Accuracy |
---|---|
Proposed method M = 5, Δθ = 1^{0} λ = 10^{-4}λ_{max} Linear SVM | Sensitivity: 97.6 (2.8)%; specificity: 99 (2.2)%; accuracy: 98.3% Control specific component extracted with respect to a cancer reference sample. |
Proposed method M = 4, Δθ = 1^{0} λ = 10^{-4}λ_{max} Linear SVM | Sensitivity: 97.7 (2.3)%; specificity: 98 (2.4)%; accuracy: 97.9% Control specific component extracted with respect to a cancer reference sample. |
[1] | Sensitivity: 86 (6.6)%; specificity: 67.8(12.9)%; accuracy: 76.9%. |
[46] | Sensitivity: 94.7%; specificity: 75.9%; accuracy: 85.3%. 253 benign and 69 cancers. Results were obtained on independent test set comprised of 38 cancers and 228 benign samples. |
[47] | Sensitivity: 97.1%; specificity: 96.8%; accuracy: 97%. 253 benign and 69 cancers. Cross-validation details not reported. |
[45] | Average error rate of 28.97 on four class problem with three-fold cross-validation. |
Comparative performance results in colon cancer prediction. Sensitivities and specificities were estimated by 100 two-fold cross-validations (standard deviations are in brackets).
Methods | Sensitivity/Specificity/Accuracy |
---|---|
Proposed method M = 2, Δθ = 1^{0} RBF SVM (σ^{2} = 1200, C = 1) | Sensitivity: 90.8 (5.5)%, specificity: 79.4 (9.8)%; accuracy: 85.1% Control specific component extracted with respect to a cancer reference sample. |
Proposed method M = 4, Δθ = 5^{0} λ = 10^{-2}λ_{max} RBF SVM (σ^{2} = 1000, C = 1) | Sensitivity: 89.8 (6.2)%, specificity: 78.6 (12.8)%; accuracy: 84.2%. Control specific component extracted with respect to a control reference sample. |
[1] | Sensitivity: 89.7 (6.4)%, specificity: 84.3 (8.4)%; accuracy = 87%. 100 two-fold cross-validations. |
[2] | Sensitivity: 92.1 (4.7)%, specificity: 85 (10.1)%; accuracy: 88.55%. 100 two-fold cross-validations. c u = 2.0. |
[48] | Sensitivity: 92-95% calculated from Figure 5. Specificity not reported. |
[15] | Accuracy 85%. Cross-validation details not reported. |
[50] | Accuracy 82.5%, ten-fold cross-validation (RFE with linear SVM). |
[51] | Accuracy 88.84%, two-fold cross-validation (RFE with linear SVM and optimized penalty parameter C). |
1.4 Model validation
1.5 Ovarian cancer prediction from a protein mass spectra
1.6 Prostate cancer prediction from a protein mass spectra
1.7 Colon cancer prediction from gene expression profiles
Conclusions
This work presents a feature extraction/component selection method based on innovative additive linear mixture model of a sample (protein or gene expression levels represented respectively by mass spectra or microarray data) and sparseness constrained factorization that operates on a sample(experiment)-by-sample basis. That is different in respect to the existing methods which factorize complete dataset simultaneously. The sample model is comprised of a test sample and a reference sample representing disease and/or control group. Each sample is decomposed into several components selected automatically (the number is determined by cross-validation), without using label information, as disease-, control specific and differentially not expressed. An automatic selection is based on mixing angles which are estimated from each sample directly. Hence, due to the locality of decomposition, the strength of the expression of each feature can vary from sample to sample. However, the feature can still be allocated to the same (disease or control specific) component in different samples. As opposed to that, feature allocation/selection algorithms that operate on a whole dataset simultaneously try to optimize a single threshold for the whole dataset. Selected components can be used for classification due to the fact that labelled information is not used in the selection. Moreover, disease specific component(s) can also be used for further biomarker related analysis. As opposed to the existing matrix factorization methods, such disease specific component can be obtained from one sample (experiment) only. By postulating one or more components with differentially not expressed features the method yields less complex disease and control specific components that are composed of smaller number of features with higher discriminative power. This has been demonstrated to improve prediction accuracy. Moreover, decomposing sample with one or more components with indifferent features performs (indirectly) sample adaptive preprocessing related to removal of features that do not significantly vary across the sample population. The proposed feature extraction/component selection method is demonstrated on the real world proteomic datasets used for prediction of the ovarian and prostate cancers as well as on the genomic dataset used for the colon cancer prediction. Results obtained by 100 two-fold cross-validations are compared favourably against most of the state-of-the-art methods cited in the literature and used for cancer prediction on the same datasets.
Declarations
Acknowledgements
This work has been supported by Ministry of Science, Education and Sports, Republic of Croatia through Grant 098-0982903-2558. Professor Vojislav Kecman's and Dr. Ivanka Jerić's help in proofreading the manuscript is gratefully acknowledged.
Authors’ Affiliations
References
- Henneges C, Laskov P, Darmawan E, Backhaus J, Kammerer B, Zell A: A factorization method for the classification of infrared spectra. BMC Bioinformatics 2010, 11: 561. 10.1186/1471-2105-11-561PubMed CentralView ArticlePubMedGoogle Scholar
- Alfo M, Farcomeni A, Tardella L: A Three Component Latent Class Model for Robust Semiparametric Gene Discovery. Stat Appl in Genet and Mol Biol 2011., 10(1): Article 7Google Scholar
- Schachtner R, Lutter D, Knollmüller P, Tomé AM, Theis FJ, Schmitz G, Stetter M, Vilda PG, Lang EW: Knowledge-based gene expression classification via matrix factorization. Bioinformatics 2008, 24: 1688–1697. 10.1093/bioinformatics/btn245PubMed CentralView ArticlePubMedGoogle Scholar
- Liebermeister W: Linear modes of gene expression determined by independent component analysis. Bioinformatics 2002, 18: 51–60. 10.1093/bioinformatics/18.1.51View ArticlePubMedGoogle Scholar
- Lutter D, Ugocsai P, Grandl M, Orso E, Theis F, Lang EW, Schmitz G: Analyzing M-CSF dependent monocyte/macrophage differentiation: Expression modes and meta-modes derived from an independent component analysis. BMC Bioinformatics 2008, 9: 100. 10.1186/1471-2105-9-100PubMed CentralView ArticlePubMedGoogle Scholar
- Stadtlthanner K, Theis FJ, Lang EW, Tomé AM, Puntonet CG, Górriz JM: Hybridizing Sparse Component Analysis with Genetic Algorithms for Microarray Analysis. Neurocomputing 2008, 71: 2356–2376. 10.1016/j.neucom.2007.09.017View ArticleGoogle Scholar
- Carmona-Saez P, Pascual-Marqui RD, Tirado F, Carazo JM, Pascual-Montano A: Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics 2006, 7: 78. 10.1186/1471-2105-7-78PubMed CentralView ArticlePubMedGoogle Scholar
- Lee SI, Batzoglou S: Application of independent component analysis to microarrays. Genome Biol 2003, 4: R76. 10.1186/gb-2003-4-11-r76PubMed CentralView ArticlePubMedGoogle Scholar
- Girolami M, Breitling R: Biologically valid linear factor models of gene expression. Bioinformatics 2004, 20: 3021–3033. 10.1093/bioinformatics/bth354View ArticlePubMedGoogle Scholar
- Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 2004, 101: 4164–4169. 10.1073/pnas.0308531101PubMed CentralView ArticlePubMedGoogle Scholar
- Gao Y, Church G: Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics 2005, 21: 3970–3975. 10.1093/bioinformatics/bti653View ArticlePubMedGoogle Scholar
- Kim H, Park H: Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 2007, 23: 1495–1502. 10.1093/bioinformatics/btm134View ArticlePubMedGoogle Scholar
- Li L, Umbach DM, Terry P, Taylor JA: Application of the GA/KNN method to SELDI proteomics data. Bioinformatics 2004, 20: 1638–1640. 10.1093/bioinformatics/bth098View ArticlePubMedGoogle Scholar
- Yu JS, Ongarello S, Fiedler R, Chen XW, Toffolo G, Cobelli C, Trajanoski Z: Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics 2005, 21: 2200–2209. 10.1093/bioinformatics/bti370View ArticlePubMedGoogle Scholar
- Qiu P, Wang ZJ, Liu RKJ: Ensemble dependence model for classification and prediction of cancer and normal gene expression data. Bioinformatics 2005, 21: 3114–3121. 10.1093/bioinformatics/bti483View ArticlePubMedGoogle Scholar
- Mischak H, Coon JJ, Novak J, Weissinger EM, Schanstra J, Dominiczak AF: Capillary electrophoresis-mass spectrometry as powerful tool in biomarker discovery and clinical diagnosis: an update of recent developments. Mass Spectrom Rev 2008, 28: 703–724.View ArticleGoogle Scholar
- Comon P, Jutten C: Handbook on Blind Source Separation: Independent Component Analysis and Applications. Academic Press; 2010.Google Scholar
- Hyvärinen A, Karhunen J, Oja E: Independent Component Analysis. Wiley Interscience; 2001.View ArticleGoogle Scholar
- Cichocki A, Zdunek R, Phan AH, Amari SI: Nonnegative Matrix and Tensor Factorizations - Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Chichester: John Wiley; 2009.Google Scholar
- Hyvärinen A, Oja E: A fast fixed-point algorithm for independent component analysis. Neural Computation 1997, 9: 1483–1492. 10.1162/neco.1997.9.7.1483View ArticleGoogle Scholar
- Decramer S, Gonzalez de Peredo A, Breuil B, Mischak H, Monsarrat B, Bascands JL, Schanstra JP: Urine in clinical proteomics. Mol Cell Proteomics 2008, 7: 1850–1862. 10.1074/mcp.R800001-MCP200View ArticlePubMedGoogle Scholar
- Kopriva I, Jeric I: Blind separation of analytes in nuclear magnetic resonance spectroscopy and mass spectrometry: sparseness-based robust multicomponent analysis. Analytical Chemistry 2010, 82: 1911–1920. 10.1021/ac902640yView ArticlePubMedGoogle Scholar
- Kopriva I, Jerić I: Multi-component Analysis: Blind Extraction of Pure Components Mass Spectra using Sparse Component Analysis. Journal of Mass Spectrometry 2009, 44: 1378–1388. 10.1002/jms.1627View ArticlePubMedGoogle Scholar
- Hyvärinen A, Cristescu R, Oja E: A fast algorithm for estimating overcomplete ICA bases for image windows. In Proc Int Joint Conf On Neural Networks. Washington DC, USA; 1999:894–899.Google Scholar
- Lewicki M, Sejnowski TJ: Learning overcomplete representations. Neural Comput 2000, 12: 337–365. 10.1162/089976600300015826View ArticlePubMedGoogle Scholar
- Bofill P, Zibulevsky M: Underdetermined blind source separation using sparse representations. Signal Proc 2001, 81: 2353–2362. 10.1016/S0165-1684(01)00120-7View ArticleGoogle Scholar
- Georgiev P, Theis F, Cichocki A: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Trans Neural Net 2005, 16: 992–996. 10.1109/TNN.2005.849840View ArticleGoogle Scholar
- Li Y, Cichocki A, Amari S: Analysis of Sparse Representation and Blind Source Separation. Neural Comput 2004, 16: 1193–1234. 10.1162/089976604773717586View ArticlePubMedGoogle Scholar
- Li Y, Amari S, Cichocki A, Ho DWC, Xie S: Underdetermined Blind Source Separation Based on Sparse Representation. IEEE Trans Signal Process 2006, 54: 423–437.View ArticleGoogle Scholar
- Cichocki A, Zdunek R, Amari SI: Hierarchical ALS Algorithms for Nonnegative Matrix Factorization and 3D Tensor Factorization. LNCS 2007, 4666: 169–176.Google Scholar
- Kopriva I, Cichocki A: Blind decomposition of low-dimensional multi-spectral image by sparse component analysis. J of Chemometrics 2009, 23: 590–597. 10.1002/cem.1257View ArticleGoogle Scholar
- Hoyer PO: Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research 2004, 5: 1457–1469.Google Scholar
- Reju VG, Koh SN, Soon IY: An algorithm for mixing matrix estimation in instantaneous blind source separation. Signal Proc 2009, 89: 1762–1773. 10.1016/j.sigpro.2009.03.017View ArticleGoogle Scholar
- Kim SG, Yoo CD: Underdetermined Blind Source Separation Based on Subspace Representation. IEEE Trans Sig Proc 2009, 57: 2604–2614.View ArticleGoogle Scholar
- Naini FM, Mohimani GH, Babaie-Zadeh M, Jutten C: Estimating the mixing matrix in Sparse Component Analysis (SCA) based on partial k -dimensional subspace clustering. Neurocomputing 2008, 71: 2330–2343. 10.1016/j.neucom.2007.07.035View ArticleGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. J Royal Statist Soc B 1996, 58: 267–288.Google Scholar
- Tropp JA, Wright SJ: Computational Methods for Sparse Solution of Linear Inverse Problems. Proc of the IEEE 2010, 98: 948–958.View ArticleGoogle Scholar
- Beck A, Teboulle M: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J on Imag Sci 2009, 2: 183–202. 10.1137/080716542View ArticleGoogle Scholar
- Selected publications list of professor Amir Beck[http://ie.technion.ac.il/Home/Users/becka.html]
- Kecman V: Learning and Soft Computing - Support Vector Machines, Neural Networks and Fuzzy Logic Models. The MIT Press; 2001.Google Scholar
- Hastie T, Tibshirani R, Fiedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 3rd edition. Springer; 2009:649–698.Google Scholar
- Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 2002, 359: 572–577. 10.1016/S0140-6736(02)07746-2View ArticleGoogle Scholar
- National Cancer Institute clinical proteomics program[http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp]
- Assareh A, Volkert LG: Fuzzy rule based classifier fusion for protein mass spectra based ovarian cancer diagnosis. Proceedings of the 2009 IEEE Symposium Computational Intelligence in Bioinformatics and Computational Biology (CIBCB'09) 2009, 193–199.View ArticleGoogle Scholar
- Yang P, Zhang Z, Zhou BB, Zomaya AY: A clustering based hybrid system for biomarker selection and sample classification of mass spectrometry data. Neurocomputing 2010, 73: 2317–2331. 10.1016/j.neucom.2010.02.022View ArticleGoogle Scholar
- Petricoin EF, Ornstein DK, Paweletz CP, Ardekani A, Hackett PS, Hitt BA, Velassco A, Trucco C, Wiegand L, Wood K, Simone CB, Levine PJ, Linehan WM, Emmert-Buck MR, Steinberg SM, Kohn EC, Liotta LA: Serum proteomic patterns for detection of prostate cancer. J Natl Canc Institute 2002, 94: 1576–1578. 10.1093/jnci/94.20.1576View ArticleGoogle Scholar
- Xu Q, Mohamed SS, Salama MMA, Kamel M: Mass spectrometry-based proteomic pattern analysis for prostate cancer detection using neural networks with statistical significance test-based feature selection. Proceedings of the 2009 IEEE Conference Science and Technology for Humanity (TIC-STH) 2009, 837–842.View ArticleGoogle Scholar
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999, 96: 6745–6750. 10.1073/pnas.96.12.6745PubMed CentralView ArticlePubMedGoogle Scholar
- Data pertaining to the article 'Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays'[http://genomics-pubs.princeton.edu/oncology/affydata/index.html]
- Ambroise C, McLachlan G J: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 2002, 99: 6562–6566. 10.1073/pnas.102102699PubMed CentralView ArticlePubMedGoogle Scholar
- Huang TM, Kecman V: Gene extraction for cancer diagnosis using support vector machines. Artificial Intelligence in Medicine 2005, 35: 185–194. 10.1016/j.artmed.2005.01.006View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.