UFFizi: a generic platform for ranking informative features
© Gottlieb et al; licensee BioMed Central Ltd. 2010
Received: 2 December 2009
Accepted: 3 June 2010
Published: 3 June 2010
Feature selection is an important pre-processing task in the analysis of complex data. Selecting an appropriate subset of features can improve classification or clustering and lead to better understanding of the data. An important example is that of finding an informative group of genes out of thousands that appear in gene-expression analysis. Numerous supervised methods have been suggested but only a few unsupervised ones exist. Unsupervised Feature Filtering (UFF) is such a method, based on an entropy measure of Singular Value Decomposition (SVD), ranking features and selecting a group of preferred ones.
We analyze the statistical properties of UFF and present an efficient approximation for the calculation of its entropy measure. This allows us to develop a web-tool that implements the UFF algorithm. We propose novel criteria to indicate whether a considered dataset is amenable to feature selection by UFF. Relying on formalism similar to UFF we propose also an Unsupervised Detection of Outliers (UDO) method, providing a novel definition of outliers and producing a measure to rank the "outlier-degree" of an instance.
Our methods are demonstrated on gene and microRNA expression datasets, covering viral infection disease and cancer. We apply UFFizi to select genes from these datasets and discuss their biological and medical relevance.
Statistical properties extracted from the UFF algorithm can distinguish selected features from others. UFFizi is a framework that is based on the UFF algorithm and it is applicable for a wide range of diseases. The framework is also implemented as a web-tool.
The web-tool is available at: http://adios.tau.ac.il/UFFizi
The present information age is characterized by exponentially increasing data, e.g. in numbers of documents and in records of various kinds or biological data. Improved experimental techniques, such as high throughput methods in biology, allow for the measurement of thousands of features (genes) for each instance (single gene-expression microarray per patient). This leads to a flood of data, whose analysis calls for preprocessing in order to reduce noise and enhance the signal through dimensionality reduction. This is important for both enabling the application of various categorization techniques and allowing for biological inference from the data.
Dimensionality reduction algorithms are usually categorized as extraction or selection methods. Feature extraction transforms all features into a lower dimension space, while feature selection selects a subset of the original features. A benefit of the latter is the ability to attach meaning to the selected features. This is important both for exploration of the biological reality and for preparing a more concise experimental layout. The method to be studied here is categorized as feature selection.
It is customary to divide feature selection methods into two types: supervised, in which a target function is known and one tries to rank features or optimize some objective function relative to it, and unsupervised, in which one has no additional information regarding the instances. In practice, the abundance of unlabeled data or data that might posses multiple possible labeling, calls for an unsupervised approach.
While supervised feature selection methods are abundant , unsupervised methods are scarce, most of them tested on labeled data . Nevertheless, unsupervised feature selection methods may play an important role even in supervised cases. Being unbiased by the labeling of the instances, unsupervised feature selection can be used as a preprocessing tool for supervised learning algorithms providing reduction of overfitting (for a comprehensive review we refer to ). As described in , feature selection from unsupervised data can be applied at three different stages: before, during and after clustering. Methods that operate before clustering are referred to as filter methods. Common methods of unsupervised feature filtering rank features according to either (1) their non-zero loadings in the first principal components , (2) their normalized range,(3) entropy or (4) variance of the feature as calculated from its values on all instances [2, 5]. All these methods estimate the importance of each feature independently of all others.
Our Unsupervised Feature Filtering (UFF) algorithm  differs from aforementioned methods in that it ranks features based on a criterion that involves all other features. It also provides a natural cutoff for selecting the number of features. We have also previously showed that UFF also selects stable feature sets under perturbations . Our aim in this article is to introduce a new framework, based on the UFF. We (1) explore the properties of UFF and the features it selects, (2) introduce a faster approximate version, (3) suggest indicators for the ability to apply the method to certain datasets and (4) extend it by proposing a method called Unsupervised Instance Selection (UIS) for inspecting and eliminating potential outlier instances. A faster version of UFF, together with identification of indicators for the ability to apply the method to different datasets enables the implementation of UFF as a web-tool. The performance of the UFF is shown to surpass commonly used unsupervised filtering methods (e.g. variance, feature entropy) for the datasets used in this study. These findings are consistent with the findings reported in .
In the Results section, we explore the properties of UFF on example datasets, introduce a faster algorithm for UFF and analyze which datasets can be evaluated successfully by the UFF method. We then describe the UDO method and provide biological insights on gene and microRNA expression from a wide range of diseased states.
Results and Discussion
Analyzing and Improving UFF
In this section, we present analysis of UFF selected features and provide improvements and extensions to the algorithm. The improvements include (i) Faster version of the algorithm and (ii) Addition of a criterion for assessing the quality of the results provided by UFF. We further extend the algorithm by introducing the Unsupervised Detection of Outliers (UDO).
Properties of selected features
The differences in projection on the principal components between the positive and negative scored features, may provide an explanation for the difference between our approach and the sparse-PCA approach . The latter selects genes that correlate mainly with the leading PC, while UFF prefers a wider distribution.
Finally we observe that negative score features have skewness close to zero and kurtosis close to three. Hence we conclude that negative score features possess wide Gaussian distributions, which can be regarded as baring no indicative signal over the instances. These noisy features are discarded by UFF but selected by Variance Selection, which explains their inferior results demonstrated in 
An extended formulation is given in the Appendix.
Comparison of running time between regular and fast UFF
Regular UFF Matlab
Fast UFF Matlab
Regular UFF C++
Fast UFF C++
Melanoma Size = [69 × 22283]
HIV Size = [40 × 22283]
Hepatitis C Size = [78 × 54675]
The quality of the approximation lies in the assumption of small perturbations. In order to test whether this assumption holds for a given dataset, we inspect the SVD entropy of the matrix, defined to lie between 0 and 1 (see Methods). For most data-sets that we studied it is smaller than 0.1. Such a small value of the entropy guarantees that only a few eigenvalues (principal components) are of importance, and the removal of a single feature is indeed a small perturbation assuring the validity of the approximation (equation 2). In two of the studied datasets (GBM and OV microRNA) the SVD entropy is large (0.59 and 0.34 correspondingly), putting the approximation (equation 2) in doubt. In both cases one should therefore resort to the regular UFF calculation to obtain reliable results
Fast UFF allows for the analysis of much larger datasets. Moreover it enables incorporating this algorithm in a web-based tool. Computationally, it allows for a distributed evaluation of UFF scores, once the eigenvectors of the Gram matrix C are obtained. The calculation of the SVD entropy of the matrix is incorporated into the UFFizi web tool, initiating a warning when the results of the fast UFF might deviate substantially from the regular UFF.
When is UFF applicable
Working with more than twenty datasets from different domains, we have found measures that allow for separation between datasets on which UFF is effective from datasets in which it is not. One such measure is the normalized entropy of the squares of UFF scores. This, as well as another measure, is presented in the supplementary appendix. They allow for a prior estimate on whether UFF selected features should be used. These measures, formulated in the supplementary appendix, are incorporated into our web-tool, providing a confidence level for relying on UFF results.
Unsupervised Detection of Outliers (UDO)
Outliers are typically defined as instances that differ significantly from other instances in the data (for extensive surveys, see [8, 9]). Detecting such outlier instances may be desirable in certain cases, e.g. when there is a suspicion of faulty or unreliable measurements or for detecting rare events. A multitude of methods for unsupervised outlier detection have been proposed. Most relate to one of two approaches: (1) model based, in which a model is fit to the data and outliers are the ones deviating from the model [10, 11], (2) Distance-based methods, which find instances lying far from all instances, nearest instances, or nearby clusters [12–18]. We present here an alternative definition and a method to detect such outliers, based on the UFF framework.
The data-matrix A contains information on instances in terms of features and features in terms of instances, and the singular values are common to both. One may therefore consider a 'leave-one-out' measure applied to instances. This is the Unsupervised Detection of Outliers (UDO) method, to be studied here. UDO identifies instances that, when removed, decrease the entropy of the dataset and thus provide a more homogeneous dataset. Recognizing these entropy-increasing instances as outliers provides a natural definition for an "outlier-degree". UDO attaches to each instance the amount of decrease of the SVD entropy, which is considered the global measure of the "outlier-degree" of each instance in the dataset. As in the UFF method, a threshold of one standard deviation (std) above the mean may be applied to assess the number of such outliers. UDO is a data-driven method, making no prior assumption regarding the distribution of the data such as model-based methods. It is not restricted by small sample size datasets which prohibit creation of valid distribution assessments. It is also different from distance-based outlier detection schemes in that it assesses the influence of instance removal on the entire dataset rather than the mere location in feature space of the instance relative to other instances. In contrast to the Donoho-Stanhel estimator , which assesses the "outlier-degree" of an instance relative to one selected direction in feature space, UDO estimates it on all eigenvectors at once. UDO in this sense emphasizes directions along which other instances are relatively comparable. We note that in datasets of relatively low SVD entropy, the correlation between the UDO ranking and the popular outlier detection method of the kth-NN ranking  is relatively high (0.61 and 0.82 for the melanoma and HIV datasets respectively, k = 5). This can be explained by noting that removal of an instance in such datasets does not alter the leading eigenvectors substantially and UDO thus selects the high-entropy instances that reside mainly farthest along these eigenvectors. In high SVD entropy datasets (e.g. the two microRNA datasets in this paper), the correlation between the two different methods is essentially zero.
Since outlier defining criterion and the methods implementing them are intertwined, evaluation of each method turns often into subjective inspection of the outliers. We note that in the HIV dataset for which we have some clinical information, the first 4 selected instances (out of 5 selected by UDO) are samples of two individuals (containing both CD4+ and CD8+ T cells). The two leading outlier instances belong to the same individual, possessing an HIV infection at a very preliminary stage (~1 month), possibly explaining high divergence of measurements from individuals with longer periods of HIV infection.
In this section we present novel results obtained by applying UFF to gene-expression and microRNA (miRNA) expression datasets.
Melanoma - UFF selected genes
The melanoma dataset is used for demonstrating the different traits of UFF. Running UFF on this dataset, we obtain 231 genes. The top ranked genes include Stratifin, Keratin 14, Keratin 1 and Loricrin, mutations in which are related to skin cancer and other skin diseases [19–22]. Enrichment analysis includes terms having Bonferroni score < 0.05. GO Enrichment analysis of the selected genes includes functions of biological processes such as ectoderm and epidermis development, homophilic cell adhesion, keratinocyte differentiation and melanin biosynthetic process. Cellular compartments enrichment includes intermediate filament, extracellular region and melanosome. Interestingly, GO molecular function enrichment show various metal ion binding, including copper, cadmium and calcium, all having relations to the tumor suppressor protein p53 [23–25]. Enriched pathways include cell communication, antigen processing and presentation and also breast cancer estrogen signaling. Human phenotype analysis reveals enrichment for palmoplantar hyperkeratosis, keratinization, skin and integument abnormalities. The list of UFF selected genes is provided in Additional file 1, Table S1. The full list of GO enrichment terms is provided in Additional file 2, Table S1.
Talantov, et al. (2005) performed clustering analysis on this dataset, using a filtered list of 15,795 genes. They did not obtain perfect separation between melanoma and benign tumors or normal tissues (obtaining Jaccard score  of 0.74). Using UFF selected genes and the Quantum Clustering algorithm , we were able to correctly split melanoma from benign tissues, while identifying two clusters in the melanoma samples similar to the ones identified by  (Jaccard score of 0.85)32 of UFF selected genes appear also in the 439 differentially expressed genes of  (p-value = e-12) and 10 out of 33 differentially expressed genes with high fold change (p-value < e-12).
Quantum Clustering results are provided in Additional file 4, Table S1.
HIV - UFF Selected genes
Next we explored the HIV dataset. UFF selected 179 genes, enabling us to cluster the CD4+ and CD8+ samples into separate clusters with only one misclassification. In comparison, when we clustered the samples using all the genes 2 misclassifications were obtained. In the top ranking genes we find mostly hemoglobin units, but also the specific CD4+ HIV related protein defensin  and the CD8+ HIV related CD8 antigen . GO enriched biological processes for the 179 selected genes (Bonferroni < 0.05) include immune system process, immune response, cellular defense response, antigen processing and presentation of peptide antigen via MHC class I and class II. Cellular compartments are enriched for the MHC class I and II protein complexes. Non trivial enriched pathways include Graft-versus-host disease, natural killer cell mediated cytotoxicity and type I diabetes (Bonferroni < 10-6). The selected genes involved in the type I diabetes pathway are usually in direct connection with either CD4+ or CD8+ T-cells. This connection is strongly support by literature text mining (not shown). The list of selected genes is provided in Additional file 1, Table S2. Enriched terms are provided in Additional file 2, Table S2.
Similar to figure 5, Additional file 3, Figure S2 displays the performance of clustering the HIV instances using different gene sets, selected by various unsupervised feature selection methods, random selection and using all the genes, as well as comparison to a feature extraction method, selecting the first eigenvectors computed using SVD. The performance of UFF surpasses all other methods in terms of clustering results (see Methods).
Chronic hepatitis C - UFF selected genes
Additional file 3, Figure S3 compares the performance of clustering the Hepatitis-C instances using UFF selected genes with gene sets selected by various unsupervised feature selection methods, random selection and using all the features, as well as comparison to a feature extraction method, selecting the first eigenvectors computed using SVD. The performance of UFF again tops other methods in terms of clustering results.
Glioblastoma - UFF selected genes
We present results on glioblastoma multiforme (GBM) from The Cancer Genome Atlas (TCGA) project. We selected features from each platform independently, due to the difference between experiments, allowing for identification of genes that differentiate between different platforms, rather than different instance type (UFF was applied to AgilentG4502A_07_1 and AgilentG4502A_07_2 separately, to avoid selection of genes that allows perfect separation of the two platforms). The unsupervised approach displays its full strength in this case, since we do not have access to additional sample information on these datasets.
There are variations between the number of genes selected on Agilent and Affymetrix gene expression platforms (563 and 731 genes for Agilent 1 and 2 platforms, while only 140 for Affymetrix).
Top 10 ranked genes for glioblastoma multiforme
Minimal UFF rank across platforms
Related to Cancer Biomarkers
Although Agilent and Affymetrix datasets show high variance in the number of genes selected by UFF, the highest GO enrichment terms are common to both. Both show high GO enrichment of general biological processes such as regulation of multicellular organismal process, cell proliferation and nervous system development (Bonferroni < 0.05) and nervous system development in Affymetrix, (FDR < 0.05, but Bonferroni < 0.1). UFF selected genes on Affymetrix also show inflammatory response while UFF selected genes of Agilent are enriched for cell adhesion. Both platforms are also enriched for cellular compartment of extracellular matrix and both were highly enriched for 'signal peptide' and 'secreted' (Bonferroni < 0.0005) based on UniProt keywords. UFF selected genes on both platforms are enriched for molecular function of protein and receptor binding, which includes various ligands such as polysaccharide, heparin and neuropeptide hormone activity binding (Agilent platform), and lipid and ferric iron binding (Affymetrix platform). Enrichment analysis is provided in Additional file 2, Table S4.
OV - UFF selected genes
We performed similar analysis of the glioblastoma multiforme (GBM) datasets on the ovarian serous cystadenocarcinoma (OV) dataset from TCGA. UFF selects 669 and 998 genes from Agilent and Affymetrix platform datasets respectively. GO enrichment analysis reveals that UFF selected genes expose very similar GO terms as UFF selected genes on GBM.
genes are common to both Agilent and Affymetrix platforms. Table 3 lists the top 10 common genes in terms of minimal UFF rank. Additional file 5, Table S2 provides detailed explanations for Table 3. List of UFF OV selected genes and the 190 platform-shared genes are provided in Additional file 1, Table S5.
Top 10 ranked genes for ovarian serous (OV) cystadenocarcinoma
Minimal UFF rank across platforms
Related to Cancer Biomarkers
of the UFF selected genes are common to both GBM and OV. These are POSTN, NPTX2, GJA1, NNMT, CSRP2, SCG5 and HSPA1A, all of them related to cancer biomarkers. Additional file 5, Table S3 provides further details on relation of these 7 common genes to cancer biomarkers. Note that POSTN appears in the top 10 selected genes in both GBM and OV datasets.
Selected miRNA for GBM and OV
We also report UFF selected microRNAs (miRNA) from TCGA microarrays for the glioblastoma (GBM) and ovarian (OV) cancers. There are 534 miRNAs in GBM, taken from 325 samples and 799 miRNAs taken from 295 OV samples. UFF selected 43 and 63 miRNAs in GBM and OV respectively.
of the selected miRNAs appear in both GBM and OV tumors. They are listed in Table 4. Additional file 6, Table S1 provides further details on relation of these miRNAs to cancer biomarkers. Selected miRNAs for GBM and OV are also listed in Additional file 6, Tables S2 and S3.
MicroRNAs selected by UFF, common to GBM and OV
Minimal UFF rank
Related to Cancer Biomarkers
We present an improved method, and a new web tool, that enable users to benefit from the power of UFF, an unsupervised approach that scores and ranks each feature according to its influence on the singular values distribution.
A statistical characterization of the selected features shows that our method selects features of high variance (over instances), but only those that do not have large correlation only with the first principal component. It turns out that thus we ignore noisy features that have Gaussian distributions. The strength of our method lies in selecting features that both capture inherent clustering of the instances and possess high variance. The combination of the two is significant in the case of biological datasets such as expression microarrays.
By studying various empirical datasets and evaluating different scoring functions we show that our approach is generic, and can identify the subset of relevant features. In contradistinction to other methods we can estimate the size of the group of selected relevant features. Furthermore, we present a novel approximation method, enabling significantly faster calculation of the UFF feature scores.
UFF is a heuristic method which exposes its strength in realistic applications. Nevertheless, not all datasets are amenable to feature selection by UFF. We propose criteria for deciding when UFF application is effective. This information is also provided in the online UFF tool. We further extend the capabilities of UFF by introducing the Unsupervised Detection of Outliers (UDO) method. UDO provides a novel definition of an "outlier-degree" of an instance and identifies such outliers in the dataset. This enables the researcher to detect rare events in the dataset or filter faulty instances before proceeding with further analysis.
Finally, we analyze various gene expression and microRNA expression datasets to show the strength of our approach and to expose interesting findings on these datasets with possible biological relevance.
Web tool: http://adios.tau.ac.il/UFFizi
We use three gene-expression microarray datasets with known labeling in order to demonstrate the performance of UFF. They were compiled from the online public repository of the National Center for Biotechnology Information/GenBank Gene Expression Omnibus (GEO) database [38, 39]. Data collections are: (i) Gene expression measurements taken from skin tissues including 7 normal skin tissues, 18 benign melanocytic lesions and 45 malignant melanoma  (series entry GSE3189); (ii) HIV dataset (series entry GSE6740), containing gene expression measurements from 20 CD4+ and 20 CD8+ T cells from HIV patients at different clinical stages; (iii) Hepatitis C (series entry GSE11190) containing gene expression measurements from 78 samples, comprising of 38 blood samples and 40 liver biopsy, before and after interferon treatment of Hepatitis C (19 blood samples before and after the treatment, 21 and 19 liver biopsies before and after respectively). All these datasets are Affymetrix Human Genome U133A Array (Hepatitis C is a U133 plus 2.0 array).
In addition, we present results obtained from using UFF on The Cancer Genome Atlas (TCGA) gene-expression and microRNA (miRNA) expression datasets. These datasets are comprised of samples taken from (i) glioblastoma multiforme (GBM) and (ii) ovarian serous cystadenocarcinoma (OV) patients. Gene-expression datasets are measured using Affymetrix Human Genome U133A Arrays and Agilent G4502A_07 platforms. miRNA expression is measured using Agilent Human miRNA Microarray Rel12.0 and Agilent 8 × 15 K Human miRNA-specific platforms. Details of these datasets are specified in Additional file 7, Table S1.
Unsupervised Feature Filtering (UFF)
UFF is based on an entropy measure applied to Singular Value Decomposition (SVD). Let A denote a matrix, whose elements Aij denote the measurement of feature i on instance j, e.g. expression of gene i under condition j. SVD decomposes the original matrix A into A = USVT, where U and V are unitary matrices whose columns form orthonormal bases. The diagonal matrix S is composed of singular values (s k ) ordered from highest to lowest. SVD is a common technique in feature extraction. UFF uses the information contained in the singular values in order to select the features.
Other functions may be used instead of H. They have to be monotonic and vary from a maximum, when all singular values are equal, to a minimum when there is only one singular value bigger than zero. Two such functions that we tested are the negative value of sum of squares and the geometric mean. The results using these functions are very similar to those obtained using the SVD-entropy, hence we will not elaborate further on them.
Features with positive score. These features increase the entropy.
Neutral features. These features have negligible influence on the entropy.
Negative score features. These features decrease the entropy.
We follow the Simple Ranking (SR) method of UFF, denoting positive score features (group 1) as features whose scores lie above the mean score + one std (upper dotted line in figure 2), negative score features (group 3) as features whose scores lie below the mean score - one std (lower dotted line) and neutral features (group 2) the rest. Note that most features fall into group 2, while groups 1 and 3 represent minorities. UFF  selects group 1 as containing the most relevant features. The rationale behind this selection is that, because these features increase the entropy, they decrease redundancy. Hence one may expect that instances will be better separated in the space spanned by these features. Further analysis of this group and its comparison with the two other groups is presented in the "properties of selected features" section.
In this paper, we follow the Simple Ranking (SR) method of UFF, selecting all positive score features (group 1). Alternative UFF methods suggested in  are not shown.
GO and Pathway Enrichment
Enrichment of Gene Ontology (GO), KEGG pathways and PubMed papers presented here were calculated using the DAVID [44, 45] and ToppGene tools . Verifications were also done using other tools such as Ontologizer  and GO Tree Machine .
UFF performance validation
Clustering comparison between different unsupervised feature selection methods was performed using the widely used k-means clustering algorithm. In order to provide an unbiased comparison, all feature selection methods were tested with the same input parameter k (k = 3 for the melanoma dataset, k = 2 for the HIV dataset and k = 4 for the Hepatitis-C dataset) for the k-means clustering algorithm with no additional preprocessing. The clustering was repeated 100 times for each feature selection method and each number of selected features.
Random selection was used to generate 100 different sets. Feature entropy was performed on each feature individually, using the same formalism as in equation 3. We used the Jaccard score  to measure the quality of the clustering relative to known labels.
Connection between projection on first principal component and negative entropy score
One can prove that in the extreme case, where a feature is lying only on the first PC, it is bound to have a negative score. We shall now prove it for the SVD-entropy function. This proof can be extended to cover also the alternative measures mentioned in the methods section (UFF sub-section).
This means that removing such a feature always leads to increase of entropy.
where Vi is the i-th eigenvector of C.
Adjusting appropriately S and K, it is easy to prove this also for the sum of squares and the geometric mean functions mentioned in the methods section.
When is UFF applicable?
Suitable datasets can then be defined as those lying below certain thresholds in both measures. We tested more than a dozen 'suitable' and ten 'not-suitable' datasets (not shown) using UFF and clustering algorithms. It seems that combining the two measures using the geometric mean provides the best test for applicability. We found 'suitable' datasets to lie below a threshold of 0.8 of the combined score.
- List of abbreviations used in this paper:
UFF: Unsupervised Feature Filtering
Singular Value Decomposition
Unsupervised Instance Selection
Comparative Toxicogenomics Database.
We thank Alon Kaufman and Nati Linial for stimulating discussions and suggestions. RV is a fellow member of the Sudarsky Center for Computational Biology.
Funding: A.G. is an Edmond J. Safra fellow.
This work was supported by the EU Framework VII Prospects consortia
- Saeys Y, Inza I, Larrañaga P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23(19):2507–2517. 10.1093/bioinformatics/btm344View ArticlePubMedGoogle Scholar
- Guyon I, Elisseeff A: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 2003, 3: 1157–1182. 10.1162/153244303322753616Google Scholar
- Dy JG, Brodley CE: Feature Selection for Unsupervised Learning. J Mach Learn Res 2004, 5: 845–889.Google Scholar
- Zou H, Hastie T, Tibshirani R: Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics 2006, 15(2):265–286. 10.1198/106186006X113430View ArticleGoogle Scholar
- Herrero J, Diaz-Uriarte R, Dopazo J: Gene expression data preprocessing. Bioinformatics 2003, 19(5):655–656. 10.1093/bioinformatics/btg040View ArticlePubMedGoogle Scholar
- Varshavsky R, Gottlieb A, Linial M, Horn D: Novel Unsupervised Feature Filtering of Biological Data. Bioinformatics 2006, 22(14):e507–513. 10.1093/bioinformatics/btl214View ArticlePubMedGoogle Scholar
- Varshavsky R, Gottlieb A, Horn D, Linial M: Unsupervised feature selection under perturbations: meeting the challenges of biological data. Bioinformatics 2007, 23(24):3343–3349. 10.1093/bioinformatics/btm528View ArticlePubMedGoogle Scholar
- Hodge V, Austin J: A Survey of Outlier Detection Methodologies. Artificial Intelligence Review 2004, 22(2):85–126. 10.1023/B:AIRE.0000045502.10941.a9View ArticleGoogle Scholar
- Zhang Y, Meratnia N, Havinga P: A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets. Technical Report TR-CTIT-07–79, Centre for Telematics and Information Technology, University of Twente, Enschede 2007.Google Scholar
- Guyon I, Matic N, Vapnik V: Advances in knowledge discovery and data mining. American Association for Artificial Intelligence Menlo Park, CA, USA; 1996.Google Scholar
- Yamanishi K, Takeuchi Ji, Williams G, Milne P: On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms. Data Mining and Knowledge Discovery 2004, 8(3):275–300. 10.1023/B:DAMI.0000023676.72185.7cView ArticleGoogle Scholar
- Donoho DL, Gasko M: Breakdown Properties of Location Estimates Based on Halfspace Depth and Projected Outlyingness. Ann Statist 1992, 20(4):1803–1827. 10.1214/aos/1176348890View ArticleGoogle Scholar
- Donoho DL: Breakdown properties of multivariate location estimators. Harvard University; 1982. PhD qualifying paper.Google Scholar
- Stahel WA: Breakdown of Covariance Estimators. Research Report 31, Fachgruppe für Statistik, ETH Zürich 1981.Google Scholar
- Maronna RA, Yohai VJ: The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 1995, 90(429):330–341. 10.2307/2291158View ArticleGoogle Scholar
- Ramaswamy S, Rastogi R, Shim K: Efficient algorithms for mining outliers from large data sets. Proceedings of the ACM SIGMOD Conference 2000, 29(2):427–438. 10.1145/335191.335437View ArticleGoogle Scholar
- Breunig MM, Kriegel HP, Ng RT, Sander J: LOF: Identifying Density-Based Local Outliers. ACM SIGMOD conference 2000, 29(2):93–104. 10.1145/335191.335388View ArticleGoogle Scholar
- Zoubi MdBA: An Effective Clustering-Based Approach for Outlier Detection. European Journal of Scientific Research 2009, 28(2):310–316.Google Scholar
- Herron BJ, Liddell RA, Parker A, Grant S, Kinne J, Fisher JK, Siracusa LD: A mutation in stratifin is responsible for the repeated epilation (Er) phenotype in mice. Nature Genetics 2005, 37: 1210–1212. 10.1038/ng1652View ArticlePubMedGoogle Scholar
- Chan Y, Anton-Lamprecht I, Yu QC, Jäckel A, Zabel B, JPE, Fuchs E: A human keratin 14 "knockout": the absence of K14 leads to severe epidermolysis bullosa simplex and a function for an intermediate filament protein. Genes & Dev 1994, 8: 2574–2587.View ArticleGoogle Scholar
- Rothnagel JA, Dominey AM, Dempsey LD, Longley MA, Greenhalgh DA, Gagne TA, Huber M, Frenk E, Hohl D, Roop DR: Mutations in the rod domains of keratins 1 and 10 in epidermolytic hyperkeratosis. Science 1992, 257: 1128–1130. 10.1126/science.257.5073.1128View ArticlePubMedGoogle Scholar
- Maestrini E, Monaco AP, McGrath JA, Ishida-Yamamoto A, Camisa C, Hovnanian A, Weeks DE, Lathrop M, Uitto J, Christiano AM: A molecular defect in loricrin, the major component of the cornified cell envelope, underlies Vohwinkel's syndrome. Nature Genetics 1996, 13: 70–77. 10.1038/ng0596-70View ArticlePubMedGoogle Scholar
- Verhaegh G, Richard M, Hainaut P: Regulation of p53 by metal ions and by antioxidants: dithiocarbamate down-regulates p53 DNA-binding activity by increasing the intracellular level of copper. Mol Cell Biol 1997, 17(10):5699–5706.View ArticlePubMedPubMed CentralGoogle Scholar
- MéplanDagger C, Mann K, Hainaut P: Cadmium Induces Conformational Modifications of Wild-type p53 and Suppresses p53 Response to DNA Damage in Cultured Cells. J Biol Chem 1999, 274(44):31663–31670. 10.1074/jbc.274.44.31663View ArticleGoogle Scholar
- Metcalfe S, Weeds A, Okorokov AL, Milner J, Cockman M, Pope B: Wild-type p53 protein shows calcium-dependent binding to F-actin. Oncogene 1999, 18(14):2351–2355. 10.1038/sj.onc.1202559View ArticlePubMedGoogle Scholar
- Jaccard P: Nouvelles recherches sur la distribution florale. Bul Soc Vaudoise Sci Nat 1908, 44: 223–270.Google Scholar
- Horn D, Gottlieb A: Algorithm for data clustering in pattern recognition problems based on quantum mechanics. Physical Review Letters 2001., 88(1): 10.1103/PhysRevLett.88.018702Google Scholar
- Talantov D, Mazumder A, Yu XJ, Briggs T, Jiang Y, Backus J, Atkins D, Wang Y: Novel Genes Associated with Malignant Melanoma but not BenignMelanocytic Lesions. Clin Cancer Res 2005., 11(20): 10.1158/1078-0432.CCR-05-0683Google Scholar
- Theresa L, Chang JV Jr, Armando DelPortillo, Klotman MaryE: Dual role of α-defensin-1 in anti-HIV-1 innate immunity. J Clin Invest 2005, 115(3):765–773.View ArticleGoogle Scholar
- Chu F, Tsang PH, Robez JP, Wallace JI, Bekesi JG: Increased spontaneous release of CD8 antigen from CD8+ cells reflects the clinical progression of HIV-1 infected individuals. Int Conf AIDS 1989., 5(431):Google Scholar
- Hodgson PD, Renton KW: The role of nitric oxide generation in interferon-evoked cytochrome P450 down-regulation. The role of nitric oxide generation in interferon-evoked cytochrome P450 down-regulation 1995, 17(12):995–1000.Google Scholar
- Barsoum RS: Hepatitis C virus: from entry to renal injury--facts and potentials. Nephrology Dialysis Transplantation 2007, 22(7):1840–1848. 10.1093/ndt/gfm205View ArticleGoogle Scholar
- Tso CL, Shintaku P, Chen J, Liu Q, Liu J, Chen Z, Yoshimoto K, Mischel PS, Cloughesy TF, Liau LM, et al.: Primary Glioblastomas Express Mesenchymal Stem-Like Properties. Mol Cancer Res 2006, 4: 607. 10.1158/1541-7786.MCR-06-0005View ArticlePubMedGoogle Scholar
- Santala M, Simojoki M, Risteli J, Risteli L, Kauppila A: Type I and Type III Collagen Metabolites as Predictors of Clinical Outcome in Epithelial Ovarian Cancer. Clinical Cancer Res 1999, 5: 4091–4096.Google Scholar
- Santala M, Risteli J, Risteli L, Puistola U, Kacinski BM, Stanley ER, Kauppila A: Synthesis and breakdown of fibrillar collagens: concomitant phenomena in ovarian cancer. Br J Cancer 1998, 77(11):1825–1831.View ArticlePubMedPubMed CentralGoogle Scholar
- Martorell EA, Murray PM, Peterson JJ, Menke DM, Calamia KT: Palmar fasciitis and arthritis syndrome associated with metastatic ovarian carcinoma: a report of four cases. J Hand Surg 2004, 29(4):654–660. 10.1016/j.jhsa.2004.04.012View ArticleGoogle Scholar
- Lee YS, Dutta A: MicroRNAs in cancer. Annual Review of Pathology: Mechanisms of Disease 2008, 4: 199–227. 10.1146/annurev.pathol.4.110807.092222View ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30(1):207–210. 10.1093/nar/30.1.207View ArticlePubMedPubMed CentralGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res 2007, 35: D760-D765. 10.1093/nar/gkl887View ArticlePubMedPubMed CentralGoogle Scholar
- The Cancer Genome Atlas[http://tcga.cancer.gov/]
- Wall M, Rechtsteiner A, Rocha L: Singular Value Decomposition and Principal Component Analysis. In A Practical Approach to Microarray Data Analysis. Edited by: Berrar D, Dubitzky W. Granzow M: Kluwer; 2003:91–109. full_textView ArticleGoogle Scholar
- Alter O, Brown PO, Botstein D: Singular value decomposition for genome-wide expression data processing and modeling. PNAS 2000, 97(18):10101–10106. 10.1073/pnas.97.18.10101View ArticlePubMedPubMed CentralGoogle Scholar
- Devijver PA, Kittler J: Pattern recognition: a statistical approach. Englewood Cliffs, N.J: Prentice-Hall; 1982.Google Scholar
- Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc 2009, 4(1):44–57. 10.1038/nprot.2008.211View ArticleGoogle Scholar
- Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempick RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology 2003., 4(P3):Google Scholar
- Chen J, Bardes EE, Aronow BJ, Jegga AG: ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res 2009, 37: W305-W311. 10.1093/nar/gkp427View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson PN, Wollstein A, Böhme U, Beattie B: Ontologizing gene-expression microarray data: characterizing clusters with Gene Ontology. Bioinformatics 2004, 20(6):979–981. 10.1093/bioinformatics/bth040View ArticlePubMedGoogle Scholar
- Zhang B, Schmoyer D, Kirov S, Snoddy J: GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies. BMC Bioinformatics 2004., 5(16):Google Scholar
- Hellman-Feynmann : theorem of quantum mechanical forces was originally proven by P. Ehrenfest, Z. Phys. 45, 455 (1927), and later discussed by Hellman (1937) and independently rediscovered by Feynman (1939). 1927.Google Scholar
- Hellman H: Einfuhrung in die Quantenchemie. Leipzig and Vienna: Deuticke; 1937.Google Scholar
- Feynman R, P: Forces in Molecules. Physical Review 1939, 56: 340–343. 10.1103/PhysRev.56.340View ArticleGoogle Scholar
- Dahiya N, Sherman-Baust CA, Wang TL, Davidson B, Shih Ie M, Zhang Y, Wood W, Becker KG, Morin PJ: MicroRNA expression and identification of putative miRNA targets in ovarian cancer. PLoS One 2008, 3(6):e2436. 10.1371/journal.pone.0002436View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.