SPECS: a non-parametric method to identify tissue-specific molecular features for unbalanced sample groups

Background To understand biology and differences among various tissues or cell types, one typically searches for molecular features that display characteristic abundance patterns. Several specificity metrics have been introduced to identify tissue-specific molecular features, but these either require an equal number of replicates per tissue or they can’t handle replicates at all. Results We describe a non-parametric specificity score that is compatible with unequal sample group sizes. To demonstrate its usefulness, the specificity score was calculated on all GTEx samples, detecting known and novel tissue-specific genes. A webtool was developed to browse these results for genes or tissues of interest. An example python implementation of SPECS is available at https://github.com/celineeveraert/SPECS. The precalculated SPECS results on the GTEx data are available through a user-friendly browser at specs.cmgg.be. Conclusions SPECS is a non-parametric method that identifies known and novel specific-expressed genes. In addition, SPECS could be adopted for other features and applications.


Introduction
Let the index d refer to a particular sample state (e.g. disease or cancer type), i.e. d = 1, . . . , m d with m d diseases. State d has prevelance π d in the target population. Suppose there are m g candidate features, i.e. g = 1, . . . , m g . If no prior knowledge on the prevalences is available, set π g = 1/m d .
Let Y gd denote the outcome of feature g in group d. With n gd observations in this group, the individual outcomes are denoted by Y gdi , i = 1, . . . , n gd . We use the notation Y g−d to denote the outcome of feature g in all groups but group d. When the description of the method applies to a single feature, the index g will be dropped.
A feature is said to be a good classifier or discriminant for a given state, if its outcome distribution for that particular state shows no (or only a little) overlap with the outcome distribution for the other states. This means a large AUC. The AUC is related to the probabilistic index, which is given by If p gd is very small (close to zero) or very large (close to 1), the outcome distributions in the target state and the other groups are well separated.
The index g will be dropped now. It will be convenient for computational and flexibility reasons to write the probabilistic index as where estimated directly from the sample data, it may be biased because of the state-specific sample sizes may not be proportional to the prevalences in the target population.
In the next few paragraphs we will outline how to estimate the probabilistic indices and how their variances can be estimated consistently. The latter is important for a proper feature selection procedure.
Hence, an estimator of For further purposes we need the sampling distribution ofp d . We therefore writê P t d = (P kd ) k =d and π t = (π 1 , . . . , π m d ) It's a well known result that each of theP kd is asymptotically normal and that they are jointly asymptotically multivariate normal. Hence,p d is asymptotically normal too. Its mean will later be set to 1/2 under the null hypothesis, so at this stage we only need the covariance matrix ofP d , and a consistent estimator of this covariance matrix.
For k = l: Cov P kd ,P kd = Var P kd .
For k = l: Cov P kd ,P ld = 1 n k n l n 2  Estimates for P kd and P ld are known already. Hence, the for estimation of Cov {I ki;dj , I la;dj } we only need estimates of E kld = E {I ki;dj I la;dj }. An empirical estimator is given bŷ With these ingredients, the covariance matrix ofP d can be estimated. LetΣ denote this matrix. An estimate of the variance ofp d is then given bŷ

Selection of good discriminating features
We focus on a single state, say d. We reintroduce the index g for feature. Letp gd the estimate of p gd , andσ 2 gd its variance estimate. In an empirical Bayesian setting we assume (g = 1, . . . , m g ) where g(·) is a density function (there will be no need to specify this explicitely).
From the asymptotic normality ofp gd we can assumê p gd | p gd ∼ N (p gd , σ 2 gd ).
Let f d (·) denote the marginal distribution ofp gd over all features g = 1, . . . , m g . Tweedie's formula (Efron, 2011) eventually gives In this expression the variance σ 2 gd can be replaced by its estimate, and f d (·) can be replaced by a nonparametric density estimate upon using allp gd , g = 1, . . . , m g . The latter only works with m g sufficiently large. Letp gd denote this estimate of E p gd | (p gd ) g=1,...,mg .
Features can now be selected by ranking thep gd and selecting the largest (closest to 1) or the smallest (closest to zero). The empirical Bayes procedure protects against selection bias (Efron, 2011).