Bayesian inference of the number of factors in geneexpression analysis: application to human virus challenge studies
 Bo Chen†^{1},
 Minhua Chen†^{1},
 John Paisley^{1},
 Aimee Zaas^{2},
 Christopher Woods^{2},
 Geoffrey S Ginsburg^{2},
 Alfred HeroIII^{3},
 Joseph Lucas^{2},
 David Dunson^{4} and
 Lawrence Carin^{1}Email author
DOI: 10.1186/1471210511552
© Chen et al; licensee BioMed Central Ltd. 2010
Received: 01 September 2009
Accepted: 09 November 2010
Published: 09 November 2010
Abstract
Background
Nonparametric Bayesian techniques have been developed recently to extend the sophistication of factor models, allowing one to infer the number of appropriate factors from the observed data. We consider such techniques for sparse factor analysis, with application to geneexpression data from three virus challenge studies. Particular attention is placed on employing the Beta Process (BP), the Indian Buffet Process (IBP), and related sparsenesspromoting techniques to infer a proper number of factors. The posterior density function on the model parameters is computed using Gibbs sampling and variational Bayesian (VB) analysis.
Results
Timeevolving geneexpression data are considered for respiratory syncytial virus (RSV), Rhino virus, and influenza, using blood samples from healthy human subjects. These data were acquired in three challenge studies, each executed after receiving institutional review board (IRB) approval from Duke University. Comparisons are made between several alternative means of performing nonparametric factor analysis on these data, with comparisons as well to sparsePCA and Penalized Matrix Decomposition (PMD), closely related nonBayesian approaches.
Conclusions
Applying the Beta Process to the factor scores, or to the singular values of a pseudoSVD construction, the proposed algorithms infer the number of factors in geneexpression data. For real data the "true" number of factors is unknown; in our simulations we consider a range of noise variances, and the proposed Bayesian models inferred the number of factors accurately relative to other methods in the literature, such as sparsePCA and PMD. We have also identified a "panviral" factor of importance for each of the three viruses considered in this study. We have identified a set of genes associated with this panviral factor, of interest for early detection of such viruses based upon the host response, as quantified via geneexpression data.
I. Background
When performing geneexpression analysis for inference of relationships between genes and conditions/phenotypes, one typically must analyze a small number of samples, each composed of expression values from tens of thousands of genes. In this setting the observed data is X ∈ ℝ^{p×n}, where each column corresponds to one of n samples, quantifying the associated geneexpression values for all p genes under investigation. We typically must address the "large p, small n" problem [1], in which often n ≪ p. Therefore, to yield reliable inference, one must impose strong restrictions on the form of the model.
When developing regression and classification models for geneexpression data, a widely employed assumption (restriction) is that the model parameters are sparse, implying that only a small subset of the genes are important for prediction. If only a small set of genes (≪ p) are responsible for differences in disease groups, then reliable inference may often be performed even when n ≪ p. Example approaches that have taken this viewpoint are lasso [2], the elastic net [3], and related Bayesian approaches [4]. In fact, sparse regression and classification algorithms are widely used in many statistics and machinelearning applications, beyond gene analysis [5–7].
where A ∈ ℝ^{p×r}, S ∈ ℝ^{r×n} and E ∈ ℝ^{p × n}; if covariates are available they may also be considered in the model [11], with none assumed here. Note that here and henceforth we assume that the geneexpression data are centered in advance of the analysis; otherwise, there should be an intercept added to the model. Considering the j th sample, x_{ j } , corresponding to the j th column of X, the model states that x_{ j } = As_{ j } + e_{ j } , where s_{ j } and e_{ j } are the j th columns of S and E, respectively.
The columns of A represent the factor "loadings", and rows of S are often called factors. To address the fact that n ≪ p, researchers have typically imposed a sparseness constraint on the columns of A [11], with the idea that each column of A should ideally (in the gene application) correspond to a biological "pathway", which should be defined by a relatively small number of correlated genes. Within Bayesian formalisms, the sparse columns of A are typically imposed via spikeslablike priors [1], [11], or alternatively via shrinkage (e.g., Studentt [6]) priors. Several nonBayesian approaches have also been introduced, including sparsePCA [12] and the related Penalized Matrix Decomposition (PMD) [13].
A problem that is receiving increased attention in factoranalysisbased approaches is a means of defining an appropriate number of factors (i.e., to infer r). The nonBayesian approaches are often sequential, and one may infer r by monitoring the error E_{ F } as a function of iteration number [12], [13]. In many previous Bayesian approaches r has just been set [11], and presumably many nonbiologicallyrelevant factor loadings are inferred. A computationally expensive reversejump MCMC approach has been developed [14], with computational efficiency improved in [15] while also considering a default robust prior specification. Perhaps the most widely employed approach [16–18] for choosing r is the Bayesian information criteria (BIC). A disadvantage is that conditioning on a fixed choice of the number of factors ignores uncertainty and the BIC is not well justified in hierarchical models, as the number of parameters is unclear.
There has been recent interest in applying nonparametric Bayesian methods [8], [9] to infer r (in fact, a posterior distribution on r), based on the observed data X. An example of recent research in this direction employs the Indian Buffet Process (IBP) [19], [20]. In this paper we also consider the Beta Process (BP), recognizing that the BP and IBP are closely linked [21], [22].
For data sets with very large p (e.g., 10,000 or more), computational efficiency is of major practical importance. In previous use of nonparametric Bayesian methods to this problem, a Gibbs sampler has typically been employed [11]. The BPbased formulation admits a relatively simple variational Bayesian (VB) [23] approximation to the posterior, which is considerably faster than Gibbs sampling. An advantage of a VB analysis, in addition to speed, is that convergence is readily monitored (for the Gibbs sampler there are typically challenges when assessing convergence). We perform a comparison of the difference in inferred model parameters, based on VB and Gibbs analysis.
The specific data on which the models are employed correspond to geneexpression data from recent viral challenge studies. Specifically, after receiving institutional review board (IRB) approvals from Duke University, we performed three separate challenge studies, in which individuals were inoculated with respiratory syncytial virus (RSV), Rhino virus, and influenza. Informed consent was used in all studies. Using blood samples collected sequentially over time, we have access to geneexpression data at preinoculation, just after inoculation, and at many additional time points up to the point of full symptoms (such data were collected on all subjects, although not all became symptomatic). Using these data, we may investigate timeevolving factor scores of samples, to examine how the response to the virus evolves with time. Of particular importance is an examination of the factors of importance for individuals who became symptomatic relative to those who did not. In the factor analysis we consider data individually for each of the three viruses (at all times), as well as for all three viruses in a single analysis (seeking panviral factors). Results are generated based on nonparametric Bayesian approaches to factor analysis, employing the Beta Process, the Indian Buffet Process, and a related sparsenessconstrained pseudoSVD construction (a Bayesian construction of sparsePCA [12]). We also make comparisons to the nonBayesian Penalized Matrix Decomposition (PMD) [13].
II. Results
A. Brief summary of models
We first provide a brief intuitive explanation of the workings of the different Bayesian models considered. These models are built around the Indian buffet process (IBP) [19], so named for the following reason. In the factor model of (1), the columns of A represent factor loadings in which the geneexpression values for sample j are expressed: x_{ i } = As_{ j } + e_{ j } . One construction of the IBP constitutes a set of candidate columns of A, and these are termed "dishes" at an Indian "buffet". Each of the n samples {x_{ j }}_{j = 1,n}correspond to "customers" at the buffet; each customer selects a subset of dishes from the buffet (i.e., selects a subset of candidate columns of A). The IBP is constructed such that the more a particular dish (column of A) is used by a subset of customers {x_{ j }}_{ j } _{= 1,}_{ n }, the more probable it is that it will be used by other customers. Thus, the IBP imposes the idea that many of the samples {x_{ j }}_{ j } _{= 1,}_{ n } will utilize the same subset of columns of A, but each sample may also utilize idiosyncratic factor loadings, representing unique characteristics of particular samples. The IBP construction does not impose a total number of factors for the data {x_{ j }}_{ j } _{= 1,n}, with this number inferred by the analysis. Thus, the IBP is a natural Bayesian method for inferring the number of factors appropriate for representing all observed data {x_{ j }}_{ j } _{= 1,n}. A convenient means of implementing the IBP employs the Beta process (BP) [21].
There are multiple ways in which one may utilize the IBP/BP within the factor model, with three such methods considered here: (i) the BP is applied to the factor scores S (termed below the BP construction), (ii) the IBP is employed on the factor loadings A [8] (termed below the IBP construction), and (iii) a BPlike construction is employed to implement a Bayesian construction of a singularvalue decomposition of X (termed below the pseudoSVD construction). To realize the approximate posterior density function for the parameters of these models, we have considered both MCMC and VB computational methods. The specifics of the BP, IBP and pseudoSVD methods, as well as computational details, are provided in Section IV.
B. Synthesized Data
The first validation example we considered was taken from [8]. In this example the genefactor connectivity matrix of an Ecoli network is employed to generate a synthetic dataset having 100 samples of 50 genes and 8 underlying factors. The data had additive white Gaussian noise with a signaltonoiseratio of 10. For this very smallscale example we considered all three Bayesian methods (BP, IBP and pseudoSVD); in each case we considered both MCMC and VB methods for inferring the posterior density function. We also considered the nonBayesian PMD and sparsePCA [13], [24]. All methods performed well in uncovering the proper number of factors, and in capturing the proper genes associated with each factor. For brevity we do not provide further details on this example. While it is worthy of consideration because it was considered in related published research [8], its smallscale nature (only 50 genes) makes it less relevant for the largescale real application we consider below. Therefore, in the next synthetic example we consider a much largerscale problem, and consequently for that problem we were unable to test against the IBP method.
The synthetic data were generated as follows. A total of p = 10, 000 features ("genes") are employed, and the expression value for these p genes was constituted using five factors (r = 5) plus a noise term E (i.e., via the model in (1)). For each of the five factors, a unique set of 50 genes were selected and were given a factorloading value of one. In addition, ten more genes were selected, with these shared among all five factors (again with unitamplitude contribution to the factor loadings). Thus, a total of 260 genes contributed nonzero loadings to at least one of the five factors. For all other genes the factorloading contribution was set to zero. The above construction defines the sparse matrix A in (1). The components of S ∈ ℝ^{r×n}, for n = 150 samples, are drawn i.i.d. from$\mathcal{N}$ (0, 1). The elements of the noise matrix E are drawn i.i.d. from $\mathcal{N}(0,{\alpha}_{0}^{1})$. The data X were then utilized within the various factoranalysis models, with the datageneration process repeated 100 independent times (100 different X), with mean and standarddeviation results presented on the inferred model parameters (discussed below), based on the 100 runs.
We consider a range of noise variances 1/α_{0} to constitute E, to address model performance as a function of the signaltonoise ratio (SNR). As one definition of SNR, one may consider the average energy contributed from a nonzero gene to a particular factor, relative to the energy in the noise contribution for that gene, from E. Based on the fact that the nonzero components of A have unit amplitude, and the components of S are drawn from $\mathcal{N}$ (0, 1), on average (across many samples) the energy contributed by a nonzero gene to a particular factor is one. The average noise energy contributed to each gene is 1/α_{0}. Hence, the ratio of these two quantities, α_{0}, may be considered as a measure of SNR. Other measures of SNR may be defined with respect to this model, each of which will be defined in terms of α_{0}.
Concerning sparsePCA [12] (and PMD, not shown), every effort was made to optimize the model parameters for this task. Our experience is that, while sparsePCA and PMD [13] are very fast algorithms, and generally quite effective, they are not as robust to noise as the Bayesian methods (the Bayesian methods are also less sensitive to parameter settings). It is possible that the sparsePCA and PMD results could be improved further if the model parameters are optimized separately for each noise level (and the Bayesian results may also be improved with such tuning). However, the model parameters were fixed for all noise variances considered (since the noise variance is often not known a priori, with the sparsePCA carefully tuned to achieve the best results in such a circumstance.
We also performed additional simulated examples of the type discussed above, the details of which are omitted for brevity. In those experiments the different genes did not have the same noise variance. The Bayesian methods, which as indicated above infer the noise variance separately for each gene, performed as well as in Figures 1 and 2. However, the sparsePCA and PMD models performed relatively poorly in this case, since they assume the same noise variance for all genes. The assumption of a constant noise variance for each gene may not be as appropriate for real data.
C. Details on Data Collections for Three Viral Challenge Studies
We considered three cohorts of healthy volunteers experimentally infected with either rhinovirus, respiratory syncytial virus (RSV) or influenza A; these three challenges were performed separately, with no overlap in the subjects. All exposures were approved by the Duke University institutional review board and conducted according to the Declaration of Helsinki. The three challenges are briefly summarized here, with further details provided in [25].
Human Rhinovirus cohort
We recruited 20 healthy volunteers via advertisement to participate in the rhinovirus challenge study through an active screening protocol at the University of Virginia (Charlottesville, VA). On the day of inoculation, 10^{6} TCID50 GMP rhinovirus (Johnson and Johnson) was inoculated intranasally. Subjects were admitted to the quarantine facility for 48 hours following rhinovirus inoculation and remained in the facility for 48 hours following inoculation. Blood was sampled into PAXGene™blood collection tubes at predetermined intervals post inoculation. Nasal lavage samples were obtained from each subject daily for rhinovirus titers to accurately gauge the success and timing of the rhinovirus inoculation. Following the 48th hour post inoculation, subjects were released from quarantine and returned for three consecutive mornings for sample acquisition and symptom score ascertainment.
Human RSV cohort
A healthy volunteer intranasal challenge with RSV A was performed in a manner similar to the rhinovirus intranasal challenge. The RSV challenge was performed at Retroscreen Virology, Ltd (Brentwood, UK) using 20 prescreened volunteers who provided informed consent. On the day of inoculation, a dose of 10^{4} TCID50 respiratory syncytial virus (RSV; serotype A) manufactured and processed under current good manufacturing practices (cGMP) by Meridian Life Sciences, Inc. (Memphis, TN USA) was inoculated intranasally per standard methods. Blood and nasal lavage collection methods were similar to the rhinovirus cohort, but continued throughout the duration of the quarantine. Due to the longer incubation period of RSV A, subjects were not released from quarantine until after the 165th hour and were negative by rapid RSV antigen detection (BinaxNow Rapid RSV Antigen; Inverness Medical Innovations, Inc).
Influenza cohort
A healthy volunteer intranasal challenge with influenza A/Wisconsin/67/2005 (H3N2) was performed at Retroscreen Virology, LTD (Brentwood, UK), using 17 prescreened volunteers who provided informed consent. On the day of inoculation, a dose of 106 TCID50 Influenza A manufactured and processed under current good manufacturing practices (cGMP) by Bayer Life Sciences, Vienna, Austria was inoculated intranasally per standard methods at a varying dose (1:10, 1:100, 1:1000, 1:10000) with four to five subjects receiving each dose. Due to the longer incubation period of influenza as compared to rhinovirus, subjects were not released from quarantine until after the 216th hour. Blood and nasal lavage collection continued throughout the duration of the quarantine. All subjects received oral oseltamivir (Roche Pharmaceuticals) 75 mg by mouth twice daily prophylaxis at day 6 following inoculation. All patients were negative by rapid antigen detection (BinaxNow Rapid Influenza Antigen; Inverness Medical Innovations, Inc) at time of discharge.
For each viral challenge, subjects had samples taken 24 hours prior to inoculation with virus (baseline), immediately prior to inoculation (prechallenge) and at set intervals following challenge. For the rhinovirus challenge, peripheral blood was taken at baseline, then at 4 hour intervals for the first 24 hours, then 6 hour intervals for the next 24 hours, then 8 hour intervals for the next 24 hours, and then 24 hour intervals for the remaining 3 days of the study. For the RSV and influenza challenges, peripheral blood was taken at baseline, then at 8 hour intervals for the initial 120 hours, and then 24 hours for the remaining 2 days of the study. All results presented here are based on geneexpression data from blood samples. For the RSV and Rhino virus cases not all blood samples were converted to gene expression values, as a costsaving measure. Hence, for these two cases the gene expression data are not sampled as finely in time as are the influenza data.
In the statistical analysis, the matrix X in (1) has columns that correspond to the n samples; n = n_{ s }n_{ t }, with n_{ s } representing the number of subjects and n_{ t } the number of sample time points. We do not impose a prior on the timedependence of the factors scores, and uncover this time dependence via the inferred posterior distribution of factor scores S.
D. Analysis of influenza data
The geneexpression data consisted of over p = 12, 000 genes, and consequently we found that the IBP approach developed in [8] was computationally intractable. We found that the VB and MCMC results were generally in good agreement for this real data, and therefore the two very distinct computational tools served to crossvalidate each other. The VB and MCMC computations also required similar CPU time (for the number of Gibbs iterations considered); while the VB analysis required far fewer iterations to converge, each iteration is significantly more expensive than that associated with the Gibbs sampler.
For brevity, we here focus exclusively on MCMC solutions when considering Bayesian analysis. Results are presented using the BP and pseudoSVD methods, as well as via PMD [13] (similar results were computed using sparsePCA [24]). We note that the design of each the experiments involves samples from the same subjects observed at multiple time points (with different subjects for the three viruses). Therefore, the assumption within the models that the samples at different times are statistically independent may warrant reconsidering in future studies. This subject has been considered in related work [26], although that research assumes a known factor structure and Gaussian latent factors.
At each time point, there are data from 17 subjects (the same individuals were sampled at a sequence of times). The horizontal axis in Figure 3 corresponds to a sequence of groups of data, proceeding in time from inoculation, with generally 17 samples per time point (all data will be released for other investigators to experiment with). The blue points correspond to samples of individuals who eventually became symptomatic, and the red points to asymptomatic individuals.
The vertical axis in these plots corresponds to the factor score associated with the respective sample. We observe in Figure 3 that Factor 1 (the factor indexing is arbitrary) provides a clear discriminator of those who will become symptomatic, particularly as time proceeds (note that the model is completely unsupervised, and therefore this discriminating power was uncovered without using label information).
Note that in each case there appears to be one factor that clearly distinguishes symptomatic vs. asymptomatic, particularly as time increases. Upon examining the important genes in each of these factors, one recognizes a high level of overlap (suggesting consistency between the different methods). Further discussion of the associated genes and their biological significance is provided in [25].
E. Panviral factors
We now consider a "panviral" analysis, in which data from all three viruses are analyzed jointly. For further conciseness, for this example we only present results for the BP applied to the factor scores; similar results were obtained with the Bayesian pseudoSVD framework and by PMD.
It is also of interest to consider Factors 1 and 2 in Figure 6. Each of the samples from the individual viruses is offset by a distinct amplitude, almost entirely independent of whether the sample was symptomatic or asymptomatic. This phenomenon associated with Factors 1 and 2 in Figure 6 is attributed to challengestudydependent offsets in the data (the geneexpression values were obtained separately for each of these studies, and the data normalized separately), which account for different normalizations of the data between the three distinct viral challenges. This underscores that not all factors have biological significance, with some a consequence of the peculiarities of geneexpression data (studydependent offsets in normalization). The other factoranalysis methods (omitted here for brevity) produced very similar normalizationrelated factors.
We have here identified many genes that are inferred to be connected with the viruses under study. It has been observed, by the medical doctors on our research team, that the inferred genes are closely aligned with relevant known pathways, with this discussed in detail in [25].
III. Conclusions
We have examined two distinct but related objectives. First, in the context of Bayesian factor analysis, we have examined three ways of inferring an appropriate number of factors. Each of these methods is based on a different means of leveraging the utility of the Beta Process, and the closely related Indian Buffet Process (IBP). In the context of such models, we have examined inference based on variational Bayesian analysis, and based on a Gibbs sampler. We have also compared these Bayesian approaches to stateoftheart nonBayesian factor models.
The second contribution of this paper is the introduction of a new set of geneexpression data, from three timeevolving viral challenge studies. These data allow one to examine the timeevolution of Rhino virus, RSV and InfluenzaA. In addition to the geneexpression data, we have also recorded clinical symptom scores, to which the geneexpression analysis may be compared. With the limited space available here, we have presented results on the Influenza data alone, and for all three viruses together (a "panviral" analysis).
Based on this study, we may make the following observations. For the number of Gibbs iterations deemed necessary, the VB and MCMC inference approaches required comparable computation time (VB was slightly faster, but not substantially). Although VB requires far fewer iterations (converges typically in 50 iterations), each VB iteration is significantly more expensive than that associated with MCMC. The advantage of using these two very distinct computational methods on the models considered is that they serve to crossvalidate each other (providing confidence in the results, when these two very different methods agree, as they generally did in the studies considered).
Of the three methods of inferring the number of factors, the IBP applied to the factor loadings works well for smallscale problems, but it is computationally intractable for the largescale viral data considered here. Applying the Beta Process to the factor scores, or to the singular values of a pseudoSVD construction, yields reliable and highquality results.
It is not our purpose to provide a detailed (perhaps philosophical) discourse on the relative merits of Bayesian and nonBayesian approaches. However, we observed that the nonBayesian Penalized Matrix Decomposition (PMD) yielded very highquality results, as long as the model parameters were set carefully via crossvalidation; very similar phenomenon was observed for the closely related sparsePCA. Both PMD and sparsePCA infer an appropriate number of factors, but one must very carefully set the stop criterion. Since PMD and sparsePCA are much faster than the Bayesian approaches, perhaps a good compromise is to use the output of these models to initialize the Gibbs sampler in a Bayesian solution (this is a subject for future research).
Concerning the viral data, it was observed that all methods were able to infer a factor that was capable of distinguishing those subjects who would become symptomatic from those who would not. It was possible to infer a "panviral" factor, that was discriminative for all viruses considered.
The evolution of the factor scores tracked well the recorded clinical symptom scores. Further, for the discriminative factor, there was a good association between the genes inferred as important and the associated biology [25] (with interpretation provided by the medical doctors on our research team).
IV. Methods
A. Basic sparse factor model
Recall the factor model in (1); r defines the number of factors responsible for the data X, and it is not known in general, and must be inferred. Within the analysis we will consider K factors (K columns of A), with K set to a value anticipated to be large relative to r. We then infer the number of columns of A needed to represent the observed data X, with this number used as an estimate of r. Since we will be performing a Bayesian analysis, we will infer a posterior density function on r. Henceforth we assume A has K columns, with the understanding that we wish to infer the r <K columns that are actually needed to represent the data.
Let a_{ k } represent the k th column of A, for k = 1, . . . , K, and e_{ j } and s_{ j } represent respectively the j th columns of E and S (with j = 1, . . . , n). Within the imposed prior, vectors e_{ j } and s_{ j } are generated as s_{ j } ~ $\mathcal{N}$(0, I_{ K }), and ${e}_{j}~\mathcal{N}(0,\text{diag}({\psi}_{1}^{1},\phantom{\rule{0.1em}{0ex}}\dots ,\phantom{\rule{0.1em}{0ex}}{\psi}_{p}^{1}))$ ; I_{ K } is the identity matrix and the precisions (ψ_{1}, . . . , ψ_{ p }) are all drawn i.i.d. from a gamma prior.
where (a, b) are selected as to strongly favor w_{ lk } → 1, δ_{0} is a distribution concentrated at zero, and l = 1, . . . , p. The advantage of (2) is that sparseness is imposed explicitly (many components of a_{ k } are exactly zero).
but now with (e, f) selected as to constitute a Studentt sharply peaked about zero. One may employ a similar construction to impose a doubleexponential (Laplace) sparsenesspromoting prior [4].
B. Beta process for inferring number of factors
The Beta Process (BP) was first developed by Hjort for survival data [28], and more recently it has found many other applications and extensions [19–21]. We here seek to provide a simple discussion of how this construction may be of interest in inferring an appropriate number of factors in factor modeling [22]. Our goal is to use the BP construction, which is closely related to the Indian buffet process (IBP) [19–21], to infer the number of factors r based on the observed data X.
We seek to link our construction explicitly to the factor model, and therefore a_{ k } is the k th candidate factor loading (column of A), and H_{0} is defined by the construction in (2) or (3), depending upon which model is used. The expression π_{ k } represents the probability that a_{ k } is used to represent any particular data sample, defined by the columns of X. The expression δ_{ ak } represents a unit point measure concentrated at a_{ k }.
where the H in BeP(H) is drawn H ~ BP(α, β, H_{0}), as defined in (4). As discussed further below, if z_{ kj } = 1 then a_{ k } is used as a factor loading to represent x_{ j }, the j th column of X; if z_{ kj } = 0, a_{ k } is not used to represent x_{ j }. In other words, B_{ j } is a sum of point measures (δ_{ ak } is a unit point measure concentrated at a_{ k }), and the binary variables z_{ kj } denote whether specific δ_{ ak } are employed within B_{ j }. More details on such constructions may be found in [21].
To make a connection to the introductory comments in Section IIA, and to relate the model to the IBP [19], we consider the above construction in the limit K → ∞. Further, we marginalize (integrate) out the probabilities (π_{1}, . . . , π_{ K } ) used to constitute the BP draw H; we retain the K candidate factor loadings {a_{ k }}_{k = 1,K}used to define A, as drawn from the BP. Recall that x_{ j } represents the j th data sample (j th column of X). We assume that the data samples ("customers") select from among "dishes" at a "buffet", with the dishes defined by {a_{ k }}_{k = 1,K}. Data sample x_{1} enters the buffet first, and selects the first ν_{1} dishes a_{1}, . . . , a_{ ν }_{1} , where ν_{1} is a random variable drawn from Possion(α/β). Therefore, the first column of S has the first ν_{1} elements as nonzero, with the remaining elements in that column set to zero. The second "customer" x_{2} then enters the buffet, and selects from among the first ν_{1} dishes; the probability that x_{2} selects a_{ k }, for each of k ∈ {1, . . . , ν_{1}}, is 1/(β + 1); i.e., z_{k 2}~ Bernoulli(1/(β + 1)), for k ∈ {1, . . . , ν_{1}}. Customer x_{2} also selects ν_{2} new dishes {a_{ ν }_{1+1} . . . , a_{ ν }_{1+ν 2}}, with ν_{2} ~ Possion(α /(β + 1)). Hence, z_{k 2}= 1 for k ∈ {ν_{1} + 1, . . . , ν_{1} + ν_{2}}, and unless stated explicitly otherwise, all other components of z_{ j }are zero. This process continues sequentially, with each x_{ j } entering the buffet in ascending order of j. Sample x_{ J } , with J ∈ {1, . . . , n} selects dishes as follows. Let ${C}_{J1}={\displaystyle {\sum}_{j=1}^{J1}{v}_{j}}$ represent the cumulative number of dishes selected off the buffet, among the previous customers {x_{1}, . . . , x_{ J }_{−1}}. Further, let m_{J−1,k}≥ 1 represent the total number of times dish a_{ k } has been selected by previous customers {x_{1}, . . . , x_{ J }_{−1}}, for k ∈ {1, . . . , C_{J−1}}. Then x_{ J } selects dish a_{ k }, k ∈ {1, . . . , C_{J−1}}, with probability m_{J−1,k}/(β + J − 1); i.e., ${z}_{k,J}~\text{Bernoulli}\left(\frac{{m}_{J1,k}}{\beta +J1}\right)$ for k ∈ {1, . . . , C_{J−1}}. Note that the more "popular" a_{ k } among the previous J − 1 customers (i.e., larger m_{J−1,k}), the more probable it is that it will be selected by x_{ J } . Additionally, x_{ J } selects new dishes a_{ k } for k ∈ {C_{J−1}+ 1, . . . , C_{J−1}+ ν_{ J }}, where ${v}_{J}~\text{Poisson}\left(\frac{\alpha}{\beta +J1}\right)$. Therefore we have z_{ k, J } = 1 for k ∈ {C_{J−1}+1, . . . , C_{J−1}+ ν_{ J }}. Thus, each new customer selects from among the dishes (factor loadings) already selected by at least one previous customer, and the more "popular" one of these dishes is, the more probable it is that the new customer will select it. Further, a new customer will also select additional dishes (factor loadings) not selected by any of the previous customers. However, note that as J increases, the draws ${v}_{J}~\text{Poisson}\left(\frac{\alpha}{\beta +J1}\right)$ are likely to be decreasing in size, since $\frac{\alpha}{\beta +J1}$ is getting smaller with increasing J. Therefore, although K → ∞, a finite subset of the candidate dishes (factor loadings) {a_{ k }}_{k = 1,K}will be used among the n customers, defined by the columns of X, thereby imposing sparseness in the use of factor loadings. This model is also fully exchangeable, in that the order of the columns of X may be permuted, with no change in the properties of the prior [19]. The model imposes that many of the n samples will share the same set of factors, but the model is flexible enough to allow idiosyncratic (sampledependent) factor usage.
In practice K is finite, and therefore it is also if interest to consider the properties of this prior for finite K. For finite K, one may show that the number of nonzero components of z_{ j } is drawn from Binomial(K, α /(α + β(K − 1))), and therefore one may set α and β to impose prior belief on the number of factors that will be important. The expected number of nonzero components in z_{ j } is αK/[α + β(K − 1)].
To complete the model specifications, note that a_{ k } from the BetaBernoulli construction above defines the k th column of the factorloading matrix A. The factorscore matrix S utilizes the binary vectors z_{ j } = (z_{1j}, . . . , z_{ Kj })^{ T } defined in (5), for j ∈ {1, . . . , n}. Specifically, we define the j th column of S as ${s}_{j}={\widehat{s}}_{j}\circ {z}_{j}$ (∘ represents a pointwise, or Hadamard product), with ${\widehat{s}}_{j}~\mathcal{N}(0,\phantom{\rule{0.1em}{0ex}}{\text{I}}_{K})$ . The vector product ${\widehat{s}}_{j}\circ {z}_{j}$ selects a subset of the components in ${\widehat{s}}_{j}$ , setting the rest to zero, and therefore the columns of S are sparse.
C. Sparse factor modeling with BP/IBP placed on factor loadings
In the above discussion the BetaBernoulli and IBP processes were presented for a specific construction of the factoranalysis model, with the goal of making the connection to the model explicit and hence clearer. However, there are alternative ways of utilizing the IBP for design of factor models. Specifically, rather than using the binary vectors to construct S, as above, they may alternatively be used to define A, with factor scores designed as in traditional factor models. This approach was considered in [8], using an Indian Buffet Process (IBP) construction (explicitly using the marginalization discussed above). A limitation of this approach is that one must perform p draws from the IBP to construct A, and typically p is very large for the geneexpression problems of interest. When presenting results in Section IIB, we discuss our experience with this model on smallscale problems, although this approach was found computationally intractable for the motivating virus studies considered in Section IID.
D. Constructing pseudo singular values
The matrix product AS in (1) is now constituted as $\sum}_{k=1}^{K}{\lambda}_{k}{\mathit{a}}_{k}{\mathit{\xi}}_{k}^{T$ . The nonzero components of λ select the columns of A used across all columns of X. As discussed in Section IV, the number of nonzero components of λ is drawn Binomial(K, α/(α + β(K − 1))), and the posterior on the number of such components provides desired information on the number of factors r. Note that this construction is like the BetaBernoulli process discussed above, in that it utilizes π_{ k } ~ Beta(α/K, βK/(K − 1)) and the Bernoulli distribution; however, it only draws the binary vector z once, and therefore there is not the idea of multiple "customers", as in the two IBPrelated formulations discussed above.
E. Computational issues, model quality and hyperparameter settings
Relative CPU times of the different models, implemented on a pc, as applied to the influenza data. the pmd method required a few minutes.
CPU Time VB (hours)  CPU Time MCMC (hours)  

BPFA  0.5  4.87 
Bayesian PseudoSVD  0.11  3.47 
FA in [11]  0.11  4.87 
To be explicit, we provide detailed hyperparameter settings for the model in (7)(14); the other models are set similarly. Specifically, α = 1, β = 1, c = 1, and d = g = h = e = f = 10^{−6}. These parameters were not optimized, and were set in the same manner for all experiments. Although the PMD model is a nonBayesian method, it also has parameter settings that must be addressed carefully; two hyperparameters need adjusting: the sparseness threshold and the stop condition [13]. In all PMD experiments, we set the sparseness threshold as 4, and the PMD iterations were terminated when the reconstruction error was smaller than 5%.
All calculations were performed on PCs with Intel Pentium Dual E2200 processors and 2.00 GB memory, and all software was written in Matlab. For the largescale analysis performed on the real data discussed above, MCMC required approximately 4 hours of CPU, while VB required 3 hours (per analysis).
Appendix: Gibbs and Variational Bayesian Analysis
where i = 1, . . . , n, j = 1, . . . , p and k = 1, . . . , K.
Gibbs sampler
The sequential update equations of the Gibbs sampler are as follows.

Sample each entry of the binary matrix, z_{ ki }. The probability of z_{ ki }= 1 is expressed as$p({z}_{ki}=1\mathit{X},\phantom{\rule{0.1em}{0ex}}{\mathit{Z}}_{ki},\phantom{\rule{0.1em}{0ex}}\mathit{A},\phantom{\rule{0.1em}{0ex}}\mathit{S},\phantom{\rule{0.1em}{0ex}}\psi )\propto \mathrm{ln}({\pi}_{k})\frac{1}{2}({A}_{k}^{T}\text{diag}(\psi ){A}_{k}{s}_{ki}^{2}2{A}_{k}^{T}\text{diag}(\psi ){X}_{i}^{k}{s}_{ki}).$

Sample π_{ k }from p(π_{ k }) = Beta(π_{ k };α',β') where $\alpha \prime ={\displaystyle {\sum}_{i=1}^{n}{z}_{ki}}+\frac{\alpha}{K}$ and $\beta \prime =n+\frac{\beta (K1)}{K}{\displaystyle {\sum}_{i=1}^{n}{z}_{ki}}$.

Sample each entry of factor loading matrix, A_{ jk }from p(A_{ jk }−) = $\mathcal{N}$ (A_{ jk }; μ_{ jk }, Σ_{ jk }) where ${\Sigma}_{jk}={\left[{\displaystyle {\sum}_{i=1}^{n}{\psi}_{j}}{s}_{ki}^{2}{z}_{ki}^{2}+{\gamma}_{jk}\right]}^{1},\phantom{\rule{0.1em}{0ex}}{\mu}_{jk}={\Sigma}_{jk}({\displaystyle {\sum}_{i=1}^{n}{\psi}_{j}}{z}_{ki}{s}_{ki}{\mathit{X}}_{ji}^{k})$ , and ${\mathit{X}}_{ji}^{k}={x}_{ji}{\displaystyle {\sum}_{l=1,l\ne k}^{K}{A}_{jl}{z}_{li}{s}_{li}}$.

Sample each column of factor score matrix, s_{ i }, from p( s_{ i }−) = $\mathcal{N}$ (s_{ i }; ξ_{ i }, Λ_{ i }) where ${\lambda}_{i}={[({A}^{T}\circ {\tilde{Z}}_{i})diag(\psi )(A\circ {\tilde{Z}}_{i}^{T})+\delta {\text{I}}_{K}]}^{1},\phantom{\rule{0.1em}{0ex}}{\xi}_{i}={\lambda}_{i}(A\circ {\tilde{Z}}_{i})\text{diag}(\psi ){x}_{i}$, and ${\tilde{Z}}_{i}=[{z}_{i},\phantom{\rule{0.1em}{0ex}}\dots ,\phantom{\rule{0.1em}{0ex}}{z}_{i}]$ with the Kdimensional vector,z_{ i }, repeated p times, 1 ≤ i ≤ n.

Sample ψ_{ j }from $p({\psi}_{j})=\text{Gamma}({\psi}_{j}\phantom{\rule{0.1em}{0ex}};\phantom{\rule{0.1em}{0ex}}{g}_{j}^{\prime},\phantom{\rule{0.1em}{0ex}}{h}_{j}^{\prime})$ where ${g}_{j}^{\prime}=g+\frac{n}{2}$ and ${h}_{j}^{\prime}=h+\frac{1}{2}{\displaystyle {\sum}_{i=1}^{N}(\Vert {x}_{ji}}{A}_{j}({z}_{i}\circ {s}_{i}){\Vert}^{2})$.

Sample γ_{jk} from p(γ_{ jk }) = Gamma (γ_{ jk }; c', d') where c' = c+ 1/2 and $d\prime =d+\frac{1}{2}{A}_{jk}^{2}$.

Sample δ from p(δ) = Gamma (δe', f') where e' = e + nK/2 and $f\prime =f+\frac{1}{2}{\displaystyle \sum _{i=1}^{n}({s}_{i}^{T}{s}_{i})}$ In the above equations expressions of the form p(γ_{ jk }−) represent the probability of γ_{ jk }conditioned on all other parameters.
Variational Bayesian inference
Note that the term p(X) is a constant with respect to Γ, and therefore $\tilde{F}(\Gamma )$ is maximized when the KullbackLeibler divergence KL(Q(Θ; Γ)p(ΘX)) is minimized. However, we cannot explicitly compute the KL divergence, since p(ΘX) is unknown. However, the denominator term in $\tilde{F}(\Gamma )$ may be computed, since p(X)p(ΘX) = p(XΘ)p(Θ), and the prior p(Θ) and likelihood function p(XΘ) are available. To make computation of $\tilde{F}(\Gamma )$ tractable, we assume Q(ΘΓ) has a factorized form Q(Θ; Γ) = Π_{ i }Q_{ i }(Θ_{i}; Γ_{ i }). With appropriate choice of Q_{ i }, the variational expression $\tilde{F}(\Gamma )$ may be evaluated analytically. The update equations are as follows.

For z_{ ki }we have $Q({z}_{ki})=\text{Bemou}11\text{i}({z}_{ki};\phantom{\rule{0.1em}{0ex}}{\rho}_{ki}^{\prime})$ where ${\rho}_{ki}^{\prime}$ is the probability of z_{ ki }= 1. We consider the following two conditions:
discussion below, the symbol < • > represents the expectation of the argument.

For π_{ k }we have $Q({\pi}_{k})=\text{Beta}({\pi}_{k}\phantom{\rule{0.1em}{0ex}};\phantom{\rule{0.1em}{0ex}}{\alpha}_{k}^{\prime},\phantom{\rule{0.1em}{0ex}}{\beta}_{k}^{\prime})$ where ${\alpha}_{k}^{\prime}={\displaystyle {\sum}_{i=1}^{n}\u3008{z}_{ki}\u3009}+\frac{\alpha}{K}$ and ${\beta}_{k}^{\prime}=n+\frac{\beta (K1)}{K}{\displaystyle {\sum}_{i=1}^{n}\u3008}{z}_{ki}\u3009$.

For A_{ jk }we have $Q({A}_{jk})=\mathcal{N}({\mathit{A}}_{jk};{\mu}_{jk},\phantom{\rule{0.1em}{0ex}}{\Sigma}_{jk})$ with ${\Sigma}_{jk}={\left[{\displaystyle {\sum}_{i=1}^{n}\u3008{\psi}_{j}\u3009\u3008{s}_{ki}^{2}\u3009\u3008{z}_{ki}\u3009+\u3008{\gamma}_{jk}\u3009}\right]}^{1}$ and${\mu}_{jk}={\Sigma}_{jk}({\displaystyle {\sum}_{i=1}^{n}\u3008}{\psi}_{j}{z}_{ki}{s}_{ki}{X}_{ji}^{k}\u3009$
, where ${\mathit{X}}_{ji}^{k}={x}_{ji}{{\displaystyle {\sum}_{l=1;l\ne k}^{K}A}}_{jl}{z}_{li}{s}_{li}$.

For s_{ i }we have $Q({s}_{i})=\mathcal{N}({s}_{i};\phantom{\rule{0.1em}{0ex}}{\xi}_{i},\phantom{\rule{0.1em}{0ex}}{\lambda}_{i})$ , with ${\lambda}_{i}={[\u3008({\mathit{A}}^{T}\circ {\tilde{\mathit{Z}}}_{i})\text{diag}(\u3008\psi \u3009)(\mathit{A}\circ {\tilde{\mathit{Z}}}_{i}^{T})\u3009+\u3008\delta \u3009I]}^{1}$ and ${\xi}_{i}={\lambda}_{i}((\u3008{\mathit{A}}^{T}\u3009\circ \u3008{\tilde{\mathit{Z}}}_{i}\u3009)\text{diag}(\u3008\psi \u3009){x}_{i})$ , where ${\tilde{Z}}_{i}=[{z}_{i},\phantom{\rule{0.1em}{0ex}}\dots ,\phantom{\rule{0.1em}{0ex}}{z}_{i}]$ is a Kdimensional vector of all zi repeated p times. In order to exactly calculate the expectation,$B=\u3008({\mathit{A}}^{T}\circ {\tilde{\mathit{Z}}}_{i})\text{diag}(\psi )(\mathit{A}\circ {\tilde{\mathit{Z}}}_{i}^{T})\u3009$
, we have to consider it as two parts. Specifically, the offdiagonal elements of B are $(\u3008{\mathit{A}}^{T}\u3009\circ \u3008{\tilde{\mathit{Z}}}_{i}\u3009)\text{diag}(\u3008\psi \u3009)(\u3008\mathit{A}\u3009\circ \u3008{\tilde{\mathit{Z}}}_{i}^{T}\u3009)$ , and the diagonal elements, ${B}_{kk}=({\displaystyle {\sum}_{j=1}^{p}({\u3008{A}_{jk}\u3009}^{2}+{\Sigma}_{jk})\u3008{\psi}_{j}\u3009)\u3008{z}_{ki}\u3009}$ , since $\u3008{z}_{ki}^{2}\u3009=\u3008{z}_{ki}\u3009$ and $\u3008{A}_{jk}^{2}\u3009={\u3008{A}_{jk}\u3009}^{2}+{\Sigma}_{jk}$ , where 1 ≤ k ≤ K, 1 ≤ j ≤ p and 1 ≤ i ≤ n.

For ψ_{ j }we have $Q({\psi}_{j})=\text{Gamma}({\psi}_{j}\phantom{\rule{0.1em}{0ex}};\phantom{\rule{0.1em}{0ex}}{g}_{j}^{\prime},\phantom{\rule{0.1em}{0ex}}{h}_{j}^{\prime})$ , where ${g}_{j}^{\prime}=g+\frac{n}{2},{h}_{j}^{\prime}=h+\frac{1}{2}{\displaystyle {\sum}_{i=1}^{n}\u3008}({x}_{ji}{A}_{j}{({z}_{i}\circ {s}_{i})}^{2}\u3009$.

For γ_{ jk }we have $Q({\gamma}_{jk})=\text{Gamma}({c}_{jk}^{\prime},\phantom{\rule{0.1em}{0ex}}{d}_{jk}^{\prime})$ , with ${c}_{jk}^{\prime}=c+1/2\phantom{\rule{0.1em}{0ex}}$ and ${d}_{jk}^{\prime}=d+\frac{1}{2}\u3008{A}_{jk}^{2}\u3009$.

For δ we have Q(δ) Gamma (e',f'), where e' = e + Kn/2 and $f\prime =f+\frac{1}{2}{\displaystyle {\sum}_{i=1}^{n}\u3008{s}_{i}^{T}{s}_{i}\u3009}$ .
Acknowledgements
The research reported here was supported under the DARPA PHD program. The research and results presented here are the contributions of the authors, and the results were not influenced in any way by the sponsor.
Notes
Declarations
Authors’ Affiliations
References
 West M: "Bayesian factor regression models in the "large p, small n" paradigm,". In Bayesian Statistics 7. Edited by: Bernardo JM, Bayarri M, Berger J, Dawid A, Heckerman D, Smith A, West M. Oxford University Press; 2003:723–732.
 Tibshirani R: "Regression shrinkage and selection via the lasso,". Journal of Royal Statistical Society Ser. B 1996, 58: 267–288.
 Zou H, Hastie T: "Regularization and variable selection via the elastic net,". Journal of Royal Statistical Society Ser. B 2005, 67: 301–320. 10.1111/j.14679868.2005.00503.xView Article
 Park T, Casella G: "The Bayesian Lasso,". Journal of the American Statistical Association 2008, 103: 681–686,. 10.1198/016214508000000337View Article
 Cristianini N, ShaweTaylor J: An Introduction to Support Vector Machines. Cambridge University Press; 2000.
 Tipping M: "Sparse Bayesian learning and the relevance vector machine,". Journal of Machine Learning Research 2001, 1: 211–244. 10.1162/15324430152748236
 Ji S, Xue Y, Carin L: "Bayesian compressive sensing,". IEEE Transactions on Signal Processing 2008., 56:
 Rai P, Daum'e H III: "The infinite hierarchical factor regression model,". Proc Conf Neural Information Proc Systems (NIPS), Vancouver, Canada 2008.
 Knowles D, Ghahramani Z: "Infinite sparse factor analysis and infinite independent components analysis,". 7th International Conference on Independent Component Analysis and Signal Separation 2007.
 Meeds E, Ghahramani Z, Neal R, Roweis S: "Modeling dyadic data with binary latent factors,". Advances in Neural Information Processing Systems 2007, 977–984.
 Carvalho C, Chang J, Lucas J, Nevins JR, Wang Q, West M: "Highdimensional sparse factor modelling: Applications in gene expression genomics,". Journal of the American Statistical Association 2008, 103: 1438–1456. 10.1198/016214508000000869View ArticlePubMedPubMed Central
 Zou H, Hastie T, Tibshirani R: "Sparse principal component analysis,". Journal of Computational and Graphical Statistics 2006, 15: 2004. 10.1198/106186006X113430View Article
 Witten D, Tibshirani R, Hastie T: "A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,". Biostatistics 2009, 10: 515–534. 10.1093/biostatistics/kxp008View ArticlePubMedPubMed Central
 Lopes H, West M: "Bayesian model assessment in factor analysis,". Statistica Sinica 2004, 14: 41–67.
 Ghosh J, Dunson D: "Bayesian model selection in factor analytic models,". In Random Effect and Latent Variable Model Selection. Edited by: Dunson D. John Wiley & Sons; 2008.
 Berger J, Ghosh J, Mukhopadhyay N: "Approximation and consistency of Bayes factors as model dimension grows,". J. Statist. Plann. Inference 2003, 112: 241258,. 10.1016/S03783758(02)003361View Article
 Press S, Shigemasu K: "A note on choosing the number of factors,". Comm Statist Theory Methods 1999, 28: 1653–1670. 10.1080/03610929908832378View Article
 Lee S, Song X: "Bayesian selection on the number of factors in a factor analysis model,". Behaviormetrika 2002, 29: 2339. 10.2333/bhmk.29.23View Article
 Griffiths T, Ghahramani Z: "Infinite latent feature models and the indian buffet process,". Advances in Neural Information Processing Systems 2005, 475–482.
 DoshiVelez F, Miller K, Gael JV, The Y: "Variational inference for the indian buffet process,". AISTATS 2009.
 Thibaux R, Jordan M: "Hierarchical beta processes and the Indian buffet process,". International Conference on Artificial Intelligence and Statistics 2007.
 Paisley J, Carin L: "Nonparametric factor analysis with beta process priors,". Int Conf Machine Learning 2009.
 Beal M: "Variational algorithms for approximate bayesian inference,". Ph.D. dissertation, Gatsby Computational Neuroscience Unit, University College London; 2003.
 Zou H, Hastie T, Tibshirani R: "Sparse principal component analysis,". Technical Report, Statistics Department, Stanford University 2004.
 Zaas AK, Chen M, Lucas J, Veldman T, Hero AO, Varkey J, Turner R, Oien C, Kingsmore S, Carin L, Woods CW, Ginsburg GS: "Peripheral blood gene expression signatures characterize symptomatic respiratory viral infection,". Cell Host & Microbe 2009, 6: 207–217. 10.1016/j.chom.2009.07.006View Article
 Dunson D: "Dynamic latent trait models for multidimensional longitudinal data,". J. Am. Statistical Ass 2003, 98: 555–563. 10.1198/016214503000000387View Article
 Ramilo O, Allman W, Chung W, Mejias A, Ardura M, Glaser C, Wittkowski KM, Piqueras B, Banchereau J, Palucka AK, Chaussabel D: "Gene expression patterns in blood leukocytes discriminate patients with acute infections,". Blood 2007, 109: 2066–2077. 10.1182/blood200602002477View ArticlePubMedPubMed Central
 Hjort NL: "Nonparametric bayes estimators based on beta processes in models for life history data,". Annals of Statistics 1990, 18(3):1259–1294. 10.1214/aos/1176347749View Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.