Modeling the next generation sequencing sample processing pipeline for the purposes of classification

Background A key goal of systems biology and translational genomics is to utilize high-throughput measurements of cellular states to develop expression-based classifiers for discriminating among different phenotypes. Recent developments of Next Generation Sequencing (NGS) technologies can facilitate classifier design by providing expression measurements for tens of thousands of genes simultaneously via the abundance of their mRNA transcripts. Because NGS technologies result in a nonlinear transformation of the actual expression distributions, their application can result in data that are less discriminative than would be the actual expression levels themselves, were they directly observable. Results Using state-of-the-art distributional modeling for the NGS processing pipeline, this paper studies how that pipeline, via the resulting nonlinear transformation, affects classification and feature selection. The effects of different factors are considered and NGS-based classification is compared to SAGE-based classification and classification directly on the raw expression data, which is represented by a very high-dimensional model previously developed for gene expression. As expected, the nonlinear transformation resulting from NGS processing diminishes classification accuracy; however, owing to a larger number of reads, NGS-based classification outperforms SAGE-based classification. Conclusions Having high numbers of reads can mitigate the degradation in classification performance resulting from the effects of NGS technologies. Hence, when performing a RNA-Seq analysis, using the highest possible coverage of the genome is recommended for the purposes of classification.


Background
In recent years, modern high throughput sequencing technologies have become one of the essential tools in measuring the number of transcripts of each gene in a cell population or even in individual cells. Such information could be used to detect differential gene expression due to different treatment or phenotype. In our case we are interested in using gene-expression measurements to classify phenotypes into one of two classes. The accuracy of classification will depend on the manner in which the phenomena are transformed into data by the measurement technology. We consider the effects of Next-Generation http://www.biomedcentral.com/1471-2105/14/307 genome to determine the gene-expression levels. The number of reads mapped to a gene on the reference genome defines the count data, which is a discrete measure of the gene-expression levels. Two popular models for statistical representation of the discrete NGS data are the negative binomial [4,5] and Poisson [6]. The negative binomial model is more general because it can mitigate over-dispersion issues associated with the Poisson model; however, with the relatively small number of samples available in most current NGS experiments, it is difficult to accurately estimate the dispersion parameter of the negative binomial model. Therefore, in this study we choose to model the NGS data processing pipeline through the transformation via a Poisson model, for the purposes of phenotype classification. SAGE technology produces short continuous sequences of nucleotides, called tags. After a SAGE experiment is done, one can measure the expression level of a particular region/gene of interest on the genome by counting the number of tags that map to it. SAGE is very similar to RNA-Seq in nature and in terms of statistical modeling. The SAGE data processing pipeline is traditionally modeled as a Poisson random vector [7,8]. We follow the same approach for synthetically generated SAGE-like data sets.
Our overall methodology is to generate three different types of synthetic data: (1) actual gene expression concentration, called MVN-GC, from a multivariate normal (Gaussian) model formulated to model various aspects of gene expression concentration [9]; (2) Poisson transformed MVN-GC data, called NGS-reads, with specifications that resemble NGS reads; and (3) Poisson transformed MVN-GC data, called SAGE-tags, where the characteristics of the data model SAGE data. The classification results related to these three different types of data sets indicate that MVN-GC misclassification errors are lower compared to data subjected to transformations that produce either NGS-reads or SAGE-tags data. Moreover, classification using RNA-Seq synthetic data outperforms classification using SAGE data when the number of reads is in an acceptable range for an RNA-Seq experiment. The better performance is attributed to the significantly higher genome coverage associated with the RNA-Seq technology.

Next-generation sequencing technologies
Next-Generation Sequencing refers to a class of technologies that sequence millions of short DNA fragments in parallel, with a relatively low cost. The length and number of the reads differ based on the technology. Currently, there are three major commercially available platforms/technologies for NGS: (1) Illumina, (2) Roche, and (3) Life Technologies [10,11]. The underlying chemistry and technique used in each platform is unique and affects the output. In this paper, we focus on Illumina sequencers and use the NGS term to refer to this platform. High-throughput sequencing and NGS are used interchangeably.
The specific application of NGS for RNA sequencing is called RNA-Seq [2], which is a high-throughput measurement of gene-expression levels of thousands of genes simultaneously as represented by discrete expression values for regions of interest on the genome (e.g. genes). NGS has many advantages when compared to the available microarray expression platforms. NGS does not depend on prior knowledge about regions of expression to measure a gene [11], whereas the microarray probes are designed based on known sequences [12]. The background correction, probe design and spot filtering, which are typical for microarray-based technology, are no longer problematic due to the different nature of NGS technology. RNA-Seq enables scientists to discover new splice junctions [13], due to its flexibility and independence of the pre-designed probes. The prediction of absolute copy-number variation (CNV) is another great advantage of RNA-Seq, which allows scientists to identify large segments of deletions/duplications in a genome with respect to another (a reference) genome [14]. Detecting single-nucleotide polymorphisms (SNP) is yet another application of RNA-Seq [3,15]. Furthermore, it has been shown that RNA-Seq can detect spike-in RNA controls with good accuracy and reproducibility [16].
The RNA-Seq process starts with samples that are randomly sheared to generate millions of small RNA fragments. These fragments are then converted to cDNA and the adapter sequences are ligated to their ends. The length of a fragment can vary between 30 bp -110 bp, approximately. The Illumina system provides flowcells for sequencing by synthesis and its reversible terminator chemistry [2,3,10]. A flowcell is an eight-channel glass and each channel is commonly referred to as a lane. The size selected fragments with the adapters attach to the flowcell surface inside the lanes and generate clusters of the same fragment through bridge amplification. Following the bending and attaching of both sides of the fragment to the surface, the strand duplicates. This process is repeated many times and results in a cluster of fragments. After the cluster generation step, a pool of floating nucleotides is added to the flowcell along with DNA polymerase to incorporate to the single strand fragments in each cluster. Each nucleotide incorporation makes a unique fluorescent label and the images are captured after the addition. Finally, image processing and base calling determine the base at each cycle of the sequencing. Each cluster produces a read whose length equals the number of cycles. Each RNA-Seq experiment produces millions of reads depending on the input RNA material, length of the reads, desired coverage for the reference genome, number of samples per lane, etc. Following the sequencing http://www.biomedcentral.com/1471-2105/14/307 experiment, the expression levels of the genes are estimated by mapping the reads to the reference genome. There are many algorithms developed for this task, including: ELAND [3], Bowtie [17], BWA [18], MAQ [19], SHRiMP [20], mrFast [14], mrsFAST [21], SOAP [22], etc. After the gene expressions are determined, they can be used in further analysis, such as SNP detection or detecting differentially expressed genes.
The entire RNA-Seq sample processing pipeline from generating reads to calling gene expressions can involve different sources of error, e.g. the basecalling is a probabilistic procedure and the quality scores assigned to each base of the reads are prone to small likelihoods of being wrong. The certainty of reference genomes and mapping algorithms are additional issues that need attention. Thus, the entire RNA-Seq sample processing pipeline should be considered from a probabilistic point of view. In this study, we model the above mentioned errors as a noise term in the model of the sample processing pipeline.

Discussion
In this section, we consider three specific models for "actual" gene expression data, RNA-Seq count data and SAGE tags data. The models are used to synthetically generate data for the simulation experiments described in this paper. The performances of different classification schemes are analyzed and compared across these three synthetically generated types of data.
A common assumption for modeling of the original mRNA expressions is that they follow a multivariate Gaussian distribution [9,23,24]. Starting with this assumption, we model and generate the RNA-Seq and SAGE data by applying a specific nonlinear Poisson transformation to the mRNA expression model. All data are synthetically generated according to a biologically relevant model to emulate the real experimental situations where the number of features/genes is very large, usually tens of thousands, and only a small number of sample points is available. Knowing the full distributional characteristics of the synthetic data makes it possible to measure the classification performance as described by the respective error rates.

MVN-GC model
The model proposed in [9] uses a block-based structure on the covariance matrix, which is a standard tool to model groups of interacting variables where there is negligible interaction between the groups. In genomics, genes within a block represent a distinct pathway and are correlated, whereas genes not in the same group are uncorrelated [25,26]. Sample points are drawn randomly and independently from two equally likely classes, 0 and 1, each sharing the same D features. There are also c equally likely subclasses in class 1 with different parameters for the probability distribution of the features. The subclasses model scenarios typically seen as different stages or subtypes of a cancer. Each sample point in class 1 belongs to one and only one of these subclasses. Features are categorized into two major groups: markers and non-markers. Markers resemble genes associated with a disease or condition related to the disease and they have different class-conditional distributions for the two classes.
Markers can be further categorized into two different groups: global and heterogeneous markers. Global markers take on values from D gm -dimensional Gaussian distributions with parameters (μ Heterogeneous markers, on the other hand, are divided into c subgroups of size D hm , each associated with one of c mutually exclusive subclasses within class 1. Therefore, a sample point that belongs to a specific subclass has D hm heterogeneous markers distributed as a D hmdimensional Gaussian with parameters (μ hm 1 , hm 1 ). The same D hm heterogeneous markers for the sample points belonging to other subclasses, as well as points in class 0, follow a different D hm -dimensional Gaussian distribution with parameters (μ hm 0 , hm 0 ). We assume that the global and heterogeneous markers have similar structure for the covariance matrices. Therefore, we represent the covariance matrices of these two types of markers by 0 = σ 2 0 and 1 = σ 2 1 for class 0 and class 1, respectively, where σ 2 0 and σ 2 1 can be different. For this structure, we assume that has the following block structure: where ρ is an l × l matrix, with 1 on the diagonal and ρ off the diagonal. Note that the dimension of is different for the global and heterogeneous markers. Furthermore, we assume that the mean vectors for the global markers and the heterogeneous markers possess the same structure denoted by μ 0 = m 0 × (1, 1, . . . , 1) and μ 1 = m 1 × (1, 1, . . . , 1) for class 0 and class 1, respectively, where m 0 and m 1 are scalars. The non-markers are also divided into two groups: high-variance and lowvariance non-markers. The D hv non-markers belonging to the high-variance group are uncorrelated and their distributions are described by pN(m 0 , σ 2 0 ) + (1 − p)N(m 1 , σ 2 1 ), where m 0 , m 1 , σ 2 0 and σ 2 1 take values equal to the means and variances of the markers, respectively, and p is a random value uniformly distributed over [0, 1]. The D lv lowvariance non-markers are uncorrelated and have identical one-dimensional Gaussian distributions with parameters (m 0 , σ 2 0 ) [9,27]. Figure 1 represents the block-based structure of the model. http://www.biomedcentral.com/1471-2105/14/307

NGS-reads and SAGE-tags models
In NGS-type experiments, the gene-expression levels are measured by discrete values providing the number of reads that map to the respective gene. Several statistical models have been proposed for representing NGS data. Those models are based on either negative binomial or Poisson distributions [4][5][6]. In what follows, we denote the read count for gene j for sample point i by X i,j , where it is assumed that in an NGS experiment each lane has a single biological specimen belonging to either class 0 or class 1. Furthermore, we select a model where the number of reads for each gene is generated from a Poisson distribution with a known parameter. We calculate the expected number of reads (mean of the Poisson distribution) from the generalized linear model [28]: where λ i,j is the jth gene-expression level in lane i. The term θ i,j represents technical effects that might be associated with an experiment. The term log s i is an offset where s i , referred to as "sequencing depth" in the statistical community [29], denotes a major factor in the transformation from expression levels to read data. It accounts for different total numbers of reads produced by each lane and plays an important role in normalizing the specimens across the flowcell. The trimmed mean of M values (TMM) [30], quantile normalization [28], and median count ratio [4] are three commonly used methods for estimating the sequencing depth. Equation (1) represents the expected value of reads, conditioned on the sequencing depth, based on the linear combination of the factors that affect its value: the depth of sequencing, the gene-expression level and a general noise term. Therefore, it can be used to model the expected number of reads, as the mean of the Poisson distribution in our synthetic data generation pipelines. Rewriting equation (1) yields indicating that if λ i,j and θ i,j are normally distributed, then exp(λ i,j + θ i,j ) will have a log-normal distribution. Therefore, for a given s i the mean of X i,j is log-normally distributed. This phenomenon has been previously reported for microarray studies where the means of expression levels are shown to have log-normal distributions [23,31]. Furthermore, we assume that the offset s i is random and uniformly distributed [29]. Because the term θ i,j represents the unknown technical effects associated with the experimentation, we assume that it follows a Gaussian distribution with zero mean and a variance set by the coefficient of variation (COV): COV aims to model the unknown errors that can occur during an/a NGS/SAGE experiment, including basecalling, mapping reads to the reference genome, etc. The term E[X i,j |s i ] serves as the single parameter of a Poisson distribution that models the NGS/SAGE processes by generating random non-negative integers, as the read counts or tag numbers data, having expected value equal to the right-hand side of equation (2). To generate synthetic datasets for the purposes of our simulation study we proceed as follows: for a sample point http://www.biomedcentral.com/1471-2105/14/307 i in the experiment, we first randomly generate a number s i from a uniform distribution, U(α, β), where α > 0 and β > α. Then, a random sample point, λ i,j , is drawn from the MVN-GC model and its value is perturbed with θ i,j , which is drawn randomly according to its distribution defined in (3). Using (2), the mean of the Poisson distribution is calculated for each sample point i and gene j, and a single random realization of X i,j is generated. The processes of generating count data for RNA-Seq reads and SAGE tag numbers are very similar, but the total number of reads per NGS run is significantly more than tags for a SAGE experiment. Therefore, we only change α and β to get the desired number of read counts or tags. We also assume that the SAGE experiments always have a fixed range for the total number of tags, whereas RNA-Seq has a variety of ranges for the total read counts. By having different ranges for the read counts, we can compare the performance of classification resulting from NGS-reads and SAGE-tags models under different experimental settings.

Classification schemes
The setup for the classification problem is determined by a joint feature-label probability distribution F XY , where X ∈ R D is a random feature vector and Y ∈ {0, 1} is the unknown class label of X. In the context of genomics, the feature vector is usually the expression levels of many genes and the class labels are different types or stages of disease to which sample points belong to. A classifier rule model is defined as a pair ( , ), where is a classification rule, possibly including feature selection, and is a training-data error estimation rule. In a typical classification task, a random training set S n = {(X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n )} is drawn from F XY and the goal is to design a classifier ψ n = (S n ), which takes X as the input and outputs a label Y. The true classification error of ψ n is given by ε n = P(ψ n (X) = Y ). The error estimation rule provides an error estimate,ε n = (S n ), for ψ n .
In this study, we consider linear discriminant analysis (LDA), three nearest neighbors (3NN) and radial basis function support vector machine (RBF-SVM) as the classification rules, and report the true error of the classifiers. We implement t-test feature selection (as a part of the classification rule) before the classifier design procedure to select d ≤ D features with highest t-scores. The training set with d features is then used to design the classifier. The same d features are also used for finding the true error of the designed classifier.
LDA is the plug-in rule for the Bayes classifier when the class-conditional densities are multivariate Gaussian with a common covariance matrix [32]. The sample means and pooled sample covariance matrix are estimated from the training data S n and plugged into the discriminant function. If the classes are equally likely, LDA assigns x to class 1 if and only if (4) whereμ y is the sample mean for class y ∈ {0, 1}, and is the pooled sample covariance matrix.
3NN is a special case of kNN rule (with k = 3), which is a nonparametric classification rule based on the training data. The kNN classifier assigns a label, 0 or 1, to a point x according to the majority of the labels of the k nearest training data points to it. To avoid tied votes in binary classification problems, an odd number is usually chosen for k.
A support vector machine finds a maximal margin hyperplane for a given set of training sample points. If it is not possible to linearly separate the data, one can introduce some slack variables in the optimization procedure that allow the mislabeled sample points and solve the dual problem. Alternatively, one can transform the data and project it into a higher-dimensional space, where it becomes linearly separable. The equivalent classifier back in the original feature space will generally be non-linear [33,34]. If a Gaussian radial basis function used as the kernel function, the corresponding classifier is referred to as RBF-SVM.  and list of parameters used in the study. The analysis of the results follows in the next section.

Simulation setup and parameters
To emulate real-world scenarios involving thousands of genes with only a few sample points, we choose D = 20, 000 as the number of features and n ∈ {60, 120, 180} as the number of sample points available for each synthetic feature-label distribution. Because there is no closed form expression to calculate the true error of the designed classifiers, we generate a large independent test sample of size n t = 3000, with samples divided equally between the two classes. Because the RMS between the true and estimated error when using independent test data is bounded above by 1/2 √ n t , this test sample size provides an errorestimate RMS of less than 0.01. Once the training and test data are generated, we normalize the training data so that each feature is zero-mean unit-variance across all sample points in both classes. We also apply the same normalization coefficients from the training data to the test set. The normalized data are then used in the feature selection, classifier design and error calculation. The parameter settings for the MVN-GC model, SAGE-tags and NGS-reads are provided in Table 1.
As explained in the previous section, the datasets generated from the MVN-GC model are transformed into the NGS-reads and SAGE-tags datasets through equation (2)  and Poisson processes. We only need to properly set the parameters COV, α, and β to get the desired number of read counts or tags. We assume that the parameter COV can take on two values, 0.01 and 0.05, representing two different levels of noise and unknown errors in the experiment. In its current state, RNA-Seq technology can provide different numbers of reads per sample, depending on many factors, such as quality of the sample, the desired coverage, sample multiplexing, etc. In this study, we examine a variety of ranges for RNA-Seq experiments.
We start with a very low total number of RNA-Seq reads, which may not match the real-world experiments, however, they are necessary for comparing the SAGE-tags and NGS-reads models with similar coverage. Furthermore, demonstrating the classification results for the NGS-reads model with wide ranges introduces an extra internal variability, which makes interpretations of the results rather difficult. Table 1 lists the NGS-reads ranges we have considered in this study. In a typical SAGE experiment, one expects 50K to 100K tags [35,36]. Using trial and error, we have found that by choosing the parameters α = 2.0 and β = 3.75 in the distribution of s i , the observed number of tags usually falls within this range. Similarly, the parameters α and β are chosen to meet the range requirements in the NGS-reads model. Our goal is to study the performance of the traditional classification schemes on different sources of random samples; thus, we take a Monte-Carlo approach and generate 10, 000 random training sets of size n from the MVN-GC model, transform them to the corresponding NGS-reads and SAGE-tags samples, and apply the classification schemes to each training sample. By taking such an approach, we aim to compare the classification performance of the three pipelines, in terms of the true error of the deigned classifiers. We also study the effect of NGS-reads and SAGE-tags transformations on the performance of a simple t-test biomarker discovery method, where we report the probability that global markers are recovered when d D features are selected after the feature-selection step.

Results
The probability of recovering a certain number of global markers after a t-test feature selection can be approximated empirically by the percentage of experiments (out of 10, 000 independent experiments) that detect such a number of markers. This probability depends on the size of the training data sample, quality of features, and underlying joint probability distribution of the features. Here, we only show the results for the MVN-GC model, with d = 10, in Table 2. Tables 3,  4 Tables 2, 3, 4, 5 and 6. Another important observation in Tables 3,  4, 5 and 6 is the effect of the total number of reads and COV for the NGS-reads models. As the number of reads increases, it is more likely to pick up more global markers, until the number of reads reaches a threshold, where no further improvement is observed. Similarly, for smaller COV the probability of selecting more global markers also increases.
The true error of a designed classifier measures the generalization capability of the classifier on a future sample point. Given a set of training sample points and a classification rule, one needs the full feature-label probability distribution to calculate the true error of the classifier designed on the training set. In this study, we find the true error of a classifier with a large independent sample. Because the training sample is random, the true error is a random variable with its own probability distribution. Therefore, to demonstrate the actual performance of the designed classifiers, the estimated probability density function (PDF) of the true error for each classification rule, distribution model and all data generation models (MVN-GC, NGS-reads and SAGE-tags) is presented.
We report the true errors of the designed classifiers for two different feature-selection schemes and classification rules. In the first scheme, no feature selection is performed and the first d global markers are directly used for classification. In the second scheme, t-test feature selection is done before a classifier is designed to select the best d features. Figures 3, 4 and 5 show the salient finding of this study. We present the PDF of the true error    for different classification rules trained on 60 and 180 sample points when the correlation among features in the same block is high (ρ = 0.8) and COV= 0.05. Distributions with higher and tighter densities around lower true errors indicate better classifier performance. If the PDFs can be approximated as univariate Gaussians, then a good classification performance amounts to smaller mean and variance. In all cases, for similar sample sizes and similar settings for ρ and COV, MVN-GC outperforms the two other data types. This is not surprising since it is considered as the ground truth. Also, it is evident that a larger sample size gives better classifiers with smaller variance as illustrated by the distribution of the true error. Figures 3, 4 and 5 show that utilizing the best d features (with or without feature selection) in the model, SAGEtags and NGS-reads for small ranges of the total reads yield similar results (or even much worse when the range is smaller) in terms of the classification performance, and both are inferior when compared to the ground truth. This may lead to a conclusion that one should not expect improvements in classification performance when geneexpression levels are processed through an NGS-reads pipeline. However, our main objective here is to show that as one increases the total number of reads for the NGS-reads model, improvement can be achieved and the error rates decrease. This conclusion confirms the intuition provided by a simple calculation about the increase of the separability of two Poisson random variables with means proportional to the number of reads. However, notice that the modeling of the sequencing pipeline introduces randomness in the means of the respective Poisson parameters describing the individual gene reads. Moreover, our focus is not that much on the separability of the two classes/phenotypes but rather on the classification performance as measured by the classification error and its dependence on ground truth (MVN-GC model) sample size, sequencing depth, and feature vectors. Our goal for modeling NGS-reads with small ranges is to demonstrate the performance of SAGE and RNA-Seq, when their data is similar. This result suggests that having a larger number of reads for the RNA-Seq experiments could compensate for the errors that can occur during the NGSreads pipeline sample processing. Here we have shown the results only for COV = 0.05, since in our observations, it appears that changing COV has little effect on the distribution of the true error for both SAGE-tags and NGS-reads models, except that it slightly increases the variance of the PDFs. The effect of feature selection (right columns) on the performance is best shown when the sample size is small. In this case, the variance of the true error is so large that drawing any conclusion regarding the performance on a small sample would be futile. Another interesting observation is that, on average, http://www.     : (a,b) j X i,j ∈ (1K, 50K); (c,d) j X i,j ∈ (250K, 300K); (e,f) j X i,j ∈ (25M, 25M + 50K); (g,h) j X i,j ∈ (50M, 50M + 50K).   ∈ (50K, 100K) for SAGE-tags and the following total number of reads for NGS-reads: (a,b) j X i,j ∈ (1K, 50K); (c,d) j X i,j ∈ (250K, 300K); (e,f) j X i,j ∈ (25M, 25M + 50K); (g,h) j X i,j ∈ (50M, 50M + 50K). http://www.biomedcentral.com/1471-2105/14/307 3NN and RBF-SVM classification rules outperform LDA for both NGS-reads and SAGE-tags data, supporting the idea that using linear classifiers for these types of data is not the best way to proceed in a highly non-Gaussian model -as is the case after the data have gone through the pipeline. The best rates for the expected true error across all settings are achieved by the RBF-SVM classification rule, even for small samples, especially with small variance.

Conclusions
In this paper, we model gene-expression levels as a multivariate Gaussian distribution that statistically captures the real mRNA levels within the cells. The newly developed technologies of sequencing DNA/RNA, referred to as NGS, and their ascendant SAGE technology generate discrete measures for gene expressions. The count data from these technologies can be modeled with a Poisson distribution. The multivariate Gaussian gene expressions are transformed through a Poisson filter to model NGS and SAGE technologies. The three categories of data are subjected to feature selection and classification. The objective is to evaluate the performance of the NGS technologies in classification. Our simulations show that when the gene expressions are directly used in classification, the best performance in terms of the classification error is achieved. The NGS-reads model generates considerably higher coverage for genes and can outperform SAGE in classification, when the experiment generates large number of reads. Even though NGS still has a variety of error sources involved in its process, its high volume of reads for a specific gene can lower the chance of misclassification. Thus, it is important to use the highest possible coverage for the entire genome while performing a RNA-Seq analysis if the goal of the study is to distinguish the samples of interest. Nevertheless, one must recognize that, as is typical with nonlinear transformations, the NGS pipeline transforms the original Gaussian data in such a way as to increase classification difficulty. As more refined modeling becomes available for the NGS pipeline, further study needs to performed, as has been done in the case of the LC-MS proteomics pipeline [1], to determine which segments of the pipeline are most detrimental to classification and what, if anything, can be done to mitigate classification degradation.