- Research article
- Open Access
- Published:

# Modeling the next generation sequencing sample processing pipeline for the purposes of classification

*BMC Bioinformatics*
**volume 14**, Article number: 307 (2013)

## Abstract

### Background

A key goal of systems biology and translational genomics is to utilize high-throughput measurements of cellular states to develop expression-based classifiers for discriminating among different phenotypes. Recent developments of Next Generation Sequencing (NGS) technologies can facilitate classifier design by providing expression measurements for tens of thousands of genes simultaneously via the abundance of their mRNA transcripts. Because NGS technologies result in a nonlinear transformation of the actual expression distributions, their application can result in data that are less discriminative than would be the actual expression levels themselves, were they directly observable.

### Results

Using state-of-the-art distributional modeling for the NGS processing pipeline, this paper studies how that pipeline, via the resulting nonlinear transformation, affects classification and feature selection. The effects of different factors are considered and NGS-based classification is compared to SAGE-based classification and classification directly on the raw expression data, which is represented by a very high-dimensional model previously developed for gene expression. As expected, the nonlinear transformation resulting from NGS processing diminishes classification accuracy; however, owing to a larger number of reads, NGS-based classification outperforms SAGE-based classification.

### Conclusions

Having high numbers of reads can mitigate the degradation in classification performance resulting from the effects of NGS technologies. Hence, when performing a RNA-Seq analysis, using the highest possible coverage of the genome is recommended for the purposes of classification.

## Background

In recent years, modern high throughput sequencing technologies have become one of the essential tools in measuring the number of transcripts of each gene in a cell population or even in individual cells. Such information could be used to detect differential gene expression due to different treatment or phenotype. In our case we are interested in using gene-expression measurements to classify phenotypes into one of two classes. The accuracy of classification will depend on the manner in which the phenomena are transformed into data by the measurement technology. We consider the effects of Next-Generation Sequencing (NGS) and Serial Analysis of Gene Expression (SAGE) on gene-expression classification using currently accepted measurement modeling. The accuracy of classification problem has previously been addressed for the LC-MS proteomics pipeline, where state-of-the-art modeling is more refined, the purpose being to characterize the effect of various noise sources on classification accuracy [1].

NGS technology provides a discrete counting measurement for gene-expression levels. In particular, RNA-Seq sequences small RNA fragments (mRNA) to measure gene expression. When a gene is expressed, it produces mRNAs. The RNA-Seq experiment randomly shears and converts the RNA fragments to cDNAs, sequences them, and finally outputs the results in the form of short reads [2, 3]. After obtaining those reads, a typical part of a processing pipeline is to map them back to a reference genome to determine the gene-expression levels. The number of reads mapped to a gene on the reference genome defines the count data, which is a discrete measure of the gene-expression levels. Two popular models for statistical representation of the discrete NGS data are the negative binomial [4, 5] and Poisson [6]. The negative binomial model is more general because it can mitigate over-dispersion issues associated with the Poisson model; however, with the relatively small number of samples available in most current NGS experiments, it is difficult to accurately estimate the dispersion parameter of the negative binomial model. Therefore, in this study we choose to model the NGS data processing pipeline through the transformation via a Poisson model, for the purposes of phenotype classification.

SAGE technology produces short continuous sequences of nucleotides, called tags. After a SAGE experiment is done, one can measure the expression level of a particular region/gene of interest on the genome by counting the number of tags that map to it. SAGE is very similar to RNA-Seq in nature and in terms of statistical modeling. The SAGE data processing pipeline is traditionally modeled as a Poisson random vector [7, 8]. We follow the same approach for synthetically generated SAGE-like data sets.

Our overall methodology is to generate three different types of synthetic data: (1) actual gene expression concentration, called MVN-GC, from a multivariate normal (Gaussian) model formulated to model various aspects of gene expression concentration [9]; (2) Poisson transformed MVN-GC data, called NGS-reads, with specifications that resemble NGS reads; and (3) Poisson transformed MVN-GC data, called SAGE-tags, where the characteristics of the data model SAGE data. The classification results related to these three different types of data sets indicate that MVN-GC misclassification errors are lower compared to data subjected to transformations that produce either NGS-reads or SAGE-tags data. Moreover, classification using RNA-Seq synthetic data outperforms classification using SAGE data when the number of reads is in an acceptable range for an RNA-Seq experiment. The better performance is attributed to the significantly higher genome coverage associated with the RNA-Seq technology.

### Next-generation sequencing technologies

*Next-Generation Sequencing* refers to a class of technologies that sequence millions of short DNA fragments in parallel, with a relatively low cost. The length and number of the reads differ based on the technology. Currently, there are three major commercially available platforms/technologies for NGS: (1) Illumina, (2) Roche, and (3) Life Technologies [10, 11]. The underlying chemistry and technique used in each platform is unique and affects the output. In this paper, we focus on Illumina sequencers and use the NGS term to refer to this platform. High-throughput sequencing and NGS are used interchangeably.

The specific application of NGS for RNA sequencing is called RNA-Seq [2], which is a high-throughput measurement of gene-expression levels of thousands of genes simultaneously as represented by discrete expression values for regions of interest on the genome (e.g. genes). NGS has many advantages when compared to the available microarray expression platforms. NGS does not depend on prior knowledge about regions of expression to measure a gene [11], whereas the microarray probes are designed based on known sequences [12]. The background correction, probe design and spot filtering, which are typical for microarray-based technology, are no longer problematic due to the different nature of NGS technology. RNA-Seq enables scientists to discover new splice junctions [13], due to its flexibility and independence of the pre-designed probes. The prediction of absolute copy-number variation (CNV) is another great advantage of RNA-Seq, which allows scientists to identify large segments of deletions/duplications in a genome with respect to another (a reference) genome [14]. Detecting single-nucleotide polymorphisms (SNP) is yet another application of RNA-Seq [3, 15]. Furthermore, it has been shown that RNA-Seq can detect spike-in RNA controls with good accuracy and reproducibility [16].

The RNA-Seq process starts with samples that are randomly sheared to generate millions of small RNA fragments. These fragments are then converted to cDNA and the adapter sequences are ligated to their ends. The length of a fragment can vary between 30 bp - 110 bp, approximately. The Illumina system provides *flowcells* for *sequencing by synthesis* and its *reversible terminator chemistry*[2, 3, 10]. A flowcell is an eight-channel glass and each channel is commonly referred to as a *lane*. The size selected fragments with the adapters attach to the flowcell surface inside the lanes and generate clusters of the same fragment through *bridge amplification*. Following the bending and attaching of both sides of the fragment to the surface, the strand duplicates. This process is repeated many times and results in a *cluster* of fragments. After the cluster generation step, a pool of floating nucleotides is added to the flowcell along with DNA polymerase to incorporate to the single strand fragments in each cluster. Each nucleotide incorporation makes a unique fluorescent label and the images are captured after the addition. Finally, image processing and base calling determine the base at each cycle of the sequencing. Each cluster produces a read whose length equals the number of cycles. Each RNA-Seq experiment produces millions of reads depending on the input RNA material, length of the reads, desired coverage for the reference genome, number of samples per lane, etc. Following the sequencing experiment, the expression levels of the genes are estimated by mapping the reads to the reference genome. There are many algorithms developed for this task, including: ELAND [3], Bowtie [17], BWA [18], MAQ [19], SHRiMP [20], mrFast [14], mrsFAST [21], SOAP [22], etc. After the gene expressions are determined, they can be used in further analysis, such as SNP detection or detecting differentially expressed genes.

The entire RNA-Seq sample processing pipeline from generating reads to calling gene expressions can involve different sources of error, e.g. the basecalling is a probabilistic procedure and the quality scores assigned to each base of the reads are prone to small likelihoods of being wrong. The certainty of reference genomes and mapping algorithms are additional issues that need attention. Thus, the entire RNA-Seq sample processing pipeline should be considered from a probabilistic point of view. In this study, we model the above mentioned errors as a noise term in the model of the sample processing pipeline.

## Discussion

In this section, we consider three specific models for “actual” gene expression data, RNA-Seq count data and SAGE tags data. The models are used to synthetically generate data for the simulation experiments described in this paper. The performances of different classification schemes are analyzed and compared across these three synthetically generated types of data.

A common assumption for modeling of the original mRNA expressions is that they follow a multivariate Gaussian distribution [9, 23, 24]. Starting with this assumption, we model and generate the RNA-Seq and SAGE data by applying a specific nonlinear Poisson transformation to the mRNA expression model. All data are synthetically generated according to a biologically relevant model to emulate the real experimental situations where the number of features/genes is very large, usually tens of thousands, and only a small number of sample points is available. Knowing the full distributional characteristics of the synthetic data makes it possible to measure the classification performance as described by the respective error rates.

### MVN-GC model

The model proposed in [9] uses a block-based structure on the covariance matrix, which is a standard tool to model groups of interacting variables where there is negligible interaction between the groups. In genomics, genes within a block represent a distinct pathway and are correlated, whereas genes not in the same group are uncorrelated [25, 26]. Sample points are drawn randomly and independently from two equally likely classes, 0 and 1, each sharing the same *D* features. There are also *c* equally likely subclasses in class 1 with different parameters for the probability distribution of the features. The subclasses model scenarios typically seen as different stages or subtypes of a cancer. Each sample point in class 1 belongs to one and only one of these subclasses. Features are categorized into two major groups: markers and non-markers. Markers resemble genes associated with a disease or condition related to the disease and they have different class-conditional distributions for the two classes.

Markers can be further categorized into two different groups: global and heterogeneous markers. Global markers take on values from *D*_{gm}-dimensional Gaussian distributions with parameters $({\mu}_{0}^{\text{gm}},{\Sigma}_{0}^{\text{gm}})$ for the sample points from class 0 and $({\mu}_{1}^{\text{gm}},{\Sigma}_{1}^{\text{gm}})$ for the points from class 1. Heterogeneous markers, on the other hand, are divided into *c* subgroups of size *D*_{hm}, each associated with one of *c* mutually exclusive subclasses within class 1. Therefore, a sample point that belongs to a specific subclass has *D*_{hm} heterogeneous markers distributed as a *D*_{hm}-dimensional Gaussian with parameters $({\mu}_{1}^{\text{hm}},{\Sigma}_{1}^{\text{hm}})$. The same *D*_{hm} heterogeneous markers for the sample points belonging to other subclasses, as well as points in class 0, follow a different *D*_{hm} -dimensional Gaussian distribution with parameters $({\mu}_{0}^{\text{hm}},{\Sigma}_{0}^{\text{hm}})$. We assume that the global and heterogeneous markers have similar structure for the covariance matrices. Therefore, we represent the covariance matrices of these two types of markers by ${\Sigma}_{0}={\sigma}_{0}^{2}\Sigma $ and ${\Sigma}_{1}={\sigma}_{1}^{2}\Sigma $ for class 0 and class 1, respectively, where ${\sigma}_{0}^{2}$ and ${\sigma}_{1}^{2}$ can be different. For this structure, we assume that *Σ* has the following block structure:

where *Σ*_{
ρ
} is an *l*×*l* matrix, with 1 on the diagonal and *ρ* off the diagonal. Note that the dimension of **Σ** is different for the global and heterogeneous markers. Furthermore, we assume that the mean vectors for the global markers and the heterogeneous markers possess the same structure denoted by *μ*_{0}=*m*_{0}×(1,1,…,1) and *μ*_{1}=*m*_{1}×(1,1,…,1) for class 0 and class 1, respectively, where *m*_{0} and *m*_{1} are scalars. The non-markers are also divided into two groups: high-variance and low-variance non-markers. The *D*_{hv} non-markers belonging to the high-variance group are uncorrelated and their distributions are described by $\mathit{\text{pN}}({m}_{0},{\sigma}_{0}^{2})+(1-p)N({m}_{1},{\sigma}_{1}^{2})$, where *m*_{0}, *m*_{1}, ${\sigma}_{0}^{2}$ and ${\sigma}_{1}^{2}$ take values equal to the means and variances of the markers, respectively, and *p* is a random value uniformly distributed over [0, 1]. The *D*_{lv} low-variance non-markers are uncorrelated and have identical one-dimensional Gaussian distributions with parameters $({m}_{0},{\sigma}_{0}^{2})$[9, 27]. Figure 1 represents the block-based structure of the model.

### NGS-reads and SAGE-tags models

In NGS-type experiments, the gene-expression levels are measured by discrete values providing the number of reads that map to the respective gene. Several statistical models have been proposed for representing NGS data. Those models are based on either negative binomial or Poisson distributions [4-6]. In what follows, we denote the read count for gene *j* for sample point *i* by *X*_{i,j}, where it is assumed that in an NGS experiment each lane has a single biological specimen belonging to either class 0 or class 1. Furthermore, we select a model where the number of reads for each gene is generated from a Poisson distribution with a known parameter. We calculate the expected number of reads (mean of the Poisson distribution) from the generalized linear model [28]:

where *λ*_{i,j} is the *j*th gene-expression level in lane *i*. The term *θ*_{i,j} represents technical effects that might be associated with an experiment. The term log*s*_{
i
} is an offset where *s*_{
i
}, referred to as “sequencing depth” in the statistical community [29], denotes a major factor in the transformation from expression levels to read data. It accounts for different total numbers of reads produced by each lane and plays an important role in normalizing the specimens across the flowcell. The trimmed mean of M values (TMM) [30], quantile normalization [28], and median count ratio [4] are three commonly used methods for estimating the sequencing depth. Equation (1) represents the expected value of reads, conditioned on the sequencing depth, based on the linear combination of the factors that affect its value: the depth of sequencing, the gene-expression level and a general noise term. Therefore, it can be used to model the expected number of reads, as the mean of the Poisson distribution in our synthetic data generation pipelines. Rewriting equation (1) yields

indicating that if *λ*_{i,j} and *θ*_{i,j} are normally distributed, then exp(*λ*_{i,j}+*θ*_{i,j}) will have a log-normal distribution. Therefore, for a given *s*_{
i
} the mean of *X*_{i,j} is log-normally distributed. This phenomenon has been previously reported for microarray studies where the means of expression levels are shown to have log-normal distributions [23, 31]. Furthermore, we assume that the offset *s*_{
i
} is random and uniformly distributed [29]. Because the term *θ*_{i,j} represents the unknown technical effects associated with the experimentation, we assume that it follows a Gaussian distribution with zero mean and a variance set by the *coefficient of variation (COV)*:

COV aims to model the unknown errors that can occur during an/a NGS/SAGE experiment, including basecalling, mapping reads to the reference genome, etc. The term *E*[ *X*_{i,j}|*s*_{
i
}] serves as the single parameter of a Poisson distribution that models the NGS/SAGE processes by generating random non-negative integers, as the read counts or tag numbers data, having expected value equal to the right-hand side of equation (2).

To generate synthetic datasets for the purposes of our simulation study we proceed as follows: for a sample point *i* in the experiment, we first randomly generate a number *s*_{
i
} from a uniform distribution, *U* (*α*,*β*), where *α* > 0 and *β* > *α*. Then, a random sample point, *λ*_{i,j}, is drawn from the MVN-GC model and its value is perturbed with *θ*_{i,j}, which is drawn randomly according to its distribution defined in (3). Using (2), the mean of the Poisson distribution is calculated for each sample point *i* and gene *j*, and a single random realization of *X*_{i,j} is generated. The processes of generating count data for RNA-Seq reads and SAGE tag numbers are very similar, but the total number of reads per NGS run is significantly more than tags for a SAGE experiment. Therefore, we only change *α* and *β* to get the desired number of read counts or tags. We also assume that the SAGE experiments always have a fixed range for the total number of tags, whereas RNA-Seq has a variety of ranges for the total read counts. By having different ranges for the read counts, we can compare the performance of classification resulting from NGS-reads and SAGE-tags models under different experimental settings.

### Classification schemes

The setup for the classification problem is determined by a joint feature-label probability distribution *F*_{
X
Y
}, where $\mathbf{X}\in {R}^{D}$ is a random feature vector and *Y* ∈ {0,1} is the unknown class label of **X**. In the context of genomics, the feature vector is usually the expression levels of many genes and the class labels are different types or stages of disease to which sample points belong to. A classifier rule model is defined as a pair (*Ψ*, Ξ), where *Ψ* is a classification rule, possibly including feature selection, and Ξ is a training-data error estimation rule. In a typical classification task, a random training set ${\mathcal{S}}_{n}=\left\{\right({\mathbf{X}}_{1},{Y}_{1}),({\mathbf{X}}_{2},{Y}_{2}),\dots ,({\mathbf{X}}_{n},{Y}_{n}\left)\right\}$ is drawn from *F*_{
X
Y
} and the goal is to design a classifier ${\psi}_{n}=\Psi \left({\mathcal{S}}_{n}\right)$, which takes **X** as the input and outputs a label *Y*. The true classification error of *ψ*_{
n
} is given by *ε*_{
n
} = *P* (*ψ*_{
n
} (**X**) ≠ *Y*). The error estimation rule Ξ provides an error estimate, ${\widehat{\epsilon}}_{n}=\mathrm{\Xi}\left({\mathcal{S}}_{n}\right)$, for *ψ*_{
n
}.

In this study, we consider linear discriminant analysis (LDA), three nearest neighbors (3NN) and radial basis function support vector machine (RBF-SVM) as the classification rules, and report the true error of the classifiers. We implement *t*-test feature selection (as a part of the classification rule) before the classifier design procedure to select *d* ≤ *D* features with highest *t*-scores. The training set with *d* features is then used to design the classifier. The same *d* features are also used for finding the true error of the designed classifier.

LDA is the plug-in rule for the Bayes classifier when the class-conditional densities are multivariate Gaussian with a common covariance matrix [32]. The sample means and pooled sample covariance matrix are estimated from the training data ${\mathcal{S}}_{n}$ and plugged into the discriminant function. If the classes are equally likely, LDA assigns **x** to class 1 if and only if

where ${\stackrel{\u0304}{\mu}}_{y}$ is the sample mean for class *y* ∈ {0,1}, and $\hat{\mathbf{\Sigma}}$ is the pooled sample covariance matrix.

3NN is a special case of *k*NN rule (with *k* = 3), which is a nonparametric classification rule based on the training data. The *k*NN classifier assigns a label, 0 or 1, to a point **x** according to the majority of the labels of the *k* nearest training data points to it. To avoid tied votes in binary classification problems, an odd number is usually chosen for *k*.

A support vector machine finds a maximal margin hyperplane for a given set of training sample points. If it is not possible to linearly separate the data, one can introduce some slack variables in the optimization procedure that allow the mislabeled sample points and solve the dual problem. Alternatively, one can transform the data and project it into a higher-dimensional space, where it becomes linearly separable. The equivalent classifier back in the original feature space will generally be non-linear [33, 34]. If a Gaussian radial basis function used as the kernel function, the corresponding classifier is referred to as RBF-SVM.

### Simulation setup and parameters

Figure 2 presents an overview of the simulations performed in the study. In this section we provide the setup and list of parameters used in the study. The analysis of the results follows in the next section.

To emulate real-world scenarios involving thousands of genes with only a few sample points, we choose *D* = 20,000 as the number of features and *n* ∈ {60, 120, 180} as the number of sample points available for each synthetic feature-label distribution. Because there is no closed form expression to calculate the true error of the designed classifiers, we generate a large independent test sample of size *n*_{
t
} = 3000, with samples divided equally between the two classes. Because the RMS between the true and estimated error when using independent test data is bounded above by $1/2\sqrt{{n}_{t}}$, this test sample size provides an error-estimate RMS of less than 0.01. Once the training and test data are generated, we normalize the training data so that each feature is zero-mean unit-variance across all sample points in both classes. We also apply the same normalization coefficients from the training data to the test set. The normalized data are then used in the feature selection, classifier design and error calculation. The parameter settings for the MVN-GC model, SAGE-tags and NGS-reads are provided in Table 1.

As explained in the previous section, the datasets generated from the MVN-GC model are transformed into the NGS-reads and SAGE-tags datasets through equation (2) and Poisson processes. We only need to properly set the parameters COV, *α*, and *β* to get the desired number of read counts or tags. We assume that the parameter COV can take on two values, 0.01 and 0.05, representing two different levels of noise and unknown errors in the experiment. In its current state, RNA-Seq technology can provide different numbers of reads per sample, depending on many factors, such as quality of the sample, the desired coverage, sample multiplexing, etc. In this study, we examine a variety of ranges for RNA-Seq experiments. We start with a very low total number of RNA-Seq reads, which may not match the real-world experiments, however, they are necessary for comparing the SAGE-tags and NGS-reads models with similar coverage. Furthermore, demonstrating the classification results for the NGS-reads model with wide ranges introduces an extra internal variability, which makes interpretations of the results rather difficult. Table 1 lists the NGS-reads ranges we have considered in this study. In a typical SAGE experiment, one expects 50K to 100K tags [35, 36]. Using trial and error, we have found that by choosing the parameters *α* = 2.0 and *β* = 3.75 in the distribution of *s*_{
i
}, the observed number of tags usually falls within this range. Similarly, the parameters *α* and *β* are chosen to meet the range requirements in the NGS-reads model.

Our goal is to study the performance of the traditional classification schemes on different sources of random samples; thus, we take a Monte-Carlo approach and generate 10,000 random training sets of size *n* from the MVN-GC model, transform them to the corresponding NGS-reads and SAGE-tags samples, and apply the classification schemes to each training sample. By taking such an approach, we aim to compare the classification performance of the three pipelines, in terms of the true error of the deigned classifiers. We also study the effect of NGS-reads and SAGE-tags transformations on the performance of a simple *t*-test biomarker discovery method, where we report the probability that global markers are recovered when *d* ≪ *D* features are selected after the feature-selection step.

## Results

The probability of recovering a certain number of global markers after a *t*-test feature selection can be approximated empirically by the percentage of experiments (out of 10,000 independent experiments) that detect such a number of markers. This probability depends on the size of the training data sample, quality of features, and underlying joint probability distribution of the features. Here, we only show the results for the MVN-GC model, with *d* = 10, in Table 2. Tables 3, 4, 5 and 6 represent the corresponding results for NGS-reads for different combinations of COV and *ρ*: {0.05, 0.4}, {0.05, 0.8}, {0.1, 0.4}, {0.1, 0.8}, respectively. In each table, we also report the corresponding results for the SAGE-tags model in a row with the NGS-reads range of [ 50*K* − 100*K*]. A successful feature selection should identify as many global markers as possible, however the situation is worsened because with small sample sizes the noisy features sometimes may appear to be good among the 20,000 features. As the number of sample points increases, we expect to get better results for the feature selection and this is exactly what we see in Tables 2, 3, 4, 5 and 6. Another important observation in Tables 3, 4, 5 and 6 is the effect of the total number of reads and COV for the NGS-reads models. As the number of reads increases, it is more likely to pick up more global markers, until the number of reads reaches a threshold, where no further improvement is observed. Similarly, for smaller COV the probability of selecting more global markers also increases.

The true error of a designed classifier measures the generalization capability of the classifier on a future sample point. Given a set of training sample points and a classification rule, one needs the full feature-label probability distribution to calculate the true error of the classifier designed on the training set. In this study, we find the true error of a classifier with a large independent sample. Because the training sample is random, the true error is a random variable with its own probability distribution. Therefore, to demonstrate the actual performance of the designed classifiers, the estimated probability density function (PDF) of the true error for each classification rule, distribution model and all data generation models (MVN-GC, NGS-reads and SAGE-tags) is presented.

We report the true errors of the designed classifiers for two different feature-selection schemes and classification rules. In the first scheme, no feature selection is performed and the first *d* global markers are directly used for classification. In the second scheme, *t*-test feature selection is done before a classifier is designed to select the best *d* features. Figures 3, 4 and 5 show the salient finding of this study. We present the PDF of the true error for different classification rules trained on 60 and 180 sample points when the correlation among features in the same block is high (*ρ* = 0.8) and COV = 0.05. Distributions with higher and tighter densities around lower true errors indicate better classifier performance. If the PDFs can be approximated as univariate Gaussians, then a good classification performance amounts to smaller mean and variance. In all cases, for similar sample sizes and similar settings for *ρ* and COV, MVN-GC outperforms the two other data types. This is not surprising since it is considered as the ground truth. Also, it is evident that a larger sample size gives better classifiers with smaller variance as illustrated by the distribution of the true error.

Figures 3, 4 and 5 show that utilizing the best *d* features (with or without feature selection) in the model, SAGE-tags and NGS-reads for small ranges of the total reads yield similar results (or even much worse when the range is smaller) in terms of the classification performance, and both are inferior when compared to the ground truth. This may lead to a conclusion that one should not expect improvements in classification performance when gene-expression levels are processed through an NGS-reads pipeline. However, our main objective here is to show that as one increases the total number of reads for the NGS-reads model, improvement can be achieved and the error rates decrease. This conclusion confirms the intuition provided by a simple calculation about the increase of the separability of two Poisson random variables with means proportional to the number of reads. However, notice that the modeling of the sequencing pipeline introduces randomness in the means of the respective Poisson parameters describing the individual gene reads. Moreover, our focus is not that much on the separability of the two classes/phenotypes but rather on the classification performance as measured by the classification error and its dependence on ground truth (MVN-GC model) sample size, sequencing depth, and feature vectors. Our goal for modeling NGS-reads with small ranges is to demonstrate the performance of SAGE and RNA-Seq, when their data is similar. This result suggests that having a larger number of reads for the RNA-Seq experiments could compensate for the errors that can occur during the NGS-reads pipeline sample processing. Here we have shown the results only for COV = 0.05, since in our observations, it appears that changing COV has little effect on the distribution of the true error for both SAGE-tags and NGS-reads models, except that it slightly increases the variance of the PDFs. The effect of feature selection (right columns) on the performance is best shown when the sample size is small. In this case, the variance of the true error is so large that drawing any conclusion regarding the performance on a small sample would be futile. Another interesting observation is that, on average, 3NN and RBF-SVM classification rules outperform LDA for both NGS-reads and SAGE-tags data, supporting the idea that using linear classifiers for these types of data is not the best way to proceed in a highly non-Gaussian model - as is the case after the data have gone through the pipeline. The best rates for the expected true error across all settings are achieved by the RBF-SVM classification rule, even for small samples, especially with small variance.

## Conclusions

In this paper, we model gene-expression levels as a multivariate Gaussian distribution that statistically captures the real mRNA levels within the cells. The newly developed technologies of sequencing DNA/RNA, referred to as NGS, and their ascendant SAGE technology generate discrete measures for gene expressions. The count data from these technologies can be modeled with a Poisson distribution. The multivariate Gaussian gene expressions are transformed through a Poisson filter to model NGS and SAGE technologies. The three categories of data are subjected to feature selection and classification. The objective is to evaluate the performance of the NGS technologies in classification. Our simulations show that when the gene expressions are directly used in classification, the best performance in terms of the classification error is achieved. The NGS-reads model generates considerably higher coverage for genes and can outperform SAGE in classification, when the experiment generates large number of reads. Even though NGS still has a variety of error sources involved in its process, its high volume of reads for a specific gene can lower the chance of misclassification. Thus, it is important to use the highest possible coverage for the entire genome while performing a RNA-Seq analysis if the goal of the study is to distinguish the samples of interest. Nevertheless, one must recognize that, as is typical with nonlinear transformations, the NGS pipeline transforms the original Gaussian data in such a way as to increase classification difficulty. As more refined modeling becomes available for the NGS pipeline, further study needs to performed, as has been done in the case of the LC-MS proteomics pipeline [1], to determine which segments of the pipeline are most detrimental to classification and what, if anything, can be done to mitigate classification degradation.

## References

- 1.
Sun Y, Braga-Neto UM, Dougherty ER: Modeling and systematic analysis of the LC-MS Proteomics Pipeline. BMC Genomics. 2012, 13 (Supp 6): S2-

- 2.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.

- 3.
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456 (6): 53-59.

- 4.
Anders S, Huber H: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R106-10.1186/gb-2010-11-10-r106.

- 5.
Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139-140. 10.1093/bioinformatics/btp616.

- 6.
Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18 (9): 1509-1517. 10.1101/gr.079558.108.

- 7.
Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007, 23 (21): 2881-2887. 10.1093/bioinformatics/btm453.

- 8.
Robinson MD, Smyth GK: Small sample estimation of negative binomial dispersion, with applications to SAGE data. Biostat. 2008, 9 (2): 321-332.

- 9.
Hua J, Waibhav T, Dougherty ER: Performance of feature selection methods in the classification of high-dimensional data. Pattern Recognit. 2009, 42 (3): 409-424. 10.1016/j.patcog.2008.08.001.

- 10.
Mardis ER: Next-generation DNA sequencing methods. Ann Rev Genomics Human Genet. 2008, 9: 387-402. 10.1146/annurev.genom.9.081307.164359.

- 11.
Auer PL, Doerge RW: Statistical design and analysis of RNA sequencing data. Genet. 2010, 185 (2): 405-416. 10.1534/genetics.110.114983.

- 12.
Sun W: A statistical framework for eQTL mapping using RNA-seq data. Biometrics. 2011, 68: 1-11.

- 13.
Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O’Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321: 956-960. 10.1126/science.1160342.

- 14.
Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O, Sahinalp SC, Gibbs RA, Eichler EE: Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet. 2009, 41: 1061-1067. 10.1038/ng.437.

- 15.
Craig DW, Pearson JV, Szelinger S, Sekar A, Redman M, Corneveaux JJ, Pawlowski TL, Laub T, Nunn G, Stephan DA, Homer N, Huentelman MJ: Identification of genetic variants using bar-coded multiplexed sequencing. Nature Methods. 2008, 5: 887-893. 10.1038/nmeth.1251.

- 16.
Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, Gingeras TR, Oliver B: Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011, 21 (9): 1543-1551. 10.1101/gr.121095.111.

- 17.
Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.

- 18.
Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.

- 19.
Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, 18: 1851-1858. 10.1101/gr.078212.108.

- 20.
Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M: SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol. 2009, 5 (5): e1000386-10.1371/journal.pcbi.1000386.

- 21.
Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler E, Sahinalp SC: mrsFAST: a cache-oblivious algorithm for short-read mapping. Nature Methods. 2010, 7: 576-577. 10.1038/nmeth0810-576.

- 22.
Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics. 2008, 24 (5): 713-714. 10.1093/bioinformatics/btn025.

- 23.
Attoor S, Dougherty ER, Chen Y, Bittner ML, Trent JM: Which is better for cDNA-microarray-based classification: ratios or direct intensities. Bioinformatics. 2004, 20 (16): 2513-2520. 10.1093/bioinformatics/bth272.

- 24.
Dalton LA, Dougherty ER: Application of the Bayesian MMSE error estimator for classification error to gene-expression microarray data. Bioinformatics. 2011, 27 (13): 1822-1831. 10.1093/bioinformatics/btr272.

- 25.
Dougherty ER: Validation of computational methods in genomics. Curr Genomics. 2007, 8: 1-19. 10.2174/138920207780076956.

- 26.
Shmulevich I, Dougherty ER: Genomic Signal Processing. 2007, Princeton: Princeton University Press

- 27.
Yousefi MR, Hua J, Dougherty ER: Multiple-rule bias in the comparison of classification rules. Bioinformatics. 2011, 27: 1675-1683. 10.1093/bioinformatics/btr262.

- 28.
Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010, 11 (94): 1471-2105.

- 29.
Li J, Wittn DM, Johnstone IM, Tibshirani R: Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostat. 2010, 11 (94): 1471-2105.

- 30.
Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010, 11 (R25): 1471-2105.

- 31.
Hoyle DC, Rattray M, Jupp R, Brass A: Making sense of microarray data distributions. Bioinformatics. 2002, 18 (4): 576-584. 10.1093/bioinformatics/18.4.576.

- 32.
Duda RO, Hart PE, Stork DG: Pattern Classification. 2001, New York: Wiley

- 33.
Boser BE, Guyon IE, Vapnik VN: A training algorithm for optimal margin classifiers. Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory. 1992, 144-152. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.3818,

- 34.
Cortes C, Vapnik VN: Support-vector networks. Mach Learn. 1995, 20: 273-297.

- 35.
Wang SM: Understanding SAGE data. TRENDS Genet. 2006, 23: 42-50.

- 36.
Bianchetti L, Wu Y, Guerin E, Plewniak F, Poch O: SAGETTARIUS: a program to reduce the number of tags mapped to multiple transcripts and to plan SAGE sequencing stages. Nucleic Acids Res. 2007, 35 (18): e122-10.1093/nar/gkm648.

## Author information

### Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors’ contributions

NG proposed the main idea, planned/structured the study, and worked on the simulations and manuscript. MRY contributed to the main idea, simulation, structure of the study, and the manuscript. CDJ and II helped in conceiving the study and revised the manuscript. ERD contributed to the formulation of main idea and also to the steps of the study, and revised the manuscript. All authors read and approved the final manuscript.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Ghaffari, N., Yousefi, M.R., Johnson, C.D. *et al.* Modeling the next generation sequencing sample processing pipeline for the purposes of classification.
*BMC Bioinformatics* **14, **307 (2013). https://doi.org/10.1186/1471-2105-14-307

Received:

Accepted:

Published:

### Keywords

- Next Generation Sequencing
- Linear Discriminant Analysis
- Next Generation Sequencing Technology
- True Error
- Next Generation Sequencing Data