### Bayesian semi-supervised classification model

Standard MLST databases contain DNA sequences for 7 housekeeping genes shared by a pathogen species or a species group. Typical lengths of these genes vary in the range 350-500 basepairs. Let *g* = 1, ..., 7, denote the index of a single MLST gene and **x**
_{
ig
} the observed DNA sequence for gene *g* in strain *i*. It is assumed that the sequences **x**
_{
ig
} are aligned and of the same length *d*
_{
g
} for all considered strains. The total set of sequences for each strain is written as **x**
_{
i
}. Each element *x*
_{
ijg
} in **x**
_{
ig
} belongs to the finite alphabet
, which is uniquely mapped to a set of integers such that we get the sample space
for each site *j*, *j* = 1, ..., *d*
_{
g
}. However, to obtain a less parameter-heavy classification model, we define the sample spaces for all 3-mers in these sequences in a more parsimonious manner (for details see below).

Corander and Tang [4] introduced a Bayesian second-order Markov model for unsupervised classification of MLST sequence data, which aims at a balance between a parsimonious parametrization and an adequate representation of dependencies in observed nucleotide frequencies among neighboring sites. Such standard Markovian structures are ubiquitous in statistical modeling of DNA sequences. Here we adapt this modeling framework to a semi-supervised setting, where training data are used to pre-specify a finite set of *k*
_{1} possible distinct sources of new test strains, while not excluding the possibility that some (or even all) of these have emerged from a previously unseen evolutionary group. Let *V*
_{
g
} = {1, ..., *d*
_{
g
}} denote the index set of the site variables *x*
_{
ijg
} and *G*
_{
g
} = *G*
_{
g
}(*V*
_{
g
}, *E*
_{
g
}) an undirected graph on the node set *V*
_{
g
} with the edges in set *E*
_{
g
}. The edge set is determined by a second-order Markov structure where for any pair {*j*, *j**} of site indices {*j*, *j**} ∈ *E*
_{
g
} if and only if |*j* - *j**| *<* 3. Given the standard properties of decomposable graphical models [10], such a dependence structure leads to an explicit factorization of the joint probability distribution of site patterns given a joint classification of the training and query data. To define the factorization we let *cl*(*G*
_{
g
}) and *sep*(*G*
_{
g
}) denote the sets of cliques and the set of separators of the cliques of graph *G*
_{
g
}, respectively. The cliques correspond to all the triplets of consecutive site indices, whereas the separators correspond to all the pairs of consecutive site indices, except the first and last pairs for each gene.

Assuming there are in total *n* strains in a particular query, we index them by the set of integers *N* = {1, ..., *n*}. The observed sequence data for any subset *s* ⊆ *N* of query strains is given by the collection **x**
^{(s)}= {**x**
_{
i
} : *i* ∈ *s*}, and hence **x**
^{(N)}represents the entire query data set. The sequence types (STs) existing in a curated MLST database are used as labeled training data, indexed by *M* = {1, ..., *m*}. The labels are assumed to be specified by an earlier analysis of the database contents, which divides the *m* STs into *k*
_{1} distinct evolutionary groups using, for instance, an unsupervised classification with the BAPS software. The labeling *T* of the training data is a joint classification of the *m* STs into *k*
_{1} classes and we use **z**
_{
ig
}, **z**
_{
i
}, **z**
^{(s)}, **z**
^{(M)}for training data in a notation analogous to the query data as defined above.

For a set *a*
_{
g
} of sequence sites indexed by *V*
_{
g
}, such that the cardinality |*a*
_{
g
}| equals three, we let **x**
_{
iag
}, **z**
_{
iag
} be the corresponding 3-mers observed in gene *g* for strain *i* in the query and training data sets. Further, we let
equal the total number of distinct 3-mers observed at the sites *a*
_{
g
} in the joint collection of query and training data:
.

Let *S* denote a joint classification of the *n* query STs into the *k*
_{1} ≥ 1 sources labeled by training data and *k*
_{2} ≥ 0 putative novel sources. Thus,
defines a (possibly) partially labeled partition of *N* (semi-supervised classification), such that
and *s*
_{
c
} ∩ *s*
_{
c'
} = ∅, for all pairs {*c*, *c'*} ranging between 1, ..., *k*
_{1} + *k*
_{2}. The partition is completely labeled (supervised classification) when *k*
_{2} = 0, that is when no query STs are assigned into previously unknown sources. We use
to denote the space of possible values of the semi-supervised classification structure *S*, conditional on a user-specified upper bound for *k*
_{1} + *k*
_{2}.

The joint conditional likelihood of query data given the classification

*S* and the training data labeling

*T* is under our Markov model defined as

where *b*
_{
g
} and
are defined for subsets in *sep*(*G*
_{
g
}) analogously to the subsets in *cl*(*G*
_{
g
}),
,
are the probabilities of observing the *l*th 3-mer and 2-mer, respectively, in class *c*, and
,
are sufficient statistics corresponding to the observed counts of the *l*th 3-mer and 2-mer in class *c*. Parameter *θ* in (1) is used as a joint abbreviation for all the continuous parameters in the expression, which correspond to probabilities of observing the particular site patterns within the classes. Notice that the probabilities
and counts
are unambiguously determined by marginalization from
and
, since each *b*
_{
g
} is a subsequence of a *a*
_{
g
} with cardinality equal to two, which follows from the order of the Markov model. Since the probabilities
are unknown parameters, the training data are used for learning them for the *k*
_{1}
*a priori* known classes, whereas only non-informative prior distributions are used for inferences about the remaining *k*
_{2} classes. Furthermore, since these probabilities are nuisance parameters regarding the classification task, they should be integrated out when making inferences about the classification *S*.

Assuming standard Dirichlet

prior distributions which are factorized with respect to the graphs

*Gg* for all components of

*θ* [

10,

11],we can derive an analytical expression for the posterior probability

*p*(

*S*|

**z**
^{(M)},

**x**
^{(N)},

*T*) of

*S*. The conjugate Dirichlet prior is widely adopted in particular in bioinformatics applications due to the computational advantage provided by analytical marginalization over frequency (nuisance) parameters in multinomial models. The posterior of

*S* equals

where

*p*(

*S*|

*T*)

*>* 0 is the prior probability of

*S* and

*f*(

**z**
^{(M)},

**x**
^{(N)},

*T*) is a normalizing constant equal to the sum

In the expressions below

is the observed count of the

*l*th 3-mer from the training data on class

*c* and

is the corresponding marginalized count. The first one of the two above integrals can be written in detail as

where the term

equals

which further simplifies to the expression

where

. Correspondingly,

equals

which in turn simplifies to

where

.We set all the Dirichlet hyperparameters in

equal to the reference value

, which is generalization of the Jeffreys' prior and reflects

*a priori* symmetry with respect to the 3-mer values. For a detailed discussion about such reference priors see [

11]. The prior distribution of

*S* is set equal to the uniform distribution in

, which is defined as

where

refers to the cardinality of the space

. Similarly, the second integral can be written as

since the previously unseen sources lack the training data observations.

We define our joint semi-supervised classifier as the classification structure

*Ŝ* corresponding to the posterior mode over the distribution specified in (2)

Given

*Ŝ*, one may calculate the conditional posterior distribution over possible assignments of the

*n* query STs according to

where *Ŝ*(*i* ∈ *s*
_{
c
}) is the mode classification with *i*th query strain re-assigned to class *c*. These probabilities reflect the local posterior uncertainty about the possible sources of the query STs and they can be calculated in a simple manner using the above analytical expressions. In the next section it is shown how fast stochastic optimization can be used to obtain a plausible estimate of *Ŝ* in the online setting considered here.

### Inference algorithm

A standard Bayesian supervised classifier, for example the naive Bayes classifier [12], would treat each query ST separately and assign it to the class maximizing the posterior probability among the *k*
_{1} known alternative sources. Such an approach has very modest computational complexity and it can be easily extended to the semi-supervised classification task where any single query ST is allowed to be assigned to an additional class lacking training data. However, considering the query STs individually has the disadvantage that when multiple STs are assigned to a previously unknown evolutionary group, the classifier provides no information about whether they should be interpreted as a single group or eventually be split into multiple novel lineages. In addition, when compared to a simultaneous classification, separate classification of all query STs offers lower statical power to detect strains from novel groups which are only modestly distinct from the *k*
_{1} groups in the training data. On the other hand, simultaneous semi-supervised classification of the query STs is computationally substantially more challenging than a separate classification, since the search operators must allow for the presence of multiple novel subsets of strains. Standard Bayesian computational tools, such as the Gibbs sampler [13], provide a straightforward way to implement a simultaneous semi-supervised classifier. However, due to their notoriously slow convergence for mixture models, they do not offer a highly versatile solution for an online application where query assignments are expected to be provided on a nearly real-time basis. Hanage et al. [14] analyzed a large MLST database for which they concluded that a Gibbs sampler based approach did not converge with a reasonable computational effort (~3 days on a single CPU). The same convergence issue was also explored for a different data type in [15], where Gibbs sampler and a stochastic greedy search algorithm were compared. Therefore, we use for semi-supervised classification the same efficient non-reversible stochastic search operators that are used for unsupervised classification of MLST data in the BAPS algorithm.

Given a set of query data and a preprocessed MLST database in which STs are divided into *k*
_{1} groups, it is necessary to determine first the total number
of distinct 3-mers observed in the joint collection of query and training data for all collections of sites *a*
_{
g
} over the genes. This requires a linear scan of the observed sequences in the query data. Additionally, pairwise Hamming sequence distances are calculated for all pairs of query STs, as these are used to guide the stochastic search of the optimal classification. Notice that the unnormalized posterior probability distribution over the possible assignments *S* of query strains is uniquely determined by the sufficient statistics
and the Dirichlet prior hyperparameters
. Therefore, an efficient algorithm for searching the classification maximizing the posterior can be efficiently constructed by book-keeping changes in the sufficient statistics implied by re-assignments of subsets of query strains. The search operators in
that are used for improving any current state *S* of the simultaneous assignment of query STs work as follows:

1. In a random order relocate each single ST to the class in *S* that leads to the maximal increase in the posterior probability (2). This operator considers explicitly the assignment of each ST into a new singleton class, unless that would increase the number of classes *k*
_{2} beyond the user-specified upper bound.

2. Merge STs in the two classes of *S* which leads to the maximal increase in the posterior probability (2). If no putative merge increases the probability, the state of *S* is not altered. Notice that this operator applies to all classes irrespectively of their size, thus including any potential singleton classes introduced by the first or third operator.

3. In a random order, split each class into two maximally homogeneous subclasses using the complete linkage clustering algorithm with Hamming distances between the query STs. If a classification *S** after split is associated with higher posterior probability than the current classification *S*, the split is accepted and otherwise it is rejected.

4. In a random order over the classes of *S*, simultaneous relocation of several STs from each class is attempted. The STs in a class are first sorted into a decreasing order with respect to the improvement in posterior probability (2) when they are assigned one-by-one into some other class, that is the ST associated with the largest improvement is placed first in the sorting etc. A candidate for new classification structure *S** is then formed by relocating STs in this order to the class which leads to the largest increase in (2) or to the smallest decrease if no positive changes are possible. The relocation is continued either until the the total change in (2) becomes positive, in which case the candidate *S** is set as the next state of the search algorithm, or until all STs in the class are relocated and the total change remains negative, in which case the candidate is rejected.

The search algorithm uses each of the above operators in varying combinations until no improvement in (2) is achievable after two consecutive attempts. Given its efficient implementation, even in an online application the algorithm can be independently run multiple times such that the globally best classification over the runs is chosen as the final estimate of the posterior mode classification. Multiple independent searches will reduce the probability that the best classification identified among them will be considerably suboptimal, representing a local peak in the posterior distribution. Since any two classification structures can be analytically compared, the searches can even be performed on separate processors and results later combined using the batch mode interface of the BAPS software.