As stated in the introduction, NMF can be seen as a DR technique, functionally similar to source extraction. This section summarily describes some of the existing NMF methods and the different alternatives for their initialisation. The choice of initialisation technique turns out to be a key feature for the success of NMF as a tumour type classification method. The specific way in which these techniques are used and interpreted in the context of MRS data analysis is also described in this section. We later explain how the data can be labelled a posteriori, once the sources have been extracted, with the purpose of helping us to understand the extent to which obtained sources are able to represent the data. Finally, the way sources can be used strictly for DR is also described.
Nonnegative matrix factorisation methods for source extraction
In the standard NMF description [
18], a nonnegative matrix
V of observed data (
d ×
n, where
d is the data dimensionality and
n is the number of observations), is approximately factorised into two nonnegative matrices,
W (of dimensions
d ×
k, where
k is the number of data basis or sources, and
k <
d) and
H (of dimensions
k ×
n, each of whose columns provides the encoding of a data point: a SV spectrum in this study). The product of these two matrices provides a good approximation to the original matrix, that is,
V≈WH. The conventional approach to finding the two factors is by minimising the divergence between
V and
WH:
subject to the nonnegativity constraints mentioned above, where ·_{
F
}is the Frobenius norm. In this study, the following divergence minimisation methods, which cover a wide palette of algorithmic alternatives, were considered:
The objective function is optimised with multiplicative update rules for
W and
H:
Monotonic convergence of the algorithm can be proven [19]. These update equations preserve the nonnegativity of W and H, and constrain the columns of W to sum to unity.
This technique alternately fixes one matrix and improves the other.
where
W and
H are updated as follows:
setting all negative elements in W and H to zero.
The equations for
W and
H in the alternating least squares method above are solved here using projected gradients. For
H, this entails:
where α is the step size, and
P[·] is a bounding function that ensures that the solution remains within the boundaries of feasibility. The gradient function is solved as:
The same approach is used to calculate W.

Alternating least squares with Optimal Brain Surgeon (OBS) [
26] (
alsobs) [
27,
28] Similar to alternating least squares, this algorithm alternately solves the least squares equations for
W and
H. The negative elements in
W and
H are set to zero and the rest are adjusted using the OBSmethod, through secondorder derivatives. The update rules for
W and
H are:
where, δ_{
W
}and δ_{
H
}act as regularisation terms and are responsible for eliminating the less important elements of W and H, respectively (the original OBS was used as a weight pruning mechanism in artificial neural networks), thus readjusting the remaining elements optimally. More implementation details can be found in [28].

ConvexNMF (
convex) [
24]. To achieve interpretability, this method imposes a constraint that the vectors (columns) defining
W must lie within the column space of
V, i.e.
W =
VA (where
A is an auxiliary adaptative weight matrix that fully determines
W), so that
V≈VAH. By restricting
W to convex combinations of the columns of
V we can, in fact, understand each of the basis or sources as weighted sums of data points. Unlike the previous ones, this NMF variant applies to both nonnegative and mixedsign data matrices. The factors
H and
A are updated as follows:
where (·)^{+} is the positive part of the matrix, where all negative values become zeros; and (·)^{} is the negative part of the matrix, where all positive values become zeros. All the algorithms, for all initialisations, were allowed to achieve convergence. Such convergence was qualified as the lack of variation in the reconstruction error, from one iteration to the next, over a common set small threshold of value 10^{5}.
Interpretation of the methods
In NMF for the analysis of MRS data, the rows in H can be understood as estimates of the concentration/abundance of the constituent signals or sources, while the columns in W are the corresponding constituent signals or sources of the spectra themselves. In conventional NMF methods (such as the first four previously described), the matrices V, W and H are constrained to be nonnegative, thus permitting the interpretation of the mixing matrix entries as quantitative estimates of the amount of source tissue in the sample. The source can, as a result, be assigned to the class (tumour type or healthy tissue) with whose template it shows a higher correlation. If nonnegativity is also imposed on the signals and sources, then it is commonplace truncating the negative values to zero, therefore loosing potentially relevant information (for instance, Lactate, Alanine, and Glutamine + Glutamate (Glx) in LTE spectra, which are expected to be especially relevant for discrimination between tumour types). Some of the methods described above impose the constraint of nonnegativity only on the mixing elements representing the constituent tissue fractions. Where nonnegative signals are also required, we propose using absolute values instead, in order to reduce data loss from the negative peaks.
ConvexNMF, instead, enforces this nonnegative constraint only on H, while V and W are allowed to be of mixed sign. Given that the observed MRS data are of mixed sign, their sources should also be of mixed sign. Thus, understanding W as the source spectra matrix, the sources will be intuitively interpretable and no preprocessing of the spectra is required in order to make them nonnegative, thus preventing any unnecessary loss of information (in the case of our database, losing the information in the negative peaks of the SV ^{1}HMRS LTE spectra). As in the previous methods, H can be understood as estimates of the concentration/abundance of the constituent signals.
NMF initialisations
NMF methods unavoidably converge to local minima. As a result, the NMF bases will be different for different initialisations. In this study, six forms of initialisation were considered (with some variations depending on the method). Although a standard procedure to justify the choice of NMF initialisation does not exist, the six alternatives considered here cover a wide array of approaches: from random initialisation, to prototypebased clustering methods (Kmeans and Fuzzy CMeans, which provide a data densitybased sample of initial data locations), and feature extraction techniques (PCA, ICA and NMF itself, which initialise the algorithm according to the basic eigenstructure of the data).
[all methods]: W and H are initialised as dense matrices of random values between 0 and 1.
[euc, alspg, als, alsobs]: W is initialised with the cluster centroids, and H with the distances from each point (MR spectrum) to every centroid.
[convex]: H is initialised as H
^{(0)}
= C + 0.2E, where E is a matrix with all its elements equal to one, and C = (c
_{1}, . . ., c
_{
n
}) is filled with the cluster indicators, which are based on the cluster indices of each point, such that C
_{
ik
} = {0,1} and the ones indicate cluster membership. A is initialised as A
^{(0)} = (C + 0.2E)D
^{1}, where D is a diagonal matrix with each element being the number of points in each cluster [24].
[euc, alspg, als, alsobs]: W is initialised with the cluster centres, and H with the fuzzy partition matrix (or membership function matrix); as in [29].
[convex]: H is initialised as H
^{(0)} = C + 0.2E, where C here is filled with the fuzzy partition values, and E is a matrix with all its elements equal to one. A is initialised as A
^{(0)} = (C + 0.2E)D
^{1} , where D is a diagonal matrix with each element being the number of points in each cluster.
[euc, alspg, als, alsobs]: The mean vector is subtracted from the complete dataset, and this is followed by the computation of its eigenvectors and eigenvalues. The matrix W is initialised with the whitened data (the corresponding projection of the eigenvectors), and H with the dewhitening matrix. In order to use the initial W and H matrices obtained from PCA in NMF, the negative values are truncated, as proposed in [29].
[convex]: H is initialised as H
^{(0)} = C + 0.2E, where C is the dewhitening matrix, calculated, as in the rest of methods, after calculating PCA, and also truncating the negative values. For the initialisation of A, and as suggested in [24], first we compute A = H
^{
T
}(HH
^{
T
})^{1}, and then A
^{(0)} = (A)^{+} + 0.2E〈(A)^{+}〉 so that the negative elements are removed, where 〈X〉 = ∑_{
n, k
}X
_{
n, k
}/X
_{
n, k
}_{0}, and where X
_{
n, k
}_{0} is the number of nonzero elements in X.
[euc, alspg, als, alsobs]: The independent components extracted using FastICA are used to initialise W, and H is initialised with the resulting mixing matrix. Then, to meet the nonnegativity condition of NMF, the negative values are truncated.
[convex]: H and A are initialised as in the PCA (for convex) initialisation, with the only difference that H is filled with the sources or independent components from FastICA.
[euc, alspg, als, alsobs]: W is initialised with the sources extracted with NMF (als), and H is initialised with the resulting mixing matrix. In the case of als method, initialising with the same method is equivalent to duplicating the number of iterations, which does not necessarily mean that the results will improve.
[convex]: H and A are initialised as in the PCA and FastICA (for convex) initialisations, with the only difference that H is filled with the sources from NMF (als).
In principle, we might expect the different initialisation strategies to behave as follows. Random initialisation might be considered as an uninformed first estimate for NMF methods, which may lead to different outcomes given different initialisation conditions [29, 31]. We might expect Kmeans and FCM initialisations to make all methods perform better, but the results may depend on the initial selection of clusters; therefore, different results could be obtained depending on such selection. PCA and ICA, instead, can provide a unique solution, although perhaps too biased, while, in the case of NMF, the existence of a unique solution will depend on its own initialisation. All these methods will converge to local optima, so there is no guarantee that the solution obtained will be the best possible.