Collection of mosquito spectra data
We analysed mid-infrared spectra from two strains of An. arabiensis mosquitoes obtained from two different insectaries, one from University of Glasgow, UK and another from Ifakara Health Institute, Tanzania. The same data had previously been used to demonstrate the capabilities of mid-infrared spectroscopy and CNN for distinguishing between species and determining mosquito age [10]. The insectary conditions under which the mosquitoes were reared (temperature 27 ± 1.0 °C, and relative humidity 80 ± 5%) have been described elsewhere [17].
Mosquitoes were collected from day 1 to day 17 after pupal emergence at both laboratories and divided in two age classes (1–9 day-olds and 10–17 day-olds). Silica gel was used to dry the mosquitoes. For each chronological age in each laboratory, ~ 120 samples were measured by MIRS on each day. The heads and thoraces of the mosquitoes were then scanned with an attenuated total reflectance Fourier-Transform Infrared (FTIR) ALPHA II and Bruker Vertex 70 spectrometers both equipped with a diamond ATR accessory (BRUKER-OPTIC GmbH, Ettlingen, Germany). The scanning was performed in the mid-infrared spectral range (4000–400 cm−1) at a resolution of 2 cm−1, with each sample being scanned 16 times to obtain averaged spectra as previously described [9, 18]. As a result, the spectral dataset contained 1665 spectral features (Fig. 1).
Data pre-processing
The spectral data were cleaned to eliminate bands of low intensity or significant atmospheric intrusion using the custom algorithm [19]. The final datasets from Ifakara and Glasgow contained 1720 and 1635 mosquito spectra, respectively. In these two datasets, the chronological age of An. arabiensis was categorised as 1–9 days old (i.e. young mosquitoes representative of those typically unable to transmit malaria) and 10–17 days old (i.e. older mosquitoes representative of those potentially able to transmit malaria) [20].
To improve the accuracy and speed of convergence of subsequent algorithms, data were standardised by centring around the mean and scaling to unit variance [21].
Dimensionality reduction
Principal component analysis (PCA) and t-distributed stochastic neighbour embedding (t-SNE) were used separately to reduce the dimensionality of the data [13,14,15,16]. Both PCA and t-SNE were implemented using the scikit-learn library [21].
Separately, t-SNE was used to convert high-dimensional Euclidean distances between spectral points into joint probabilities representing similarities. To cluster the data into three features, the embedded space was set to 3, because the Barnes-hut algorithm in t-SNE is limited to only 4 components. Perplexity was set to 30 as the number of nearest neighbours, which means that for each point, the algorithm took the 30 closest points and preserved the distances between them. For smaller datasets perplexity values ranging from 5 and 50 are thought to be optimal for avoiding local variations and merged clusters caused by small or large perplexity values [16]. The learning rate for t-SNE is generally in the range of 10–1000 [21], thus it was set to 200 scalar.
Machine learning training
Deep learning
DL models were trained and used to classify the An. arabiensis mosquitoes into the two age classes (1–9 or 10–17 day-olds). The intensities of An. arabiensis mid-infrared spectra (matrix of features) were used as input data, while the model outputs were the mosquito age classes.
Three different deep learning models were trained; (1) Convolutional neural network (CNN) model without dimensionality reduction, (2) Multi-Layer Perceptron (MLP) with PCA as dimensionality reduction, and (3) MLP with t-SNE as dimensionality reduction. For all models, a SoftMax layer was added to transform the non-normalized outputs of K-units in a fully connected layer into a probability distribution of belonging to either one of two age classes (1–9 or 10–17 days). Moreover, to compute the gradient of the networks, stochastic gradient boosting was used as an optimisation algorithm [22], and categorical cross-entropy loss was used for the classifier’s metric.
To begin, we trained a one-dimensional CNN model with four convolutional layers and one fully connected layer when the dimensionality of the data was not reduced (Fig. 2A), and therefore consisting of 1666 training features from the data. The one-dimensional CNN was used because it is effective at deriving features from fixed-lengths (i.e. the wavelengths of the mid-infrared spectra), and it has been previously been used efficiently with spectral data [17]. To extract features from spectral signals, the deep learning architecture used convolutional, max-pooled and fully connected layers. The convolutional operation was carried out with kernel sizes (window) of 8, 4, and 6, and a kernel window shift size (stride) of either 1 or 2. For each kernel size, 16 filters were used to detect and derive features from the input data. Furthermore, given the size of the training data, the fully connected layer consisted of 50 neurons to reduce the model's complexity.
Moreover, batch normalisation layers were added to both models to improve model stability by keeping mean activation close to 0 and activation standard deviation close to 1. To reduce the likelihood of overfitting, dropout was used during model training to randomly and temporarily remove units from the network at a rate of 0.5 per step. Furthermore, after 50 rounds, early stopping was used to halt training when a validation loss stopped improving.
Dimensionality reduction
We trained two additional deep learning models, in this case Multi-Layer Perceptron (MLP), with PCA or t-SNE transformed input data (Fig. 2B). The models were trained with only fully connected layers (n = 6) containing 500 neurons each, given the limited number of training features to ensure performance and stability. To control for overfitting, the procedure was similar to that of the CNN above, except that early stopping was used to halt training when a validation loss stopped improving after 500 rounds.
Transfer learning
The Ifakara dataset was used as the source domain for pre-training the ML models. The Ifakara dataset was divided into training and test sets, and estimator performance was assessed using K-fold cross-validation (k = 5) [23], (Fig. 3). We therefore determined what percentage of the new spectra data from the alternate location as target domain was required for ML models to learn the variability between the insectaries. To put transfer learning options to the test, either 82 or 33 spectra were randomly selected from the 1635 of the Glasgow data, accounting for 5% and 2% of the dataset, respectively. The learning process in this case relied on a pre-trained model (trained with Ifakara data), avoiding the need to start training from scratch (Fig. 3). The ML models pre-trained with Ifakara dataset were fine-tuned using 2% or 5% subsets of the Glasgow dataset. The output was compared to that of a model trained solely with Ifakara data (i.e., no transfer learning).
Precision, recall, and F1-scores were calculated from predicted values for each age class to demonstrate the validity of the final models in predicting the unseen Glasgow data. Keras and TensorFlow version 2.0 were used for deep learning process [24, 25].
Standard machine learning
We also compared the prediction accuracy of CNN and MLP to that of a standard machine learning model trained on spectra data transformed by PCA or t-SNE. Different algorithms were compared, including K-Nearest Neighbour, logistic regression, support vector machine classifier, random forest classifier, and a gradient boosting (XGBoost) classifier. The model with the highest accuracy score for predicting mosquito age classes was optimised further by tuning its hyper-parameters with randomised search cross-validation [21]. The cross-validation evaluation used to assess estimator performance in this case was the same as that used in deep learning. The fine-tuned model was used to predict mosquito age classes in previously unseen Glasgow dataset.
Python version 3.8 was used for both the deep learning and standard machine learning training. All computations were done on a computer equipped with 32 Gigabytes of random-access memory (RAM) and an octa-core central processing unit.