Automated identification of Monogeneans using digital image processing and K-nearest neighbour approaches

Background Monogeneans are flatworms (Platyhelminthes) that are primarily found on gills and skin of fishes. Monogenean parasites have attachment appendages at their haptoral regions that help them to move about the body surface and feed on skin and gill debris. Haptoral attachment organs consist of sclerotized hard parts such as hooks, anchors and marginal hooks. Monogenean species are differentiated based on their haptoral bars, anchors, marginal hooks, reproductive parts’ (male and female copulatory organs) morphological characters and soft anatomical parts. The complex structure of these diagnostic organs and also their overlapping in microscopic digital images are impediments for developing fully automated identification system for monogeneans (LNCS 7666:256-263, 2012), (ISDA; 457–462, 2011), (J Zoolog Syst Evol Res 52(2): 95–99. 2013;). In this study images of hard parts of the haptoral organs such as bars and anchors are used to develop a fully automated identification technique for monogenean species identification by implementing image processing techniques and machine learning methods. Result Images of four monogenean species namely Sinodiplectanotrema malayanus, Trianchoratus pahangensis, Metahaliotrema mizellei and Metahaliotrema sp. (undescribed) were used to develop an automated technique for identification. K-nearest neighbour (KNN) was applied to classify the monogenean specimens based on the extracted features. 50% of the dataset was used for training and the other 50% was used as testing for system evaluation. Our approach demonstrated overall classification accuracy of 90%. In this study Leave One Out (LOO) cross validation is used for validation of our system and the accuracy is 91.25%. Conclusions The methods presented in this study facilitate fast and accurate fully automated classification of monogeneans at the species level. In future studies more classes will be included in the model, the time to capture the monogenean images will be reduced and improvements in extraction and selection of features will be implemented.


Background
Parasitic organisms have categorical homogeneous morphology, hence, pattern recognition techniques can be used to identify them [1]. The monogenean species are used in this study because they are worthy taxons for investigation [2]. There might be around 25000 species of monogenean in the world while barely 4000 of them are currently known [3]. Monogeneans are very diversified in terms of morphology and they are the only flatworm clade that have advanced adaptive radiation [2], with the variation of structural designs in the attachment organs [4], which are usually used for species identification. In particular, the haptoral attachment organ is characterized by sclerotized structures such as anchors, bars, hooks, etc. The morphology of these organs are usually unique to monogenean species [5] and are used as diagnostic characters in taxonomy [6,7]. Automated classification of specimens' images to their corresponding species requires development of models and methods that are able to characterize a specie's morphology and apply this knowledge for their recognition. Automated systems should be combined with databases of images or text based information [8]. Primo Coltelli et al. [9] believe that image acquisition is the most important step in designing an automated system and capturing images should be well-focused with less complexity. The acquisition condition should be defined clearly and kept equal for all images, later labelled by expert taxonomists.
Automated systems may identify specimens at the species, genus, family or order levels using image processing, feature extraction and classification. Digital images of species, especially microscopic images, usually exhibit dust or other noise artefacts. Noise makes neighbouring pixel values cluster [10], so it should be reduced by image processing, in particular the smoothing methods of filtering. It is important to know the prevalent types of noise to be filtered so that it can be removed more efficiently. Besides this, the aim of image processing in the system is usually to transform digital images to a standard pose [11] and achieving recognizable objects on a uniform background, using segmentation. In order to facilitate the segmentation step, image artefacts should be removed and contrast as well as dynamic range has to be improved. The goal is to identify and classify objects of interest in digital images.
The performance of feature extraction and selection techniques depends on the type of a system's classifiers and the quality of the data [12]. In order to achieve a high performance classification, not all features are required to be detected. If employed classifiers are strong enough, even if some features are left undetected, the method may yield successful results [13]. In KNN classification, objects are classified according to majority vote of their neighbours, with the objects being assigned to the class which is most common amongst its k nearest neighbours. In this classification method, objects which are close to each other according to their features, are likely to belong to the same pattern class [14]. The neighbours are taken from a set of training samples where the correct classification is known. For example in identification of species of Gyroactylus genus in fish ectoparasite [15], features that were extracted by Active Shape Models (ASM) were implemented to create two linear classifiers, Linear Discriminant Analysis (LDA) and K-nearest neighbor (KNN), and two non-linear classifiers, Multilayer Perceptron (MLP) and Support Vector Machine (SVM). KNN yielded the most ideal results with a classification accuracy of 98.75%.
Many semi-automated systems have been developed for identification of biological images in different levels. In 1996, the Dinoflagellate categorization (DiCANN) system, based on neural networks [16] was developed. Subsequently, forensic identification of mammals according to their single hair patterns under a microscope was investigated by Moyo et al. [17]. Yuan et al. [18] discussed the identification of rats up to the species level from images of their tracks. Later the improvements in semi-automated systems resulted in fully automated systems. Examples of successful existing automated systems are the automated Leafhopper Identification system (ALIS) [19], the Digital Automated Identification System (DAISY) [20], the Automatic Identification and characterization of Microbial Populations (AIMS) [21], the Automated Bee Identification System (ABIS) [22], BugVisux [23], automated identification of bacteria by use of statistical methods [10], an automated identification system which estimates whiteflies, aphids and thrips densities in a greenhouse [24], Species Identification, automated (SPIDA) [25], But2fly [26], Automated Insect Identification through Concatenated Histograms of Local Appearance (AIICHLA) [13], an automated identification system for algae [9] and Automated identification of copepods [27]. In the work done by Arpah et al. (2013) [28], illustration images of haptoral bar of monogeneans were used in building a content based image retrieval system. Contrary to the previous study, in thecurrent study, we digitised microscopic specimen images to develop a monogenean species identification technique.

Methods
The study's approach followed the methodology which is detailed as follows. Figure 1 shows the system workflow.

Data collection
Monogeneans were collected from gills of killed or captured specimens. The attached tissues were removed using fine needles and placed on clean slides with a drop of water under a coverslip. Later the water was removed and four corners of coverslip were fixed. Subsequently, specimens were cleared using Ammonium pirate glycerine. Digital images of the hard anatomical structures of the monogeneans were taken using a Leica digital camera attached to Leica microscope at 40× magnification. Recognition of monogeneans is based on shape and size of their hard parts which are dorsal and ventral anchors, bars, as well as their male and female compulatory organs [29]. Thus, accurate recognition of monogeneans is very much dependant on features which are extracted from these parts. Our database consist of 23 species' images, however in this study we randomly picked 4 species : Sinodiplectanotrema malayanus, Trianchoratus pahangensis, Metahaliotrema mizellei and Metahaliotrema sp. (undescribed).
The modular image processing and analysis software package used is QWin Plus, due to its versatile architecture, designed to solve demanding quantitative analysis tasks. MATLAB R2013a was used to pre-process and extract features from monogenean digital images.

Preparation of slides of monogenean specimens
The slides of monogeneans used in this study were collected by experts since 1996. Ammonium pirate glycerine was applied to very old specimens' slides to prepare them for image acquisition. Broken and spoiled specimens were discarded during this phase.

Image acquisition & database development
In this study, only the diagnostic parts of monogeneans such as anchors, bars and copulatory organs were considered for digitisation. In some slides soft parts of specimens were damaged due to pressure of sliding during preparation of slides and this resulted to a very messy background for the selected organs which are the hard parts. Such slides were avoided in digitization as the hard parts were not easy to segment.
The quality of images depends on the model of microscope, lenses and camera specifications. The resolution of the captured images was 1044 × 772 pixels and all the images were saved in Tagged Image File format (TIF). All acquired images were indexed according to slide tags. A total of 102 images were taken from four species' and the best 80 were used in this study. According to previous studies, [30,31], we decided to use half of our digital images for training and the other half for testing the system. 10 images of each species were selected as training set and 10 were used as testing set.

Image processing
Image processing in this study includes three essential steps: 1. Image pre-processing, 2. Image segmentation and 3. Feature extraction. Matlab R [32] was used as the Image Processing Toolbox, installed on Intel(R) Xeon (R) CPU E5-1620 v2 @ 3.70GHz, 16.00GB RAM, Windows 7 Professional (64-bit) to conduct this study. Background feature minimization is an important pre-processing step in monogeneans classification. Otherwise, soft part features of monogeneans could mix with those from hard parts and the texture analysis will yield unreliable results. The image preprocessing follows as: 1. Images were converted to intensity images. 2. Filtering intensity images with the average correlation kernel of size 20 × 20. 3. Detecting the edge of the anchors and bars of monogeneans.
After detecting the edges in the images, image segmentation was performed where bars and anchors were identified and segmented from unwanted particles in the images: 1) The images were converted to binary images with threshold of zero. After creating average filter, we deduct the image from the filter. The result is an intensity image which contains negative and positive values. Therefore, pixels, greater than 0 will turn to 1 (white) and other pixels will turn to 0 (black).
2) Small particles (<1000 pixels) were excluded to ensure only the bars and anchors are segmented for feature extraction. Figure 2 shows the image processing steps for four species. 3) Features were extracted from the shape descriptors represented by the binary images of the bars and anchors, using appropriate functions in Matlab. The 10 features extracted were: Euler number, perimeter, area, area density, perimeter density, centre of bounding box, length of bounding box, width of bounding box and orientation of bounding box.

Feature selection
To increase the performance of KNN and decrease the number of unnecessary features, Linear Discriminant Analysis (LDA) was applied for feature selection. Practically, LDA as a feature dimensionality reduction technique would be a pre-step for a typical classification task. In LDA, first the d-dimensional mean vectors and scatter matrices were computed. Next, to obtain the linear discriminants the generalized eigenvalue problem was solved. Then linear discriminants for the new feature subspace was selected and finally the samples were transferred onto the new subspace. The eigenvector corresponding to a larger eigenvalue is more efficient in capturing discriminative information of the sample [33].

Classification
We applied K-nearest neighbour (KNN) classifier to the same training and test datasets. K-NN, as a non-parametric classifier, identifies the test sample by a majority vote of its neighbours which are assigned to the class that is most common among its K nearest neighbours. The KNN parameter was set to 10 in this study. The three selected features obtained from the previous stage were used as input to the KNN classifier. Four species of monogeneans were used and the vectors of image labels were prepared according to their features. Since the sample size is small in this study we applied Leave-One-Out (LOO) and 10 fold cross validation to assess how the results of our system generalize to an independent data set.

Feature selection
As a result of LDA, three embedding functions were calculated for the feature vectors. Three features were selected for training the K-nearest neighbour classifier. The feature selection method increased the classification results from 75 to 90%. Figure 3 demonstrates 3D scatter plots of the classification result for 4 species with 10 extracted features and three features after LDA.

K-nearest neighbour (KNN) Training
KNN does not make any hypothesis on the underlying data distribution. This is pretty useful in our case as the data is from the real world. Generally practical data does not follow the theoretical assumptions like for example Gaussian mixtures or linearly separable. Non parametric algorithms like KNN come to the rescue here. The trained model was constructed using 10 images of each monogenean species. Basically we tried to find the k nearest neighbour and do a majority voting. The best result with k = 10 nearest neighbours was 90%.

System evaluation
Images of four species of monogeneans namely Sinodiplectanotrema malayanus, Trianchoratus pahangensis, Metahaliotrema mizellei and Metahaliotrema sp. were used in this study. The performance of the system was evaluated by comparing the output from the trained model with the predicted result using the testing dataset as the input. The testing dataset of monogenean images was a new independent dataset not used for the training. 10 images of each species were used for testing the system. The accuracy is calculated according to the confusion matrix of the testing dataset. This means the sum of correct prediction, divided by number of all data. Based on our results, the technique presented in this study was able to recognise and identify most of the monogeneans correctly with an overall accuracy of 90%. All Metahaliotrema mizellei specimens were identified correctly; one specimen from Sinodiplectanotrema malayanus was misidentified as Trianchoratus pahangensis, two specimens of Trianchoratus pahangensis were misidentified as Metahaliotrema sp. and one specimen from Metahaliotrema sp. was misidentified as Trianchoratus pahangensis. Confusion matrix (Table 1) demonstrates the classification result.

Discussions
An automatic identification system for monogenean species is proposed in this study. The system was trained using 40 images of four species and tested by another 40 images of the same species. A total of 102 images were captured and digitised, however 22 were eliminated due to noise and overlapping of anchors and bars that might lead to misclassification [34,35]. All captured images were stored in an image database for easy retrieval by the classifier to train and test the system. K-nearest neighbour was used for detection of four species of monogeneans according to their diagnostic parts. The accuracy of automated identification in this study is 90%. Features such as Euler number, perimeter, area, area density, perimeter density, centre of bounding box, length of bounding box, width of bounding box and orientation of bounding box were extracted from anchors and bars of monogeneans. The overlapping of bars and anchors of most sample images prevented good feature extraction. Using LDA for feature selection, a new feature vector with 3 features was established. The selected features were given to KNN and as result Metahaliotrema mizellei specimens were identified correctly.
Mainly anchor and bar size in this species were totally distinct; one specimen from Sinodiplectanotrema malayanus and also one specimen from Metahaliotrema sp. were misidentified as Trianchoratus pahangensis. The shape of anchor organs' tail in Sinodiplectanotrema malayanus species and size of anchors in Metahaliotrema sp. specimens are similar to Trianchoratus pahangensis specimens. Two specimens of Trianchoratus pahangensis were misidentified as Metahaliotrema sp. The anchors of these two species are very similar in size and shape of their tails. The features are very much dependent on the size and shape.
In the future, different image processing techniques will be adopted to differentiate these two species. In some images anchors were separated and in some they were overlapping with other anchors and bars, therefore, this causes turbulence in feature extraction. We used the images to calculate the LOO and 10 fold cross validation. The error rate from LOO cross validation is 0.1 and the accuracy is 91.25%. We also divided all of the digital images to 10 subset, one subset was held out as training and the other sets were held out as testing set. The error rate resulted for 10 fold cross validation is this study is 0.1 and accuracy is 92.5%. The results from LOO and 10 fold cross validation are better than KNN classification. (Tables 1, 2 and 3).
In this study, only bars and anchors of monogeneans were used as diagnostic organs for species identification. In future studies marginal hooks and copulatory organs  can be considered. Improving the image quality could also yield better classification results.

Conclusions
Automated identification of monogenean species based on haptoral organ images of monogeneans achieved an overall accuracy of 90%. Image processing techniques were applied to automatically extract features from microscope images followed by KNN as the classifier. An automated identification method will be useful for taxonomists and nontaxonomists since sample processing time is reduced. The enhancement of image acquisition to achieve better image quality and improvement in feature extraction techniques to accommodate large datasets covering more taxa is planned for future work. In conclusion, the purpose of this study is to develop a fully automated identification system capable of identifying monogenean specimens. Eventually, this study will be integrated into a digital biological ecosystem published by the corresponding author [36]. This work will also be extended to a species classification pipeline which incorporates semantics using textual annotations and image attributes.