Towards a supervised classification of neocortical interneuron morphologies

Background The challenge of classifying cortical interneurons is yet to be solved. Data-driven classification into established morphological types may provide insight and practical value. Results We trained models using 217 high-quality morphologies of rat somatosensory neocortex interneurons reconstructed by a single laboratory and pre-classified into eight types. We quantified 103 axonal and dendritic morphometrics, including novel ones that capture features such as arbor orientation, extent in layer one, and dendritic polarity. We trained a one-versus-rest classifier for each type, combining well-known supervised classification algorithms with feature selection and over- and under-sampling. We accurately classified the nest basket, Martinotti, and basket cell types with the Martinotti model outperforming 39 out of 42 leading neuroscientists. We had moderate accuracy for the double bouquet, small and large basket types, and limited accuracy for the chandelier and bitufted types. We characterized the types with interpretable models or with up to ten morphometrics. Conclusion Except for large basket, 50 high-quality reconstructions sufficed to learn an accurate model of a type. Improving these models may require quantifying complex arborization patterns and finding correlates of bouton-related features. Our study brings attention to practical aspects important for neuron classification and is readily reproducible, with all code and data available online. Electronic supplementary material The online version of this article (10.1186/s12859-018-2470-1) contains supplementary material, which is available to authorized users.


NeuroSTR morphometrics
We computed 'standard' morphometrics with the NeuroSTR neuroanatomy library (see Table 1). These include branch length and bifurcation angles, arbor height and width, and topological features such as vertex ratio. We mainly summarized part-of-tree analyses (i.e., those computed for a section of an arbor, such as a branch or segment) by computing their average, using the median, standard deviation, or maximum statistics only when we deemed it justified (e.g., for maximum arbor distance to soma). We also computed some morphometrics specific to axonal terminal branches (e.g., mean terminal branch length).

Custom-implemented
We used 48 axonal and dendritic custom-implemented morphometrics (see Table 2).

Distribution along X and Y axes
Each neuronal reconstruction consisted of points with Euclidean coordinates, with the center of gravity of the soma located at coordinates (0, 0, 0). Thus, computing, e.g., the standard deviation along the X axis provided an estimate of arborization extent in the horizontal direction.

PCA-derived
Following Yelnik et al. (1983) we used principal component analysis (PCA) to quantify possible preferential orientation of an arbor along either the X or Y dimension. We set the Z coordinates to zero and quantified such preference with the index of axialization measure of Yelnik et al. (1983), calling it eccentricity: where s 1 and s 2 are standard deviations of the first and second principal components, respectively (thus, s 1 ≥ s 2 ≥ 0). An e towards 1 indicates a strong preference for one axis, whereas an e towards 0 indicates a circular arbor. We used the angle θ of the main axis (i.e., the first principal component) to a positive X axis passing through the center of mass to quantify the degree of radial or tangential orientation of the arbor, namely, where y and x are the loadings of the first component on the Y and X axes, respectively, and correspond to | sin θ| and | cos θ|. Thus, r is positive if the tree is ascending or descending, and close to -1 if it mainly arborizes horizontally. To reduce its magnitude for trees that did not have a preference for one of the two directions, we factored in the degree of eccentricity, e (which is always positive).

Moments along the axes (distribution around the soma)
We computed the mean, standard deviation, and the standardized mean (i.e., the ratio of the mean to the standard deviation) along the X and Y axes. The sign of y_mean and y_std_mean, for example, may help distinguish between arbors that ascend towards the pial surface or descend towards the white matter; unlike y_mean, y_std_mean is dimensionless and expresses the arborization preference in terms of the Y extent of the arbor. We also computed the means of |x| and |y| so as to not distinguish between arbors skewed towards the right or the left (or above or below) of the soma, but instead between those arborizing close and far from soma, both horizontally and vertically. The standard deviations indicate the extent along an axis and are very correlated with the height and width morphometrics. Finally, we computed the ratio of the range along an axis and the standard deviation along that axis (ratio_x and ratio_y).

Grid analysis
We split the X and Y plane into 20 µm by 20 µm squares, and computed the number of branches in each square. We recorded the number of non-empty squares (i.e., those containing at least one branch; grid_area), as an estimate of the arbor's area, and the mean (grid_mean) branch count per non-empty square. Finally, we computed the ratio of non-empty 100 µm by 100 µm squares and grid_area, to quantify arborization density (grid_density), i.e., arbors that tend to occupy a large of portion of a given 100 µm by 100 µm square.

Axon origin
In order to distinguish axons that originate from below the soma from those that originate above it, we recorded the Y coordinate of the first axonal bifurcation (axon_origin), as well as the difference between the minimal path distance from the soma among points more than 100 µm below the soma (Y < 100 µm) and those at least 100 µm above the soma (Y > 100 µm; axon_above_below); a positive value would suggest that the arborization begun on the upper side of the soma.

Laminar distribution
Since we did not know the exact location of the soma within a layer, we could only estimate axonal projection across the layers. For these estimates we relied on layer thickness data from Figure 3 of Markram et al. (2015), shown in Table 3, assuming that the thickness T l of layer l follows a Gaussian distribution, N (mt l , st l ), where mt l and st l are the mean and standard deviation of T l (given in Table 3).
The probability of an axon reaching L1 depends on axonal height above the soma, h a , and the distance D from the soma to the center of L1, c 1 . We modelled D as a sum of two independent random variables, D = D l + P , where D l is the distance from c l , the center of the soma's layer l, to c 1 , and P the position of the soma with respect to c l (considering, in both cases, only the Y dimension). Assuming layers' thicknesses are independent, D l ∼ N (md l , sd l ), where where the summation term is omitted for L2 (i.e., l = 2). Assuming that P follows N (0, mt l 4 ), the sum D l + P follows N (md l , sd 2 l + ( mt l 4 ) 2 ) and the probability l1_prob of an axon reaching L1 is that of drawing a value equal or greater than h a from this distribution. Thus, for example, for an L4 cell with its axon extending 500 µm above its soma, the probability of reaching L1 was 0.0005, whereas for one with length 700 µm was 0.6450, i.e., 65% (md 4 = 679.5; see Table 3).  Markram et al. (2015) and the estimated distance from the layer's center to the center of L1.
Layer Thickness Distance to L1 (md l ± sd l ) 1 165 ± 13 2/3 502 ± 27 333.5 ± 15 4 190 ± 7 679.5 ± 28 5 525 ± 33 1037.0 ± 33.1 6 700 ± 48 1649.5 ± 49.9 We estimated the probability p a of an arbor extending into the layer above as the probability of drawing h a from a Gaussian distribution N ( mt l 2 , mt l 4 ), where h a is the arbor's height above the soma, and mt l , as above, the mean thickness of the soma's layer l. We computed the probability p b of reaching the layer below analogously, setting it to 0 for layer L6. The probability of an arbor being translaminar (i.e., not confined to a single layer) was given by max{p a , p b }.

MC arborization pattern
To estimate axonal width in L1 (l1 width; MC cells' axons tend to spread out horizontally in this layer), we computed its width in the upper 165 µm (i.e., the thickness of L1) of its arborization and multiplied it with the probability of having reached L1 (l1_prob). In an analogous way we estimated the total number of bifurcations in L1 (as a proxy for total arbor extent in that layer). We also estimated the extent to which this arborization grew horizontally (l1_gx) and away from the soma (l1_gxa), following the assumption that the axon rises vertically approximately above the soma, and ramifies in both horizontal directions in layer L1. l1_gx is the sum of all segments' X-axis projections, whereas l1_gxa equals l1_gx minus the X-axis projections of all segments directed towards the soma (i.e., their initial X coordinate is further from the soma than their terminal coordinate).

ChC arborization pattern
Since ChC cells' axons have short vertical terminals (Markram et al., 2004;Somogyi, 1977), we counted the number of terminal branches with an extent along the Y axis < 50 µm (Somogyi (1977) reports ChC vertical terminals from 10 µm to 50 µm long) and at least twice as large as the extent along the X axis (short_vertical_terminals).

Arbor density
We quantified arbor density with a number of ratios involving arbor length as the denominator: the ratio of the number of bifurcations and arbor length (density_bifs), proportional to the inverse of branch length; the ratio of area and arbor length (density_area), and, finally, the ratio of average Euclidean distance and total length (density_dist).

Dendritic bipolarity
We quantified whether the dendrites stemmed from opposite ends of the soma and whether those ends are located along a radial (i.e., parallel to the Y) axis, as is the case with bipolar and bitufted dendrites. We did this by applying the above-described PCA-derived analysis to the dendrite insertion points on the soma's surface, after having replicated every insertion point once for each whole µm of the corresponding dendrite's length, so as to give more weight to insertion points of longer dendrites, and having set the Z coordinates to 0. A high insert.eccentricity thus indicated insertion points along an axis, rather than spread-out across the soma's surface, whereas a high insert.radial suggested that the axis was parallel to the Y axis. For cells with a single dendrite insertion point we set insert.eccentricity and insert.radial to 0.

Displaced dendritic arbor
To quantify whether the dendritic arbor was displaced (DeFelipe et al., 2013) from the axonal one, we averaged the distance to the closest axonal reconstruction point for each point of a dendritic arbor (displaced).

Overview
We considered eight separate classification scenarios, one for each interneuron class (including the basket 'superclass') versus all others joined together. Most of these classification tasks were highly imbalanced, with one class much scarcer than the other. For each task, we carried out the same learning procedure -univariate feature selection, under-and over-sampling, and classifier learning-and estimated its performance with cross-validation.
More precisely, we first standardized the predictors over the entire data set, i.e., prior to cross-validation. Crossvalidation then split the data sample into k training and test subsets. We performed feature selection and data sampling on the training data alone, and separately for each training subset. For each training subset, we performed, in the following order: 1) feature selection; 2) data under-and over-sampling; and 3) classifier learning, where we considered a number of different learning algorithms. Steps 1 and 2 were optional, giving the four combinations considered: feature selection followed by classifier learning (without data sampling); feature selection followed by data sampling and then classifier learning; classifier induction without sampling or feature selection; and finally, classifier induction without feature selection but with data sampling. We evaluated the induced classifiers on the k test subsets.
Besides the performance of supervised classification, we looked at the results of feature selection. We ranked and selected features according to the Kruskal-Wallis (KW) test and random forest balanced variable importance (RF BVI), a variation of the well-known random forest variable importance metric, defined below.

Notation and terminology
We denote the vector of n predictor variables or features with X = (X 1 , . . . , X n ) and the class variable with C. Lowercase x and c each denote a single assignment to the predictors X and the class C, respectively. In our setting, x ∈ R n while c ∈ {c 0 , c 1 }, the positive and negative classes. We have a data set D = {(x (i) , c (i) )} N 1 consisting of N instances or data points (i.e., interneurons) x (i) with their label (interneuron class) c (i) . A classifier is a function f : R n → C. A learning algorithm produces f from a training set of observed values of X and C. We may use the terms classifier, learning algorithm, and model interchangeably.

Supervised classifiers
We applied a number of state-of-the-art classifier learning algorithms (Murphy, 2012;Hastie et al., 2009). They are all listed in Table 2 in the main text, along with the abbreviations that we will use to refer to them in the following sections. Below we briefly describe them.

CART
The CART algorithm (Breiman et al., 1984) produces a decision tree by recursively partitioning the training samples according to a single predictor at a time. For each node a of the tree, CART selects the splitting predictor X j , and its threshold value t, by minimizing 'class impurity' G, where D a is a subset of D at node a, while D a jt and D a \ D a jt are the left and right splits, respectively, of D a according to X j and threshold t, where P D a is the empirical probability of class c l in D a . Deep trees can overfit the data, and options for regulating complexity include |D a |, the minimum size of D a required in order to attempt a split, and |D l |, the minimum size of some leaf node D l .

Random forest
A CART tree can overfit the training data. Besides pruning, another way to reduce variance is to use an ensemble of trees, such as the random forest classifier (RF; Breiman, 2001). One draws T bootstrap (Efron, 1979) samples (size N samples from D with replacement), and on each learns an unpruned CART tree. At each split, consider only m ≤ n randomly selected features. To make a prediction, choose the majority class among the T trees. Due to averaging over bootstrap samples, the RF is generally robust to overfitting.

AdaBoost
Like random forest, AdaBoost (Freund and Schapire, 1997) is an ensemble of classification trees. The first tree is trained in regular fashion, with all instances having the same weight. For each following tree, the weight of the instances misclassified by the previous tree is increased. This way even weak individual trees can be combined into an accurate ensemble. Parameters of the method include the depth d of the individual trees, a regularization parameter s ∈ [0, 1], and the number of trees T .

Naive Bayes
The naive Bayes (Minsky, 1961) is a simple approximation to the joint probability distribution P (C, X). It assumes that predictors are conditionally independent given the class and classifies an instance according to Here we assume that each p(X j | c) is a Gaussian probability density with mean µ j,c and variance σ 2 j,c . Albeit a simple model, the naive Bayes often performs well, generally due to its low variance.

k-nearest neighbors
kNN (Fix and Hodges, 1951) classifies an instance x according to its nearest neighbors in feature space, by choosing the most common class label among them. The number of neighbors k is a parameter to the model, with a lower value reducing bias but increasing variance (a lower k fits the training data better). The neighbors are usually identified using a variant of the Minkowski distance, such as Euclidean distance. A common extension is to predict c * by giving more importance to the points that are closer to the target point. Kernel functions are a common means of expressing such weight functions, with weights decreasing smoothly with distance from the target point x (see, e.g., Hechenbichler and Schliep, 2004).

Regularized logistic regression
According to the (binomial) logistic regression model (e.g., Hastie et al., 2009, Chapter 4), the log odds of a class c 0 and class c 1 are a linear function of x: where β 0 and β are the model's coefficients. While the β can be fit by maximum likelihood estimation, regularizing the model by shrinking them can reduce variance. The lasso (Tibshirani, 1996) regularization finds the β by maximizing: is the probability, under the model, of c (i) given x (i) , while λ specifies the degree of penalty on the magnitude of the coefficients. The lasso tends to shrink some coefficients to zero, effectively selecting, for interpretation purposes, the non-zero coefficient variables (features with β j = 0 are effectively omitted from the model). The β coefficients are straightforward to interpret: keeping all other predictors fixed, a unit increase in a standardized predictor X j increases the log-odds of the positive class by β j . Thus, the higher |β j |, the more useful is X j . For groups of correlated predictors, lasso tends to keep a single non-zero coefficient and shrink the rest to zero. Implementations such as the glmnet package (Friedman et al., 2010) can efficiently optimize λ according to the cross-validated estimate of a loss function such as classification error.

Single-layer neural network
A single-layer neural network (Bishop, 1995) models P (c | x) as a linear combination of derived features, also called hidden neurons, each of which is, in turn, a linear combination of x. With the number of derived features h = 0, the neural network corresponds to a linear model such as logistic regression. Increasing h makes the model more flexible, with a sufficiently high h allowing it to represent any piecewise continuous function. The model is trained with an algorithm that minimizes cross-entropy loss.

Support vector machine
The SVM (Boser et al., 1992;Cortes and Vapnik, 1995) finds the maximal margin hyperplane that separates the two classes. It uses kernel functions to project the data onto a higher dimensional space, where they are more likely to be linearly separable. It searches for a separating hyperplane, determined by a coefficient vector β and an intercept β 0 , by finding min β0,β,ξ is on the correct side of the hyperplane, and R > 0 is the complexity parameter, with larger values narrowing the margin and yielding less training set misclassifications, while φ maps x to a higher dimensional space. φ is given by a kernel function K such that K(x, x ) = φ(x) T φ(x ). A common example is the radial basis function, K(x, x ) = exp −γ||x − x || 2 , whose parameter γ > 0 indicates spread from the target instance x.

Linear discriminant analysis
Like multinomial regression, the LDA (Fisher, 1936;Rao, 1948) is a linear classifier, with piecewise hyperplanar decision boundaries. It assumes p(x | c l ) = N (µ l , Σ), that is, multivariate normal class-conditional distributions, with an µ l mean vector for each class c l and a shared covariance matrix Σ, equal for both classes.

Undersampling and oversampling
Most classifiers implicitly optimize classification accuracy, and with high class imbalance, this can lead to a poor prediction of the minority class. A standard technique for reducing this bias towards the majority class is to undersample or oversample the training data (He and Garcia, 2009) in order to achieve a more balanced training set.
Oversampling augments the training set with instances of the minority class. The SMOTE (Chawla et al., 2002) method creates a synthetic instance of the minority class by randomly choosing a point on the line between some minority class instance x and one of its k nearest neighbors from the minority class. Undersampling, on the other hand, involves removing a number of instances from the majority class. We combined both under-and over-sampling (as in, e.g., Estabrooks et al., 2004), using random undersampling followed by SMOTE oversampling. Finally, one has to determine the number of instances to add to (remove from) the training set; albeit a balanced training set is usually desirable, many synthetic minority class instances can lead to overfitting, while losing many majority class instances can mean losing valuable information (He and Garcia, 2009).

Feature selection
In small-sample class-imbalance settings, univariate feature selection (Guyon et al., 2006) can improve predictive performance more than over-and under-sampling, at least for the SVM (Wasikowski and Chen, 2010 (2013)), we expect it to be relatively insensitive to class imbalance. In addition, as a statistical hypothesis test, it provides a straightforward cut-point to discern relevant predictors from irrelevant ones: the p-value and a significance level, α, most commonly set to 0.05. We also used a multivariate feature ranking based on RF-derived variable importance (see below).

Kruskal-Wallis test
The null hypothesis of the Kruskal-Wallis test (Kruskal and Wallis, 1952) is that the medians of k samples are the same. In our case, these samples correspond to the k different classes. It is a non-parametric procedure and as such it does not assume that the data follow a particular distribution. Its special case for k = 2 is the Mann-Whitney-Wilcoxon test (Wilcoxon, 1945;Mann and Whitney, 1947). The test statistic H j , for some feature X j , is where r li is the rank of i-th sample in class c l ,r l· is the average rank of samples in class c l ,r is the average rank, and N l is the number of instances in class c l . H j asymptotically follows the χ 2 distribution and thus we compute the test's p-value as P (χ 2 k−1 ≥ H j ). With small N l , the χ 2 approximation is less accurate and results in reduced test power (Sheskin, 2003). We adjusted the p-values (obtained with the χ 2 test) for multiple testing by using the false discovery rate procedure (Benjamini and Hochberg, 1995).

RF variable importance
Variable importance (VI) is given by the out-of-bag (OOB) accuracy of the trees in the forest. An OOB sample for a tree t consists of instances which were not in the bootstrap subsample from which t was learned. Let a t be the percentage of correct classifications in the OOB sample for tree t, and a ptj the percentage of correct classifications after randomly permuting the values of X j in the OOB sample. Then, where a tj = a ptj = 0 if X j is not in tree t; otherwise a tj = a t ; T , as mentioned in Section 2.3.2, is the number of trees in the ensemble. Alternatively, one can compute per-class VIs by measuring changes in class-specific accuracies.
VI can loosely be interpreted as the feature's effect on accuracy and it provides a ranking of the features (obtained in a multivariate way). Useful features will have positive values whereas useless ones will have VIs around or below zero. A drawback is that the VI ranking tends to favor correlated predictors, especially for low values of m (i.e., the number of features considered at each split; see . Because the ranking is stochastic, it is important to use enough trees for it to stabilize. The above-described VI is less effective in imbalanced settings, as misclassifications due to imbalance can overcome those due to class label permutation. While Janitza et al. (2013) proposed a VI derived from the change in area under the ROC curve (Swets, 1988;Fawcett, 2006), rather than the change in accuracy, so as to balance both types of errors (i.e., false positives and false negatives), we did not use it as it is only implemented for the RF variant based on conditional inference trees (Hothorn et al., 2006). Instead, we used the arithmetic mean of the per-class VI values provided by the randomForest R package, and refer to this as the balanced VI (BVI) 1 .
Finally, given a VI-based ranking, it is not straightforward to determine the cut-point that separates useful features from useless ones. While Breiman (2001) suggests a statistical test for the purpose, it has some undesirable statistical properties (i.e., its power increases with the number of trees and decreases with sample size) and is thus not recommended  2 . Alternatives include permutation tests (Wang et al., 2010;Altmann et al., 2010) and methods based on OOB accuracy of nested RF models, corresponding to different cut-points along the ranking (Svetnik et al., 2003;Díaz-Uriarte and De Andres, 2006;Genuer et al., 2010). We used a simple heuristic and selected only features with a BVI above 0.01.

Detecting mislabelled examples
Wrong class labels may arise due to issues such as lack of relevant information, subjectivity, or simple human mistakes (Frénay and Verleysen, 2014). We followed the approach by Brodley and Friedl (1999), considering cells misclassified by different models as possibly mislabelled. It is beneficial for the models to belong to different paradigms, such as nearest neighbors and decision trees, so that their biases differ. The misclassifications correspond to cross-validation rather than resubstitution.

Unsupervised preprocessing
We standardized all predictors to zero mean and unit variance. This gives equal weight to all predictors for the kNN classifier, and allows us to interpret the magnitude of the coefficients of the linear models, while it does not affect the remaining models.

Assessing performance
On imbalanced data sets, a model can be accurate by simply predicting the majority class. We thus complemented the accuracy metric with the F-measure (Baeza-Yates and Ribeiro-Neto, 1999) score: where TP, FP, and FN are entries of the confusion matrix, namely the true positives, the false positives, and false negatives, respectively. The F-measure thus balances different aspects of positive class prediction, namely the true positives, false positive, and the false negatives. In our case, the positive class was always the class of interest, e.g., when classifying ChC versus all other types, the positive type was ChC. Except for BA, the positive class was always considered the minority class (i.e., there were more BA than non BA cells).
We estimated the accuracy and F-measure with cross-validation (CV). Because over-and under-sampling introduced stochasticity into the learning process, we repeated CV a number of times and averaged to get the final estimates. Note that, unlike for classification accuracy, one cannot get an unbiased estimate of F-measure by averaging over the k test samples (Forman and Scholz, 2010); thus, we computed the F-measure of a CV run from the full confusion matrix, obtained by aggregating the true labels and predictions from the k test folds. That is, the F-measure estimate for a run of cross-validation was not the average of k per-fold F-measure scores, but rather the single value computed from the aggregated confusion matrix.

Parameters
We set the classifiers' parameters (see Table 2 in main text) on the basis of available recommendations (Boulesteix et al., 2012;Hsu et al., 2003) or we used the defaults in the software implementation. For kNN we used k = 5, and, similarly, for CART we set |D l | = 5; while this might be too coarse-grained for ChC, as there are at most six ChC cells per training set, we sought to avoid overly complex models (with a lower k). Note that the m = √ n parameter for RF was recomputed on every training set; thus, it was adjusted each time feature selection reduced n. For RF, we set T = 2000 and chose the standard value of m = √ n.
For KW feature selection, we set the significance level α = 0.05, whereas for RF BVI ranking we selected features with BVI ≥ 0.01 and kept m = √ n, although it can yield a BVI ranking that prefers correlated features (Nicodemus et al., 2010), while we increased the number of trees, T , to 20000, as that produced stable BVI values (a higher T does not increase model variance nor presents any other drawback besides longer computation time).
For undersampling, we randomly removed up to a half of the majority class instances, keeping at least three majority class instances per each one of the minority class (thus, for the imbalance ratios minority:majority above (i.e., less pronounced than) 1:3 we did not undersample). More precisely, after undersampling there were N M , 3N m ) minority class examples. Therefore, for large imbalances (e.g., a ratio 1:10) most balancing was due to undersampling, potentially reducing imbalance down to a ratio of to 1:3, with SMOTE oversampling then further reducing the ratio towards 1:1.
We evaluated the learning procedures with stratified 10-fold cross-validation, except for ChC versus rest, where we used seven folds (in order to have at least one ChC instance in each test set). When sampling the training data, we repeated CV 10 times and averaged the results. When computing F-measure, we always considered the minority class as the positive one, except for BA versus rest, when we considered BA, the majority class, as the positive one. We looked for mislabelled cells by collected misclassifications over 30 runs of ten-fold cross validation. Tables 2 and 3 in the main text list all the parameters.

Implementation
We implemented all the data analysis and classification in the R programming language (R Core Team, 2015). We used the mlr (Bischl et al., 2015) for classifier learning and evaluation, feature selection, oversampling and undersampling, extending it to compute a global F-measure for an entire cross-validation run and adding the FDR p-value correction to the KW feature selection method. All code and data are available at https: //github.com/ComputationalIntelligenceGroup/bbp-interneurons-classify.

Morphology quality 3.1 Reconstruction differences
We found that the cells differed in mean axonal segment length and that this difference could be related to the internal ids (e.g., C010600B1 and MTC070301B_IDC) of the cells. Out of the seven different initial letters of these ids, cells whose id begun with a letter C (88 of them) had shorter, as well as thicker, axonal segments than the remaining 131 cells (see Figure 5 in main text). As branch and total arbor length did not differ between the two groups, this meant that the C-prefixed cells had fewer long and thick segments per branch, whereas the non-C cells contained more short and thin ones, suggesting that they were simply reconstructed at a finer granularity. We found that the C-prefixed cells were deposited at Neuromorpho.org repository (Ascoli et al., 2007) earlier than the non-C ones, meaning they may have been reconstructed at an earlier stage.
More morphometrics, such as axonal remote tilt angle (remote_tilt_angle.avg), arbor depth (depth), and tortuosity (tortuosity.avg), also differed between the two groups, albeit with much less statistical significance (see Table 4) than thickness and segment length. We suspect that only some of these, such as possibly tortuosity (the non-C cells have lower tortuosity, i.e., they are less straight, which seems logical given that they are broken into more segments) had been affected by differences in branch reconstruction granularity; others might have differed due to others causes, such as different proportion of interneuron types in the two groups, or the different laminar distribution (see Figure 1).  Figure 1: The branches of non-C cells' (those whose ids do not begin with a C) were less straight (i.e., their tortuosity values were lower), even after accounting for interneuron type and layer. : Morphometrics that differed between the cells whose id begins with a C and the rest, according to a Kruskal-Wallis test at α = 0.05, with the p-value corrected for multiple testing with the false discovery rate procedure (Benjamini and Hochberg, 1995).

Classification results
Figures 3 to 10 show all classifiers' F-measure for all eight classification tasks. For ChC and BTC the results depended more strongly on sampling, with some samplings providing better and other worse results (e.g., for ChC, the F-measure of RF BVI + SVM ranged from 0.13 to 0.57; see Figure 3), as in these settings the amount of removed instances was highest. Perhaps an informed, rather than random, undersampling scheme could have improved the results.

Multi-class
We carried out multi-class classification by combining one-versus-all models. Figure 11 shows the accuracies of the different methods while Table 9 is the confusion matrix of the most accurate method, RF with KW feature selection and data sampling.    Figure 4: BTC versus rest. Violin plot of 10-fold cross-validation estimates of F-measure. Above: ten CV repetitions when under-and over-sampling training data; below: a single CV repetition with no data sampling. Vertical rows of panels correspond to the feature selection methods applied: none, KW, and RF BVI.  Figure 5: DBC versus rest. Violin plot of 10-fold cross-validation estimates of F-measure. Above: ten CV repetitions when under-and over-sampling training data; below: a single CV repetition with no data sampling. Vertical rows of panels correspond to the feature selection methods applied: none, KW, and RF BVI.  Figure 6: SBC versus rest. Violin plot of 10-fold cross-validation estimates of F-measure. Above: ten CV repetitions when under-and over-sampling training data; below: a single CV repetition with no data sampling. Vertical rows of panels correspond to the feature selection methods applied: none, KW, and RF BVI.  Figure 7: NBC versus rest. Violin plot of 10-fold cross-validation estimates of F-measure. Above: ten CV repetitions when under-and over-sampling training data; below: a single CV repetition with no data sampling. Vertical rows of panels correspond to the feature selection methods applied: none, KW, and RF BVI.  Figure 9: LBC versus rest. Violin plot of 10-fold cross-validation estimates of F-measure. Above: ten CV repetitions when under-and over-sampling training data; below: a single CV repetition with no data sampling. Vertical rows of panels correspond to the feature selection methods applied: none, KW, and RF BVI.  Figure 11: Multi-class classification by combining one-versus all models. Violin plot of 7-fold cross-validation estimates of F-measure. Above: ten CV repetitions when under-and over-sampling training data; below: a single CV repetition with no data sampling. Vertical rows of panels correspond to the feature selection methods applied: none, KW, and RF BVI. Entries are missing for ADA, NNET, and RMLR as their execution did not complete.