### Simulated data

First we simulated a dataset as an example that PCA is not always capable of detecting clusters in high-dimensional data. Consider three 10-variate normal populations with *N*
_{10}(*μ*
_{
i
},*Σ*), where

*μ*
_{1}=(−*μ*
^{∗},*μ*
^{∗},0,0,0,0,0,0,0,0)^{⊤},

*μ*
_{2}=(*μ*
^{∗},0,0,0,0,0,0,0,0,0)^{⊤} and

*μ*
_{3}=(0,−*μ*
^{∗},0,0,0,0,0,0,0,0)^{⊤} and

\(\boldsymbol {\Sigma }=\text {diag}\left (\frac {1}{5},\frac {1}{5},1,1,1,2,2,2,2,2\right)\) with *μ*
^{∗}=2. From each population we simulated then 100 samples. In order to hide the clearly visible subpopulations, we further rotated the simulated observations with a random orthogonal matrix. Note that the rotation has no impact on the performance of PCA as the method is rotation invariant.

Additional file 1: Figure S1 shows the simulated data before the rotation and Additional file 1: Figure S2 the data after it. In the latter one, the groups are clearly not visible anymore.

### Chicken data

The high-density genotype data consists of 749 chicken from 4 generations. The last generation is the largest group with 603 samples. The other generations contain 50, 46 and 50 samples. The data consists of sequence based variation data from 7 genomic regions, covering approx. 35% of the genome. The regions have been preselected based on previous studies as containing loci affecting egg-quality traits, see [15] and [16]. As reference genome we used galgal4.

In total there were 157,528 genotypes measured in those regions. See Additional file 1: Figure S3 for the locations of the used regions on the chicken reference genome.

In addition to the genotype data, also a set of 15 different breeding values was available for all chicken. These were, besides others, egg production in period 3 to 7, egg production in period 9 to 12 and feed intake. We use this data as real data example and will follow up on the biological findings only for one detected subgroup in order to keep the focus on the method.

### Invariant coordinate selection

To explain ICS we need first to introduce the concept of scatter matrix. For a *p*-variate random vector **x** any *p*×*p* matrix-valued estimator **S**(**x**) is called a scatter matrix if it affine equivariant in the sense that

$$\mathbf{S}(\mathbf{A}\mathbf{x} + \mathbf{b}) = \mathbf{A}\mathbf{S}(\mathbf{x}) \mathbf{A}^{\top}, $$

for any full-rank *p*×*p* matrix **A** and any *p*-variate vector **b**. Clearly the regular covariance matrix COV is a scatter matrix. But especially in the robust statistics literature many other scatter matrices were introduced. For more details about how scatter matrices generalize the covariance matrix and many related references see [17]. A scatter matrix we will use later is the so-called scatter matrix of fourth moments

$$\text{COV}_{4}(\mathbf{x}) = \frac{1}{p+2} \mathbf{E}\left(r^{2} (\mathbf{x}- \mathbf{E}(\mathbf{x}))(\mathbf{x}-\mathbf{E}(\mathbf{x}))^{\top}\right), $$

where *r*=||COV(**x**)^{−1/2}(**x**−**E**(**x**))|| and ||·|| denotes the Frobenius norm.

The main idea of ICS is to compare two different scatter matrices **S**
_{1}(**x**) and **S**
_{2}(**x**) by solving the following eigenvector-eigenvalue problem

$$\mathbf{S}_{1}^{-1}(\mathbf{x}) \mathbf{S}_{2}(\mathbf{x}) \mathbf{B}^{\top}(\mathbf{x}) = \mathbf{B}^{\top}(\mathbf{x}) \mathbf{D}(\mathbf{x}), $$

where **D**(**x**) is then the diagonal matrix containing the *p* eigenvalues of \(\mathbf {S}_{1}^{-1}(\mathbf x) \mathbf {S}_{2}(\mathbf {x})\) in decreasing order. The rows of **B**(**x**) contain then the corresponding eigenvectors. For convenience of notation we will denote from now on **S**
_{1}(**x**)=**S**
_{1}, **S**
_{2}(**x**)=**S**
_{2}, **B**(**x**)=**B** and **D**(**x**)=**D**.

The ICS equation above can be seen as the problem of jointly diagonalizing the two scatter matrices, i.e. find **B** and **D** such that

$$\mathbf{B}\mathbf{S}_{1} \mathbf{B}^{\top} = \mathbf{I}_{p} \quad \text{and} \quad \mathbf{B}\mathbf{S}_{2} \mathbf{B}^{\top} = \mathbf{D}. $$

An interpretation can then be given as follows. First **S**
_{1} is used to whiten the data, i.e. uncorrelate the variables and standardize the scales. Then perform on the whitened data PCA using **S**
_{2}. Therefore the idea is to see if **S**
_{2} finds still some interesting structure after removing second order information as measured by **S**
_{1}.

The transformation **B**(**x**)**x** yields then an invariant coordinate system in the sense that

$$\mathbf{B}(\mathbf{x})\mathbf{x} = \mathbf{B}(\mathbf{A}\mathbf{x}) (\mathbf{A}\mathbf{x}), $$

where equality holds up to marginal signs for any full rank *p*×*p* matrix **A**. The new vector **z**=**B**(**x**)**x** is then usually referred to as the invariant coordinates.

The univariate concept of kurtosis can be seen as the ratio of two (standardized) scale measures and similarly \(\mathbf {S}_{1}^{-1}\mathbf {S}_{2}\) can hence be seen as a multivariate extension of this concept. Therefore the eigenvalues contained in **D** can be interpreted as generalized kurtosis values as measured by **S**
_{1} and **S**
_{2}. In the special case of **S**
_{1}=COV and **S**
_{2}=COV_{4} it can be shown that the diagonal elements in **D** are a linear function of the classical measures of kurtosis of the components in **z** [18].

And for example when searching clusters it is well-known that large clusters can be found often in directions with small kurtosis and outliers and small clusters in directions with large kurtosis. This means that invariant coordinates are very suitable for searching for groups as the components are ordered according to their (generalized) kurtosis. As actually [5] show, in the context of mixtures of elliptical distributions with proportional scatter matrices, ICS finds Fisher’s linear discriminant subspace without knowing the group memberships. Hence, when using ICS for exploratory data analysis usually most attention is paid to the components with extreme generalized kurtosis values, like for example the first 3–5 and last 3–5 components. For more details about ICS see [4, 5, 18, 19].

As practical considerations we would however like to point out that there is no general best combination of scatter matrices and the performance might depend on the choice of *S*
_{1} and *S*
_{2}. The choice *S*
_{1}=COV and *S*
_{2}=COV_{4} is however well-established and for example also a solution to the independent component problem (ICA) if **x** follows it (see eg [20] for further details). ICA has been applied in the context of genetic data e.g. in [21].

Furthermore, ICS is however currently limited to the case when *p*<*n*−1 as otherwise scatter matrices are always proportional to each other, see [22] for details. Therefore if *p*≥*n*−1, then one can for example first perform dimension reduction using PCA, resulting in a *n*×*n* matrix where the *n*-th eigenvalue is zero. Then ICS is only applied to a subspace which is known to have variation and is of smaller dimension than *n*−1. This is for example standard practice in many multivariate methods which are limited to the *p*≥*n*−1 case, like for example the high-dimensional noisy ICA approaches [23].

### Distance measure, distance groups and statistical testing

For the simulated data the classification decision based on the scatterplot matrices from PCA and ICS was done by applying a k-means algorithm to the desired components. The classification results of the different dimension reduction methods were then evaluated using the adjusted Rand index [24]. In the real data example, the classification decision was done by visual inspection of the figures.

In order to calculate the genetic distance of two different groups in a region of interest, we followed a basic approach. Assuming two subpopulations *A* and *B* have been identified in the data, we determined first at each loci *l*=1,2,… the most common genotypes for both groups and denote these \(\tilde {G}_{A,l}\) respective \(\tilde {G}_{B,l}\). Then, we compared if these genotypes match between the two groups, by setting *G*
_{
l
}=1, if \(\tilde {G}_{A,l} = \tilde {G}_{B,l}\) and 0 else. Afterwards we calculated a moving average of length 1000 across the data and calculated in each window the average level of agreement. Let *W*=*w*
_{1},*w*
_{2},… be the set of all windows of length 1000 with *w*
_{1}=*l*
_{1},…,*l*
_{1000},*w*
_{2}=*l*
_{2},…,*l*
_{1001}, the average level of agreement in window *i* is then \(\bar {x}_{w_{i}}=\sum _{\forall l \in w_{i}} G_{l} / 1000\). For the sake of simplicity, we calculated the moving average also across chromosomal borders.

For all windows *w*
_{
i
} with level of agreement between two subpopulations \(\bar {x}_{w_{i}} \leq 0.4\), the individual distance of each individuum in the one group was calculated to the average of the other group. For that, we use again the most common genotype for each loci in the subpopulation coded as 0,1,2 and then we calculated the Manhattan distance of each individuum from the standard population to that.

Testing for differences in the breeding values between the two subpopulations has been done by applying a two-sided Mann-Whitney test. Significant breeding values (*p*-value ≤0.05) are further investigated with a directional test, as proposed by [25] and implemented in the R-package *gMWT* [26]. The individual distance measure of the chicken from the main population to the subpopulation showed three types of chicken, those which are genetically close (*c*), those that are medium (*m*) and those that are far (*f*) away from the subpopulation. Let *F*
_{
p,c
}, *F*
_{
p,m
} and *F*
_{
p,f
} be the distributions of the three groups for a given phenotype *p*, we have then the following two testing problems in mind

$$H_{0}: F_{p,c} = F_{p,m} = F_{p,f} \quad \text{vs} \quad H_{1}: F_{p,c} \preceq_{st} F_{p,m} \preceq_{st} F_{p,f} $$

or

$$H_{0}: F_{p,c} = F_{p,m} = F_{p,f} \quad \text{vs} \quad H_{2}: F_{p,f} \preceq_{st} F_{p,m} \preceq_{st} F_{p,c} $$

with ≼_{
st
} being the stochastical ordering of the two distributions. Two distributions *F*
_{1} and *F*
_{2} are stochastically ordered, if *F*
_{1}(*x*)≥*F*
_{2}(*x*) ∀*x*∈*R* and we write *F*
_{1}≼_{
st
}
*F*
_{2}. These directional hypotheses have been used to test for a directional relationship between the similarity group and the different phenotypes.