### Score function

Our method should find similar colors for similar symbols and distant colors for dissimilar symbols. In order to quantify this objective, differences are computed via a score function. The score function takes a combination of *CIE L*a*b** colors (*color conformation*), represented as a (*N*×3)-matrix, and returns a scalar score value. *N* is the number of symbols in the alphabet (20 for amino acids).

In the following <*X*> denotes the arithmetic mean of all elements of a matrix: \(\left < X \right > := \frac {1}{n} \sum _{ij} X_{ij}\). For a triangular matrix only the non-zero diagonals are taken into account; *n* is the number of respective matrix elements.

#### Construction of distance matrices

First, the input substitution matrix *M* is converted into a distance matrix *D*^{′}. *D*^{′} is triangular to remove redundant entries, since the distance is commutative.

$$ D'_{ij} = \left\{\begin{array}{ll} \left((M_{ii} - M_{ij}) + (M_{jj} - M_{ji}) \right) / 2, & \text{for} \; j \leq i \\ 0, & \text{for} \; j > i \end{array}\right. $$

For any substitution matrix *M*, *M*_{ii} should be the maximum value in the row/column *i*, otherwise a symbol would be more similar to another symbol than to itself. Consequently, the main diagonal of *D*^{′} contains only zeros.

To be agnostic about the magnitude of substitution scores in *M*, *D*^{′} is scaled, so that the average distance is 1.

$$D = \frac {D'} {\left< D' \right>} $$

#### Calculation of the perceptual difference matrix

On the other hand, the triangular matrix *C* denotes the pairwise perceptual differences of the *L*a*b** colors for each pair of symbols. Our implementation uses the *CIEDE2000* formula [8], which is omitted for brevity. However, the formula can be approximated as the Euclidean distance [7]:

$$ C_{ij} \approx \sqrt{(L^{*}_{i} - L^{*}_{j})^{2} + (a^{*}_{i} - a^{*}_{j})^{2} + (b^{*}_{i} - b^{*}_{j})^{2}} $$

Note, while *D* is constant, *C* is dependent on the (current) color conformation.

In order to relate the *L*a*b** color differences in *C* to the distances in *D*, a scale factor *f*_{s} is introduced. *f*_{s} is the proportion of the average distance in *D* to the average difference in *C*:

$$ f_{s} = \frac{\left< D \right>}{\left< C \right>} = \frac{ 1} { \frac{1}{n} \sum_{ij} C } = \frac{ n} { \sum_{ij} C_{ij} } $$

As *C* is variable, *f*_{s} also dynamically changes during our optimization runs.

#### Score function

The score function *S*_{T} is a sum of two terms: a sum of harmonic potentials between each pair of symbols *S*_{H} and a linear *contrast score**S*_{C}:

$$S_{T} = S_{H} + S_{C} $$

The harmonic potentials are used to adjust the relative color differences in accordance with the substitution matrix. The equilibrium distance of each potential corresponds to the distance in the distance matrix *D*:

$$S_{H} = \sum_{ij} \left(f_{s} C_{ij} - D_{ij} \right)^{2} $$

However, this term is not sufficient to create an appealing color scheme: due to the scale factor *f*_{s}, a scheme with a small average color difference would get the same score as a scheme with a high average color difference. In consequence low contrast color schemes could arise. In order to favor high contrast color schemes the contrast score *S*_{C} is introduced. A reciprocal function based on the average color difference is used here. The *contrast factor**f*_{c} is a user-supplied parameter for weighting this term:

$$S_{C} = \frac{f_{c}}{\left< C \right>} $$

### Optimization

The question, which color conformation minimizes the score function, is a (*N*×3)-dimensional, continuous optimization problem. As the optimization landscape can be restricted via user input, we face the problem of a non trivially bounded and, depending on the constraint, possibly non-convex optimization problem. In general, obtaining an exact solution in a non-convex, continuous problem setting is computationally hard, as shown for the example of pair potentials in atomic clusters [9]. Therefore, we resort to heuristic optimization, namely *simulated annealing* (SA) [4], as described in Algorithm 1.

The SA algorithm samples the search space, which is a vector space consisting of color space vectors \( \vec {x} = \left (L^{*}_{1}, a^{*}_{1}, b^{*}_{1}, \hdots, L^{*}_{N}, a^{*}_{N}, b^{*}_{N}\right) \).

Sampling is realized based on a neighborhood definition in color space. For the space of color space vectors this neighborhood is defined by all valid colors reachable by adding perturbations of the color values drawn from a uniform distribution \(U(-\Delta \vec {x}(t), \Delta \vec {x}(t))\), excluding the user specified region.

Initially the search space is sampled with a high temperature, so the algorithm has the ability to escape local minima or even jump over large sections of the search space. By gradually limiting this behavior, which is specified by the annealing schedule *β*(*t*) (quantified by *τ*), the algorithm converges to a suitable optimal solution – guarantees about the found optima’s quality, however, cannot be given due to the heuristic approach of SA. Yet, the convergence towards the global optimum in infinite time is a proven quality of this algorithm [10, 11]. Therefore, after a sufficiently long run time a non-optimal, but nonetheless *good* solution is found.

Typically SA is used for combinatorial optimization problems, i.e. problems defined on a discrete search space, e.g., the traveling salesman problem [4]. Since its inception SA has also been adapted to various continuous optimization problems.

As it turns out a robust adaption of SA for the continuous problem discussed here, is realized by simply doing an annealing in both the temperature and step size.

We run an ensemble of SA instances, meaning multiple independent instances with different random number generator seeds. The best found solution and score seen during a single SA run are captured in the variables \(\vec {x}_{BSF} \) and *S*_{BSF}.

These quantities are referred to as the *best-so-far* solution and score. After the last iteration the optimal solution is given as the minimum of the *best-so-far* solutions within the ensemble of optimizers. In our implementation we also store the ensemble maximum which is given by the maximum of the worst seen solutions over the algorithm run, which we neglected to include in Algorithm 1 in favor of simplicity, as in principle a broad variety of run features could be stored in the same way as the *best-so-far* solution. Furthermore an ensemble average and standard deviation are stored. These two properties are used for further quality analysis of the SA run.

### Software

We have implemented the method for color scheme generation in the Python package Gecos. In addition to the more flexible Python API, the package offers a command line interface (CLI) for simple color scheme generation. Either way, the alphabet, substitution matrix and color subspace can be customized for the purpose of the user. By default the software creates a color scheme for the *BLOSUM62* matrix [12]. The CLI saves the generated color scheme in JSON format, containing the RGB color code for each symbol. The JSON format can be directly used for alignment visualization in Biotite [13]. For usage in other visualization software the color codes must be inserted into the input format of the respective program.