Skip to main content

Table 1 Description of parameters for various techniques used in acdc

From: acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data

Method

Parameter description

Data pre-processing

Given a target of n data points (by default, n=1000), the window width is fixed as \(w = \sum _{i} l_{i} / n\), where l i is the length of contig i. Default choices of Δ w=w/2 and k=4 (tetramer frequencies) are robust. For contigs with l i <w, the window width is taken as large as possible (w=l i ).

BH-SNE

The parameter θ=0.5 is a trade-off between speed and accuracy. We set the perplexity perp(n)=log(n)2. It can be seen as an effective neighborhood size that controls the graininess of clusters. A small number of data points n receives a small perplexity whereas with growing n the perplexity saturates.

DIP

The significance level which is uncritical as it is α=0 in the large majority of significant cases. Furthermore, the DIP split threshold, i.e. the percentage of data points, for which multimodality was detected, can be seen as a control of detection precision. We found a default value of t dip =0.001 to work very well throughout all tested data sets.

CC

The number of clusters found depends on the underlying graph. In acdc, the graph is constructed by connecting each data point to it’s k cc mutual nearest neighbors. The parameter k cc can be interpreted as the minimum number of data points contained in a separate cluster. To be able to detect also very small contamination, we use a default value of k cc =9.

Bootstrapping

We set the number of bootstraps B=10. Setting B to a larger number will result in more accurate confidence estimations at the cost of a longer runtime.

Kraken

The only parameter required by Kraken is the database to be used. It can be specified as a parameter to acdc as well.

RNAmmer

16S rRNA gene sequence prediction using RNAmmer does not require any parameters.