From: acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data
Method | Parameter description |
---|---|
Data pre-processing | Given a target of n data points (by default, n=1000), the window width is fixed as \(w = \sum _{i} l_{i} / n\), where l _{ i } is the length of contig i. Default choices of Δ w=w/2 and k=4 (tetramer frequencies) are robust. For contigs with l _{ i }<w, the window width is taken as large as possible (w=l _{ i }). |
BH-SNE | The parameter θ=0.5 is a trade-off between speed and accuracy. We set the perplexity perp(n)=⌊log(n)^{2}⌋. It can be seen as an effective neighborhood size that controls the graininess of clusters. A small number of data points n receives a small perplexity whereas with growing n the perplexity saturates. |
DIP | The significance level which is uncritical as it is α=0 in the large majority of significant cases. Furthermore, the DIP split threshold, i.e. the percentage of data points, for which multimodality was detected, can be seen as a control of detection precision. We found a default value of t _{ dip }=0.001 to work very well throughout all tested data sets. |
CC | The number of clusters found depends on the underlying graph. In acdc, the graph is constructed by connecting each data point to it’s k _{ cc } mutual nearest neighbors. The parameter k _{ cc } can be interpreted as the minimum number of data points contained in a separate cluster. To be able to detect also very small contamination, we use a default value of k _{ cc }=9. |
Bootstrapping | We set the number of bootstraps B=10. Setting B to a larger number will result in more accurate confidence estimations at the cost of a longer runtime. |
Kraken | The only parameter required by Kraken is the database to be used. It can be specified as a parameter to acdc as well. |
RNAmmer | 16S rRNA gene sequence prediction using RNAmmer does not require any parameters. |