From: Hypercluster: a flexible tool for parallelized unsupervised clustering optimization
config.yml parameter | Explanation | Example |
---|---|---|
1 input_data_folder | Path to folder in which input data can be found | /input_data |
2 input_data_files | List of prefixes of data files | ['input_data1’, 'input_data2’] |
3 gold_standard_file | File name of gold_standard_file, must be in input_data_folder | {'input_data': 'gold_standard_file.txt'} |
4 read_csv_kwargs | pandas.read_csv keyword arguments for input data | {'test_input': {'index_col':[0]}} |
5 output_folder | Path to folder into which results should be written | /results |
6 intermediates_folder | Name of subfolder to put intermediate results | clustering_intermediates |
7 clustering_results | Name of subfolder to put aggregated results | clustering |
8 clusterer_kwargs | Additional arguments to pass to clusterers | KMeans: {'random_state':8}} |
9 generate_parameters_addtl_kwargs | Additonal keyword arguments for the hypercluster.AutoClusterer class | {‘KMeans’: {'random_search': true) |
10 evaluations | Names of evaluation metrics to use | ['silhouette_score', 'number_clustered'] |
11 eval_kwargs | Additional kwargs per evaluation metric function | {'silhouette_score': {'random_state': 8}} |
12 metric_to_choose_best | Which metric to maximize to choose the labels | silhouette_score |
13 metric_to_compare_labels | Which metric to use to compare label results to each other | adjusted_rand_score |
14 compare_samples | Whether to made a table and figure with counts of how often each two samples are in the same cluster | "true" |
15 output_kwargs | pandas.to_csv and pandas.read_csv keyword arguments for output tables | {'evaluations': {'index_col':[0]}, 'labels': {'index_col':[0]}} |
16 heatmap_kwargs | Arguments for seaborn.heatmap for pairwise visualizations | {'vmin':-2, 'vmax':2} |
17 optimization_parameters | Which algorithms and corresponding hyperparameters to try | {'KMeans': {'n_clusters': [5, 6, 7] }} |