DeepDPM clustering example on 2D data.
On the left: DeepDPM’s predicted clusters’ assignments, centers and covariances. On the right: Clusters colored by the GT labels, and the net’s decision boundary.
Examples of the clusters found by DeepDPM on the ImageNet Dataset:
DeepDPM is a nonparametric deep-clustering method which unlike most deep clustering methods, does not require knowing the number of clusters, K; rather, it infers it as a part of the overall learning. Using a split/merge framework to change the clusters number adaptively and a novel loss, our proposed method outperforms existing (both classical and deep) nonparametric methods.
While the few existing deep nonparametric methods lack scalability, we show ours by being the first such method that reports its performance on ImageNet.
Installation
The code runs with Pytorch version 3.9.
Assuming Anaconda, the virtual environment can be installed using:
See the requirements.txt file for an overview of the packages in the environment we used to produce our results.
Training
Setup
Datasets and embeddings
When training on raw data (e.g., on MNIST, Reuters10k) the data for MNIST will be automatically downloaded to the “data” directory. For reuters10k, the user needs to download the dataset independently (available online) into the “data” directory.
Logging
To run the following with logging enabled, edit DeepDPM.py and DeepDPM_alternations.py and insert your neptune token and project path. Alternatively, run the following script with the –offline flag to skip logging. Evaluation metrics will be printed at the end of the training in both cases.
Training models
We provide two models which can be used for clustering: DeepDPM which clusters embedded data and DeepDPM_alternations which alternates between feature learning using an AE and clustering using DeepDPM.
Key hyperparameters:
–gpus specifies the number of GPUs to use. E.g., use “–gpus 0” to use one gpu.
–offline runs the model without logging
–use_labels_for_eval: run the model with ground truth labels for evaluation (labels are not used in the training process). Do not use this flag if you do not have labels.
–dir specifies the directory where the train_data and test_data tensors are expected to be saved
–init_k the initial guess for K.
–start_computing_params specifies when to start computing the clusters’ parameters (the M-step) after initialization. When changing this it is important to see that the network had enough time to learn the initializatiion
–split_merge_every_n_epochs specifies the frequency of splits and merges
–hidden_dims specifies the AE’s hidden dimension layers and depth for DeepDPM_alternations
–latent_dim specifies the AE’s learned embeddings dimension (the dimension of the features that would be clustered)
Please also note the NIIW hyperparameters and the guidelines on how to choose them as described in the supplementary material.
Training examples:
To generate a similar gif to the one presented above, run:
python DeepDPM.py –dataset synthetic –log_emb every_n_epochs –log_emb_every 1
To run DeepDPM on pretrained embeddings (including custom ones):
Training on custom datasets:
DeepDPM is desinged to cluster data in the feature space.
For dimensionality reduction, we suggest using UMAP, an Autoencoder, or off-the-shelf unsupervised feature extractors like MoCO, SimCLR, swav, etc.
If the input data is relatively low dimensional (e.g. <= 128D), it is possible to train on the raw data.
To load custom data, create a directory that contains two files: train_data.pt and test_data.pt, a tensor for the train and test data respectively.
DeepDPM would automatically load them. If you have labels you wish to load for evaluation, please use the –use_labels_for_eval flag.
Note that the saved models in this repo are per dataset, and in most of the cases specific to it. Thus, it is not recommended to use for custom data.
Inference
For loading a pretrained model from a saved checkpoint, and for an inference example, see: scripts\DeepDPM_load_from_checkpoint.py
Contributions, feature requests, suggestion etc. are welcomed.
If you use this code for your work, please cite the following:
@inproceedings{Ronen:CVPR:2022:DeepDPM,
title={DeepDPM: Deep Clustering With An Unknown Number of Clusters},
author={Ronen, Meitar and Finder, Shahaf E. and Freifeld, Oren},
booktitle={Conference on Computer Vision and Pattern Recognition},
year={2022}
}
DeepDPM: Deep Clustering With An Unknown Number of Clusters
This repo contains the official implementation of our CVPR 2022 paper:
DeepDPM clustering example on 2D data.
On the left: DeepDPM’s predicted clusters’ assignments, centers and covariances. On the right: Clusters colored by the GT labels, and the net’s decision boundary.
Examples of the clusters found by DeepDPM on the ImageNet Dataset:
![Examples of the clusters found by DeepDPM on the ImageNet dataset](https://www.gitlink.org.cn/api/dnrops/deepdpm/raw/ImageNet_cluster_examples/cluster_examples.jpg?raw=true “Examples of the clusters found by DeepDPM on the ImageNet dataset”&ref=main)
Table of Contents
Introduction
DeepDPM is a nonparametric deep-clustering method which unlike most deep clustering methods, does not require knowing the number of clusters, K; rather, it infers it as a part of the overall learning. Using a split/merge framework to change the clusters number adaptively and a novel loss, our proposed method outperforms existing (both classical and deep) nonparametric methods.
While the few existing deep nonparametric methods lack scalability, we show ours by being the first such method that reports its performance on ImageNet.
Installation
The code runs with Pytorch version 3.9. Assuming Anaconda, the virtual environment can be installed using:
See the requirements.txt file for an overview of the packages in the environment we used to produce our results.
Training
Setup
Datasets and embeddings
When training on raw data (e.g., on MNIST, Reuters10k) the data for MNIST will be automatically downloaded to the “data” directory. For reuters10k, the user needs to download the dataset independently (available online) into the “data” directory.
Logging
To run the following with logging enabled, edit DeepDPM.py and DeepDPM_alternations.py and insert your neptune token and project path. Alternatively, run the following script with the –offline flag to skip logging. Evaluation metrics will be printed at the end of the training in both cases.
Training models
We provide two models which can be used for clustering: DeepDPM which clusters embedded data and DeepDPM_alternations which alternates between feature learning using an AE and clustering using DeepDPM.
Please also note the NIIW hyperparameters and the guidelines on how to choose them as described in the supplementary material.
To generate a similar gif to the one presented above, run: python DeepDPM.py –dataset synthetic –log_emb every_n_epochs –log_emb_every 1
To run DeepDPM on pretrained embeddings (including custom ones):
for example, for MNIST run:
For the imbalanced case use the data dir accordingly, e.g. for MNIST:
To run on STL10:
(note that for STL10 there is no imbalanced version)
DeepDPM with feature extraction pipeline (jointly learning clustering and features):
To load custom data, create a directory that contains two files: train_data.pt and test_data.pt, a tensor for the train and test data respectively. DeepDPM would automatically load them. If you have labels you wish to load for evaluation, please use the –use_labels_for_eval flag.
Note that the saved models in this repo are per dataset, and in most of the cases specific to it. Thus, it is not recommended to use for custom data.
Inference
For loading a pretrained model from a saved checkpoint, and for an inference example, see: scripts\DeepDPM_load_from_checkpoint.py
Citation
For any questions: meitarr@post.bgu.ac.il
Contributions, feature requests, suggestion etc. are welcomed.
If you use this code for your work, please cite the following: