This repository offers a PyTorch implementation of a variant of the Deep Embedded Clustering (DEC) algorithm. The original code can be found at vlukiyanov/pt-dec. This implementation is compatible with PyTorch 1.0.0 and supports Python 3.6 and 3.7, with optional CUDA acceleration.
This follows (or attempts to; note this implementation is unofficial) the algorithm described in "Unsupervised Deep Embedding for Clustering Analysis" of Junyuan Xie, Ross Girshick, Ali Farhadi (https://arxiv.org/abs/1511.06335).
To set up the environment for running this code, you can create a new Conda environment with python 3.11 using the following command:
conda create --name your_env_name python=3.11Replace your_env_name with a name of your choice for the environment.
Once your env activated, run:
pip install -r requirements.txtIf Conda is not installed on your system, follow these steps to install Miniconda:
-
Download the Miniconda installer:
cd ~ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-
Make the installer executable:
chmod +x Miniconda3-latest-Linux-x86_64.sh
-
Run the installer:
./Miniconda3-latest-Linux-x86_64.sh
-
Reload your shell configuration:
source ~/.bashrc
-
Verify the installation:
conda --version
Refer to the official Miniconda documentation for more details.
-
Transform data to Parquet format:
Run the following script to convert your data:python tools/transform_to_parquet.py
-
Generate embeddings:
Create embeddings for your data with:python tools/generate_embeddings.py
-
Train autoencoder and DEC models:
Use the main training script with customizable options:python tcc.py [OPTIONS]
Key options:
--cuda: Use CUDA for acceleration (default:False)--testing-mode: Run in testing mode (default:False)--train-autoencoder: Train the autoencoder from scratch or load an existing one (default:True)--sort-by-elem: Split data by "ElemDespesaTCE" and cluster each part separately (default:False)
Example usage:
python tcc.py --train-autoencoder False --sort-by-elem True
-
Create the vector store:
Before running the app, generate the vector store:python chroma_vector_store.py
-
Launch the Streamlit app :streamlit: 🚀: Start the application with:
streamlit run app.py
- Original Caffe: https://github.com/piiswrong/dec
- PyTorch: https://github.com/CharlesNord/DEC-pytorch and https://github.com/eelxpeng/dec-pytorch
- Keras: https://github.com/XifengGuo/DEC-keras and https://github.com/fferroni/DEC-Keras
- MXNet: https://github.com/apache/incubator-mxnet/blob/master/example/deep-embedded-clustering/dec.py
- Chainer: https://github.com/ymym3412/DeepEmbeddedClustering