Skip to content

This repository contains datasets and codes for reproducing the results in our paper.

Notifications You must be signed in to change notification settings

ydcnanhe/Imbalanced-Data-Clustering-using-Equilibrium-K-Means

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Repository Description

This repository contains datasets and code for reproducing the results in our paper

He Yudong, An Equilibrium Approach to Clustering: Surpassing Fuzzy C-Means on Imbalanced Data, IEEE Transactions on Fuzzy Systems, 2025.

He Yudong, Imbalanced Data Clustering Using Equilibrium K-Means, arXiv, 2024.

A Python package "sklekmeans" that implement euilibrium k-means, whose API is compatible with sklearn estimator, has been released. Check the repo or doc or PyPI project page for more details.

Equilibrium K-Means: A K-Means-type Clustering Algorithm for Imbalanced Data

The objective of EKM:

$$\min_{c_1,\cdots,c_K}\sum_{n=1}^N \left(\sum_{i=1}^K d^2_{in} e^{-\alpha d^2_{in}}\right)/\left(\sum_{i=1}^K e^{-\alpha d^2_{in}}\right),$$ where $c_1,\cdots,c_K$ are K centroids as centers of clusters, $d_{in}$ is the distance (usually the Euclidean distance) between $n$-th data point and $i$-th centroid, and $\alpha$ is a parameter to be tuned.

Optimization by a two-step iteration algorithm:

Step 1. Computing weights:

$$w_{kn}^{(\tau)}=\frac{e^{-\alpha (d_{kn}^{(\tau)})^2}}{\sum_i e^{-\alpha (d_{in}^{(\tau)})^2}} [1-\alpha((d_{kn}^{(\tau)})^2-\frac{\sum_i (d_{in}^{(\tau)})^2e^{-\alpha (d_{in}^{(\tau)})^2}}{\sum_i e^{-\alpha (d_{in}^{(\tau)})^2}})]$$

Step 2. Computing weighted centroids:

$$c_k^{(\tau+1)}=\frac{\sum_n w_{kn}^{(\tau)} x_n}{\sum_n w_{kn}^{(\tau)}},$$ where $x_n$ is the $n$-th data point.

EKM converges when centroids cease to change or the maximum number of iterations is reached. The time complexity of one iteration of the above two steps is $O(NK^2P)$ with $P$ being the data dimension.

Examples

image image image image

How To Use (Python)

You can find the Python version in the "python" folder. EKM's functionality is packaged in "ekm_sklearn.py" in the sklearn style, and examples and benchmarks are provided.

How To Use (Matlab)

Install Matlab 2022a (or the latest version), and download this repository to your local directory.

Clustering a Dataset

You can find "ekm.m", a matlab function in which ekm is implemented, in the folder of "algorithms".

Below is an example of using EKM to cluster the iris dataset.

rng(0) % for reproducibility
addpath("./algorithms")
addpath("./metrics")
data = iris_dataset; % load the iris dataset
data = data(1:2,:); % only use the first two features for clustering the Iris dataset
data=data';

% normalization
for p=1:2
    data(:,p)=data(:,p)-mean(data(:,p));
    data(:,p)=data(:,p)/std(data(:,p));
end

% create labels of iris dataset
label_iris = ones(150,1);
label_iris(51:100)=2;
label_iris(101:150)=3;

% scatter plot of Iris
figure;
gscatter(data(:,1), data(:,2), label_iris);
title('Iris','FontSize',15)
xlabel('Normalized feature 1','FontSize',15)
ylabel('Normalized feature 2','FontSize',15)
legend off

% clustering by EKM
alpha=1;
K=3; % # of clusters
[label_ekm,C]=ekm(data,3);

% scatter diagram of EKM clustering
figure;
gscatter(data(:,1), data(:,2), label_ekm);
hold on
plot(C(:,1),C(:,2),'k+','MarkerSize',15,'LineWidth',3)
title('EKM clustering for Iris','FontSize',15)
xlabel('Normalized feature 1','FontSize',15)
ylabel('Normalized feature 2','FontSize',15)
legend off

Reproducing the Experiments

To replicate the experiments in the original paper, first put "reproduction", "algorithm", and "metrics" in the same directory (e.g., D:/git-EKM), open Matlab 2022a (or the latest version) and specify the working directory as "D:/git-EKM", then:

addpath("./algorithms")
addpath("./metrics")

Specify your Matlab working directory as "D:/git-EKM/reproduction"

If you want to replicate the clustering result on the "Ecoli" dataset, type the following code and enter in the command window

clustering_Ecoli

After the program is finished, you shall see the generated result files and folders in "./ecoli".

If you want to see the average of clustering quality, run

log_avg_best

If you want to see the average implementation time and the number of iterations, run

log_time

Citation Request

If you find this repo helpful, please cite our paper

@article{he2025equilibrium, title={An Equilibrium Approach to Clustering: Surpassing Fuzzy C-Means on Imbalanced Data}, author={He, Yudong}, journal={IEEE Transactions on Fuzzy Systems}, year={2025}, publisher={IEEE} }

@article{he2024imbalanced, title={Imbalanced Data Clustering using Equilibrium K-Means}, author={He, Yudong}, journal={arXiv preprint arXiv:2402.14490}, year={2024} }

About

This repository contains datasets and codes for reproducing the results in our paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published