Drift Detection and Embedding Tracking Framework

This project provides a framework for detecting data drift and tracking embedding distributions using various vector-based and distribution-based metrics. It supports multiple models, datasets, and drift strengths, enabling experiments and visualisations.

This provides a high-level workflow of using LLMs and DL experiments, creating embeddings, testing vectors and distribution-based drifts on synthetic and real-world streams.

Features

Drift Detection: Supports both vector-based and distribution-based metrics for detecting data drift.
Embedding Tracking: Tracks embeddings using KLL sketches, histograms, and PCA-based dimensionality reduction.
Metrics: Includes metrics such as KL divergence, Jensen-Shannon divergence, Wasserstein distance, and others.
Visualisation: Generates detailed plots for analysing drift detection results.
Extensibility: Easily add new models, datasets, or metrics.

File Structure

Core Files

distribution_experiment.py: Runs experiments focused on distribution-based metrics.
drift_detection.py: Implements drift detection using both vector-based and distribution-based approaches.
embedding_tracker.py: Tracks embeddings and computes distances using various methods.
metrics.py: Defines vector-based and distribution-based metrics.
plot.py: Generates visualisations for experiment results.
utils.py: Utility functions for data loading, embedding extraction, and drift introduction.

Configuration

config.py: Contains default arguments for models, datasets, and experiment parameters.

Experiments

vector_experiment.py: Placeholder for vector-based experiments (under development).

Tests

tests/: Contains unit tests for KLL transformations and other components.

Results

Experiment results and plots are saved in the data/ directory.

Installation

Install dependencies:

pip install -r requirements.txt

Usage

Running Experiments

Run the distribution_experiment.py script to evaluate distribution-based metrics:

python distribution_experiment.py

or

python vector_experiment.py

Run the drift_detection.py script to evaluate both vector-based and distribution-based metrics:

python drift_detection.py

Generating Plots

To generate visualisations for experiment results:

python plot.py

Configuration

Modify config.py to customise models, datasets, and experiment parameters.

Key Parameters

Models: Specify models in config.py (e.g., distilbert-base-uncased, google/mobilebert-uncased).
Datasets: Add datasets in config.py (e.g., ag_news).
Drift Strengths: Control the level of drift introduced during experiments.
Metrics: Choose from vector-based (e.g., Euclidean, cosine) or distribution-based (e.g., KL divergence, Wasserstein).

Outputs

Results: Saved as results.json in the data/ directory.
Plots: Includes similarity trends, memory usage, and overhead comparisons.

Baseline Experiment

The baseline experiment is located at /baseline-experiment/. It contains the following files:

baseline_experiment.py: The main script for running the baseline experiment. This file needs to have the amazon dataset downloaded and unzipped together with the product mapper in the data/ directory.

Here's how you can update your README with the provided image descriptions and paths from your GitLab repo. I’ll insert a new section after the Outputs section, and format the image markdown for clarity:

Visual Overview

This section provides visualizations illustrating concepts and workflows from the framework:

Image 1: Controlled Simulation of Embedding Drift. Text data is shuffled token-wise for LLMs, while tabular features undergo incremental shifts for DeepFM.

Image 2: Embedding Drift Detection via Full and Compressed Representations. Comparison of histograms and KLL-based summaries for detecting embedding shifts.

Image 3: Distance-Based Embedding Drift Detection. Geometric shifts tracked over time using distance metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Drift Detection and Embedding Tracking Framework

Features

File Structure

Core Files

Configuration

Experiments

Tests

Results

Installation

Usage

Running Experiments

Generating Plots

Configuration

Key Parameters

Outputs

Baseline Experiment

Visual Overview

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
baseline-experiment		baseline-experiment
batch_results		batch_results
images		images
tests		tests
.gitignore		.gitignore
README.md		README.md
config.py		config.py
distribution_experiment.py		distribution_experiment.py
drift_detection.py		drift_detection.py
embedding_tracker.py		embedding_tracker.py
metrics.py		metrics.py
plot.py		plot.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
utils.py		utils.py
uv.lock		uv.lock
vector_experiment.py		vector_experiment.py

Folders and files

Latest commit

History

Repository files navigation

Drift Detection and Embedding Tracking Framework

Features

File Structure

Core Files

Configuration

Experiments

Tests

Results

Installation

Usage

Running Experiments

Generating Plots

Configuration

Key Parameters

Outputs

Baseline Experiment

Visual Overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages