Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 33 additions & 6 deletions tkyo-drift/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ In production, even minor changes to prompts, model weights, or input phrasing c

And it’s not just the model: user language evolves too. New slang, trending phrases, or tone shifts may emerge that your model wasn't trained on and without observability, you'll miss them.

TKYO Drift embeds each message and compares it against a configurable baseline using **Cosine similarity**, **Euclidean distance**, and scalar features like **punctuation density**, **entropy**, and more. The result is a continuous record of how your models and users behavior changes over time.
TKYO Drift embeds each message and compares it against a configurable baseline using **Cosine similarity**, **Euclidean distance**, and scalar features like **punctuation density**, **entropy**, and more. The result is a continuous record of how your model's and users' behavior changes over time.

Use it to answer questions like:

Expand Down Expand Up @@ -114,10 +114,14 @@ tkyoDrift(userSubmission, 'input')

5. Enjoy the benefits of having drift detection:

```
🏎️☁️☁️☁️ <- THAT GUY IS DRIFTING
```bash
npx tkyo cos
npx tkyo scalar
🏎️☁️☁️☁️ ← THAT GUY IS DRIFTING
```

This library will create a tkyoData folder at the project root! Don't forget to add it to your `.gitIgnore` as it may contain large files depending on your throughput. All logs, scalars, and binary files tkyoDrift needs to operate will be placed there.

# How do you use this thing?

You can interact with this library in a couple ways;
Expand All @@ -132,6 +136,22 @@ You can interact with this library in a couple ways;

There is also a small training file downloader script in the util folder called downloadTrainingData.py that you can run to grab the training data from hugging face if you happen to be using a model for your workflow from there.

## Configuration via Environment Variables

TKYO Drift supports configuration via environment variables for deployment flexibility. You can set the following variables:

- `TEXT_LOGGING`: Set to `false` to disable logging of input text. Default is `true`.
- `OUTPUT_DIR`: Set the output directory for all drift data. Default is `./tkyoData`.

Example usage (in your shell or `.env` file):

```bash
export TEXT_LOGGING=false
export OUTPUT_DIR=/custom/path/for/tkyoData
```

If not set, the defaults in `util/config.js` will be used.

## One-off Ingestion

Usage: Add `tkyoDrift.js(text, ioType)` in your file, along with an import statement.
Expand Down Expand Up @@ -304,7 +324,7 @@ Again, the second argument is the key for the object you would like to embed and

## Logging

Results are stored in two CSV files (`COS_log.csv` & `EUC_log.csv`) with dynamic headers. Each one-off run appends one row to each file. Keep in mind that training data is not added to the log, as the assumption is that your training baseline is what we compare against to measure drift.
Results are stored in three CSV files (`COS_log.csv`, `EUC_log.csv` & `text_log.csv`) with dynamic headers. Each one-off run appends one row to each file. Keep in mind that training data is not added to the log, as the assumption is that your training baseline is what we compare against to measure drift.

### Format

Expand All @@ -320,9 +340,16 @@ For the euclidean distance log:
ID, TIMESTAMP, I/O TYPE, SEMANTIC ROLLING EUC, SEMANTIC TRAINING EUC, CONCEPT ROLLING EUC...
```

For the text input log:

```
ID, TEXT
```

- Cosine similarities and euclidean distances are recorded per model and baseline type.
- Additional metadata like ioType, date and UUIDs are included for tracking.
- Neither the log, nor the binary files, contain your users input or AI outputs. This data is not necessary to calculate drift, and its exclusion is an intentional choice for data privacy.
- Text inputs are logged in a separate `text_log.csv` file for debugging and analysis purposes. This is separate from the drift calculation logs and binary files.
- The binary files contain only the embeddings and do not store the original text inputs or AI outputs.

Note: if you add or remove model types to the tkyoDrift tracker, the log will break. Please ensure you clear any existing logs after altering the embedding model names. What we mean here, is that if you change your conceptual embedding model from "concept" to "vibes", when writing to the log the makeLogEntry method of the Drift Class would work, but the log parser would fail.

Expand Down Expand Up @@ -501,7 +528,7 @@ The result is a value between -1 and 1. For normalized embedding vectors (as use
- `1.0` → Identical direction (no drift)
- `0.0` → Orthogonal (maximum drift)

Normalization ensures magnitude doesnt influence the result, so only the _direction_ of the vector matters. Additionally, we are calculating the Euclidean Distance. This metric is not scale-invariant and is typically larger in magnitude. Its useful in conjunction with cosine similarity to detect both directional and magnitude-based drift.
Normalization ensures magnitude doesn't influence the result, so only the _direction_ of the vector matters. Additionally, we are calculating the Euclidean Distance. This metric is not scale-invariant and is typically larger in magnitude. It's useful in conjunction with cosine similarity to detect both directional and magnitude-based drift.

## How we get the Baseline (B)

Expand Down
23 changes: 23 additions & 0 deletions tkyo-drift/config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import path from 'path';

// TKYO Drift configuration file
//
// You can override the following settings using environment variables:
// - TEXT_LOGGING: Set to 'false' to disable text input logging (default: true)
// - OUTPUT_DIR: Set the output directory for all drift data (default: './tkyoData')
//
// The models object is static. To add or change models, edit this file directly.

export const config = {
// List of transformer models to use for drift analysis. Edit this object to add/remove models.
models: {
mini: 'Xenova/all-MiniLM-L12-v2',
e5: 'Xenova/e5-base-v2'
},

// Enable or disable logging of input text. Set TEXT_LOGGING=false in your environment to disable.
enableTextLogging: process.env.TEXT_LOGGING === 'false' ? false : true,

// Output directory for all drift data. Set OUTPUT_DIR in your environment to override.
outputDir: path.resolve(process.env.OUTPUT_DIR || './tkyoData')
};
39 changes: 39 additions & 0 deletions tkyo-drift/getHFTrainingData.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
"""
Utility module for downloading and loading datasets from Hugging Face.
This module provides functionality to download training data from Hugging Face
datasets and store them in the local cache.
"""

# Prevent _pycache_ creation, since these scripts only run on demand
import sys
sys.dont_write_bytecode = True
from datasets import load_dataset

# Default dataset to load
data_location = "SmallDoge/SmallThoughts"

def dataSetLoader(data_location):
"""
Load a dataset from Hugging Face and store it in the local cache.

This function downloads the specified dataset from Hugging Face and
stores it in the user's ~/.cache folder. The dataset can then be used
for training or evaluation purposes.

Args:
data_location (str): The Hugging Face dataset identifier (e.g., 'username/dataset-name')

Returns:
Dataset: The loaded Hugging Face dataset object

Example:
>>> dataset = dataSetLoader("SmallDoge/SmallThoughts")
>>> print(dataset)
"""
dataset = load_dataset("SmallDoge/SmallThoughts")
print(dataset)
return dataset

# Load the default dataset when the script is run directly
if __name__ == "__main__":
dataSetLoader(data_location)
28 changes: 8 additions & 20 deletions tkyo-drift/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 2 additions & 5 deletions tkyo-drift/package.json
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
{
"name": "tkyodrift",
"version": "1.0.7",
"version": "1.1.0",
"description": "Lightweight CLI tool and library for detecting AI model drift using embeddings and scalar metrics. Tracks semantic, conceptual, and lexical change over time.",
"main": "./tkyoDrift.js",
"bin":{
"bin": {
"tkyo": "./tkyoDrift.js"
},
"types": "./tkyo.d.ts",
Expand All @@ -16,9 +16,6 @@
"ai-monitoring",
"embedding",
"model-drift",
"semantic-drift",
"concept-drift",
"lexical-drift",
"ai-evaluation",
"machine-learning",
"transformers",
Expand Down
Loading