diff --git a/tkyo-drift/README.md b/tkyo-drift/README.md index f833a78..1ecca99 100644 --- a/tkyo-drift/README.md +++ b/tkyo-drift/README.md @@ -24,7 +24,7 @@ In production, even minor changes to prompts, model weights, or input phrasing c And it’s not just the model: user language evolves too. New slang, trending phrases, or tone shifts may emerge that your model wasn't trained on and without observability, you'll miss them. -TKYO Drift embeds each message and compares it against a configurable baseline using **Cosine similarity**, **Euclidean distance**, and scalar features like **punctuation density**, **entropy**, and more. The result is a continuous record of how your model’s and users’ behavior changes over time. +TKYO Drift embeds each message and compares it against a configurable baseline using **Cosine similarity**, **Euclidean distance**, and scalar features like **punctuation density**, **entropy**, and more. The result is a continuous record of how your model's and users' behavior changes over time. Use it to answer questions like: @@ -114,10 +114,14 @@ tkyoDrift(userSubmission, 'input') 5. Enjoy the benefits of having drift detection: -``` -🏎️☁️☁️☁️ <- THAT GUY IS DRIFTING +```bash +npx tkyo cos +npx tkyo scalar +🏎️☁️☁️☁️ ← THAT GUY IS DRIFTING ``` +This library will create a tkyoData folder at the project root! Don't forget to add it to your `.gitIgnore` as it may contain large files depending on your throughput. All logs, scalars, and binary files tkyoDrift needs to operate will be placed there. + # How do you use this thing? You can interact with this library in a couple ways; @@ -132,6 +136,22 @@ You can interact with this library in a couple ways; There is also a small training file downloader script in the util folder called downloadTrainingData.py that you can run to grab the training data from hugging face if you happen to be using a model for your workflow from there. +## Configuration via Environment Variables + +TKYO Drift supports configuration via environment variables for deployment flexibility. You can set the following variables: + +- `TEXT_LOGGING`: Set to `false` to disable logging of input text. Default is `true`. +- `OUTPUT_DIR`: Set the output directory for all drift data. Default is `./tkyoData`. + +Example usage (in your shell or `.env` file): + +```bash +export TEXT_LOGGING=false +export OUTPUT_DIR=/custom/path/for/tkyoData +``` + +If not set, the defaults in `util/config.js` will be used. + ## One-off Ingestion Usage: Add `tkyoDrift.js(text, ioType)` in your file, along with an import statement. @@ -304,7 +324,7 @@ Again, the second argument is the key for the object you would like to embed and ## Logging -Results are stored in two CSV files (`COS_log.csv` & `EUC_log.csv`) with dynamic headers. Each one-off run appends one row to each file. Keep in mind that training data is not added to the log, as the assumption is that your training baseline is what we compare against to measure drift. +Results are stored in three CSV files (`COS_log.csv`, `EUC_log.csv` & `text_log.csv`) with dynamic headers. Each one-off run appends one row to each file. Keep in mind that training data is not added to the log, as the assumption is that your training baseline is what we compare against to measure drift. ### Format @@ -320,9 +340,16 @@ For the euclidean distance log: ID, TIMESTAMP, I/O TYPE, SEMANTIC ROLLING EUC, SEMANTIC TRAINING EUC, CONCEPT ROLLING EUC... ``` +For the text input log: + +``` +ID, TEXT +``` + - Cosine similarities and euclidean distances are recorded per model and baseline type. - Additional metadata like ioType, date and UUIDs are included for tracking. -- Neither the log, nor the binary files, contain your users input or AI outputs. This data is not necessary to calculate drift, and its exclusion is an intentional choice for data privacy. +- Text inputs are logged in a separate `text_log.csv` file for debugging and analysis purposes. This is separate from the drift calculation logs and binary files. +- The binary files contain only the embeddings and do not store the original text inputs or AI outputs. Note: if you add or remove model types to the tkyoDrift tracker, the log will break. Please ensure you clear any existing logs after altering the embedding model names. What we mean here, is that if you change your conceptual embedding model from "concept" to "vibes", when writing to the log the makeLogEntry method of the Drift Class would work, but the log parser would fail. @@ -501,7 +528,7 @@ The result is a value between -1 and 1. For normalized embedding vectors (as use - `1.0` β†’ Identical direction (no drift) - `0.0` β†’ Orthogonal (maximum drift) -Normalization ensures magnitude doesn’t influence the result, so only the _direction_ of the vector matters. Additionally, we are calculating the Euclidean Distance. This metric is not scale-invariant and is typically larger in magnitude. It’s useful in conjunction with cosine similarity to detect both directional and magnitude-based drift. +Normalization ensures magnitude doesn't influence the result, so only the _direction_ of the vector matters. Additionally, we are calculating the Euclidean Distance. This metric is not scale-invariant and is typically larger in magnitude. It's useful in conjunction with cosine similarity to detect both directional and magnitude-based drift. ## How we get the Baseline (B) diff --git a/tkyo-drift/config.js b/tkyo-drift/config.js new file mode 100644 index 0000000..effbd3a --- /dev/null +++ b/tkyo-drift/config.js @@ -0,0 +1,23 @@ +import path from 'path'; + +// TKYO Drift configuration file +// +// You can override the following settings using environment variables: +// - TEXT_LOGGING: Set to 'false' to disable text input logging (default: true) +// - OUTPUT_DIR: Set the output directory for all drift data (default: './tkyoData') +// +// The models object is static. To add or change models, edit this file directly. + +export const config = { + // List of transformer models to use for drift analysis. Edit this object to add/remove models. + models: { + mini: 'Xenova/all-MiniLM-L12-v2', + e5: 'Xenova/e5-base-v2' + }, + + // Enable or disable logging of input text. Set TEXT_LOGGING=false in your environment to disable. + enableTextLogging: process.env.TEXT_LOGGING === 'false' ? false : true, + + // Output directory for all drift data. Set OUTPUT_DIR in your environment to override. + outputDir: path.resolve(process.env.OUTPUT_DIR || './tkyoData') +}; \ No newline at end of file diff --git a/tkyo-drift/getHFTrainingData.py b/tkyo-drift/getHFTrainingData.py new file mode 100644 index 0000000..0f26869 --- /dev/null +++ b/tkyo-drift/getHFTrainingData.py @@ -0,0 +1,39 @@ +""" +Utility module for downloading and loading datasets from Hugging Face. +This module provides functionality to download training data from Hugging Face +datasets and store them in the local cache. +""" + +# Prevent _pycache_ creation, since these scripts only run on demand +import sys +sys.dont_write_bytecode = True +from datasets import load_dataset + +# Default dataset to load +data_location = "SmallDoge/SmallThoughts" + +def dataSetLoader(data_location): + """ + Load a dataset from Hugging Face and store it in the local cache. + + This function downloads the specified dataset from Hugging Face and + stores it in the user's ~/.cache folder. The dataset can then be used + for training or evaluation purposes. + + Args: + data_location (str): The Hugging Face dataset identifier (e.g., 'username/dataset-name') + + Returns: + Dataset: The loaded Hugging Face dataset object + + Example: + >>> dataset = dataSetLoader("SmallDoge/SmallThoughts") + >>> print(dataset) + """ + dataset = load_dataset("SmallDoge/SmallThoughts") + print(dataset) + return dataset + +# Load the default dataset when the script is run directly +if __name__ == "__main__": + dataSetLoader(data_location) \ No newline at end of file diff --git a/tkyo-drift/package-lock.json b/tkyo-drift/package-lock.json index 725ce30..40fec83 100644 --- a/tkyo-drift/package-lock.json +++ b/tkyo-drift/package-lock.json @@ -1,21 +1,23 @@ { - "name": "tkyodrifttest1", - "version": "1.0.0", + "name": "tkyodrift", + "version": "1.0.7", "lockfileVersion": 3, "requires": true, "packages": { "": { - "name": "tkyodrifttest1", - "version": "1.0.0", - "license": "ISC", + "name": "tkyodrift", + "version": "1.0.7", + "license": "MIT", "dependencies": { "@xenova/transformers": "^2.17.2", "chalk": "^5.4.1", "cli-table3": "^0.6.5", "fs": "^0.0.1-security", "path": "^0.12.7", - "tkyodrifttest1": "^1.0.0", "uuid": "^11.1.0" + }, + "bin": { + "tkyo": "tkyoDrift.js" } }, "node_modules/@colors/colors": { @@ -938,20 +940,6 @@ "b4a": "^1.6.4" } }, - "node_modules/tkyodrifttest1": { - "version": "1.0.0", - "resolved": "https://registry.npmjs.org/tkyodrifttest1/-/tkyodrifttest1-1.0.0.tgz", - "integrity": "sha512-475elQaD3QMC4zrVPba9k6783lVTBXm2wmjl0SGB7+S/zzw+fi+P2SmrJyzr9IDNA3DpwBNb+o6kamxFiscT+A==", - "license": "ISC", - "dependencies": { - "@xenova/transformers": "^2.17.2", - "chalk": "^5.4.1", - "cli-table3": "^0.6.5", - "fs": "^0.0.1-security", - "path": "^0.12.7", - "uuid": "^11.1.0" - } - }, "node_modules/tunnel-agent": { "version": "0.6.0", "resolved": "https://registry.npmjs.org/tunnel-agent/-/tunnel-agent-0.6.0.tgz", diff --git a/tkyo-drift/package.json b/tkyo-drift/package.json index f0d5892..27898a4 100644 --- a/tkyo-drift/package.json +++ b/tkyo-drift/package.json @@ -1,9 +1,9 @@ { "name": "tkyodrift", - "version": "1.0.7", + "version": "1.1.0", "description": "Lightweight CLI tool and library for detecting AI model drift using embeddings and scalar metrics. Tracks semantic, conceptual, and lexical change over time.", "main": "./tkyoDrift.js", - "bin":{ + "bin": { "tkyo": "./tkyoDrift.js" }, "types": "./tkyo.d.ts", @@ -16,9 +16,6 @@ "ai-monitoring", "embedding", "model-drift", - "semantic-drift", - "concept-drift", - "lexical-drift", "ai-evaluation", "machine-learning", "transformers", diff --git a/tkyo-drift/tkyoDrift.js b/tkyo-drift/tkyoDrift.js index eb25e72..7b4880e 100755 --- a/tkyo-drift/tkyoDrift.js +++ b/tkyo-drift/tkyoDrift.js @@ -1,9 +1,17 @@ #!/usr/bin/env node + +/** + * Main entry point for the TKYO Drift CLI tool. + * This module provides the command-line interface for drift analysis, + * including cosine similarity analysis, scalar metric comparison, + * and training data processing. + */ + /*@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@*::::-%@@#:..-:..+@@@@@@@@@%#++==+*%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%##########********#######%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@+:#@@@@@@=.-=+#++:..:=+*##*=..*%%@@@@%#=:@@@@@@@@@@%**@@@@@@@@@@@%#+=-::..::-==+**######%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%#:.-@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@@@@@@@@+%@@@@@@@@@@:.-:=#@@@@@@@@@@@@@%%=%@@@@@@%*.:-*%%%%*-*+=-:.:-=+*#%%######%%%%###***+++=====--------======+++++++****####%%%%@@%%#=--:+@@@@@@@@@@@@@@@@@@@@@@@=-@@@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@+*@@@@@@@@@@@@@@@@@@@@@%%%@@@@@@%-%%%+..=#%%#####*+=--=+**##%%%%%#*++=--::.........................................:%@%.....:...:::-=+*#%@@@@@@@@@**:@@@@@@@@@@@@@@@@@@@@@@@ -@@@@@@@@@#%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%@@@@@%#%#:-%####%%#**+=-:. -%#. :.....::::::::::::::::::-*.@@@@@@@@@@@@@@@@@@@@@@@ +@@@@@#%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%@@@@@%#%#:-%####%%#**+=-:. -%#. :.....::::::::::::::::::-*.@@@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@.#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%@@%@@@%%=:%%*.. .%@:..:.....::::::::::::::::=-:@@@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@=-@@*@@@@@@@@@@@@@@@@@@@@@%*==-:.-#@@@@@@@@@@@@@@@@@@@@@%:#%+.. .************=:*****:.=****+-+****===*####=--+#%@@@@%#*::::...........%%:..-. :@@@@@@@@@@@@@@@==-@@@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@=--@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%+.*@@@@@@@@@@@@@@@@@@=-%%.. ......:@@@@@@@@@@@@%.%@@@@=+@@@@%:.*@@@@#:.#@@@@+.%@@@@@@@@@@@@:.... --.=.#@-..:: +@@@@@@@@@@@@@+=+@@@@@@@@@@@@@@@@@@@@@@@ @@ -42,9 +50,10 @@ @@@@@@@@@@@@@@@@@%+:--::=****=:..::-. ...... ...:::::.......................... . @%%%####******+++++++++=============------:::::............. ...............................::::::::::::::::::::::------=====+++++++*******#######%%%%%%@@@@@@@ @@@@@@@@@@@@@@@@@@%%%##############%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@*/ -import tkyoDriftSetTrainingHook from './util/tkyoDriftSetTrainingHook.js'; -import printScalarCLI from './util/printScalarCLI.js'; -import printLogCLI from './util/printLogCLI.js'; + +import tkyoDriftSetTrainingHook from './util/batchPythonHook.js'; +import printScalarCLI from './util/logPrintScalarCLI.js'; +import printLogCLI from './util/logPrintCosCLI.js'; import tkyoDrift from './util/oneOffEmb.js'; import chalk from 'chalk'; import path from 'path'; @@ -53,85 +62,125 @@ import fs from 'fs'; // Get the commands from the CLI (the first 2 are not commands) const [command, ...rest] = process.argv.slice(2); -// Only run if the command is a "tkyo" command -// if (process.argv[1] === new URL(import.meta.url).pathname) { // ! Alternative, ESM Based -if (process.argv[1].endsWith('tkyo')) { - // switch case to determine which file to invoke - switch (command) { - // ? tkyo cos - case 'cos': { - const dayArgument = rest[0] || '30'; - process.argv = ['node', 'printLogCLI.js', dayArgument]; - await printLogCLI(dayArgument); - break; - } +/** + * Main CLI handler for TKYO Drift commands. + * Processes the following commands: + * - cos: Show cosine similarity drift logs + * - scalar: Show scalar metric drift comparison + * - train: Process training data and update baselines + * - inputs: Toggle text input logging + * + * @example + * Show cosine similarity drift for last 30 days + * tkyo cos 30 + * + * Show scalar metric drift comparison + * tkyo scalar + * + * Process training data + * tkyo train ./data input input + * + * Toggle text input logging + * tkyo inputs + */ - // ? tkyo scalar - case 'scalar': { - await printScalarCLI(); - break; - } +async function main() { + // Only run if the command is a "tkyo" command + if (process.argv[1].endsWith('tkyo')) { + // switch case to determine which file to invoke + switch (command) { + // ? tkyo cos + case 'cos': { + const dayArgument = rest[0] || '30'; + process.argv = ['node', 'printLogCLI.js', dayArgument]; + await printLogCLI(dayArgument); + break; + } + + // ? tkyo scalar + case 'scalar': { + await printScalarCLI(); + break; + } + + // ? tkyo train + case 'train': { + const [pathToData, columnName, ioType] = rest; - // ? tkyo train - case 'train': { - const [pathToData, columnName, ioType] = rest; + // Error handle when the user doesn't provide the correct arguments + if (!pathToData || !columnName || !ioType) { + console.error( + chalk.blueBright( + 'Usage: tkyo train ' + ) + ); + process.exit(1); + } - // Error handle when - if (!pathToData || !columnName || !ioType) { - console.error( - chalk.blueBright( - 'Usage: tkyo train ' - ) + // If someone calls the train command, we normalize the path. + const normalizedPath = path.resolve( + process.cwd(), + pathToData.replace(/\\/g, '/') ); - process.exit(1); - } - // If someone calls the train command, we normalize the path. - const normalizedPath = path.resolve( - process.cwd(), - pathToData.replace(/\\/g, '/') - ); + // Error handle when the path does not exist. + if (!fs.existsSync(normalizedPath)) { + console.error(chalk.red(`The dataSetPath provided does not exist.`)); + } - // Error handle when the path does not exist. - if (!fs.existsSync(normalizedPath)) { - console.error(chalk.red(`The dataSetPath provided does not exist.`)); + await tkyoDriftSetTrainingHook(normalizedPath, columnName, ioType); + console.log(chalk.green("Job's done.")); + break; } - await tkyoDriftSetTrainingHook(normalizedPath, columnName, ioType); - console.log(chalk.green("Job's done.")); - break; - } - - // ? help commands - default: - console.log( - chalk.gray(` -↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ ↑↑↑ ↗↓↓↓↗ ↓↓↓ ↓↓↓ ↓↓↓↓↓↓↓↓↓↓↓↓↖ - ↑↑↑ ↑↑↑ ↗↑↑↑ ↑↑↑ ↑↑↑ ↑↑↑↑ ↖↑↑ - ↑↑↑ ↑↑↑ ↗↑↑↑ ↑↑↑ ↑↑↑ ↑↑↑ ↖↑↑ - ↑↑↑ β†‘β†‘β†‘β†‘β†‘β†‘β†‘β†˜ ↑↑↑ ↑↑↑↑ ↑↑↑ ↖↑↑ - ↖↑↑ →↑↑ β†‘β†‘β†‘β†˜ ↑↑↑↑↑↑↑↑↑↑↑↑↑ ←↑↑ ↑↑↑↗ - ↑↑↑ ↑↑↑ β†‘β†‘β†‘β†˜ ↑↑↑ ↑↑↑ ↗↑↑↓ - ↑↑↑ ↑↑↑ β†‘β†‘β†‘β†˜ ↑↑↑ ↑↑↑↑ ↗↑↑↑ - ↑↑↑ ↑↑↑ β†‘β†‘β†‘β†˜ ↑↑↑ ↑↑↑↑↑↑↑↑↑↑↑↑↑↗ + // ? help commands + default: + console.log( + chalk.gray(` +↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ ↗↑↑ ↗↓↓↓↗ ↓↓↓ ↓↓↓ ↓↓↓↓↓↓↓↓↓↓↓↓↖ + ↗↑↑ ↗↑↑ ↗↑↑↑ ↗↑↑ ↗↑↑ ↗↑↑↑ ↖↑↑ + ↗↑↑ ↗↑↑ ↗↑↑↑ ↗↑↑ ↗↑↑ ↗↑↑ ↖↑↑ + ↗↑↑ β†‘β†‘β†‘β†‘β†‘β†‘β†‘β†˜ ↗↑↑ ↗↑↑↑ ↗↑↑ ↖↑↑ + ↖↑↑ →↑↑ β†‘β†‘β†‘β†˜ ↑↑↑↑↑↑↑↑↑↑↑↑↑ ←↑↑ ↗↑↑↓ + ↗↑↑ ↗↑↑ β†‘β†‘β†‘β†˜ ↑↑↑ ↖↑↑ ↗↑↑↓ + ↗↑↑ ↗↑↑ β†‘β†‘β†‘β†˜ ↑↑↑ ↖↑↑↑ ↗↑↑↗ + ↗↑↑ ↗↑↑ β†‘β†‘β†‘β†˜ ↑↑↑ ↖↑↑↑↑↑↑↑↑↑↑↑↑↗ Usage: ${chalk.yellowBright('tkyo')} ${chalk.white('cos')} ${chalk.blueBright( - '' - )} Show COS Drift logs for last N days + '' + )} Show COS Drift logs for last N days ${chalk.yellowBright('tkyo')} ${chalk.white( - 'scalar' - )} Show scalar drift comparison + 'scalar' + )} Show scalar drift comparison ${chalk.yellowBright('tkyo')} ${chalk.white('train')} ${chalk.blueBright( - ' ' - )} Embed dataset and update training baseline - -Readme docs in the node package or at ${chalk.blueBright( - 'https://github.com/oslabs-beta/tkyo-drift' - )} - `) - ); + ' ' + )} Embed dataset and update training baseline + +${chalk.cyanBright('Environment variables:')} + ${chalk.white('TEXT_LOGGING')} Set to 'false' to disable text input logging (default: true) + ${chalk.white('OUTPUT_DIR')} Set the output directory for all drift data (default: ./tkyoData) + ${chalk.white('You can also update the embedding models in the config.js file')} + +Readme docs are in the node package or at ${chalk.blueBright( + 'https://github.com/oslabs-beta/tkyo-drift' + )} + `) + ); + } } } +main(); + +/** + * Export the main drift analysis function for programmatic use. + * This allows the drift analysis functionality to be used as a library + * in addition to the CLI interface. + * + * @example + * import tkyoDrift from 'tkyodrift'; + * await tkyoDrift("Sample text", "input"); + */ + export default tkyoDrift; diff --git a/tkyo-drift/util/tkyoDriftSetTraining.py b/tkyo-drift/util/batchEmbController.py similarity index 68% rename from tkyo-drift/util/tkyoDriftSetTraining.py rename to tkyo-drift/util/batchEmbController.py index 1955380..f4d7238 100644 --- a/tkyo-drift/util/tkyoDriftSetTraining.py +++ b/tkyo-drift/util/batchEmbController.py @@ -1,9 +1,15 @@ +""" +Controller module for batch processing of embeddings and scalar metrics. +This module coordinates the generation of embeddings and scalar metrics +for training data sets. +""" + # Prevent _pycache_ creation, since these scripts only run on demand import sys sys.dont_write_bytecode = True # Import helper function to load and embed the data -import pythonTrainingEmb -from writeSharedScalars import write_shared_scalar_metrics +import batchEmbWriter +from batchScalarWriteShared import write_shared_scalar_metrics # Allows the use of time functions @@ -14,6 +20,24 @@ import traceback def tkyoDriftSetTraining(data_set_Path, io_type, io_type_name): + """ + Process a dataset to generate embeddings and scalar metrics for training. + + This function coordinates the generation of embeddings for multiple models + and computes scalar metrics for the given dataset. It handles both the + embedding generation and scalar metric computation in a single pass. + + Args: + data_set_Path (str): Path to the dataset directory + io_type (str): Type of input/output (e.g., 'input', 'output') + io_type_name (str): Name identifier for the I/O type + + Returns: + dict: A dictionary containing the status and message of the operation + + Raises: + Exception: If any error occurs during processing + """ # Starts the total function timer startTotal = time.perf_counter() @@ -30,7 +54,7 @@ def tkyoDriftSetTraining(data_set_Path, io_type, io_type_name): # Iterate through models dictionary for model_type, model_name in MODELS.items(): - pythonTrainingEmb.trainingEmb( + batchEmbWriter.trainingEmb( model_type=model_type, model_name=model_name, data_path=data_set_Path, diff --git a/tkyo-drift/util/pythonTrainingEmb.py b/tkyo-drift/util/batchEmbWriter.py similarity index 84% rename from tkyo-drift/util/pythonTrainingEmb.py rename to tkyo-drift/util/batchEmbWriter.py index 9a6ec73..640dc07 100644 --- a/tkyo-drift/util/pythonTrainingEmb.py +++ b/tkyo-drift/util/batchEmbWriter.py @@ -1,9 +1,15 @@ +""" +Module for batch processing of text embeddings using transformer models. +This module handles the generation of embeddings for both short and long texts, +with special handling for texts that exceed the model's maximum token length. +""" + # Prevent _pycache_ creation, since these scripts only run on demand import sys sys.dont_write_bytecode = True # Import helper function to create kmeans of data -import pythonKMeans +import batchMakeKMeans # This is good for vectors/matrices import numpy as np @@ -23,6 +29,23 @@ import gc def trainingEmb(model_type, model_name, data_path, io_type, io_type_name): + """ + Generate embeddings for a dataset using a specified transformer model. + + This function processes a dataset in batches, handling both short and long texts. + For long texts, it uses a chunking strategy to process them in smaller pieces. + It also computes and saves model-specific scalar metrics for each embedding. + + Args: + model_type (str): Type of model (e.g., 'mini', 'e5') + model_name (str): Name of the transformer model to use + data_path (str): Path to the dataset directory + io_type (str): Type of input/output (e.g., 'input', 'output') + io_type_name (str): Name identifier for the I/O type + + Raises: + ValueError: If no .arrow files are found in the dataset directory + """ # Starts the total function timer startTotal = time.perf_counter() @@ -62,6 +85,16 @@ def trainingEmb(model_type, model_name, data_path, io_type, io_type_name): # When invoked, this will embed the current batch def embed_data(data): + """ + Embed a batch of texts, handling both short and long texts appropriately. + + Args: + data (list): List of text strings to embed + + Returns: + numpy.ndarray: Array of embeddings in the same order as input texts + """ + # Stores texts shorter than 512 tokens short_texts = [] # Stores short text positions in the batch @@ -121,6 +154,16 @@ def embed_data(data): # Handles the embeddings of a single long text def embed_long_text(text): + """ + Embed a single long text by chunking and averaging chunk embeddings. + + Args: + text (str): The text to embed + + Returns: + numpy.ndarray: The averaged embedding vector + """ + chunks = chunk_text(text, tokenizer) tokenized = tokenizer( chunks, @@ -136,6 +179,19 @@ def embed_long_text(text): # Breaks the text up into overlapping chunks def chunk_text(text, tokenizer, max_length=512, stride=256): + """ + Break a long text into overlapping chunks for processing. + + Args: + text (str): The text to chunk + tokenizer: The tokenizer to use + max_length (int): Maximum length of each chunk + stride (int): Number of tokens to overlap between chunks + + Returns: + list: List of text chunks + """ + # Tokenizes each input tokens = tokenizer.encode(text, add_special_tokens=False) # Holds the tokenized chunks @@ -150,7 +206,7 @@ def chunk_text(text, tokenizer, max_length=512, stride=256): return chunks # Embed Data - print(f"Embedding {io_type}s using {model_name} for {model_type} knowledge...") + print(f"Embedding {io_type}s using {model_name}") # Initialize an empty list to store all input embeddings embeddings = [] # Set the number of examples to process at once (smaller = less memory, larger = faster) @@ -250,7 +306,7 @@ def chunk_text(text, tokenizer, max_length=512, stride=256): embeddings.astype(np.float32).tofile(f) else: print(f"You have >= 100000 {io_type} embeddings: Performing K Means analysis to filter embeddings.") - kMeansEmbedding = pythonKMeans.kMeansClustering(embeddings) + kMeansEmbedding = batchMakeKMeans.kMeansClustering(embeddings) # Assign the number of vectors for the training data num_vectors = kMeansEmbedding.shape[0] @@ -292,6 +348,17 @@ def chunk_text(text, tokenizer, max_length=512, stride=256): return def resolve_io_column(batch, io_type_name): + """ + Resolve the correct column name for I/O data in the batch. + + Args: + batch: The dataset batch + io_type_name (str): Name identifier for the I/O type + + Returns: + list: List of texts from the correct column + """ + try: # ------------------------------- # Case 1: Flat column access @@ -329,7 +396,7 @@ def resolve_io_column(batch, io_type_name): if val is not None: result.append(val) except (KeyError, IndexError, TypeError): - # If something goes wrong (e.g., path doesn’t exist, value is None), skip it + # If something goes wrong (e.g., path doesn't exist, value is None), skip it print(f"[WARN] Skipping row {i}: missing nested path in {io_type_name}") return result diff --git a/tkyo-drift/util/pythonKMeans.py b/tkyo-drift/util/batchMakeKMeans.py similarity index 64% rename from tkyo-drift/util/pythonKMeans.py rename to tkyo-drift/util/batchMakeKMeans.py index 72f98f5..0a90365 100644 --- a/tkyo-drift/util/pythonKMeans.py +++ b/tkyo-drift/util/batchMakeKMeans.py @@ -1,3 +1,9 @@ +""" +Module for performing K-means clustering on embedding vectors. +This module provides functionality to cluster embedding vectors into groups +using the K-means algorithm, with automatic determination of optimal cluster count. +""" + # Prevent _pycache_ creation, since these scripts only run on demand import sys sys.dont_write_bytecode = True @@ -9,6 +15,24 @@ import time def kMeansClustering(embeddings): + """ + Perform K-means clustering on a set of embedding vectors. + + This function automatically determines the optimal number of clusters + based on the number of vectors, then performs K-means clustering to + identify centroids that represent the main patterns in the data. + + Args: + embeddings (numpy.ndarray): Array of embedding vectors to cluster + + Returns: + numpy.ndarray: Array of cluster centroids + + Note: + The number of clusters is determined by the formula: sqrt(n/2) * 10, + where n is the number of vectors. This is a heuristic that balances + the granularity of clustering with computational efficiency. + """ # Starts the total function timer startTotal = time.perf_counter() diff --git a/tkyo-drift/util/tkyoDriftSetTrainingHook.js b/tkyo-drift/util/batchPythonHook.js similarity index 72% rename from tkyo-drift/util/tkyoDriftSetTrainingHook.js rename to tkyo-drift/util/batchPythonHook.js index 16c2482..58ee4ef 100644 --- a/tkyo-drift/util/tkyoDriftSetTrainingHook.js +++ b/tkyo-drift/util/batchPythonHook.js @@ -1,13 +1,31 @@ -import { spawn } from 'child_process'; +/** + * Utility function to interface with Python batch processing scripts. + * This module provides a bridge between Node.js and Python for batch processing + * of training data and embeddings. + */ + +import fs from 'fs'; import path from 'path'; import { fileURLToPath } from 'url'; -import fs from 'fs'; +import { spawn } from 'child_process'; // Full path to tkyoDriftSetTrainingHook.js const __filename = fileURLToPath(import.meta.url); // Directory containing the file (tkyo-drift) const __dirname = path.dirname(__filename); +/** + * Sets up and processes training data using Python batch processing scripts. + * This function spawns a Python process to handle batch embedding generation + * and training data setup. + * + * @param {string} dataSetPath - Path to the dataset directory + * @param {string} ioType - Type of input/output (e.g., 'input', 'output') + * @param {string} ioTypeName - Name identifier for the I/O type + * @returns {Promise} The output from the Python process + * @throws {Error} If the dataset path doesn't exist or if Python process fails + */ + export default async function tkyoDriftSetTraining( dataSetPath, ioType, @@ -26,7 +44,7 @@ export default async function tkyoDriftSetTraining( ); } // Ensures we are running tkyoDriftSetTraining.py correctly - const scriptPath = path.join(__dirname, './tkyoDriftSetTraining.py'); + const scriptPath = path.join(__dirname, './batchEmbController.py'); const pyProg = spawn('python3', [ '-u', scriptPath, diff --git a/tkyo-drift/util/writeSharedScalars.py b/tkyo-drift/util/batchScalarWriteShared.py similarity index 77% rename from tkyo-drift/util/writeSharedScalars.py rename to tkyo-drift/util/batchScalarWriteShared.py index cd906b5..33699cd 100644 --- a/tkyo-drift/util/writeSharedScalars.py +++ b/tkyo-drift/util/batchScalarWriteShared.py @@ -1,14 +1,44 @@ +""" +Module for computing and writing shared scalar metrics for text data. +This module calculates various text-based metrics like character length, +entropy, word length, and punctuation density, then writes them to JSONL files. +""" + from datasets import Dataset, concatenate_datasets import os import json import numpy as np import time from datetime import datetime -from pythonTrainingEmb import resolve_io_column +from batchEmbWriter import resolve_io_column # * Writes shared scalar metrics (like character length, entropy, etc.) for training data # * One file is created per metric (e.g., ioTypeName.characterLength.training.scalar.jsonl) def write_shared_scalar_metrics(data_path, io_type, io_type_name): + """ + Compute and write shared scalar metrics for a dataset of texts. + + This function processes a dataset of texts and computes various scalar metrics + for each text, including: + - Character length + - Character entropy (measures repetition vs. diversity) + - Average word length + - Punctuation density + - Uppercase ratio + + The metrics are written to separate JSONL files, one per metric type, + in the format: ioType.metricName.training.scalar.jsonl + + Args: + data_path (str): Path to the dataset directory containing .arrow files + io_type (str): Type of input/output (e.g., 'input', 'output') + io_type_name (str): Name identifier for the I/O type + + Note: + The function includes progress tracking and estimated time remaining + for long-running operations. + """ + # Load all `.arrow` files from the provided dataset directory arrow_files = [ os.path.join(data_path, f) diff --git a/tkyo-drift/util/downloadTrainingData.py b/tkyo-drift/util/downloadTrainingData.py deleted file mode 100644 index 5f4a692..0000000 --- a/tkyo-drift/util/downloadTrainingData.py +++ /dev/null @@ -1,15 +0,0 @@ -# Prevent _pycache_ creation, since these scripts only run on demand -import sys -sys.dont_write_bytecode = True -from datasets import load_dataset - -# ? If you are using a model on hugging face, you can use this utility to download the training data -# The data will be stored in you ~./cache folder -data_location = "SmallDoge/SmallThoughts" - -def dataSetLoader (data_location): - dataset = load_dataset("SmallDoge/SmallThoughts") - print(dataset) - return dataset - -dataSetLoader(data_location) \ No newline at end of file diff --git a/tkyo-drift/util/makeLogEntry.js b/tkyo-drift/util/logMakeDriftEntry.js similarity index 75% rename from tkyo-drift/util/makeLogEntry.js rename to tkyo-drift/util/logMakeDriftEntry.js index 950918e..d83d89b 100644 --- a/tkyo-drift/util/makeLogEntry.js +++ b/tkyo-drift/util/logMakeDriftEntry.js @@ -1,14 +1,27 @@ +/** + * Utility function to log drift metrics (cosine similarity or euclidean distance) to a CSV file. + * This function handles both the creation of new log files and appending to existing ones. + * The log file structure is dynamic based on the models and baseline types being used. + */ + import fs from 'fs'; import path from 'path'; -import { OUTPUT_DIR } from './oneOffEmb.js'; +import { config } from '../config.js'; +/** + * Creates or appends a log entry for drift metrics to the appropriate CSV file. + * + * @param {string} id - Unique identifier (UUID) for the drift analysis + * @param {Object} mathObject - Object containing drift metrics with keys in format "modelType.ioType.baselineType" + * @param {string} type - Type of drift metric, either 'COS' for cosine similarity or 'EUC' for euclidean distance + */ export default function makeLogEntry(id, mathObject, type) { let logPath = ''; // Construct the destination to the log in the data folder if (type === 'COS') { - logPath = path.join(OUTPUT_DIR, 'logs', 'COS_log.csv'); - } else { - logPath = path.join(OUTPUT_DIR, 'logs', 'EUC_log.csv'); + logPath = path.join(config.outputDir, 'logs', 'COS_log.csv'); + } else if (type === 'EUC') { + logPath = path.join(config.outputDir, 'logs', 'EUC_log.csv'); } // Create a timestamp diff --git a/tkyo-drift/util/makeErrorLogEntry.js b/tkyo-drift/util/logMakeErrorEntry.js similarity index 58% rename from tkyo-drift/util/makeErrorLogEntry.js rename to tkyo-drift/util/logMakeErrorEntry.js index 71d50c9..29890e5 100644 --- a/tkyo-drift/util/makeErrorLogEntry.js +++ b/tkyo-drift/util/logMakeErrorEntry.js @@ -1,11 +1,21 @@ +/** + * Utility function to log errors that occur during drift analysis to a CSV file. + * This function handles both the creation of new error log files and appending to existing ones. + * Errors are logged with timestamps and error messages in a structured format. + */ + import fs from 'fs'; import path from 'path'; -import { OUTPUT_DIR } from './oneOffEmb.js'; +import { config } from '../config.js'; -// * Logs a structured error entry to a CSV in the data folder +/** + * Creates or appends an error log entry to the error log CSV file. + * + * @param {Error} error - The error object to be logged + */ export default function makeErrorLogEntry(error) { - // Build path to error log - const logPath = path.join(OUTPUT_DIR, 'logs', 'ERR_log.csv'); + // Construct the path to the error log file + const logPath = path.join(config.outputDir, 'logs', 'ERR_log.csv'); // Create a timestamp for when the error occurred const timestamp = new Date().toISOString(); diff --git a/tkyo-drift/util/logMakeInputEntry.js b/tkyo-drift/util/logMakeInputEntry.js new file mode 100644 index 0000000..fd65363 --- /dev/null +++ b/tkyo-drift/util/logMakeInputEntry.js @@ -0,0 +1,55 @@ +/** + * Utility function to log text inputs that are being analyzed for drift detection. + * This creates and maintains a CSV log file that tracks the text inputs along with their + * unique identifiers and timestamps. This log is separate from the drift metrics log + * to keep the input data distinct from the analysis results. + */ + +import fs from 'fs'; +import path from 'path'; +import { config } from '../config.js'; + +/** + * Escapes a string for CSV format by: + * 1. Replacing any double quotes with two double quotes + * 2. Wrapping the entire string in double quotes + * + * @param {string} str - The string to escape + * @returns {string} - The escaped string + */ +function escapeCSV(str) { + // Replace any double quotes with two double quotes + const escaped = str.replace(/"/g, '""'); + // Wrap in double quotes + return `"${escaped}"`; +} + +/** + * Creates or appends a log entry for a text input being analyzed for drift. + * + * @param {string} id - Unique identifier (UUID) for the text input + * @param {string} text - The actual text content being analyzed + */ +export default function logMakeInputEntry(id, text) { + const logPath = path.join(config.outputDir, 'logs', 'text_log.csv'); + const timestamp = new Date().toISOString(); + const row = [id, timestamp, escapeCSV(text)]; + const csvLine = row.join(',') + '\n'; + + const fileExists = fs.existsSync(logPath); + + try { + if (!fileExists) { + // If the file doesn't exist, create it with headers + const headers = ['ID', 'TIMESTAMP', 'TEXT'].join(',') + '\n'; + fs.writeFileSync(logPath, headers + csvLine); + } else { + // If the file exists, append the new entry + fs.appendFileSync(logPath, csvLine); + } + } catch (error) { + // Log any errors that occur during file operations + // This could be due to permissions, disk space, or file locks + console.error('Failed to write text log entry:', error.message); + } +} \ No newline at end of file diff --git a/tkyo-drift/util/printLogCLI.js b/tkyo-drift/util/logPrintCosCLI.js similarity index 84% rename from tkyo-drift/util/printLogCLI.js rename to tkyo-drift/util/logPrintCosCLI.js index 3ebdb2b..0bc1a26 100644 --- a/tkyo-drift/util/printLogCLI.js +++ b/tkyo-drift/util/logPrintCosCLI.js @@ -1,12 +1,26 @@ +/** + * Utility function to print cosine similarity drift metrics in a formatted CLI table. + * This function reads the cosine similarity log file, processes the data, and displays + * it in a color-coded table showing drift metrics over a specified time period. + */ + import fs from 'fs'; import path from 'path'; import chalk from 'chalk'; import Table from 'cli-table3'; -import { MODELS, OUTPUT_DIR } from './oneOffEmb.js'; - +import { config } from '../config.js'; + +/** + * Prints a formatted table of cosine similarity drift metrics to the console. + * The table shows average similarity scores and violation counts for each model, + * I/O type, and baseline combination over the specified time period. + * + * @param {string|number} arg - Number of days to look back for drift analysis (defaults to 30 if invalid) + * @throws {Error} If the log file doesn't exist or can't be parsed + */ export default async function printLogCLI(arg) { // Constants & CLI Args - const logPath = path.join(OUTPUT_DIR, 'logs', 'COS_log.csv'); + const logPath = path.join(config.outputDir, 'logs', 'COS_log.csv'); const days = isNaN(parseInt(arg)) ? 30 : parseInt(arg); const driftThreshold = 0.8; const startTime = Date.now() - days * 86400000; // milliseconds in a day @@ -16,7 +30,7 @@ export default async function printLogCLI(arg) { throw new Error(`No log file not found at: ${logPath}`); } - // Declare header and row variables so they’re accessible later + // Declare header and row variables so they're accessible later let headers, rows; try { @@ -84,7 +98,7 @@ export default async function printLogCLI(arg) { // Build the table rows by model type, io type, and baseline type for (const ioType of ioTypes) { - for (const [modelType] of Object.entries(MODELS)) { + for (const [modelType] of Object.entries(config.models)) { for (const baselineType of baselineTypes) { const columnHeader = `${modelType.toUpperCase()} ${baselineType.toUpperCase()} COS`; const colIndex = headers.indexOf(columnHeader); diff --git a/tkyo-drift/util/printScalarCLI.js b/tkyo-drift/util/logPrintScalarCLI.js similarity index 80% rename from tkyo-drift/util/printScalarCLI.js rename to tkyo-drift/util/logPrintScalarCLI.js index 1ed7ec0..f23f788 100644 --- a/tkyo-drift/util/printScalarCLI.js +++ b/tkyo-drift/util/logPrintScalarCLI.js @@ -1,14 +1,28 @@ +/** + * Utility function to print scalar metric drift analysis in a formatted CLI table. + * This function reads scalar metric files, compares training and rolling distributions, + * and displays the results in a color-coded table showing drift metrics. + */ + import fs from 'fs'; import path from 'path'; import chalk from 'chalk'; import Table from 'cli-table3'; -import { compareScalarDistributions } from './compareScalarDistributions.js'; -import { loadScalarMetrics } from './loadScalarMetrics.js'; -import { OUTPUT_DIR } from './oneOffEmb.js'; +import { config } from '../config.js'; +import { loadScalarMetrics } from './scalarLoadMetrics.js'; +import { compareScalarDistributions } from './scalarCompare.js'; + +/** + * Prints a formatted table of scalar metric drift analysis to the console. + * The table shows statistical comparisons between training and rolling data, + * including means, standard deviations, and Population Stability Index (PSI). + * + * @returns {Promise} + */ export default async function printScalarCLI() { - // Define the path to where scalar .jsonl files are stored - const SCALAR_DIR = path.join(OUTPUT_DIR, 'scalars'); + // Construct the path to the scalar metrics directory + const SCALAR_DIR = path.join(config.outputDir, 'scalars'); // Define warning boolean to console log a warning if we are in hybrid mode let warn = false; @@ -155,14 +169,27 @@ export default async function printScalarCLI() { } } - // Helper to color code regular values + /** + * Formats a numeric value with 2 decimal places. + * + * @param {number} val - The value to format + * @returns {string} - The formatted value in white + */ + function format(val) { if (typeof val !== 'number') return chalk.gray('n/a'); const formatted = val.toFixed(2); return chalk.white(formatted); } - // Helper to color code delta values by severity + /** + * Formats a delta value with color coding based on its z-score. + * + * @param {number} val - The delta value to format + * @param {number} std - The standard deviation to use for z-score calculation + * @returns {string} - The formatted value in green/yellow/red based on severity + */ + function formatDelta(val, std) { if (typeof val !== 'number') return chalk.gray('n/a'); const formatted = val.toFixed(2); @@ -174,7 +201,13 @@ export default async function printScalarCLI() { return chalk.red(formatted); // Drifted } - // Helper to color code PSI values by severity + /** + * Formats a PSI value with color coding based on drift severity. + * + * @param {number} val - The PSI value to format + * @returns {string} - The formatted value in green/yellow/red based on severity + */ + function formatPSI(val) { if (typeof val !== 'number') return chalk.gray('n/a'); const formatted = val.toFixed(3); diff --git a/tkyo-drift/util/oneOffEmb.js b/tkyo-drift/util/oneOffEmb.js index 1af6b12..d3a50ae 100644 --- a/tkyo-drift/util/oneOffEmb.js +++ b/tkyo-drift/util/oneOffEmb.js @@ -1,25 +1,51 @@ +/** + * Core module for drift analysis using transformer-based embeddings. + * This module provides the main pipeline for analyzing text drift using + * multiple models and metrics, including cosine similarity and euclidean distance. + */ + import fs from 'fs'; import path from 'path'; import { v4 } from 'uuid'; -import { DriftModel } from './DriftModel.js'; -import makeLogEntry from './makeLogEntry.js'; -import makeErrorLogEntry from './makeErrorLogEntry.js'; -import captureSharedScalarMetrics from './captureSharedScalarMetrics.js'; - -// * Global Variables for the utilities -// Embedding Models -export const MODELS = { - // t5: 'Xenova/sentence-t5-large', - // bert: 'Xenova/sentence_bert', - mini: 'Xenova/all-MiniLM-L12-v2', - e5: 'Xenova/e5-base-v2', -}; -// Log, Scalar, and Vector root output directory -export const OUTPUT_DIR = path.resolve('./tkyoData'); -// Cache of pipeline output results, to speed up model loading +import { config } from '../config.js'; +import { DriftModel } from './oneOffModel.js'; +import makeErrorLogEntry from './logMakeErrorEntry.js'; +import logMakeDriftEntry from './logMakeDriftEntry.js'; +import logMakeInputEntry from './logMakeInputEntry.js'; +import captureSharedScalarMetrics from './scalarCaptureShared.js'; + +/** + * Cache for pipeline output results to speed up model loading. + * This is used to prevent reloading models on each request in warm environments. + * @type {Object} + */ + export const MODEL_CACHE = {}; -// * One Off Ingestion Pipeline Logic +/** + * Main function for performing drift analysis on a text input. + * + * This function orchestrates the entire drift analysis pipeline: + * 1. Sets up necessary directories and validates model configuration + * 2. Initializes and loads transformer models + * 3. Generates embeddings for the input text + * 4. Computes scalar metrics (both shared and model-specific) + * 5. Saves embeddings and metrics to disk + * 6. Calculates drift metrics (cosine similarity and euclidean distance) + * 7. Logs all results + * + * @param {string} text - The text to analyze for drift + * @param {string} ioType - Type of input/output (e.g., 'input', 'output') + * @returns {Promise} + * + * @throws {Error} If model configuration is invalid + * @throws {Error} If there are issues with model construction or loading + * + * @example + * Analyze drift in an input text + * await tkyoDrift("Sample text to analyze", "input"); + */ + export default async function tkyoDrift(text, ioType) { // Stopwatch START 🏎️ // console.time('Drift Analyzer Full Run'); @@ -34,20 +60,20 @@ export default async function tkyoDrift(text, ioType) { try { // ------------- << Make Directories >> ------------- // Check if directory exists, if not, make it. - if (!fs.existsSync(OUTPUT_DIR)) { - fs.mkdirSync(OUTPUT_DIR, { recursive: true }); + if (!fs.existsSync(config.outputDir)) { + fs.mkdirSync(config.outputDir, { recursive: true }); } // Create subdirectories for vectors, scalars, and logs for (const dir of subdirectories) { - const subdirPath = path.join(OUTPUT_DIR, dir); + const subdirPath = path.join(config.outputDir, dir); if (!fs.existsSync(subdirPath)) { fs.mkdirSync(subdirPath, { recursive: true }); } } // Validate model config (we need the / and it's gotta be a string) - for (const [type, name] of Object.entries(MODELS)) { + for (const [type, name] of Object.entries(config.models)) { if (typeof name !== 'string' || !name.includes('/')) { throw new Error( `Invalid or missing model ID for "${type}" model: "${name}"` @@ -58,7 +84,7 @@ export default async function tkyoDrift(text, ioType) { // ------------- << Construct Model Combinations >> ------------- try { // * For each model, for each baselineType, make a model and assign to driftModels object - for (const [modelType, modelName] of Object.entries(MODELS)) { + for (const [modelType, modelName] of Object.entries(config.models)) { for (const baselineType of baselineTypes) { const key = `${modelType}.${ioType}.${baselineType}`; driftModels[key] = new DriftModel( @@ -159,9 +185,14 @@ export default async function tkyoDrift(text, ioType) { // * Push the results to each log // Make shared ID and date for the cosine and Euclidean logs const sharedID = v4(); - makeLogEntry(sharedID, similarityResults, 'COS'); - makeLogEntry(sharedID, distanceResults, 'EUC'); + logMakeDriftEntry(sharedID, similarityResults, 'COS'); + logMakeDriftEntry(sharedID, distanceResults, 'EUC'); + // Log the input text if logging is enabled + if (config.enableTextLogging) { + logMakeInputEntry(sharedID, text); + } + // ------------- << END try/catch Error Handling >> ------------- // * Push any errors to the error log // ! NOTE: This platform intentionally fails silently diff --git a/tkyo-drift/util/DriftModel.js b/tkyo-drift/util/oneOffModel.js similarity index 85% rename from tkyo-drift/util/DriftModel.js rename to tkyo-drift/util/oneOffModel.js index 16e9822..2fede7e 100644 --- a/tkyo-drift/util/DriftModel.js +++ b/tkyo-drift/util/oneOffModel.js @@ -1,13 +1,33 @@ +/** + * Core class for handling drift analysis using a single model. + * This class manages the lifecycle of a drift analysis model, including + * model loading, embedding generation, and drift metric computation. + */ + import fs from 'fs'; import path from 'path'; import { error } from 'console'; +import { fileURLToPath } from 'url'; import fsPromises from 'fs/promises'; +import { config } from '../config.js'; import { spawn } from 'child_process'; +import { MODEL_CACHE } from './oneOffEmb.js'; import { pipeline } from '@xenova/transformers'; -import { OUTPUT_DIR, MODEL_CACHE } from './oneOffEmb.js'; -import { fileURLToPath } from 'url'; +/** + * Class representing a drift analysis model. + * Handles model initialization, embedding generation, and drift metric computation. + */ export class DriftModel { + /** + * Create a new DriftModel instance. + * + * @param {string} modelType - Type of model (e.g., 'mini', 'e5') + * @param {string} modelName - Name of the transformer model to use + * @param {string} ioType - Type of input/output (e.g., 'input', 'output') + * @param {string} baselineType - Type of baseline ('training' or 'rolling') + */ + constructor(modelType, modelName, ioType, baselineType) { this.baselineType = baselineType; this.modelType = modelType; @@ -25,7 +45,13 @@ export class DriftModel { this.embeddingFilePath = null; } - // * Function to set the file path + /** + * Set the file paths for embeddings and scalar metrics. + * Handles both regular and KMeans-based training files. + * + * @throws {Error} If there's an error setting the file paths + */ + setFilePaths() { try { // ?NOTE: training baselines may use KMeans files, which are handled inside the Python logic. @@ -35,14 +61,14 @@ export class DriftModel { const baseName = `${this.modelType}.${this.ioType}.${this.baselineType}`; // Assemble the embedding file path (.bin file) - const vectorPath = path.join(OUTPUT_DIR, 'vectors', `${baseName}.bin`); + const vectorPath = path.join(config.outputDir, 'vectors', `${baseName}.bin`); const vectorKmeansPath = path.join( - OUTPUT_DIR, + config.outputDir, 'vectors', `${baseName}.kmeans.bin` ); const fallbackPath = path.join( - OUTPUT_DIR, + config.outputDir, 'vectors', `${this.modelType}.${this.ioType}.rolling.bin` ); @@ -57,7 +83,7 @@ export class DriftModel { // Scalar metric path (.scalar.jsonl) this.scalarFilePath = path.join( - OUTPUT_DIR, + config.outputDir, 'scalars', `${baseName}.scalar.jsonl` ); @@ -68,7 +94,13 @@ export class DriftModel { } } - // * Function to load the embedding model + /** + * Load the embedding model using the Xenova transformer pipeline. + * Uses a global cache to avoid reloading the same model. + * + * @throws {Error} If there's an error loading the model + */ + async loadModel() { try { // Don't reload a model if it's loaded. @@ -93,7 +125,14 @@ export class DriftModel { } } - // * Function to make an embedding from an input/output pair + /** + * Generate an embedding for the given text. + * Handles both short and long texts using chunking for texts that exceed token limits. + * + * @param {string} text - The text to generate an embedding for + * @throws {Error} If the text is invalid or embedding generation fails + */ + async makeEmbedding(text) { try { // Validate that the text is not null/undefined/empty @@ -184,7 +223,13 @@ export class DriftModel { } } - // * Function to Save Data to file path + /** + * Save the current embedding to a binary file. + * Only saves for rolling baselines, not training baselines. + * + * @throws {Error} If there's an error saving the embedding + */ + async saveToBin() { // Skip if training β€” this method is only for rolling baseline if (this.baselineType === 'training') return; @@ -273,7 +318,13 @@ export class DriftModel { } } - // * Function to read the contents of the Bins, Build an HNSW + /** + * Read embeddings from the binary file. + * Handles both regular and KMeans-based training files. + * + * @throws {Error} If there's an error reading the file + */ + async readFromBin() { // Full path to DriftModel.js const __filename = fileURLToPath(import.meta.url); @@ -294,7 +345,7 @@ export class DriftModel { ); } // Ensures we are running pythonHNSW.py correctly - const scriptPath = path.join(__dirname, 'pythonHNSW.py'); + const scriptPath = path.join(__dirname, 'sharedHNSW.py'); try { return new Promise((resolve, reject) => { @@ -357,7 +408,13 @@ export class DriftModel { } } - // * Function to get baseline value from vectorArray + /** + * Calculate the baseline embedding from the loaded vector array. + * Computes the mean of all vectors in the array. + * + * @throws {Error} If the vector array is not loaded or empty + */ + getBaseline() { try { // Check to make sure the vectorArray was correctly set in readFromBin @@ -407,7 +464,13 @@ export class DriftModel { } } - // * Function to get cosine similarity between baseline and embedding + /** + * Calculate the cosine similarity between the current embedding and baseline. + * + * @returns {number} The cosine similarity score + * @throws {Error} If embeddings are not available + */ + getCosineSimilarity() { try { // Validate the embedding and baselines both exist @@ -448,7 +511,13 @@ export class DriftModel { } } - // * Function to calculate the euclidean distance from the baseline + /** + * Calculate the euclidean distance between the current embedding and baseline. + * + * @returns {number} The euclidean distance + * @throws {Error} If embeddings are not available + */ + getEuclideanDistance() { try { // Validate that the embedding and baselines exist @@ -485,7 +554,12 @@ export class DriftModel { } } - // * Function to siphon PSI distribution metrics + /** + * Capture model-specific scalar metrics for the current text. + * + * @param {string} text - The text to analyze + */ + captureModelSpecificScalarMetrics(text) { try { // Skip if training β€” this method is only for rolling baseline @@ -508,7 +582,12 @@ export class DriftModel { } } - // * Function to write model-specific scalar metrics to separate files + /** + * Save the captured scalar metrics to a JSONL file. + * + * @throws {Error} If there's an error saving the metrics + */ + async saveScalarMetrics() { // Skip if training β€” this method is only for rolling baseline if (this.baselineType === 'training') return; @@ -527,7 +606,7 @@ export class DriftModel { // Construct the file path using: ioType.metric.modelType.baselineType.scalar.jsonl // Example: input.norm.semantic.rolling.scalar.jsonl const filePath = path.join( - OUTPUT_DIR, + config.outputDir, 'scalars', `${this.ioType}.${metric}.${this.modelType}.rolling.scalar.jsonl` ); diff --git a/tkyo-drift/util/captureSharedScalarMetrics.js b/tkyo-drift/util/scalarCaptureShared.js similarity index 60% rename from tkyo-drift/util/captureSharedScalarMetrics.js rename to tkyo-drift/util/scalarCaptureShared.js index 8300696..9a4718e 100644 --- a/tkyo-drift/util/captureSharedScalarMetrics.js +++ b/tkyo-drift/util/scalarCaptureShared.js @@ -1,9 +1,24 @@ -import fsPromises from 'fs/promises'; +/** + * Utility functions for capturing and computing shared scalar metrics for text analysis. + * These metrics include character length, entropy, word length, punctuation density, + * and uppercase ratio, which are stored in JSONL files for drift analysis. + */ + import fs from 'fs'; import path from 'path'; -import { OUTPUT_DIR } from './oneOffEmb.js'; +import fsPromises from 'fs/promises'; +import { config } from '../config.js'; + +/** + * Captures and stores shared scalar metrics for a given text input. + * The metrics are written to JSONL files in the scalars directory, + * with separate files for each metric type. + * + * @param {string} text - The text to analyze + * @param {string} ioType - The type of input/output (e.g., 'input', 'output') + * @returns {Promise} + */ -// Calculates the shared scalar values for a given input/output pair export default async function captureSharedScalarMetrics(text, ioType) { const timestamp = new Date().toISOString(); @@ -13,7 +28,7 @@ export default async function captureSharedScalarMetrics(text, ioType) { Object.entries(metricSet).map(([metric, value]) => { // Construct the file path const filePath = path.join( - OUTPUT_DIR, + config.outputDir, 'scalars', `${ioType}.${metric}.rolling.scalar.jsonl` ); @@ -31,7 +46,18 @@ export default async function captureSharedScalarMetrics(text, ioType) { ); } -// Internal helper to calculate scalar metrics for a given string +/** + * Computes various scalar metrics for a given text string. + * + * @param {string} text - The text to analyze + * @returns {Object} An object containing the following metrics: + * - characterLength: Total number of characters + * - characterEntropy: Shannon entropy of character distribution + * - avgWordLength: Average length of words + * - punctuationDensity: Ratio of punctuation characters + * - uppercaseRatio: Ratio of uppercase letters + */ + function computeMetrics(text) { const metrics = {}; diff --git a/tkyo-drift/util/compareScalarDistributions.js b/tkyo-drift/util/scalarCompare.js similarity index 62% rename from tkyo-drift/util/compareScalarDistributions.js rename to tkyo-drift/util/scalarCompare.js index b12eee5..ade1b26 100644 --- a/tkyo-drift/util/compareScalarDistributions.js +++ b/tkyo-drift/util/scalarCompare.js @@ -1,4 +1,18 @@ -// * Function that compares the scalar distributions between rolling and training +/** + * Utility functions for comparing statistical distributions between training and rolling data. + * These functions calculate means, standard deviations, and Population Stability Index (PSI) + * to detect drift in scalar metrics. + */ + +/** + * Compares statistical distributions between training and rolling data sets. + * For each shared metric, calculates means, standard deviations, and PSI. + * + * @param {Object} trainingMetrics - Object containing arrays of training data for each metric + * @param {Object} rollingMetrics - Object containing arrays of rolling data for each metric + * @returns {Object} Object containing statistical comparisons for each shared metric + */ + export function compareScalarDistributions(trainingMetrics, rollingMetrics) { const result = {}; @@ -37,12 +51,24 @@ export function compareScalarDistributions(trainingMetrics, rollingMetrics) { return result; } -// Helper: Mean +/** + * Calculates the arithmetic mean of an array of numbers. + * + * @param {number[]} arr - Array of numbers + * @returns {number} The mean value + */ + function mean(arr) { return arr.reduce((sum, val) => sum + val, 0) / arr.length; } -// Helper: Standard Deviation +/** + * Calculates the standard deviation of an array of numbers. + * + * @param {number[]} arr - Array of numbers + * @returns {number} The standard deviation + */ + function stddev(arr) { const avg = mean(arr); const variance = @@ -50,6 +76,16 @@ function stddev(arr) { return Math.sqrt(variance); } +/** + * Calculates the Population Stability Index (PSI) between two distributions. + * PSI measures the difference between two probability distributions. + * + * @param {number[]} train - Array of training data values + * @param {number[]} roll - Array of rolling data values + * @param {number} bins - Number of bins to use for distribution comparison (default: 10) + * @returns {number|null} The PSI value, or null if input is invalid + */ + function calculatePSI(train, roll, bins = 10) { if ( !Array.isArray(train) || diff --git a/tkyo-drift/util/loadScalarMetrics.js b/tkyo-drift/util/scalarLoadMetrics.js similarity index 71% rename from tkyo-drift/util/loadScalarMetrics.js rename to tkyo-drift/util/scalarLoadMetrics.js index 164e5ef..d308dff 100644 --- a/tkyo-drift/util/loadScalarMetrics.js +++ b/tkyo-drift/util/scalarLoadMetrics.js @@ -1,9 +1,27 @@ +/** + * Utility function for loading scalar metrics from JSONL files. + * This function reads metric data from files, handles both model-specific and model-agnostic metrics, + * and supports hybrid mode for training data. + */ + import fs from 'fs'; -import readline from 'readline'; import path from 'path'; -import { OUTPUT_DIR } from './oneOffEmb.js'; +import readline from 'readline'; +import { config } from '../config.js'; + +/** + * Loads scalar metrics from JSONL files and groups them by metric name. + * Supports both model-specific and model-agnostic metrics, and can operate in hybrid mode + * where training data is derived from rolling data. + * + * @param {string[]} metricNames - Array of metric names to load + * @param {string} ioType - Type of input/output (e.g., 'input', 'output') + * @param {string} baselineType - Type of baseline ('training' or 'rolling') + * @param {string|null} modelType - Optional model type for model-specific metrics + * @param {boolean} hybridMode - If true, uses rolling data as training data + * @returns {Promise} Object containing arrays of metric values keyed by metric name + */ -// * Function to read scalar metrics from the scalar jsonl files and group them by metric name export async function loadScalarMetrics( metricNames, ioType, @@ -14,22 +32,23 @@ export async function loadScalarMetrics( ) { const metrics = {}; // this will hold the final merged metric data + // Use config.outputDir instead of OUTPUT_DIR + const scalarDir = path.join(config.outputDir, 'scalars'); + for (const metric of metricNames) { let filePath; // Configure file path based on model type first if (modelType) { filePath = path.join( - OUTPUT_DIR, - 'scalars', // ? If the scalar metric is model specific, this will catch it (when this function gets invoked with a model value) + scalarDir, `${ioType}.${metric}.${modelType}.${baselineType}.scalar.jsonl` ); } else { filePath = path.join( - OUTPUT_DIR, - 'scalars', // ? Otherwise, the scalar metric will come from a model agnostic file + scalarDir, `${ioType}.${metric}.${baselineType}.scalar.jsonl` ); } @@ -38,16 +57,14 @@ export async function loadScalarMetrics( if (hybridMode) { if (modelType) { filePath = path.join( - OUTPUT_DIR, - 'scalars', // ? If the scalar metric is model specific, this will catch it (when this function gets invoked with a model value) + scalarDir, `${ioType}.${metric}.${modelType}.rolling.scalar.jsonl` ); } else { filePath = path.join( - OUTPUT_DIR, - 'scalars', // ? Otherwise, the scalar metric will come from a model agnostic file + scalarDir, `${ioType}.${metric}.rolling.scalar.jsonl` ); } diff --git a/tkyo-drift/util/pythonHNSW.py b/tkyo-drift/util/sharedHNSW.py similarity index 73% rename from tkyo-drift/util/pythonHNSW.py rename to tkyo-drift/util/sharedHNSW.py index 02b6e2e..dc4bc65 100644 --- a/tkyo-drift/util/pythonHNSW.py +++ b/tkyo-drift/util/sharedHNSW.py @@ -1,3 +1,9 @@ +""" +Module for performing nearest neighbor search using HNSW (Hierarchical Navigable Small World) algorithm. +This module provides functionality to find similar vectors in a dataset using approximate nearest neighbor search, +with special handling for both training and rolling baselines. +""" + # Numerical operations package import numpy as np # HNSW nearest neighbor search package @@ -16,6 +22,28 @@ import traceback def HNSW(io_type, model_type, query, baseline_type, file_path): + """ + Perform nearest neighbor search using HNSW algorithm. + + This function loads embeddings from a binary file and finds the k nearest neighbors + to the query vector. It handles both training and rolling baselines differently, + with special considerations for small datasets and KMeans centroids. + + Args: + io_type (str): Type of input/output (e.g., 'input', 'output') + model_type (str): Type of model (e.g., 'mini', 'e5') + query (str): JSON string containing the query vector + baseline_type (str): Type of baseline ('training' or 'rolling') + file_path (str): Path to the binary file containing embeddings + + Returns: + dict: A dictionary containing: + - centroids: List of nearest neighbor vectors + - distances: List of distances to nearest neighbors (None for small datasets) + + Raises: + ValueError: If query format is invalid or data size mismatch + """ # Parse the JSON query string into a numpy array try: @@ -24,6 +52,26 @@ def HNSW(io_type, model_type, query, baseline_type, file_path): raise ValueError("Invalid query format - must be JSON string") def load_embeddings(filename): + """ + Load embeddings from a binary file with header information. + + The binary file format is: + - First 8 bytes: Header containing num_vectors (4 bytes) and dims (4 bytes) + - Remaining bytes: Float32 array of embeddings + + Args: + filename (str): Path to the binary file + + Returns: + tuple: (reshaped_data, num_vectors, dims) + - reshaped_data: numpy array of shape (num_vectors, dims) + - num_vectors: Number of vectors in the file + - dims: Dimension of each vector + + Raises: + ValueError: If data size doesn't match header information + """ + # Loads the embeddings from the binary file with the header with open(filename, "rb") as f: # Read and parse header containing num_vector and dims