Skip to content

ai-text-sanitizer is a tiny (<6 kB) zero-dependency ES module for cleaning and normalising raw text generated by large language models before you render, store, or diff it.

License

Notifications You must be signed in to change notification settings

BeMoreDifferent/ai-text-sanitizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ai-text-sanitizer

Utility for post-processing AI-generated text. It normalises output by removing invisible characters (often used as watermarks or formatting artifacts), folding exotic whitespace, converting "pretty" punctuation to ASCII, and stripping inline citation placeholders such as (oaicite:12){index=12}.

About

ai-text-sanitizer is a tiny (<6 kB) zero-dependency ES module for cleaning and normalising raw text generated by large language models before you render, store, or diff it.

Description

The library removes invisible Unicode watermark characters, exotic whitespace, and ASCII control codes, converts fancy punctuation to plain ASCII, strips inline citation placeholders, and optionally collapses redundant spaces—all while returning per-rule change statistics so you can audit the process.

Features

  • Removes Unicode format and other zero-width characters that can act as invisible watermarks.
  • Converts fancy punctuation (curly quotes, en/em dashes, ellipsis, bullets) to plain ASCII equivalents.
  • Folds a wide range of Unicode space characters to a standard space.
  • Collapses runs of multiple spaces and normalises line endings to LF.
  • Eliminates citation placeholders emitted by some language models.
  • Optionally preserves or removes emoji glue characters (ZWJ / variation selectors).
  • Returns granular change statistics so you can audit the cleaning process.

Installation

pnpm add ai-text-sanitizer

This project is published as an ES module and requires Node ≥ 18.

Usage

import { sanitizeAiText } from 'ai-text-sanitizer';

const input = `\uFEFF"Hello\u200B world…" (oaicite:5){index=5}`;

const { cleaned, changes } = sanitizeAiText(input);

console.log(cleaned);  // "Hello world..."
console.log(changes);  /* {
                          removedInvisible: 2,
                          removedCtrl: 0,
                          removedCitations: 1,
                          prettified: 3,
                          collapsedSpaces: 0,
                          total: 6
                        } */

TypeScript

ai-text-sanitizer ships with built-in .d.ts declarations. Nothing extra to install — just import and enjoy full IntelliSense:

import { sanitizeAiText, type SanitizeResult } from 'ai-text-sanitizer';

const result: SanitizeResult = sanitizeAiText('مرحبا\u200Fالعالم');
console.log(result.cleaned);

API

sanitizeAiText(text, options?){ cleaned, changes }

Parameter Type Default Description
text string Input text to sanitise.
options object (optional) Behaviour flags (below).
keepEmoji boolean true Keep ZWJ / variation selectors used by emoji.
collapseSpaces boolean true Collapse contiguous ASCII spaces.

The returned changes object reports how many code points were altered for each rule plus a total sum.

Running the test suite

pnpm install
pnpm test

Tests live in __tests__/ and exercise typical real-world scenarios including HTML fragments, code snippets, emoji sequences, and BOM handling.

Limitations

  • The function operates on raw strings; it does not parse or sanitise HTML structure. HTML tags remain untouched but are treated as plain text.
  • The mapping of fancy punctuation is intentionally conservative. If you need broader transliteration, customise the PRETTIES table in aiTextSanitizer.js.

Contributing

Contributions, bug reports, and feature requests are very welcome — feel free to open an issue or submit a pull request. Please ensure the test suite passes (pnpm test) and follow conventional commit messages for ease of release automation.


This repository contains only the core library and test suite to keep the footprint minimal.

About

ai-text-sanitizer is a tiny (<6 kB) zero-dependency ES module for cleaning and normalising raw text generated by large language models before you render, store, or diff it.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published