🛡️ privalyse-mask

Privacy-safe LLMs without losing intelligence.

Mask PII without breaking reasoning.
Your LLM still understands age, roles, locations — but never sees real names, emails, or dates.

❌ Not a redaction library
❌ Not irreversible anonymization
❌ Not a SaaS API
✅ Built for developers shipping LLM apps with real user data

⭐ Star if you're curious about privacy-first LLMs — try this, break it, and tell us what's missing. Your feedback shapes where this goes.

⚠️ Early-stage project
privalyse-mask is new and evolving.
The core idea works — but real-world feedback matters more than polish.

If you're experimenting with LLMs and privacy, we'd love your input.

Quick Start

pip install privalyse-mask

from privalyse_mask import PrivalyseMasker

masker = PrivalyseMasker()
masked, m = masker.mask("I'm Sarah, born March 15, 1990")

# → "I'm {Name_x}, born {Date_March_1990}"

Full example:

from privalyse_mask import PrivalyseMasker

masker = PrivalyseMasker()

# 1. Original sensitive data
user_input = "Hi! I'm Sarah Miller, born March 15, 1990. Email me at sarah.miller@company.com"

# 2. Mask before sending to LLM
masked, mapping = masker.mask(user_input)
print("Masked:", masked)
# Masked: Hi! I'm {Name_a8f}, born {Date_March_1990}. Email me at {Email_at_company.com}

# 3. Send to LLM (mocked for demo - works with any LLM!)
llm_response = f"Hello {list(mapping.keys())[0]}! Based on {list(mapping.keys())[1]}, you're about 35 years old."

# 4. Unmask the response
final = masker.unmask(llm_response, mapping)
print("Unmasked:", final)
# Unmasked: Hello Sarah Miller! Based on March 15, 1990, you're about 35 years old.

The LLM never saw the real name, birthday, or email. But it still gave a smart, contextual response!

Why explore semantic masking?

We ran into a recurring problem when building LLM apps with real user data:

Redaction removes meaning
Full anonymization is irreversible
Existing tools weren't designed with LLM reasoning in mind

privalyse-mask is our attempt to explore a different tradeoff:
protect privacy without blinding the model.

It's not finished — but it's promising.

Approach	Tradeoff
`[REDACTED]`	Private but blind — LLM loses reasoning ability
Anonymization	Secure but permanent — can't restore values
Presidio	Powerful but complex — requires manual setup
privalyse-mask	Privacy + context — with semantic placeholders

Example:

Input:  "I was born on October 5, 2000 and live in Berlin"
                          ↓
Masked: "I was born on {Date_October_2000} and live in Berlin"
                          ↓
LLM:    "You're about 25, and German privacy laws (GDPR) apply to you!" 🎯

The LLM still knows the user is ~25 years old. But you never leaked their birthday.

Real-World Example: OpenAI

from openai import OpenAI
from privalyse_mask import PrivalyseMasker

client = OpenAI(api_key="your-api-key")
masker = PrivalyseMasker()

prompt = "My email is alice@example.com and I was born on 15.03.1995"
masked_prompt, mapping = masker.mask(prompt)
# masked_prompt: "My email is {Email_at_example.com} and I was born on {Date_March_1995}"

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": masked_prompt}]
)
# LLM response: "Based on your email domain and birthdate, you're about 30 years old..."

final = masker.unmask(response.choices[0].message.content, mapping)
# Restored: "Based on your email domain and birthdate, you're about 30 years old..."
print(final)

Works with any LLM provider:

OpenAI · Anthropic · Gemini · LangChain · Ollama (local) · See all →

Features


🧠 Semantic Masking	Preserves age, country, format — not just `[REDACTED]`
⚡ Zero Config	Language models download automatically
🔄 Reversible	`unmask()` restores original values
🌍 Multi-language	English, German, French, Spanish
🚀 Async Ready	`mask_async()` for FastAPI, aiohttp
💬 Chat History	Consistent masking across conversation turns
🔒 Local & Private	Your data never leaves your infrastructure

Open questions (help wanted)

We're actively learning and would love feedback on:

Are the placeholders intuitive for different models?
Which entity types matter most in real applications?
Performance tradeoffs in high-throughput pipelines?
Edge cases where masking hurts reasoning?

If you have thoughts, benchmarks, or horror stories — please share them.

Configuration

# Default: fast, good for most cases
masker = PrivalyseMasker()

# Production: best accuracy (~500MB download once)
masker = PrivalyseMasker(model_size="lg")

# Multi-language
masker = PrivalyseMasker(languages=["en", "de", "fr"])

# Whitelist terms that should never be masked
masker = PrivalyseMasker(allow_list=["Acme Corp"])

Advanced

Mask entire conversations:

chat = [
    {"role": "user", "content": "I'm Peter"},
    {"role": "assistant", "content": "Hi Peter!"},
]
masked_chat, mapping = masker.mask_struct(chat)
# All "Peter" instances → same placeholder

Async for web frameworks:

masked, mapping = await masker.mask_async(text)

📚 Full Documentation

Who's This For?

🏗️ RAG pipelines — Mask documents before vector indexing
💬 Chatbots — Protect user PII in conversations
🧪 LLM testing — Use realistic data without risk
🔧 Tool calling — Keep function arguments private
🏢 Enterprise — GDPR/HIPAA compliance

Roadmap

✅ Multi-language (EN, DE, FR, ES)
✅ Async API
✅ Chat history masking
✅ Auto-download models
🔜 Streaming support (high priority!)
🔜 Custom entity types
🔜 Fine-tuned masking rules per use case

Feedback > Features

At this stage, thoughtful feedback is more valuable than code.

If you:

Tried this and hit a limitation
Disagree with the masking approach
Have privacy or compliance concerns
Think this idea won't scale

Please open an issue or discussion. Critical feedback is welcome.

🐛 Open an issue
💬 Start a discussion
🔧 Submit a PR

⭐ Star this repo if you want this idea to exist — even if it's not perfect yet.

License

MIT License — free for commercial and personal use.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
scripts		scripts
src/privalyse_mask		src/privalyse_mask
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ privalyse-mask

Privacy-safe LLMs without losing intelligence.

Quick Start

Why explore semantic masking?

Real-World Example: OpenAI

Features

Open questions (help wanted)

Configuration

Advanced

Who's This For?

Roadmap

Feedback > Features

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

Privalyse/privalyse-mask

Folders and files

Latest commit

History

Repository files navigation

🛡️ privalyse-mask

Privacy-safe LLMs without losing intelligence.

Quick Start

Why explore semantic masking?

Real-World Example: OpenAI

Features

Open questions (help wanted)

Configuration

Advanced

Who's This For?

Roadmap

Feedback > Features

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages