Mask PII without breaking reasoning.
Your LLM still understands age, roles, locations β but never sees real names, emails, or dates.
β Not a redaction library
β Not irreversible anonymization
β Not a SaaS API
β
Built for developers shipping LLM apps with real user data
β Star if you're curious about privacy-first LLMs β try this, break it, and tell us what's missing. Your feedback shapes where this goes.
β οΈ Early-stage project
privalyse-mask is new and evolving.
The core idea works β but real-world feedback matters more than polish.If you're experimenting with LLMs and privacy, we'd love your input.
pip install privalyse-maskfrom privalyse_mask import PrivalyseMasker
masker = PrivalyseMasker()
masked, m = masker.mask("I'm Sarah, born March 15, 1990")
# β "I'm {Name_x}, born {Date_March_1990}"Full example:
from privalyse_mask import PrivalyseMasker
masker = PrivalyseMasker()
# 1. Original sensitive data
user_input = "Hi! I'm Sarah Miller, born March 15, 1990. Email me at sarah.miller@company.com"
# 2. Mask before sending to LLM
masked, mapping = masker.mask(user_input)
print("Masked:", masked)
# Masked: Hi! I'm {Name_a8f}, born {Date_March_1990}. Email me at {Email_at_company.com}
# 3. Send to LLM (mocked for demo - works with any LLM!)
llm_response = f"Hello {list(mapping.keys())[0]}! Based on {list(mapping.keys())[1]}, you're about 35 years old."
# 4. Unmask the response
final = masker.unmask(llm_response, mapping)
print("Unmasked:", final)
# Unmasked: Hello Sarah Miller! Based on March 15, 1990, you're about 35 years old.The LLM never saw the real name, birthday, or email. But it still gave a smart, contextual response!
We ran into a recurring problem when building LLM apps with real user data:
- Redaction removes meaning
- Full anonymization is irreversible
- Existing tools weren't designed with LLM reasoning in mind
privalyse-mask is our attempt to explore a different tradeoff:
protect privacy without blinding the model.
It's not finished β but it's promising.
| Approach | Tradeoff |
|---|---|
[REDACTED] |
Private but blind β LLM loses reasoning ability |
| Anonymization | Secure but permanent β can't restore values |
| Presidio | Powerful but complex β requires manual setup |
| privalyse-mask | Privacy + context β with semantic placeholders |
Example:
Input: "I was born on October 5, 2000 and live in Berlin"
β
Masked: "I was born on {Date_October_2000} and live in Berlin"
β
LLM: "You're about 25, and German privacy laws (GDPR) apply to you!" π―
The LLM still knows the user is ~25 years old. But you never leaked their birthday.
from openai import OpenAI
from privalyse_mask import PrivalyseMasker
client = OpenAI(api_key="your-api-key")
masker = PrivalyseMasker()
prompt = "My email is alice@example.com and I was born on 15.03.1995"
masked_prompt, mapping = masker.mask(prompt)
# masked_prompt: "My email is {Email_at_example.com} and I was born on {Date_March_1995}"
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": masked_prompt}]
)
# LLM response: "Based on your email domain and birthdate, you're about 30 years old..."
final = masker.unmask(response.choices[0].message.content, mapping)
# Restored: "Based on your email domain and birthdate, you're about 30 years old..."
print(final)Works with any LLM provider:
OpenAI Β· Anthropic Β· Gemini Β· LangChain Β· Ollama (local) Β· See all β
| π§ Semantic Masking | Preserves age, country, format β not just [REDACTED] |
| β‘ Zero Config | Language models download automatically |
| π Reversible | unmask() restores original values |
| π Multi-language | English, German, French, Spanish |
| π Async Ready | mask_async() for FastAPI, aiohttp |
| π¬ Chat History | Consistent masking across conversation turns |
| π Local & Private | Your data never leaves your infrastructure |
We're actively learning and would love feedback on:
- Are the placeholders intuitive for different models?
- Which entity types matter most in real applications?
- Performance tradeoffs in high-throughput pipelines?
- Edge cases where masking hurts reasoning?
If you have thoughts, benchmarks, or horror stories β please share them.
# Default: fast, good for most cases
masker = PrivalyseMasker()
# Production: best accuracy (~500MB download once)
masker = PrivalyseMasker(model_size="lg")
# Multi-language
masker = PrivalyseMasker(languages=["en", "de", "fr"])
# Whitelist terms that should never be masked
masker = PrivalyseMasker(allow_list=["Acme Corp"])Mask entire conversations:
chat = [
{"role": "user", "content": "I'm Peter"},
{"role": "assistant", "content": "Hi Peter!"},
]
masked_chat, mapping = masker.mask_struct(chat)
# All "Peter" instances β same placeholderAsync for web frameworks:
masked, mapping = await masker.mask_async(text)π Full Documentation
- ποΈ RAG pipelines β Mask documents before vector indexing
- π¬ Chatbots β Protect user PII in conversations
- π§ͺ LLM testing β Use realistic data without risk
- π§ Tool calling β Keep function arguments private
- π’ Enterprise β GDPR/HIPAA compliance
- β Multi-language (EN, DE, FR, ES)
- β Async API
- β Chat history masking
- β Auto-download models
- π Streaming support (high priority!)
- π Custom entity types
- π Fine-tuned masking rules per use case
At this stage, thoughtful feedback is more valuable than code.
If you:
- Tried this and hit a limitation
- Disagree with the masking approach
- Have privacy or compliance concerns
- Think this idea won't scale
Please open an issue or discussion. Critical feedback is welcome.
- π Open an issue
- π¬ Start a discussion
- π§ Submit a PR
β Star this repo if you want this idea to exist β even if it's not perfect yet.
MIT License β free for commercial and personal use.


