Skip to content

Semantic PII Masking & Anonymization for LLMs (RAG). GDPR-compliant, reversible, and context-aware. Supports LangChain & OpenAI

License

Notifications You must be signed in to change notification settings

Privalyse/privalyse-mask

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Privalyse Mask

πŸ›‘οΈ privalyse-mask

Privacy-safe LLMs without losing intelligence.

Mask PII without breaking reasoning.
Your LLM still understands age, roles, locations β€” but never sees real names, emails, or dates.

❌ Not a redaction library
❌ Not irreversible anonymization
❌ Not a SaaS API
βœ… Built for developers shipping LLM apps with real user data

PyPI version Downloads License: MIT Tests Python Versions

⭐ Star if you're curious about privacy-first LLMs β€” try this, break it, and tell us what's missing. Your feedback shapes where this goes.

⚠️ Early-stage project
privalyse-mask is new and evolving.
The core idea works β€” but real-world feedback matters more than polish.

If you're experimenting with LLMs and privacy, we'd love your input.


Quick Start

pip install privalyse-mask
from privalyse_mask import PrivalyseMasker

masker = PrivalyseMasker()
masked, m = masker.mask("I'm Sarah, born March 15, 1990")

# β†’ "I'm {Name_x}, born {Date_March_1990}"

Full example:

from privalyse_mask import PrivalyseMasker

masker = PrivalyseMasker()

# 1. Original sensitive data
user_input = "Hi! I'm Sarah Miller, born March 15, 1990. Email me at sarah.miller@company.com"

# 2. Mask before sending to LLM
masked, mapping = masker.mask(user_input)
print("Masked:", masked)
# Masked: Hi! I'm {Name_a8f}, born {Date_March_1990}. Email me at {Email_at_company.com}

# 3. Send to LLM (mocked for demo - works with any LLM!)
llm_response = f"Hello {list(mapping.keys())[0]}! Based on {list(mapping.keys())[1]}, you're about 35 years old."

# 4. Unmask the response
final = masker.unmask(llm_response, mapping)
print("Unmasked:", final)
# Unmasked: Hello Sarah Miller! Based on March 15, 1990, you're about 35 years old.

The LLM never saw the real name, birthday, or email. But it still gave a smart, contextual response!

Privalyse Mask Demo


Why explore semantic masking?

We ran into a recurring problem when building LLM apps with real user data:

  • Redaction removes meaning
  • Full anonymization is irreversible
  • Existing tools weren't designed with LLM reasoning in mind

privalyse-mask is our attempt to explore a different tradeoff:
protect privacy without blinding the model.

It's not finished β€” but it's promising.

Approach Tradeoff
[REDACTED] Private but blind β€” LLM loses reasoning ability
Anonymization Secure but permanent β€” can't restore values
Presidio Powerful but complex β€” requires manual setup
privalyse-mask Privacy + context β€” with semantic placeholders

Example:

Input:  "I was born on October 5, 2000 and live in Berlin"
                          ↓
Masked: "I was born on {Date_October_2000} and live in Berlin"
                          ↓
LLM:    "You're about 25, and German privacy laws (GDPR) apply to you!" 🎯

The LLM still knows the user is ~25 years old. But you never leaked their birthday.

Privalyse Mask Workflow


Real-World Example: OpenAI

from openai import OpenAI
from privalyse_mask import PrivalyseMasker

client = OpenAI(api_key="your-api-key")
masker = PrivalyseMasker()

prompt = "My email is alice@example.com and I was born on 15.03.1995"
masked_prompt, mapping = masker.mask(prompt)
# masked_prompt: "My email is {Email_at_example.com} and I was born on {Date_March_1995}"

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": masked_prompt}]
)
# LLM response: "Based on your email domain and birthdate, you're about 30 years old..."

final = masker.unmask(response.choices[0].message.content, mapping)
# Restored: "Based on your email domain and birthdate, you're about 30 years old..."
print(final)

Works with any LLM provider:

OpenAI Β· Anthropic Β· Gemini Β· LangChain Β· Ollama (local) Β· See all β†’


Features

🧠 Semantic Masking Preserves age, country, format β€” not just [REDACTED]
⚑ Zero Config Language models download automatically
πŸ”„ Reversible unmask() restores original values
🌍 Multi-language English, German, French, Spanish
πŸš€ Async Ready mask_async() for FastAPI, aiohttp
πŸ’¬ Chat History Consistent masking across conversation turns
πŸ”’ Local & Private Your data never leaves your infrastructure

Open questions (help wanted)

We're actively learning and would love feedback on:

  • Are the placeholders intuitive for different models?
  • Which entity types matter most in real applications?
  • Performance tradeoffs in high-throughput pipelines?
  • Edge cases where masking hurts reasoning?

If you have thoughts, benchmarks, or horror stories β€” please share them.


Configuration

# Default: fast, good for most cases
masker = PrivalyseMasker()

# Production: best accuracy (~500MB download once)
masker = PrivalyseMasker(model_size="lg")

# Multi-language
masker = PrivalyseMasker(languages=["en", "de", "fr"])

# Whitelist terms that should never be masked
masker = PrivalyseMasker(allow_list=["Acme Corp"])

Advanced

Mask entire conversations:

chat = [
    {"role": "user", "content": "I'm Peter"},
    {"role": "assistant", "content": "Hi Peter!"},
]
masked_chat, mapping = masker.mask_struct(chat)
# All "Peter" instances β†’ same placeholder

Async for web frameworks:

masked, mapping = await masker.mask_async(text)

πŸ“š Full Documentation


Who's This For?

  • πŸ—οΈ RAG pipelines β€” Mask documents before vector indexing
  • πŸ’¬ Chatbots β€” Protect user PII in conversations
  • πŸ§ͺ LLM testing β€” Use realistic data without risk
  • πŸ”§ Tool calling β€” Keep function arguments private
  • 🏒 Enterprise β€” GDPR/HIPAA compliance

Roadmap

  • βœ… Multi-language (EN, DE, FR, ES)
  • βœ… Async API
  • βœ… Chat history masking
  • βœ… Auto-download models
  • πŸ”œ Streaming support (high priority!)
  • πŸ”œ Custom entity types
  • πŸ”œ Fine-tuned masking rules per use case

Feedback > Features

At this stage, thoughtful feedback is more valuable than code.

If you:

  • Tried this and hit a limitation
  • Disagree with the masking approach
  • Have privacy or compliance concerns
  • Think this idea won't scale

Please open an issue or discussion. Critical feedback is welcome.

⭐ Star this repo if you want this idea to exist β€” even if it's not perfect yet.


License

MIT License β€” free for commercial and personal use.

About

Semantic PII Masking & Anonymization for LLMs (RAG). GDPR-compliant, reversible, and context-aware. Supports LangChain & OpenAI

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors