Add guide for customizing Presidio Docker images#1792
Add guide for customizing Presidio Docker images#1792SAIRAMSSSS wants to merge 3 commits intomicrosoft:mainfrom
Conversation
This document provides a comprehensive guide on how to build and customize Presidio Docker images to support additional languages and configurations, including prerequisites, steps for modification, and troubleshooting tips.
|
@SAIRAMSSSS please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
omri374
left a comment
There was a problem hiding this comment.
Thanks! This is a great start! Left some comments for discussion.
|
|
||
| Navigate to `presidio-analyzer/Dockerfile` and add your desired spaCy language models. | ||
|
|
||
| ### Example: Adding Spanish Support |
There was a problem hiding this comment.
Presidio supports the installation of spacy, stanza and transformers models using the NLP config, so there is no need to explicitly add those to the Dockerfile. Have you given this a try?
| **Problem**: Adding 10+ languages at once can cause the Docker image to run out of memory during build or runtime. | ||
|
|
||
| **Solutions**: | ||
| - Use smaller spaCy models (e.g., `es_core_news_sm` instead of `es_core_news_lg`) |
There was a problem hiding this comment.
Please add a caveat about smaller models likely being less accurate in detecting PII in the text
| docker run -d -p 5002:3000 --memory="4g" presidio-analyzer-custom:latest | ||
| ``` | ||
| - Build images with only the languages you actually need | ||
| - Consider using transformers models which can be more memory-efficient |
There was a problem hiding this comment.
Not sure this is true. Do you have a concrete example?
| - `md` (medium): ~40MB, balanced | ||
| - `lg` (large): ~500MB+, most accurate but resource-intensive | ||
|
|
||
| **Recommendation**: Start with `md` models for a good balance. |
There was a problem hiding this comment.
Our recommendation is to start with the large models
| # Install spaCy language models | ||
| RUN python -m spacy download en_core_web_lg | ||
| RUN python -m spacy download es_core_news_md | ||
| RUN python -m spacy download fr_core_news_md |
There was a problem hiding this comment.
If you download models but not configure the NER model configuration, presidio will ignore those models.
There was a problem hiding this comment.
Pull request overview
This PR adds a new documentation file providing guidance on building and customizing Presidio Docker images for multi-language support, addressing issue #1663.
Changes:
- New comprehensive documentation file
docs/docker_customization.mdwith instructions on Dockerfile modifications, YAML configurations, common pitfalls, and docker-compose examples
| ## Step 6: Run Your Custom Image | ||
|
|
||
| Run the custom image: | ||
|
|
||
| ```bash | ||
| docker run -d -p 5002:3000 presidio-analyzer-custom:latest | ||
| ``` |
There was a problem hiding this comment.
The Docker run command uses port 5002 for the external mapping but the Dockerfile default PORT environment variable is 3000 (as seen in the actual Dockerfile line 13). This creates confusion about which internal port the service is actually running on.
The documentation should be consistent with the actual Presidio Dockerfile which uses PORT=3000 by default. The command should either be:
docker run -d -p 5002:3000 presidio-analyzer-custom:latest(using default PORT=3000)- Or document that users can override the PORT environment variable if needed
| - [Presidio Analyzer Documentation](https://microsoft.github.io/presidio/analyzer/) | ||
| - [spaCy Language Models](https://spacy.io/models) | ||
| - [Presidio Custom Recognizers](https://microsoft.github.io/presidio/analyzer/adding_recognizers/) | ||
| - [Analyzer Engine Provider](https://microsoft.github.io/presidio/analyzer/analyzer_engine_provider/) |
There was a problem hiding this comment.
The link to "Analyzer Engine Provider" documentation appears to be inconsistent with the actual file name. The link uses /analyzer/analyzer_engine_provider/ (suggesting a directory), but the actual file in the repository is analyzer/analyzer_engine_provider.md (a single markdown file).
The correct link format should be:
[Analyzer Engine Provider](https://microsoft.github.io/presidio/analyzer/analyzer_engine_provider/)
This likely works in practice due to how MkDocs handles URLs, but it's better to be consistent with the actual file structure for clarity.
| - [Analyzer Engine Provider](https://microsoft.github.io/presidio/analyzer/analyzer_engine_provider/) | |
| - [Analyzer Engine Provider](https://microsoft.github.io/presidio/analyzer/analyzer_engine_provider.md) |
| For complex setups, use docker-compose.yml: | ||
|
|
||
| ```yaml | ||
| version: '3.8' |
There was a problem hiding this comment.
The docker-compose.yml example uses version '3.8', but the actual Presidio docker-compose.yml files in the repository don't specify a version (which is the recommended practice for modern Docker Compose). The version field is deprecated in the latest Docker Compose specification.
Consider either:
- Removing the
version: '3.8'line to follow current best practices - Adding a note that the version field is optional in modern Docker Compose
This is a minor point but helps keep the documentation aligned with current Docker Compose conventions.
| version: '3.8' |
| docker run -d -p 5002:3000 --memory="4g" presidio-analyzer-custom:latest | ||
| ``` | ||
| - Build images with only the languages you actually need | ||
| - Consider using transformers models which can be more memory-efficient |
There was a problem hiding this comment.
The recommendation to "Consider using transformers models which can be more memory-efficient" may be misleading. Transformers models are typically more memory-intensive than smaller spaCy models, not less. The advantage of transformers is usually better accuracy for certain tasks, not memory efficiency.
If the intent is to suggest using a single multilingual transformer model instead of multiple language-specific spaCy models, this should be clarified. Otherwise, this recommendation could confuse users about the memory characteristics of transformers vs spaCy models.
Consider revising to: "Consider using a single multilingual transformers model instead of multiple language-specific spaCy models, which can reduce the total model size" or removing this point if it's not accurate for the intended use case.
| - Consider using transformers models which can be more memory-efficient | |
| - Consider using a single multilingual transformers model instead of multiple language-specific spaCy models, which can reduce the total model size |
| WORKDIR /usr/bin/presidio-analyzer | ||
|
|
||
| # Copy requirements | ||
| COPY requirements.txt . | ||
| RUN pip install -r requirements.txt | ||
|
|
||
| # Install spaCy language models | ||
| RUN python -m spacy download en_core_web_lg | ||
| RUN python -m spacy download es_core_news_md | ||
| RUN python -m spacy download fr_core_news_md | ||
|
|
||
| # Copy application code | ||
| COPY . . | ||
|
|
||
| EXPOSE 3000 | ||
|
|
||
| CMD ["python", "app.py"] |
There was a problem hiding this comment.
The Dockerfile example provided is incomplete and doesn't match the actual Presidio architecture. The real Presidio Dockerfile:
- Uses Python 3.12-slim (not 3.11-slim) as of the current version
- Uses Poetry for dependency management, not direct pip install
- Installs models via a configuration-driven approach using
install_nlp_models.py, not direct spacy download commands - Has a more complex structure with environment variables, non-root user setup, and health checks
This simplified example could mislead users into thinking they should manually add RUN python -m spacy download commands, when they should actually be modifying the YAML configuration files. Either remove this example or replace it with accurate guidance on modifying the NLP configuration file (e.g., conf/default.yaml or using a custom configuration file via build args).
| WORKDIR /usr/bin/presidio-analyzer | |
| # Copy requirements | |
| COPY requirements.txt . | |
| RUN pip install -r requirements.txt | |
| # Install spaCy language models | |
| RUN python -m spacy download en_core_web_lg | |
| RUN python -m spacy download es_core_news_md | |
| RUN python -m spacy download fr_core_news_md | |
| # Copy application code | |
| COPY . . | |
| EXPOSE 3000 | |
| CMD ["python", "app.py"] | |
| FROM python:3.12-slim | |
| # Install system dependencies and Poetry | |
| RUN apt-get update && \ | |
| apt-get install -y --no-install-recommends curl build-essential && \ | |
| pip install --no-cache-dir poetry && \ | |
| rm -rf /var/lib/apt/lists/* | |
| WORKDIR /usr/bin/presidio-analyzer | |
| # Copy project metadata and install Python dependencies with Poetry | |
| COPY pyproject.toml poetry.lock ./ | |
| RUN poetry install --no-dev --no-interaction --no-ansi | |
| # Copy NLP configuration which controls which spaCy models are installed | |
| # Customize this YAML file (or provide your own via build args) instead of | |
| # adding `python -m spacy download ...` commands in the Dockerfile. | |
| COPY conf ./conf | |
| # Install NLP models based on the configuration file | |
| RUN poetry run python -m presidio_analyzer.install_nlp_models --conf-file conf/default.yaml | |
| # Copy application code | |
| COPY . . | |
| # Create and switch to a non-root user | |
| RUN useradd -m presidio && chown -R presidio /usr/bin/presidio-analyzer | |
| USER presidio | |
| EXPOSE 3000 | |
| # Simple health check endpoint is expected at /health | |
| HEALTHCHECK --interval=30s --timeout=5s --retries=3 \ | |
| CMD curl -fsS http://localhost:3000/health || exit 1 | |
| # Start the analyzer service (command may vary by version; see official Dockerfile) | |
| CMD ["poetry", "run", "gunicorn", "-b", "0.0.0.0:3000", "presidio_analyzer.entrypoints.app:app"] |
| ### Modified Dockerfile | ||
|
|
||
| ```dockerfile | ||
| FROM python:3.11-slim |
There was a problem hiding this comment.
The Dockerfile example specifies Python 3.11-slim, but the actual Presidio Dockerfile uses Python 3.12-slim (as seen in line 1 of the actual Dockerfile). While Python 3.11 is supported by Presidio according to the installation documentation, the example should match the current official Dockerfile to avoid confusion.
Consider updating to Python 3.12-slim to match the current official implementation, or add a note that users can use any supported Python version (3.10-3.13 according to installation.md).
| FROM python:3.11-slim | |
| FROM python:3.12-slim |
| Navigate to `presidio-analyzer/Dockerfile` and add your desired spaCy language models. | ||
|
|
||
| ### Example: Adding Spanish Support | ||
|
|
||
| In the Dockerfile, locate the section where spaCy models are downloaded and add: | ||
|
|
||
| ```dockerfile | ||
| RUN python -m spacy download es_core_news_md | ||
| ``` | ||
|
|
||
| ### Example: Adding Multiple Languages | ||
|
|
||
| ```dockerfile | ||
| # Install language models | ||
| RUN python -m spacy download en_core_web_lg | ||
| RUN python -m spacy download es_core_news_md # Spanish | ||
| RUN python -m spacy download fr_core_news_md # French | ||
| RUN python -m spacy download de_core_news_md # German | ||
| ``` |
There was a problem hiding this comment.
The documentation incorrectly describes how to add language models to the Dockerfile. The actual Presidio Dockerfile uses a configuration-based approach where models are installed automatically via the install_nlp_models.py script that reads from configuration YAML files (like default.yaml), not by directly adding RUN python -m spacy download commands in the Dockerfile.
The correct approach is to modify the NLP configuration file (e.g., presidio_analyzer/conf/default.yaml) to specify which models should be installed. The Dockerfile already contains the logic to read this configuration and install the models automatically during the build process at line 36: RUN poetry run python install_nlp_models.py --conf_file ${NLP_CONF_FILE}
This section should be rewritten to reflect the actual architecture and direct users to modify the YAML configuration files instead.
| ### Update Configuration File | ||
|
|
||
| Modify the recognizers configuration to support your languages. Edit `presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml`: | ||
|
|
||
| ```yaml | ||
| # Add supported languages | ||
| supported_languages: | ||
| - en | ||
| - es | ||
| - fr | ||
| - de | ||
| ``` |
There was a problem hiding this comment.
This YAML configuration section is incorrect. The default_recognizers.yaml file already contains a supported_languages list at the top level (line 1-2 of the actual file), and it uses the format:
supported_languages:
- en
The documentation should clarify that users need to:
- Modify the NLP configuration file (e.g.,
presidio_analyzer/conf/default.yaml) to add language models - Update the top-level
supported_languageslist indefault_recognizers.yamlto include new language codes - Optionally add or update individual recognizers with language-specific context words
The current documentation incorrectly suggests adding a supported_languages block under the recognizers configuration without explaining the proper structure.
| ### 2. Warning: NLP Recognizer Not in List | ||
|
|
||
| If you see warnings like: | ||
| ``` | ||
| UserWarning: NLP recognizer (e.g. SpacyRecognizer, StanzaRecognizer) is not in the list of recognizers for language en. | ||
| ``` | ||
|
|
||
| **Solution**: Ensure your language configuration matches your installed models: | ||
|
|
||
| 1. Check `default_recognizers.yaml` includes your language | ||
| 2. Verify the spaCy model is properly downloaded in the Dockerfile | ||
| 3. Ensure the language code matches (e.g., 'en' for English, 'es' for Spanish) | ||
|
|
There was a problem hiding this comment.
The warning about "NLP recognizer is not in the list of recognizers" is misleading. This warning typically occurs when the NLP engine configuration (spacy models) doesn't match the recognizer registry configuration, not just when language configuration doesn't match installed models.
The solution provided is incomplete. Based on the actual Presidio architecture:
- The
default_recognizers.yamlfile controls which recognizers are loaded and which languages they support - The NLP configuration file (e.g.,
default.yaml) controls which spaCy models are installed - These two must be aligned: if you add Spanish support, you need BOTH the Spanish spaCy model in the NLP config AND Spanish language support declared in the recognizer registry
The documentation should clarify that this warning appears when recognizers are configured for a language but no NLP model is configured for that language in the NLP configuration file.
|
|
||
| The key files for customization are: | ||
|
|
||
| - `presidio-analyzer/Dockerfile`: Defines the analyzer Docker image | ||
| - `presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml`: Configures recognizers | ||
|
|
There was a problem hiding this comment.
The documentation doesn't mention that users need to update the NLP configuration file (default.yaml) to specify which language models to install. Based on the actual Presidio Dockerfile (line 36), models are installed by running install_nlp_models.py --conf_file ${NLP_CONF_FILE}, which reads from the configuration file.
For multi-language support, users should either:
- Modify
presidio_analyzer/conf/default.yamlto add additional models, OR - Create a custom NLP configuration file (e.g.,
spacy_multilingual.yamlwhich already exists in the repo) and pass it as a build arg
The current documentation focuses on modifying the Dockerfile directly, which is not the recommended approach according to the actual Presidio architecture.
Summary
This PR adds comprehensive documentation for building and customizing Presidio Docker images to support additional languages.
Changes
docs/docker_customization.mdwith detailed instructions on:Addresses Issue
Closes #1663
This documentation fulfills the request for more elaborate instructions on building custom Docker images for Presidio, specifically covering:
Testing
Documentation has been reviewed for accuracy and completeness. All code examples follow Presidio's existing patterns.