Most of Presidio's services are written in Go. The presidio-analyzer module, in charge of detecting entities in text, is written in Python. This document details the required parts for developing for Presidio.
-
Install go 1.11 and Python 3.7
-
Install the golang packages via dep
dep ensure
-
Install tesseract OCR framework. (Optional, only for Image anonymization)
-
Build and install re2 (Optional. Presidio will use
regexinstead ofpyre2ifre2is not installed)re2_version="2018-12-01" wget -O re2.tar.gz https://github.com/google/re2/archive/${re2_version}.tar.gz mkdir re2 tar --extract --file "re2.tar.gz" --directory "re2" --strip-components 1 cd re2 && make install
-
Install pipenv
Pipenv is a Python workflow manager, handling dependencies and environment for python packages, it is used in the Presidio's Analyzer project as the dependencies manager
pip3 install --user pipenvbrew install pipenvAdditional installation instructions: https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv
-
Create virtualenv for the project and install all requirements in the Pipfile, including dev requirements. In the
presidio-analyzerfolder, run:pipenv install --dev --sequential --skip-lock -
Download spacy model
pipenv run python -m spacy download en_core_web_lg -
Run all tests
pipenv run pytest -
To run arbitrary scripts within the virtual env, start the command with
pipenv run. For example:pipenv run flake8 analyzer --exclude "*pb2*.py"pipenv run pylint analyzerpipenv run pip freeze
-
Start shell:
pipenv shell -
Run commands in the shell
pytest pylint analyzer pip freeze
- To use presidio-analyzer as a python library, see Installing presidio-analyzer as a standalone Python package
- To add new recognizers in order to support new entities, see Adding new custom recognizers
- Installing and building the entire Presidio solution is currently not supported on Windows. However, installing and building the different docker images, or the Python package for detecting entities (presidio-analyzer) is possible on Windows. See here
- Build the bins with
make build - Build the base containers with
make docker-build-deps DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_DEPS_LABEL=${PRESIDIO_DEPS_LABEL}(If you do not specify a valid, logged-in, registry a warning will echo to the standard output) - Build the the Docker image with
make docker-build DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_DEPS_LABEL=${PRESIDIO_DEPS_LABEL} PRESIDIO_LABEL=${PRESIDIO_LABEL} - Push the Docker images with
make docker-push DOCKER_REGISTRY=${DOCKER_REGISTRY} PRESIDIO_LABEL=${PRESIDIO_LABEL} - Run the tests with
make test - Adding a file in go requires the
make go-formatcommand before running and building the service. - Run functional tests with
make test-functional - Updating python dependencies instructions
- These steps are verified on every pull request validation to a presidio branch. do not alter this document without referring to the implemented steps in the pipeline
GRPC_PORT:3001GRPC listen port
GRPC_PORT:3002GRPC listen port
WEB_PORT:8080HTTP listen portREDIS_URL:localhost:6379, Optional: Redis addressANALYZER_SVC_ADDRESS:localhost:3001, Analyzer addressANONYMIZER_SVC_ADDRESS:localhost:3002, Anonymizer address
Developing presidio as a whole on Windows is currently not supported. However, it is possible to run and test the presidio-analyzer module, in charge of detecting entities in text, on Windows using Docker:
- Run locally the core services Presidio needs to operate:
docker run --rm --name test-redis --network testnetwork -d -p 6379:6379 redis
docker run --rm --name test-presidio-anonymizer --network testnetwork -d -p 3001:3001 -e GRPC_PORT=3001 mcr.microsoft.com/presidio-anonymizer:latest
docker run --rm --name test-presidio-recognizers-store --network testnetwork -d -p 3004:3004 -e GRPC_PORT=3004 -e REDIS_URL=test-redis:6379 mcr.microsoft.com/presidio-recognizers-store:latest
-
Navigate to
<Presidio folder>/presidio-analyzer -
Install the python packages if didn't do so yet:
pipenv install --dev --sequential- If you want to experiment with
analyzerequests, navigate into theanalyzerfolder and start serving the analyzer service:
pipenv run python app.py serve --grpc-port 3000- In a new
pipenv shellwindow you can runanalyzerequests, for example:
pipenv run python app.py analyze --text "John Smith drivers license is AC432223" --fields "PERSON" "US_DRIVER_LICENSE" --grpc-port 3000
-
Edit
post.lua. Change the template name -
Run wrk
wrk -t2 -c2 -d30s -s post.lua http://<api-service-address>/api/v1/projects/<my-project>/analyze
-
If deploying from a private registry, verify that Kubernetes has access to the Docker Registry.
-
If using a Kubernetes secret to manage the registry authentication, make sure it is registered under 'presidio' namespace
Edit charts/presidio/values.yaml to:
- Setup secret name (for private registries)
- Change presidio services version
- Change default scale
-
The nlp engines deployed are set on start up based on the yaml configuration files in
presidio-analyzer/conf/. The default nlp engine is the large English SpaCy model (en_core_web_lg) set indefault.yaml. -
The format of the yaml file is as follows:
nlp_engine_name: spacy # {spacy, stanza}
models:
-
lang_code: en # code corresponds to `supported_language` in any custom recognizers
model_name: en_core_web_lg # the name of the SpaCy or Stanza model
-
lang_code: de # more than one model is optional, just add more items
model_name: de-
By default, we call the method
load_predefined_recognizersof theRecognizerRegistryclass to load language specific and language agnostic recognizers. -
Downloading additional engines.
- SpaCy NLP Models: models download page
- Stanza NLP Models: models download page
# download models - tldr
# spacy
python -m spacy download en_core_web_lg
# stanza
python -c 'import stanza; stanza.download("en");'