Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions .github/workflows/build-images-manifests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
name: Presidio Docker Build

on:
push:
branches: [ main ]
workflow_dispatch:

env:
REGISTRY_NAME: ghcr.io # SDSC ADD-ON
USERNAME: ${{ github.repository_owner }}
TAG: gha${{ github.run_number }}

jobs:
build-platform-images:
name: Build ${{ matrix.image }} (${{ matrix.platform }})
runs-on: ${{ matrix.runner }}
strategy:
matrix:
include:
- image: presidio-anonymizer
platform: linux/amd64
runner: ubuntu-latest
- image: presidio-analyzer
platform: linux/amd64
runner: ubuntu-latest
# Note: do we want this part of presidio ? Maybe future feature ?
# - image: presidio-image-redactor
# platform: linux/amd64
# runner: ubuntu-latest
steps:
# SDSC ADD-ON
- name: Get latest Presidio release tag
id: presidio_release
run: |
tag=$(curl -s https://api.github.com/repos/microsoft/presidio/releases/latest | jq -r .tag_name)
echo "tag=$tag" >> $GITHUB_OUTPUT

# SDSC ADD-ON
- name: Checkout Presidio (latest release)
uses: actions/checkout@v5
with:
repository: microsoft/presidio
ref: ${{ steps.presidio_release.outputs.tag }}

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

# SDSC ADD-ON
# https://github.com/docker/login-action
- name: Log in to the Container registry
uses: docker/login-action@v3.0.0
with:
registry: ${{ env.REGISTRY_NAME }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Build and Push ${{ matrix.image }} for ${{ matrix.platform }}
run: |
# Create platform-specific tag
PLATFORM_TAG=$(echo "${{ matrix.platform }}" | sed 's/\//-/g')
docker buildx build \
--platform ${{ matrix.platform }} \
--push \
--tag ${{ env.REGISTRY_NAME }}/${{ env.USERNAME }}/${{ matrix.image }}:${{ env.TAG }}-${PLATFORM_TAG} \
--cache-from type=registry,ref=${{ env.REGISTRY_NAME }}/${{ env.USERNAME }}/${{ matrix.image }}:latest \
--cache-to type=inline \
./${{ matrix.image }}

create-manifests:
name: Create Multi-Platform Manifests
runs-on: ubuntu-latest
needs: build-platform-images
steps:
# SDSC ADD-ON
# https://github.com/docker/login-action
- name: Log in to the Container registry
uses: docker/login-action@v3.0.0
with:
registry: ${{ env.REGISTRY_NAME }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Create all multi-platform manifests
run: |
IMAGES=("presidio-anonymizer" "presidio-analyzer" "presidio-image-redactor")

for image in "${IMAGES[@]}"; do
echo "Creating manifest for $image"
docker buildx imagetools create \
--tag ${{ env.REGISTRY_NAME }}/${{ env.USERNAME }}/${image}:${{ env.TAG }} \
${{ env.REGISTRY_NAME }}/${{ env.USERNAME }}/${image}:${{ env.TAG }}-linux-amd64
done
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
.direnv/

# third party manifests
external/helm/*
external/ytt/*
external/.vendir*

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[codz]
Expand Down
37 changes: 22 additions & 15 deletions docs/presidio-poc.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,17 @@
- can be deployed as an API server using a compose stack

## API usage

2-steps:

- analyze: NER from raw text using models
- anonymize: config (rule) based processing of pre-detected PII

### analyze

- Minimal requirements: text + language. By default, all recognizers for that language are enabled.
```sh
$ curl http://localhost:5002/analyze -s --header "Content-Type: application/json" --request POST --data '{"text": "John Smith drivers license is AC432223","language": "en"}' | jq
$ curl http://localhost:5002/analyze -s --header "Content-Type: application/json" --request POST --data '{"text": "John Smith drivers license is AC432223","language": "en"}' | jq
[
{
"analysis_explanation": null,
Expand All @@ -33,19 +36,22 @@
}
]
```
- analysis can be controlled by setting detection score, selecting entities, adding context words and adding a correlation id(?)
- analysis can be controlled by setting detection score, selecting entities, adding context words and adding a correlation id(?)
- ad-hoc pattern (regex) recognizers can be provided as json objects
- a correlation-id (hash) can be given to append to logs for easier grouping of analyses in logs / traces.

### anonymize

- By default, the anonymization replaces all detected identifies by their type (e.g. <PERSON>) in the input text.
- An anonymizer dictionary can be provided to associate specific anonymization procedure to specific entity types.
- Two inputs must be given to the endpoint:
- the raw text
- the response from the analyze step (detected entities and their positions)

### artificial sample

Input:

```
Prof. Gérard Waeber, Chef de service
Tél: +41 21 314 68 85 / Fax: +41 21 314 08 95
Expand Down Expand Up @@ -77,8 +83,10 @@ jfldéijf
Dr Médecin 00 Formateur
Chef de clinique
```

- ## initial tests
Works with example artifical lettre de sortie.
Works with example artifical lettre de sortie.

```python
import json
import requests
Expand Down Expand Up @@ -129,7 +137,9 @@ print(
## limitations

### potential improvements

Model configuration

```yaml
# config.yaml
nlp_engine_name: spacy
Expand Down Expand Up @@ -157,30 +167,28 @@ ner_model_configuration:
```

Recognizer configuration

```yaml
# recognizers.yaml
recognizers:
-
name: "Swiss Zip code Recognizer"
- name: "Swiss Zip code Recognizer"
supported_languages:
- language: fr
context: [adresse, postal]
- language: de
context: [ort,]
context: [ort]
- language: it
context: [...]

patterns:
-
name: "zip code (weak)"
regex: "(\\b\\d{5}(?:\\-\\d{4})?\\b)"
score: 0.01
- name: "zip code (weak)"
regex: "(\\b\\d{5}(?:\\-\\d{4})?\\b)"
score: 0.01
context:
- zip
- code
- zip
- code
supported_entity: "ZIP"
-
name: "Titles recognizer"
- name: "Titles recognizer"
supported_language: "en"
supported_entity: "TITLE"
deny_list:
Expand All @@ -190,5 +198,4 @@ recognizers:
- Miss
- Dr.
- Prof.

```
88 changes: 88 additions & 0 deletions docs/services.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Services management

The deployment defines multiple service (or application), each being a
collection of kubernetes manifests located in `src/<service>/`.

## Structure

- `external/`: third party resources
- `src/`: deployable manifests
- secrets are encrypted with sops+age and persisted in `src/secrets/`

Each service is structured as follows (supported tools are `ytt` and `helm`):

```text
├── external
│ └── <tool>
│ └── <service>/... # <- third party templates
└── src
└── <service>
├── additional-manifest.yaml # <- custom manifests for this deployment
├── kustomization.yaml # <- kustomization file to select resources
└── <tool>
├── out/... # <- rendered manifests
└── values.yaml # <- values used for templating
```

## Templating

[ytt](https://carvel.dev/ytt) is the preferred rendering engine, but helm is
also supported as many upstream templates are distributed with
[helm](https://helm.sh).

When running `just render`, we attempt to render each service with helm and then
with ytt and save the rendered manifests in the repository.

## Deployment

When deploying with `just deploy`, deployment is done with kustomize
(`kubectl -k`). This means that the `src` and each of its subdirectories contain
a `kustomization.yaml` file which determine what manifests are included in the
deployment.

For example, running `just deploy src/` will recursively parse
`src/kustomization.yaml` and the `kustomization.yaml` from each resources
declared in that file. This allows to simply exclude services or manifests by
commenting them out of `kustomization.yaml`.

## Updating a service

Here is the typical workflow to re-deploy a service that has been updated
upstream.

1. Update the external manifest templates. This will update the `vendir` lock
file and fetch the latest templates into `external/<tool>/<service>`.

```bash
just external::refresh
```

2. Render the manifests with the new templates.

```bash
just render
```

> [!NOTE]
> This may fail if the new templates broke compatibility with existing values,
> in which case you will need to update your values in
> `src/<service>/<tool>/values.yaml`. Also watch out in case the upstream added
> new template files, as you may need to include them in the service
> `kustomization.yaml`.

3. Deploy the updated manifests.

```bash
just deploy src/<service>
```

> [!IMPORTANT]
> In some cases, you may want to manually delete resources related to the
> service. You can achieve that with `just delete src/<service>` or use
> `kubectl delete` to delete specific resoruces.

## Adding custom manifests

Custom manifests (e.g. additional volumes) can be added inside `src/<service>/`,
but they need to be added as a resource in `kustomization.yaml` file in the same
directory.
11 changes: 11 additions & 0 deletions external/vendir.lock.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: vendir.k14s.io/v1alpha1
directories:
- contents:
- git:
commitTitle: Add label to external PRs (#1707)...
sha: af1c524460ad62e17313520a3cbb618b062b75cb
tags:
- 2.2.360
path: .
path: ytt/presidio
kind: LockConfig
20 changes: 20 additions & 0 deletions external/vendir.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: vendir.k14s.io/v1alpha1
kind: Config
directories:
- path: ytt/presidio
contents:
- path: .
git:
url: https://github.com/microsoft/presidio
ref: refs/tags/2.2.360
newRootPath: docs/samples/deployments/k8s/charts/presidio
# - path: helm/presidio
# contents:
# - path: .
# helmChart:
# name: presidio
# version: 2.2.360
# git:
# url: https://github.com/microsoft/presidio
# ref: refs/tags/2.2.360
# subPath: docs/samples/deployments/k8s/charts/presidio
9 changes: 8 additions & 1 deletion justfile
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,18 @@ render-ytt dir="src":
fd '^ytt$' {{dir}} \
-x sh -c 'ytt -f {}/values.yaml -f external/ytt/$(basename {//}) --output-files {}/out'

# Render when the code was pulled in via ytt but is a helm template
[private]
render-ytt-extract-helm-template dir="src":
# render mixed ytt + helm templates with our values into src/<service>/mix/out
fd '^helm$' {{dir}} \
-x sh -c 'helm template $(basename {//}) external/ytt/$(basename {//}) -f {}/values.yaml --output-dir {}/out'

# Render manifests
render dir="src":
just fetch && \
just render-helm {{dir}} && \
just render-ytt {{dir}} && \
just render-ytt-extract-helm-template {{dir}} && \
just format

# Apply manifests in dir to the cluster.
Expand Down
4 changes: 4 additions & 0 deletions src/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ./presidio
3 changes: 3 additions & 0 deletions src/presidio/conf/presidio-analyzer/default-analyzer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
supported_languages:
- en
default_score_threshold: 0
Loading