This document tracks the migration of the legacy Python Presidio implementation to a .NET 9 codebase. ManagedCode maintains this .NET port, with the canonical Python sources vendored via the external/microsoft-presidio submodule serving as the authoritative reference while we build the C# implementation.
any resulting updates to agents.md should go under the section "## Rules to follow" When you see a convincing argument from me on how to solve or do something. add a summary for this in agents.md. so you learn what I want over time. If I say any of the following point, you do this: add the context to agents.md, and associate this with a specific type of task. if I say "never do x" in some way. if I say "always do x" in some way. if I say "the process is x" in some way. If I tell you to remember something, you do the same, update if I say "do/don’t", define a process, or confirms success/failure, add a concise rule tied to the relevant task type. if I say "always/never X", "prefer X over Y", "I like/dislike X", or "remember this", update this file. When a mistake is corrected, capture the new rule and remove obsolete guidance. When a workflow is defined or refined, document it here. Strong negative language indicates a critical mistake; add an emphatic rule immediately.
Update guidelines:
- Actionable rules tied to task types.
- Capture why, not just what.
- One clear instruction per bullet.
- Group related rules.
- Remove obsolete rules entirely.
- for Presidio migration tasks, ALWAYS mirror functionality, tests, and docs from
external/microsoft-presidioto guarantee parity - for Presidio migration tasks, NEVER rely on stubs, mocks, or placeholder implementations; deliver the full real functionality
- for Presidio migration NLP components, use
Microsoft.ML.Tokenizerswhen implementing tokenization - for Presidio migration tasks, eliminate dependencies on
ManagedCode.Presidio.PythonBridge; port required logic to C# - when selecting dependencies, prefer official NuGet packages under permissive licenses (MIT) that match upstream functionality
- integration tests must cover real data from the original Python project to verify parity; ensure all tests pass without hacks
- for Presidio analyzer tests, NEVER add stubbed recognizer tests; port the Python scenarios to exercise the real analyzer pipeline end-to-end
- for Presidio analyzer parity work, keep iterating without pausing for confirmation and focus solely on integration tests that validate real functionality
- for Presidio migration tasks, do not stop to ask the user for clarification mid-task; follow the migration plan and deliver completed work
- for Presidio migration tasks, when the user says "продовжити"/"continue", proceed through the target file step by step without asking for additional confirmation
- for Presidio migration tasks, when you see a way to improve something, note the idea in the working file and then implement it without waiting for user approval
- for Presidio migration tasks, default to continuing the migration workflow without waiting for "продовжити"/"continue"; halt only if the user explicitly redirects
- for Presidio migration tasks, when the user specifies an execution order for follow-up work, honor that sequence without reconfirming and keep progressing task-by-task
- for Presidio migration tasks, capture any important follow-up items directly in the working file as TODOs so they are not lost
- for Presidio test work, ALWAYS include negative/error scenarios alongside positive cases to validate failure paths
- for Presidio recognizer coverage, ensure EU social security numbers are handled alongside US SSN patterns
- use enums and constants over magic strings and numbers
- for .NET work, always run
dotnet formatbeforedotnet testand confirm the suite passes - avoid template placeholders (e.g.,
Class1.cs,UnitTest1.cs); name files and types according to their real domain purpose - keep documentation, code comments, and commit messaging in English
src/ManagedCode.Presidio.Core– foundational domain objects such asTextSpan,AnalysisExplanation,RecognizerResult, and shared metadata keys.src/ManagedCode.Presidio.Analyzer– analyzer abstractions (EntityRecognizer,NlpArtifacts,Token) with lazy-loading recognizer lifecycle management and ONNX-backed NER.src/ManagedCode.Presidio.Anonymizer– anonymization primitives (PiiEntity,OperatorConfig,OperatorResult,EngineResult) mirroring the Python engine contracts.src/ManagedCode.Presidio.ImageRedactor/src/ManagedCode.Presidio.Structured– placeholders for their respective pipelines.tests/*– unit and integration suites covering the C# modules (including parity tests against Python behaviours).
The core types compile and are covered by unit tests; the integration test suite validates parity of critical behaviours (e.g., RecognizerResult semantics) against scenarios captured in the original Python tests. The analyzer/anonymizer engines are intentionally skeletal and ready to receive the full algorithmic port.
- Mirror Python domain types (
RecognizerResult,PiiEntity,OperatorConfig, etc.) and prove equivalence via integration tests that ingest canonical scenarios fromtests/in the Python repo. - Expand the shared test harness to load JSON/YAML fixtures directly from
external/microsoft-presidiofor deterministic coverage.
- Port
AnalyzerEngine,RecognizerRegistry, pattern recognizers, and context enhancers. - Reproduce Python unit suites (
presidio-analyzer/tests) as C# integration tests; compare scored outputs to golden data. - Introduce ONNX-backed NER execution (using models referenced in the Python repo) and abstract inference to enable GPU/CPU parity.
- Implement operator pipeline (mask, redact, hash, replace, FPE, etc.) and conflict resolution consistent with Python behaviour.
- Backfill structured-data traversal and policy parsing; reuse original YAML policies as integration fixtures.
- Recreate image redaction and structured anonymization flows, focusing on API compatibility and deterministic output.
- Add scenario tests leveraging sample images/JSON from the Python repository to verify pixel/record level transforms.
- Produce NuGet packages for core/analyzer/anonymizer components.
- Expose minimal public APIs and documentation for consumption, and wire CI to publish artifacts.
- Remove remaining Python scaffolding once feature parity and documentation are complete.
This plan should be revisited at the end of each phase to incorporate learnings and refine subsequent milestones.
This file should be updated as major milestones are completed.