Skip to content

refactor: Add batching and 100-doc cap to PII anonymizer#128

Open
kmandryk wants to merge 1 commit intomainfrom
PII-batching
Open

refactor: Add batching and 100-doc cap to PII anonymizer#128
kmandryk wants to merge 1 commit intomainfrom
PII-batching

Conversation

@kmandryk
Copy link

@kmandryk kmandryk commented Mar 2, 2026

The Language Service accepts max 5 documents per synchronous request. The fragment sends multiple requests in batches of 5 (up to 100 documents total); results are merged for reassembly.

note: There are two risks here - messages that exceed the 100 document limit will cause a failure, and there may also be a risk of APIM timeout.

AI Hub Infra Changes

Summary: 1 to add, 2 to change, 0 to destroy (across 2 stack(s))

Show plan details
Terraform will perform the following actions:

  # azurerm_api_management_policy_fragment.pii_anonymization[0] will be updated in-place
  ~ resource "azurerm_api_management_policy_fragment" "pii_anonymization" {
        id                = "/subscriptions/****/resourceGroups/ai-services-hub-test/providers/Microsoft.ApiManagement/service/ai-services-hub-test-apim/policyFragments/pii-anonymization"
        name              = "pii-anonymization"
      ~ value             = <<-EOT
          - <fragment>
          - 	<!-- ================================================================== -->
          - 	<!-- PII Anonymization via Azure Language Service                       -->
          - 	<!-- Per-message multi-document scanning for robust large-payload support-->
          - 	<!-- Enterprise-grade PII detection using Azure AI Language API         -->
          - 	<!-- Supports: Names, addresses, SSN, medical terms, financial data     -->
          - 	<!-- ================================================================== -->
          - 	<!--
          + <fragment>
          +     <!-- ================================================================== -->
          +     <!-- PII Anonymization via Azure Language Service                       -->
          +     <!-- Per-message multi-document scanning for robust large-payload support-->
          +     <!-- Enterprise-grade PII detection using Azure AI Language API         -->
          +     <!-- Supports: Names, addresses, SSN, medical terms, financial data     -->
          +     <!-- ================================================================== -->
          +     <!--
                    Architecture:
                    - Parses the JSON request body and extracts each message's content
          -         - Sends message contents as separate documents in one PII API call
          -           (up to 5 documents per synchronous request per Azure Language Service limits)
          -         - Large messages are automatically chunked at word boundaries to stay
          -           within the per-document character limit (5000 chars, safe for all tiers)
          +         - Builds a full list of documents (per message, chunked at 5000 chars)
          +         - Sends documents in BATCHES of up to 5 per request (Azure limit)
          +         - Multiple send-request calls for >5 documents; results merged for reassembly
                    - Chunked documents use compound IDs (e.g., "1_0", "1_1") and are
                      reassembled in order after redaction
          +         - Per-document character limit: 5000 chars (safe for all Language Service tiers)
                    - Replaces each message content with its redacted version
                    - JSON envelope (roles, parameters, model) is never scanned
                    - Falls back to raw-body single-document mode for non-JSON payloads
            
                    Document Chunking & Reassembly:
                    The Azure Language Service enforces a per-document character limit
                    (5,120 on F0, 125K on S tier). To safely handle payloads of any size
                    we use a conservative 5,000-char threshold.
            
                    1. Splitting — For each chat message whose content exceeds the limit,
                       the text is split into consecutive chunks at the nearest word
                       boundary (space character) before the limit. Each chunk becomes a
                       separate document in the PII API request with a compound ID:
                         message index 1, chunk 0 → id "1_0"
                         message index 1, chunk 1 → id "1_1"
                         message index 1, chunk 2 → id "1_2"  …and so on
                       Short messages that fit in a single document keep their simple ID
                       (e.g., "0", "1"), so existing behaviour is fully preserved.
            
                    2. Redaction — The Language Service returns a `redactedText` field for
                       every document it successfully processes. Detected PII entities are
                       replaced in-place with a mask character ('#' by default) using the
                       CharacterMask redaction policy.  For example:
                         Input:  "Contact Erin Sanchez at 604-555-7890"
                         Output: "Contact #### ####### at ############"
                       The mask length always matches the original text length, so
                       character offsets and chunk boundaries remain stable.
            
                    3. Reassembly — After the PII API responds, a lookup map is built
                       from document id → redactedText. Reconstruction works per-message:
                       a) If the map contains the simple key (e.g., "1"), the message was
                          not chunked — apply the redacted text directly.
                       b) Otherwise, iterate compound keys "1_0", "1_1", "1_2", … in
                          order and concatenate the redacted fragments. Because chunks
                          were split at the same boundaries, concatenation reproduces
                          the full redacted content with masks in the correct positions.
                       The redacted content replaces the original message content in the
                       JSON body, and the rest of the envelope (role, parameters, model
                       settings) is left untouched.
            
                    4. Error handling — If the Language Service rejects individual
                       documents (e.g., empty text), those are reported in
                       piiDiagnostics.docErrors but do not block successfully-redacted
                       documents. The fail-closed / fail-open mode controls whether an
                       overall failure blocks the request (503) or passes through the
                       original content.
            
                    5. Redaction Coverage Verification (P1 Safety) — After the PII API
                       responds, the fragment verifies that EVERY message with content
                       received complete redaction. This catches two real loopholes:
            
                       a) Document-limit protection: The Language Service accepts at most
          -               5 documents per synchronous request. When the payload requires
          -               more (many messages, or large messages that chunk into many
          -               documents), excess content is silently dropped — the API
          -               never sees it. The coverage check detects this by comparing
          -               the set of messages that have redacted output against those
          -               that don't.
          -               Example: 6 short messages = 6 docs needed, but only 5 get
          -               sent. The remaining message passes through unscanned.
          +               5 documents per synchronous request. This fragment sends multiple
          +               requests in batches of 5 (up to 100 documents total). If the
          +               payload requires more than 100 documents, excess is unscanned and
          +               the coverage check reports msgsUnscanned / fullCoverage=false.
            
                       b) Partial-chunking detection: When a large message is split into
          -               chunks and only some chunks fit within the 5-document limit,
          -               the reassembled text would be shorter than the original. The
          -               fragment detects this by comparing total redacted chunk length
          +               chunks and only some chunks are returned (e.g. API error for a
          +               batch), the reassembled text would be shorter than the original.
          +               The fragment detects this by comparing total redacted chunk length
                          against original message length. CharacterMask preserves text
                          length, so any discrepancy means missing chunks.
          -               Example: A 30K-char message needs 6 chunks. If the 5-doc
          -               limit only allows 3 chunks, the reassembled text is ~15K chars
          -               — the trailing 15K chars (potentially containing PII) would
          -               be silently lost.
            
                       c) Document-error tolerance: If the Language Service returns
                          errors for specific documents (e.g., unsupported language),
                          those documents have no redactedText in the response. The
                          coverage check detects these as unscanned messages.
            
                       Coverage result: piiRedactionCoverage JSON with:
                         - msgsWithContent: total messages that had non-empty content
                         - msgsRedacted: messages with complete redacted output
                         - msgsPartial: messages partially chunked (some chunks missing)
                         - msgsUnscanned: messages completely skipped (no output at all)
                         - fullCoverage: true only if msgsRedacted == msgsWithContent
            
                       Fail-closed mode blocks when fullCoverage == false, returning 503
                       with failure_reason "partial-redaction-N-msgs-unscanned" or
                       "partial-redaction-N-msgs-truncated" for diagnostics.
                       Fail-open mode passes through original content for unscanned/
                       partial messages (no silent truncation) and logs coverage metrics
                       to App Insights for monitoring.
            
                    Performance optimizations:
          -         - Single Body.As<string>() read (cached in piiResponseBodyStr)
          -         - Single JObject.Parse() pass for diagnostics (cached in piiDiagnostics)
          +         - Document list built once; batches of 5 sent sequentially; results merged
                    - Timing via piiStartTimeTicks / piiDurationMs for latency monitoring
            
                    Prerequisites - Set these variables before including this fragment:
                    - piiInputContent: The request body as a string (JSON with messages array)
                    - piiAnonymizationEnabled: "true" or "false" based on tenant config
            
                    Optional configuration variables:
                    - piiExcludedCategories: JSON array of PII categories to exclude
                    - piiDetectionLanguage: Language code for detection (default: "en")
                    - piiFailClosed: "true" to block requests when redaction fails (default: "false" = fail-open)
            
                    Failure behavior:
                    - If piiFailClosed="true" and redaction fails, the request is blocked with HTTP 503
                    - If piiFailClosed="false" (default), the original unredacted content is forwarded
                    - Detailed failure diagnostics (failure_reason, MSI status) are always logged regardless of mode
            
                    Required Named Value:
                    - piiServiceUrl: The Language Service endpoint URL (set in APIM Named Values)
            
                    Output:
                    - piiAnonymizedContent: The reconstructed JSON body with redacted message content
                    - piiRedactionSucceeded: "true" or "false" (at least one doc redacted)
                    - piiRedactionCoverage: JSON with fullCoverage, msgsWithContent, msgsRedacted,
                      msgsPartial, msgsUnscanned — comprehensive per-message coverage report
                    - piiDetectionStatusCode: HTTP status code from PII API
                    - piiDurationMs: PII API call latency in milliseconds
 
(truncated, see workflow logs for complete plan)

Updated by CI — plan against test environment (run #235) at 2026-03-02 21:53:53 UTC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant