You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Redacts PII inside a JSON document. String values are replaced in-place; keys, numbers, booleans, and arrays are preserved. Throws `UnsupportedOperationException` if `jackson-databind` is not on the classpath. Throws `IOException` for malformed JSON.
Redacts PII inside an XML document. Text nodes and attribute values are replaced in-place; element names, structure, and non-string content are preserved. No extra dependency required (uses JDK `javax.xml`). Throws `IOException` for malformed XML.
Registers a custom regex pattern applied by `HeuristicDetector` after all built-in patterns. Multiple calls accumulate. The 3-arg overload omits the description (defaults to the regex string).
Register organisation-specific identifiers that built-in heuristics don't cover — employee IDs, medical record numbers, internal reference codes, or any proprietary format.
"Task EMP-042731 relates to policy POL-GB-00123456.");
322
+
// → "Task [PII_1] relates to policy [PII_2]."
323
+
```
324
+
325
+
Custom patterns are applied by `HeuristicDetector` after all built-in patterns, so built-in matches always win for overlapping spans. Token counters are document-scoped: two `EMP-` matches in the same call produce `[PII_1]` and `[PII_2]`, never two `[PII_1]` tokens.
326
+
327
+
Multiple calls to `.addPattern()` accumulate — they do not replace each other.
328
+
329
+
---
330
+
331
+
## JSON / XML Redaction
332
+
333
+
Redact PII directly inside structured documents. Text values are replaced in-place; keys, numbers, booleans, and markup structure are preserved exactly.
334
+
335
+
### JSON
336
+
337
+
Requires `jackson-databind` on the classpath (not bundled — add it to your own `pom.xml`):
Uses the JDK built-in `javax.xml` — no extra dependency required. XXE injection is hardened by disabling DOCTYPE declarations and external entity loading.
|**SPG Full (H + ML)**|**206,000 sentences/s**|**0%**|
362
465
| SPG Full + NLP |~45,000 sentences/s*| 0% |
363
466
364
-
\* NLP throughput depends on model size and JVM warmup. Stream processing throughput is I/O-bound rather than CPU-bound. See [docs/benchmarks.md](docs/benchmarks.md) for full methodology.
467
+
\* NLP throughput depends on model size and JVM warmup. Stream processing throughput is I/O-bound rather than CPU-bound. See the [CI benchmark runs](https://github.com/Sushegaad/Semantic-Privacy-Guard/actions) for latest numbers.
0 commit comments