Skip to content

Commit a10d262

Browse files
committed
Updated README.md.
1 parent ea400a4 commit a10d262

File tree

1 file changed

+104
-1
lines changed

1 file changed

+104
-1
lines changed

README.md

Lines changed: 104 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -220,6 +220,7 @@ NLP results flow through the same `CompositeDetector` de-duplication as heuristi
220220
| `IP_ADDRESS` | `192.168.1.100` | Regex (range-validated) | 4 |
221221
| `ORGANIZATION` | `Barclays Bank PLC` | Naive Bayes ML + OpenNLP NER | 3 |
222222
| `COORDINATES` | `51.5074, -0.1278` | Regex (bounds-checked) | 3 |
223+
| `GENERIC_PII` | `EMP-042731` | Custom Pattern Registry | 5 |
223224

224225
---
225226

@@ -244,6 +245,18 @@ Fast pre-flight check (~30% faster than `redact()`) for yes/no answers.
244245

245246
Detection without redaction — for audit and reporting pipelines.
246247

248+
### `redactJson(String json)``StructuredRedactionOutput`
249+
250+
Redacts PII inside a JSON document. String values are replaced in-place; keys, numbers, booleans, and arrays are preserved. Throws `UnsupportedOperationException` if `jackson-databind` is not on the classpath. Throws `IOException` for malformed JSON.
251+
252+
### `redactXml(String xml)``StructuredRedactionOutput`
253+
254+
Redacts PII inside an XML document. Text nodes and attribute values are replaced in-place; element names, structure, and non-string content are preserved. No extra dependency required (uses JDK `javax.xml`). Throws `IOException` for malformed XML.
255+
256+
### `SPGConfig.Builder.addPattern(PIIType, String regex, double confidence, String description)``Builder`
257+
258+
Registers a custom regex pattern applied by `HeuristicDetector` after all built-in patterns. Multiple calls accumulate. The 3-arg overload omits the description (defaults to the regex string).
259+
247260
### Stream methods
248261

249262
```java
@@ -275,6 +288,9 @@ SPGConfig config = SPGConfig.builder()
275288
.buildReverseMap(true) // disable for slight perf gain
276289
.heuristicEnabled(true)
277290
.mlEnabled(true)
291+
// Custom organisation-specific patterns (see Custom Pattern Registry below)
292+
.addPattern(PIIType.GENERIC_PII, "EMP-\\d{6}", 0.99, "Employee ID")
293+
.addPattern(PIIType.GENERIC_PII, "MRN-[A-Z0-9]{8}", 0.98, "Medical Record Number")
278294
.build();
279295
```
280296

@@ -288,6 +304,93 @@ SPGConfig config = SPGConfig.builder()
288304

289305
---
290306

307+
## Custom Pattern Registry
308+
309+
Register organisation-specific identifiers that built-in heuristics don't cover — employee IDs, medical record numbers, internal reference codes, or any proprietary format.
310+
311+
```java
312+
SPGConfig config = SPGConfig.builder()
313+
.addPattern(PIIType.GENERIC_PII, "EMP-\\d{6}", 0.99, "Employee ID")
314+
.addPattern(PIIType.GENERIC_PII, "MRN-[A-Z0-9]{8}", 0.98, "Medical Record Number")
315+
.addPattern(PIIType.GENERIC_PII, "POL-[A-Z]{2}-\\d{8}", 0.97, "Policy Number")
316+
.build();
317+
318+
SemanticPrivacyGuard spg = SemanticPrivacyGuard.create(config);
319+
320+
RedactionResult r = spg.redact(
321+
"Task EMP-042731 relates to policy POL-GB-00123456.");
322+
// → "Task [PII_1] relates to policy [PII_2]."
323+
```
324+
325+
Custom patterns are applied by `HeuristicDetector` after all built-in patterns, so built-in matches always win for overlapping spans. Token counters are document-scoped: two `EMP-` matches in the same call produce `[PII_1]` and `[PII_2]`, never two `[PII_1]` tokens.
326+
327+
Multiple calls to `.addPattern()` accumulate — they do not replace each other.
328+
329+
---
330+
331+
## JSON / XML Redaction
332+
333+
Redact PII directly inside structured documents. Text values are replaced in-place; keys, numbers, booleans, and markup structure are preserved exactly.
334+
335+
### JSON
336+
337+
Requires `jackson-databind` on the classpath (not bundled — add it to your own `pom.xml`):
338+
339+
```xml
340+
<dependency>
341+
<groupId>com.fasterxml.jackson.core</groupId>
342+
<artifactId>jackson-databind</artifactId>
343+
<version>2.17.0</version>
344+
</dependency>
345+
```
346+
347+
```java
348+
SemanticPrivacyGuard spg = SemanticPrivacyGuard.create();
349+
350+
StructuredRedactionOutput out = spg.redactJson("""
351+
{
352+
"name": "Alice Johnson",
353+
"email": "alice@example.com",
354+
"account": 12345
355+
}
356+
""");
357+
358+
System.out.println(out.getRedactedContent());
359+
// → {"name":"[PERSON_NAME_1]","email":"[EMAIL_1]","account":12345}
360+
361+
System.out.println(out.getMatchCount()); // → 2
362+
System.out.println(out.getReverseMap()); // → {[PERSON_NAME_1]=Alice Johnson, [EMAIL_1]=alice@example.com}
363+
```
364+
365+
### XML
366+
367+
Uses the JDK built-in `javax.xml` — no extra dependency required. XXE injection is hardened by disabling DOCTYPE declarations and external entity loading.
368+
369+
```java
370+
StructuredRedactionOutput out = spg.redactXml("""
371+
<?xml version="1.0"?>
372+
<user>
373+
<name>Alice Johnson</name>
374+
<email>alice@example.com</email>
375+
<id>12345</id>
376+
</user>
377+
""");
378+
379+
System.out.println(out.getRedactedContent());
380+
// → <?xml version="1.0"?><user><name>[PERSON_NAME_1]</name><email>[EMAIL_1]</email><id>12345</id></user>
381+
```
382+
383+
`StructuredRedactionOutput` fields:
384+
385+
| Method | Returns |
386+
|---|---|
387+
| `getRedactedContent()` | Redacted JSON or XML string |
388+
| `getReverseMap()` | `Map<String, String>` token → original value |
389+
| `getMatchCount()` | Total PII matches found |
390+
| `hasPII()` | `true` if any PII was detected |
391+
392+
---
393+
291394
## Architecture
292395

293396
```
@@ -361,7 +464,7 @@ try (var exec = Executors.newVirtualThreadPerTaskExecutor()) {
361464
| **SPG Full (H + ML)** | **206,000 sentences/s** | **0%** |
362465
| SPG Full + NLP | ~45,000 sentences/s* | 0% |
363466

364-
\* NLP throughput depends on model size and JVM warmup. Stream processing throughput is I/O-bound rather than CPU-bound. See [docs/benchmarks.md](docs/benchmarks.md) for full methodology.
467+
\* NLP throughput depends on model size and JVM warmup. Stream processing throughput is I/O-bound rather than CPU-bound. See the [CI benchmark runs](https://github.com/Sushegaad/Semantic-Privacy-Guard/actions) for latest numbers.
365468

366469
---
367470

0 commit comments

Comments
 (0)