Skip to content

Commit 6a67428

Browse files
committed
Update README and tests for unicodeWordBoundaries feature
1 parent e1b717a commit 6a67428

File tree

2 files changed

+50
-15
lines changed

2 files changed

+50
-15
lines changed

README.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,23 @@ profanity.exists('Arsenic is poisonous but not profane');
8888
// true (matched on arse)
8989
```
9090

91+
### unicodeWordBoundaries
92+
93+
Controls whether word boundaries are Unicode-aware. By default this is set to `false` due to the performance impact.
94+
95+
- When `false` (default), whole-word matching uses ASCII-style boundaries (similar to `\b`) plus underscore `_` as a separator. This is fastest and ideal for ASCII inputs.
96+
- When `true`, whole-word matching uses Unicode-aware boundaries so words with diacritics (e.g., `vehículo`, `horário`) and compound separators are handled correctly.
97+
98+
```JavaScript
99+
const profanity = new Profanity({
100+
unicodeWordBoundaries: true,
101+
wholeWord: true, // must be true for boundaries to work
102+
});
103+
104+
profanity.exists('vehículo horario');
105+
// false (does not match on "culo" inside "vehículo")
106+
```
107+
91108
#### Compound Words
92109
Profanity detection works on parts of compound words, rather than treating hyphenated or underscore-separated words as indivisible.
93110

@@ -135,21 +152,6 @@ profanity.censor('I like big butts and I cannot lie', CensorType.AllVowels);
135152
// I like big b$tts and I cannot lie
136153
```
137154

138-
### unicodeWordBoundaries
139-
140-
Controls whether word boundaries are Unicode-aware. By default this is set to `false` due to the performance impact.
141-
142-
- When `false` (default), whole-word matching uses ASCII-style boundaries (similar to `\b`) plus underscore `_` as a separator. This is fastest and ideal for ASCII inputs.
143-
- When `true`, whole-word matching uses Unicode-aware boundaries so words with diacritics (e.g., `vehículo`, `horário`) and compound separators are handled correctly.
144-
145-
```JavaScript
146-
// Enable Unicode-aware boundaries when processing non-ASCII input
147-
const profanity = new Profanity({ unicodeWordBoundaries: true });
148-
149-
profanity.exists('vehículo horario');
150-
// false (does not match on "culo" inside "vehículo")
151-
```
152-
153155
## Customize the word list
154156

155157
Add words:

tests/profanity-unicode-boundaries.spec.ts

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,3 +47,36 @@ describe("Unicode word boundaries (wholeWord=true)", () => {
4747
expect(profanity.exists("véhicule")).to.be.false;
4848
});
4949
});
50+
51+
describe("Unicode option interaction with wholeWord=false", () => {
52+
it("should detect substrings regardless of unicodeWordBoundaries (es)", () => {
53+
const input = "vehículo";
54+
const pAscii = new Profanity({ languages: ["es"], wholeWord: false, grawlix: "*****", unicodeWordBoundaries: false });
55+
const pUnicode = new Profanity({ languages: ["es"], wholeWord: false, grawlix: "*****", unicodeWordBoundaries: true });
56+
expect(pAscii.exists(input)).to.be.true;
57+
expect(pAscii.censor(input)).to.equal("vehí*****");
58+
expect(pUnicode.exists(input)).to.be.true;
59+
expect(pUnicode.censor(input)).to.equal("vehí*****");
60+
});
61+
62+
it("should detect substrings regardless of unicodeWordBoundaries (fr)", () => {
63+
const input = "véhicule";
64+
const pAscii = new Profanity({ languages: ["fr"], wholeWord: false, unicodeWordBoundaries: false });
65+
const pUnicode = new Profanity({ languages: ["fr"], wholeWord: false, unicodeWordBoundaries: true });
66+
expect(pAscii.exists(input)).to.be.true; // matches 'cul'
67+
expect(pUnicode.exists(input)).to.be.true; // matches 'cul'
68+
});
69+
});
70+
71+
describe("Unicode off with wholeWord=true (legacy ASCII boundaries)", () => {
72+
it("should match 'culo' inside 'vehículo' when unicodeWordBoundaries=false", () => {
73+
const profanity = new Profanity({ languages: ["es"], wholeWord: true, grawlix: "*****", unicodeWordBoundaries: false });
74+
expect(profanity.exists("vehículo")).to.be.true;
75+
expect(profanity.censor("vehículo")).to.equal("vehí*****");
76+
});
77+
78+
it("should not match 'cul' inside 'véhicule' even when unicodeWordBoundaries=false", () => {
79+
const profanity = new Profanity({ languages: ["fr"], wholeWord: true, unicodeWordBoundaries: false });
80+
expect(profanity.exists("véhicule")).to.be.false;
81+
});
82+
});

0 commit comments

Comments
 (0)