Skip to content

Commit 309ac5e

Browse files
docs: Add version 0.2.5 changes to README
Updates README.md to include a new "Changes in 0.2.5" section. This section details the recently added features for custom profanity lists, including: - Use of `custom_words_list` parameter. - Standalone vs. combined custom list usage. - `load_custom_profanity_from_file()` helper and file format. - Updated language reporting in detection results.
1 parent 2d7709d commit 309ac5e

File tree

4 files changed

+447
-47
lines changed

4 files changed

+447
-47
lines changed

README.md

Lines changed: 126 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,21 @@ An open-source Python library for data cleaning tasks. It includes functions for
1616
> [!NOTE]
1717
> ValX will automatically install a version of `scikit-learn` that is compatible with your device if you don't have one already.
1818
19+
## Changes in 0.2.5
20+
21+
ValX v0.2.5 introduces enhanced flexibility for profanity filtering by adding support for custom profanity lists:
22+
23+
- **Custom Profanity Word Lists**: Users can now provide their own lists of profane words directly as Python lists to the `detect_profanity` and `remove_profanity` functions via the new `custom_words_list` parameter.
24+
- **Standalone Custom Lists**: Utilize your custom profanity list exclusively by setting the `language` parameter to `None`. ValX will then only use the words provided in `custom_words_list`.
25+
- **Combined Lists**: Use a custom list in conjunction with ValX's built-in language-specific wordlists. Simply provide both a `language` (e.g., "English") and your `custom_words_list`. ValX will use the combined set of words.
26+
- **Loading Custom Lists from File**: A new helper function, `load_custom_profanity_from_file(filepath)`, allows you to easily load custom profanity words from a text file.
27+
- **File Format**: The file should contain one profanity word per line.
28+
- Lines starting with a hash symbol (`#`) are treated as comments and ignored.
29+
- Empty lines or lines containing only whitespace are also ignored.
30+
- **Updated Detection Reporting**: The `detect_profanity` function's output now specifies the source of detected profanity more clearly (e.g., "Custom", "Custom + English").
31+
32+
These features give users greater control over the profanity filtering process, allowing for more tailored and specific use cases.
33+
1934
## Changes in 0.2.4
2035

2136
Fixed a major incompatibility issue with `scikit-learn` due to version changes in `scikit-learn v1.3.0` which causes compatibility issues with versions later than `1.2.2`. ValX can now be used with `scikit-learn` versions earlier and later than `1.3.0`!
@@ -113,25 +128,129 @@ Below is a complete list of all the available supported languages for ValX's pro
113128

114129
## Usage
115130

116-
### Detect Profanity
131+
### Profanity Detection and Removal
132+
133+
ValX allows for flexible profanity filtering using built-in language lists, custom word lists (provided as Python lists or loaded from files), or a combination of both.
134+
135+
**1. Basic Profanity Detection (Built-in Language)**
117136

118137
```python
119138
from valx import detect_profanity
120139

121-
# Detect profanity
140+
sample_text = ["This is some fuck and porn text."]
141+
# Detect profanity using the English list
122142
results = detect_profanity(sample_text, language='English')
123-
print("Profanity Evaluation Results", results)
143+
# results will be:
144+
# [
145+
# {'Line': 1, 'Column': 14, 'Word': 'fuck', 'Language': 'English'},
146+
# {'Line': 1, 'Column': 23, 'Word': 'porn', 'Language': 'English'}
147+
# ]
148+
print(results)
149+
```
150+
151+
**2. Profanity Detection with a Custom Word List (Python List)**
152+
153+
You can provide your own list of words to filter.
154+
155+
```python
156+
from valx import detect_profanity
157+
158+
sample_text = ["This contains custombadword1 and also asshole from English list."]
159+
my_custom_words = ["custombadword1", "anothercustom"]
160+
161+
# Option A: Custom list ONLY (language=None)
162+
results_custom_only = detect_profanity(sample_text, language=None, custom_words_list=my_custom_words)
163+
# results_custom_only will detect "custombadword1" with Language: "Custom"
164+
# [{'Line': 1, 'Column': 15, 'Word': 'custombadword1', 'Language': 'Custom'}]
165+
print(results_custom_only)
166+
167+
# Option B: Custom list COMBINED with a built-in language
168+
results_custom_plus_english = detect_profanity(sample_text, language="English", custom_words_list=my_custom_words)
169+
# results_custom_plus_english will detect "custombadword1" and "asshole"
170+
# Language will be "Custom + English"
171+
# [
172+
# {'Line': 1, 'Column': 15, 'Word': 'custombadword1', 'Language': 'Custom + English'},
173+
# {'Line': 1, 'Column': 43, 'Word': 'asshole', 'Language': 'Custom + English'}
174+
# ]
175+
print(results_custom_plus_english)
124176
```
125177

126-
### Remove Profanity
178+
**3. Loading Custom Profanity Words from a File**
179+
180+
ValX provides a helper function to load words from a text file (one word per line, '#' for comments).
127181

128182
```python
129-
from valx import remove_profanity
183+
from valx import detect_profanity, load_custom_profanity_from_file
184+
185+
# Assume 'my_profanity_file.txt' contains:
186+
# customfileword1
187+
# # this is a comment
188+
# customfileword2
189+
190+
custom_words_from_file = load_custom_profanity_from_file("my_profanity_file.txt")
191+
# custom_words_from_file will be: ['customfileword1', 'customfileword2']
192+
193+
sample_text_for_file = ["Text with customfileword1 and built-in shit."]
194+
195+
# Use file-loaded list with English built-in list
196+
results_file_plus_english = detect_profanity(
197+
sample_text_for_file,
198+
language="English",
199+
custom_words_list=custom_words_from_file
200+
)
201+
# Detects "customfileword1" and "shit", Language: "Custom + English"
202+
print(results_file_plus_english)
203+
204+
# Use file-loaded list ONLY
205+
results_file_only = detect_profanity(
206+
sample_text_for_file,
207+
language=None, # Important: set language to None
208+
custom_words_list=custom_words_from_file
209+
)
210+
# Detects only "customfileword1", Language: "Custom"
211+
print(results_file_only)
212+
```
213+
214+
**Output Format for `detect_profanity`**
215+
216+
The `detect_profanity` function returns a list of dictionaries. Each dictionary includes:
217+
- `"Line"`: The line number (1-indexed).
218+
- `"Column"`: The column number (1-indexed) where the profanity starts.
219+
- `"Word"`: The detected profanity word.
220+
- `"Language"`: Indicates the source of the word list:
221+
- `<LanguageName>` (e.g., "English"): If only a built-in language list was used.
222+
- `"Custom"`: If `language=None` and only a `custom_words_list` was used.
223+
- `"Custom + <LanguageName>"` (e.g., "Custom + English"): If both a built-in list and `custom_words_list` were used.
224+
- `"Custom + All"`: If `language='All'` and `custom_words_list` were used.
130225

131-
# Remove profanity
132-
removed = remove_profanity(sample_text, "text_cleaned.txt", language="English")
226+
227+
**4. Removing Profanity**
228+
229+
`remove_profanity` works similarly, accepting `language` and `custom_words_list` parameters.
230+
231+
```python
232+
from valx import remove_profanity, load_custom_profanity_from_file
233+
234+
sample_text = ["This is fuck, custombadword1, and text with customfileword1."]
235+
my_custom_words = ["custombadword1"]
236+
custom_words_from_file = load_custom_profanity_from_file("my_profanity_file.txt") # Assuming it contains 'customfileword1'
237+
238+
# Remove profanity using English built-in + my_custom_words + custom_words_from_file
239+
all_custom_words = list(set(my_custom_words + custom_words_from_file)) # Combine and unique
240+
241+
cleaned_text = remove_profanity(
242+
sample_text,
243+
output_file="cleaned_output.txt", # Optional: saves to file
244+
language="English",
245+
custom_words_list=all_custom_words
246+
)
247+
# cleaned_text will have "fuck", "custombadword1", and "customfileword1" replaced with "bad word".
248+
# e.g., ["This is bad word, bad word, and text with bad word."]
249+
print(cleaned_text)
133250
```
134251

252+
The `load_profanity_words` function (used internally) also accepts `language` and `custom_words_list` if you need direct access to the word lists.
253+
135254
### Detect Sensitive Information
136255

137256
```python

custom_profanity.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# This is a custom profanity list for testing
2+
custombadword1
3+
supersecretcurse
4+
anotherone
5+
6+
# Test empty lines and comments below
7+
8+
#anothercomment
9+
testwordalpha
10+
testwordbeta

0 commit comments

Comments
 (0)