Skip to content

Commit fbc7a69

Browse files
authored
feat: change english_words to set for performance gain (#380)
1 parent 1e39e1a commit fbc7a69

File tree

3 files changed

+6
-4
lines changed

3 files changed

+6
-4
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1-
## 0.5.5-dev0
1+
## 0.5.5-dev1
22

33
### Enhancements
44

5+
* `contains_english_word()`, used heavily in text processing, is 10x faster.
6+
57
### Features
68

79
* Add `clean_non_ascii_chars` to remove non-ascii characters from unicode string

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.5.5-dev0" # pragma: no cover
1+
__version__ = "0.5.5-dev1" # pragma: no cover

unstructured/nlp/english_words.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import os
22
import pathlib
3-
from typing import List
3+
from typing import List, Set
44

55
DIRECTORY = pathlib.Path(__file__).parent.resolve()
66
# NOTE(robinson) - the list of English words is based on the nlkt.corpus.words corpus
@@ -14,4 +14,4 @@
1414

1515
# NOTE(robinson) - add new words that we want to pass for the English check in here
1616
ADDITIONAL_ENGLISH_WORDS: List[str] = []
17-
ENGLISH_WORDS: List[str] = BASE_ENGLISH_WORDS + ADDITIONAL_ENGLISH_WORDS
17+
ENGLISH_WORDS: Set[str] = set(BASE_ENGLISH_WORDS + ADDITIONAL_ENGLISH_WORDS)

0 commit comments

Comments
 (0)