Skip to content

Commit d0211cc

Browse files
build: downgrade nltk version (#3527)
This PR aims to roll back `nltk` to `3.8.1` which bumped to `3.8.2` in #3512 because `3.8.2` is no longer available in PyPI due to some issues(nltk/nltk#3301)
1 parent 9b778e2 commit d0211cc

File tree

5 files changed

+18
-8
lines changed

5 files changed

+18
-8
lines changed

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,14 @@
1+
## 0.15.5-dev0
2+
3+
### Enhancements
4+
5+
### Features
6+
7+
### Fixes
8+
9+
* **Downgrade NLTK dependency version for compatibility**. Due to the unavailability of `nltk==3.8.2` on PyPI, the NLTK dependency has been downgraded to `<3.8.2`. This change ensures continued functionality and compatibility.
10+
11+
112
## 0.15.4
213

314
### Enhancements

docker/rockylinux-9.2/Dockerfile

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,7 @@ RUN python3.10 -m pip install pip==${PIP_VERSION} && \
2626
dnf -y groupremove "Development Tools" && \
2727
dnf clean all
2828

29-
RUN python3.10 -c "import nltk; nltk.download('punkt')" && \
30-
python3.10 -c "import nltk; nltk.download('averaged_perceptron_tagger')"
29+
RUN python3.10 -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()"
3130

3231
FROM deps as code
3332

requirements/base.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ mypy-extensions==1.0.0
6969
# unstructured-client
7070
nest-asyncio==1.6.0
7171
# via unstructured-client
72-
nltk==3.8.2
72+
nltk==3.8.1
7373
# via -r ./base.in
7474
numpy==1.26.4
7575
# via -r ./base.in

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.15.4" # pragma: no cover
1+
__version__ = "0.15.5-dev0" # pragma: no cover

unstructured/nlp/tokenize.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@
1616

1717
CACHE_MAX_SIZE: Final[int] = 128
1818

19-
NLTK_DATA_FILENAME = "nltk_data_3.8.2.tar.gz"
19+
NLTK_DATA_FILENAME = "nltk_data.tgz"
2020
NLTK_DATA_URL = f"https://utic-public-cf.s3.amazonaws.com/{NLTK_DATA_FILENAME}"
21-
NLTK_DATA_SHA256 = "ba2ca627c8fb1f1458c15d5a476377a5b664c19deeb99fd088ebf83e140c1663"
21+
NLTK_DATA_SHA256 = "126faf671cd255a062c436b3d0f2d311dfeefcd92ffa43f7c3ab677309404d61"
2222

2323

2424
# NOTE(robinson) - mimic default dir logic from NLTK
@@ -114,10 +114,10 @@ def _download_nltk_packages_if_not_present():
114114

115115
tagger_available = check_for_nltk_package(
116116
package_category="taggers",
117-
package_name="averaged_perceptron_tagger_eng",
117+
package_name="averaged_perceptron_tagger",
118118
)
119119
tokenizer_available = check_for_nltk_package(
120-
package_category="tokenizers", package_name="punkt_tab"
120+
package_category="tokenizers", package_name="punkt"
121121
)
122122

123123
if not (tokenizer_available and tagger_available):

0 commit comments

Comments
 (0)