Skip to content

Resource punkt_tab not found. #1280

@adnanjpg

Description

@adnanjpg

Your Environment

  • Python version: 3.11.14
  • Operating system: Ubuntu 24.04.3 LTS
  • Lightwood version: 25.9.1.0

Describe your issue

(Originally reported in mindsdb/mindsdb#11796)
When creating a lightwood prediction model using a dataset containing at least 1 string column, I get the following error:

Traceback (most recent call last):
  File "/home/myuser/myproj/reprod/a.py", line 17, in <module>
    json_ai = json_ai_from_problem(df, pdef)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/lightwood/api/high_level.py", line 68, in json_ai_from_problem
    type_information = infer_types(df, config={'engine': 'rule_based', 'pct_invalid': problem_definition.pct_invalid})
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/type_infer/api.py", line 38, in infer_types
    return engine.infer(data)
           ^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/type_infer/rule_based/core.py", line 57, in infer
    answer_arr.append(self.get_column_data_type(sample_df[x].dropna(), data, x, self.config['pct_invalid']))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/type_infer/rule_based/core.py", line 387, in get_column_data_type
    nr_words, word_dist, nr_words_dist = analyze_sentences(data)  # TODO: maybe pass entire corpus at once
                                         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/type_infer/rule_based/helpers.py", line 140, in analyze_sentences
    tokens = tokenize_text(text)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/type_infer/rule_based/helpers.py", line 165, in tokenize_text
    return (t.lower() for t in nltk.word_tokenize(decontracted(text)) if contains_alnum(t))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 142, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 119, in sent_tokenize
    tokenizer = _get_punkt_tokenizer(language)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 105, in _get_punkt_tokenizer
    return PunktTokenizer(language)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/tokenize/punkt.py", line 1744, in __init__
    self.load_lang(lang)
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
    lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/data.py", line 579, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt_tab not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt_tab')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt_tab/english/

  Searched in:
    - '/home/myuser/nltk_data'
    - '/home/myuser/myproj/reprod/.venv/nltk_data'
    - '/home/myuser/myproj/reprod/.venv/share/nltk_data'
    - '/home/myuser/myproj/reprod/.venv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

How can we replicate it?

example dataset:
file a.csv

Col1,Col2,Col3
100,"G",9
35,"RR",10
88,"UU",1
95,"UU",6
11,"H",2
9,"SS",3
85,"VV",2
52,"F",6
64,"B",5
27,"F",8
32,"A",7
29,"TT",2
25,"D",1
24,"C",4
61,"E",6

then run the following script a.py:

import lightwood as lw
import pandas as pd

df = pd.read_csv('a.csv')

from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, code_from_json_ai, predictor_from_code

pdef = ProblemDefinition.from_dict({'target': 'Col3'})
json_ai = json_ai_from_problem(df, pdef)
code = code_from_json_ai(json_ai)
predictor = predictor_from_code(code)

predictor.learn(df)

predictions = predictor.predict(df)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions