-
Notifications
You must be signed in to change notification settings - Fork 101
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your Environment
- Python version: 3.11.14
- Operating system: Ubuntu 24.04.3 LTS
- Lightwood version: 25.9.1.0
Describe your issue
(Originally reported in mindsdb/mindsdb#11796)
When creating a lightwood prediction model using a dataset containing at least 1 string column, I get the following error:
Traceback (most recent call last):
File "/home/myuser/myproj/reprod/a.py", line 17, in <module>
json_ai = json_ai_from_problem(df, pdef)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/lightwood/api/high_level.py", line 68, in json_ai_from_problem
type_information = infer_types(df, config={'engine': 'rule_based', 'pct_invalid': problem_definition.pct_invalid})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/type_infer/api.py", line 38, in infer_types
return engine.infer(data)
^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/type_infer/rule_based/core.py", line 57, in infer
answer_arr.append(self.get_column_data_type(sample_df[x].dropna(), data, x, self.config['pct_invalid']))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/type_infer/rule_based/core.py", line 387, in get_column_data_type
nr_words, word_dist, nr_words_dist = analyze_sentences(data) # TODO: maybe pass entire corpus at once
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/type_infer/rule_based/helpers.py", line 140, in analyze_sentences
tokens = tokenize_text(text)
^^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/type_infer/rule_based/helpers.py", line 165, in tokenize_text
return (t.lower() for t in nltk.word_tokenize(decontracted(text)) if contains_alnum(t))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 142, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 119, in sent_tokenize
tokenizer = _get_punkt_tokenizer(language)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 105, in _get_punkt_tokenizer
return PunktTokenizer(language)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/tokenize/punkt.py", line 1744, in __init__
self.load_lang(lang)
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/myuser/myproj/reprod/.venv/lib/python3.11/site-packages/nltk/data.py", line 579, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt_tab not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt_tab')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt_tab/english/
Searched in:
- '/home/myuser/nltk_data'
- '/home/myuser/myproj/reprod/.venv/nltk_data'
- '/home/myuser/myproj/reprod/.venv/share/nltk_data'
- '/home/myuser/myproj/reprod/.venv/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
How can we replicate it?
example dataset:
file a.csv
Col1,Col2,Col3
100,"G",9
35,"RR",10
88,"UU",1
95,"UU",6
11,"H",2
9,"SS",3
85,"VV",2
52,"F",6
64,"B",5
27,"F",8
32,"A",7
29,"TT",2
25,"D",1
24,"C",4
61,"E",6
then run the following script a.py:
import lightwood as lw
import pandas as pd
df = pd.read_csv('a.csv')
from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, code_from_json_ai, predictor_from_code
pdef = ProblemDefinition.from_dict({'target': 'Col3'})
json_ai = json_ai_from_problem(df, pdef)
code = code_from_json_ai(json_ai)
predictor = predictor_from_code(code)
predictor.learn(df)
predictions = predictor.predict(df)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working