-
Notifications
You must be signed in to change notification settings - Fork 124
Description
Bug description
I encountered an issue while running the code in the getting-started-movielens folder on my Kaggle notebook. I successfully completed the "Download and Convert" notebook, but I encountered a problem during the "ETL with NVTabular" step.
Steps/Code to reproduce bug
-
Set Up Environment
I started by writing the code from the "Download and Convert" notebook in my Kaggle notebook.
I then continued with the code from the "ETL with NVTabular" notebook, which is where the issue arises. -
Modify the Code:
In the notebook, I updated the INPUT_DATA_DIR variable to point to the correct path in my Kaggle notebook. For example:
python
INPUT_DATA_DIR = '/kaggle/working/data'
All other code remains unchanged.
- Run the Notebook
I executed the notebook cells in sequence.
The error occurs when running the following line:
python
workflow.fit(train_dataset)The error message received is: TypeError: function is not supported for this dtype: size.
Expected behavior
Could you please assist in resolving this issue?
Environment Details
- Merlin version: 1.12.1
- NVTabular version: 23.08.00
- Platform: Kaggle notebook
- Python version: 3.10.14
- PyTorch version: 2.4.0, CUDA available: True
- TensorFlow version: 2.16.1, GPU available: True
Additional context
Full Error
TypeError Traceback (most recent call last)
File :1
File /opt/conda/lib/python3.10/site-packages/nvtabular/workflow/workflow.py:213, in Workflow.fit(self, dataset)
199 def fit(self, dataset: Dataset) -> "Workflow":
200 """Calculates statistics for this workflow on the input dataset
201
202 Parameters
(...)
211 This Workflow with statistics calculated on it
212 """
--> 213 self.executor.fit(dataset, self.graph)
214 return self
File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:466, in DaskExecutor.fit(self, dataset, graph, refit)
462 if not current_phase:
463 # this shouldn't happen, but lets not infinite loop just in case
464 raise RuntimeError("failed to find dependency-free StatOperator to fit")
--> 466 self.fit_phase(dataset, current_phase)
468 # Remove all the operators we processed in this phase, and remove
469 # from the dependencies of other ops too
470 for node in current_phase:
File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:532, in DaskExecutor.fit_phase(self, dataset, nodes, strict)
530 stats.append(node.op.fit(node.input_columns, Dataset(ddf)))
531 else:
--> 532 stats.append(node.op.fit(node.input_columns, transformed_ddf))
533 except Exception:
534 LOG.exception("Failed to fit operator %s", node.op)
File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:400, in Categorify.fit(self, col_selector, ddf)
391 # Define a rough row-count at which we are likely to
392 # start hitting memory-pressure issues that cannot
393 # be accommodated with smaller partition sizes.
394 # By default, we estimate a "problematic" cardinality
395 # to be one that consumes >12.5% of the total memory.
396 self.cardinality_memory_limit = parse_bytes(
397 self.cardinality_memory_limit or int(device_mem_size(kind="total", cpu=_cpu) * 0.125)
398 )
--> 400 dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
401 return Delayed(key, dsk)
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1551, in _category_stats(ddf, options)
1549 if options.agg_cols == [] and options.agg_list == []:
1550 options.agg_list = ["size"]
-> 1551 return _groupby_to_disk(ddf, _write_uniques, options)
1553 # Otherwise, getting category-statistics
1554 if isinstance(options.agg_cols, str):
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1406, in _groupby_to_disk(ddf, write_func, options)
1402 # Use map_partitions to improve task fusion
1403 grouped = ddf.to_bag(format="frame").map_partitions(
1404 _top_level_groupby, options=options, token="level_1"
1405 )
-> 1406 _grouped_meta = _top_level_groupby(ddf._meta, options=options)
1407 _grouped_meta_col = {}
1409 dsk_split = defaultdict(dict)
File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1017, in _top_level_groupby(df, options, spill)
1015 df_gb = _maybe_flatten_list_column(cat_col_selector.names[0], df_gb)
1016 # NOTE: groupby(..., dropna=False) requires pandas>=1.1.0
-> 1017 gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict)
1018 gb.columns = [
1019 _make_name((tuple(cat_col_selector.names) + name[1:]), sep=options.name_sep)
1020 if name[0] == cat_col_selector.names[0]
1021 else _make_name((tuple(cat_col_selector.names) + name), sep=options.name_sep)
1022 for name in gb.columns.to_flat_index()
1023 ]
1024 gb.reset_index(inplace=True, drop=False)
File /opt/conda/lib/python3.10/site-packages/cudf/utils/performance_tracking.py:51, in _performance_tracking..wrapper(*args, **kwargs)
43 if nvtx.enabled():
44 stack.enter_context(
45 nvtx.annotate(
46 message=func.qualname,
(...)
49 )
50 )
---> 51 return func(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:629, in GroupBy.agg(self, func)
619 orig_dtypes = tuple(c.dtype for c in columns)
621 # Note: When there are no key columns, the below produces
622 # an Index with float64 dtype, while Pandas returns
623 # an Index with int64 dtype.
624 # (GH: 6945)
625 (
626 result_columns,
627 grouped_key_cols,
628 included_aggregations,
--> 629 ) = self._groupby.aggregate(columns, normalized_aggs)
631 result_index = self.grouping.keys._from_columns_like_self(
632 grouped_key_cols,
633 )
635 multilevel = _is_multi_agg(func)
File groupby.pyx:192, in cudf._lib.groupby.GroupBy.aggregate()
TypeError: function is not supported for this dtype: size