[BUG] TypeError: function is not supported for this dtype: size at getting-started-movielens 

### Bug description

I encountered an issue while running the code in the getting-started-movielens folder on my Kaggle notebook. I successfully completed the "Download and Convert" notebook, but I encountered a problem during the "ETL with NVTabular" step.

### Steps/Code to reproduce bug



1. Set Up Environment
I started by writing the code from the "Download and Convert" notebook in my Kaggle notebook.
I then continued with the code from the "ETL with NVTabular" notebook, which is where the issue arises.


2. Modify the Code:
In the notebook, I updated the INPUT_DATA_DIR variable to point to the correct path in my Kaggle notebook. For example:
python

```
INPUT_DATA_DIR = '/kaggle/working/data'
```

All other code remains unchanged.


3. Run the Notebook
I executed the notebook cells in sequence.
The error occurs when running the following line:
python

```python
workflow.fit(train_dataset)
```
The error message received is: TypeError: function is not supported for this dtype: size.


### Expected behavior

Could you please assist in resolving this issue?

### Environment Details
- Merlin version: 1.12.1
- NVTabular version: 23.08.00
- Platform: Kaggle notebook
- Python version: 3.10.14
- PyTorch version: 2.4.0, CUDA available: True
- TensorFlow version: 2.16.1, GPU available: True

### Additional context


Full Error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <timed eval>:1

File /opt/conda/lib/python3.10/site-packages/nvtabular/workflow/workflow.py:213, in Workflow.fit(self, dataset)
    199 def fit(self, dataset: Dataset) -> "Workflow":
    200     """Calculates statistics for this workflow on the input dataset
    201 
    202     Parameters
   (...)
    211         This Workflow with statistics calculated on it
    212     """
--> 213     self.executor.fit(dataset, self.graph)
    214     return self

File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:466, in DaskExecutor.fit(self, dataset, graph, refit)
    462 if not current_phase:
    463     # this shouldn't happen, but lets not infinite loop just in case
    464     raise RuntimeError("failed to find dependency-free StatOperator to fit")
--> 466 self.fit_phase(dataset, current_phase)
    468 # Remove all the operators we processed in this phase, and remove
    469 # from the dependencies of other ops too
    470 for node in current_phase:

File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:532, in DaskExecutor.fit_phase(self, dataset, nodes, strict)
    530         stats.append(node.op.fit(node.input_columns, Dataset(ddf)))
    531     else:
--> 532         stats.append(node.op.fit(node.input_columns, transformed_ddf))
    533 except Exception:
    534     LOG.exception("Failed to fit operator %s", node.op)

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:400, in Categorify.fit(self, col_selector, ddf)
    391 # Define a rough row-count at which we are likely to
    392 # start hitting memory-pressure issues that cannot
    393 # be accommodated with smaller partition sizes.
    394 # By default, we estimate a "problematic" cardinality
    395 # to be one that consumes >12.5% of the total memory.
    396 self.cardinality_memory_limit = parse_bytes(
    397     self.cardinality_memory_limit or int(device_mem_size(kind="total", cpu=_cpu) * 0.125)
    398 )
--> 400 dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
    401 return Delayed(key, dsk)

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1551, in _category_stats(ddf, options)
   1549 if options.agg_cols == [] and options.agg_list == []:
   1550     options.agg_list = ["size"]
-> 1551     return _groupby_to_disk(ddf, _write_uniques, options)
   1553 # Otherwise, getting category-statistics
   1554 if isinstance(options.agg_cols, str):

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1406, in _groupby_to_disk(ddf, write_func, options)
   1402 # Use map_partitions to improve task fusion
   1403 grouped = ddf.to_bag(format="frame").map_partitions(
   1404     _top_level_groupby, options=options, token="level_1"
   1405 )
-> 1406 _grouped_meta = _top_level_groupby(ddf._meta, options=options)
   1407 _grouped_meta_col = {}
   1409 dsk_split = defaultdict(dict)

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1017, in _top_level_groupby(df, options, spill)
   1015 df_gb = _maybe_flatten_list_column(cat_col_selector.names[0], df_gb)
   1016 # NOTE: groupby(..., dropna=False) requires pandas>=1.1.0
-> 1017 gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict)
   1018 gb.columns = [
   1019     _make_name(*(tuple(cat_col_selector.names) + name[1:]), sep=options.name_sep)
   1020     if name[0] == cat_col_selector.names[0]
   1021     else _make_name(*(tuple(cat_col_selector.names) + name), sep=options.name_sep)
   1022     for name in gb.columns.to_flat_index()
   1023 ]
   1024 gb.reset_index(inplace=True, drop=False)

File /opt/conda/lib/python3.10/site-packages/cudf/utils/performance_tracking.py:51, in _performance_tracking.<locals>.wrapper(*args, **kwargs)
     43 if nvtx.enabled():
     44     stack.enter_context(
     45         nvtx.annotate(
     46             message=func.__qualname__,
   (...)
     49         )
     50     )
---> 51 return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:629, in GroupBy.agg(self, func)
    619 orig_dtypes = tuple(c.dtype for c in columns)
    621 # Note: When there are no key columns, the below produces
    622 # an Index with float64 dtype, while Pandas returns
    623 # an Index with int64 dtype.
    624 # (GH: 6945)
    625 (
    626     result_columns,
    627     grouped_key_cols,
    628     included_aggregations,
--> 629 ) = self._groupby.aggregate(columns, normalized_aggs)
    631 result_index = self.grouping.keys._from_columns_like_self(
    632     grouped_key_cols,
    633 )
    635 multilevel = _is_multi_agg(func)

File groupby.pyx:192, in cudf._lib.groupby.GroupBy.aggregate()

TypeError: function is not supported for this dtype: size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] TypeError: function is not supported for this dtype: size at getting-started-movielens #1103

Bug description

Steps/Code to reproduce bug

Expected behavior

Environment Details

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] TypeError: function is not supported for this dtype: size at getting-started-movielens #1103

Description

Bug description

Steps/Code to reproduce bug

Expected behavior

Environment Details

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions