Skip to content

Support upcoming default pandas string dtype (pandas >= 3) #930

@jorisvandenbossche

Description

@jorisvandenbossche

Pandas decided to introduce a default string dtype (which will be used by default instead of object-dtype when inferring values to be strings), see https://pandas.pydata.org/pdeps/0014-string-dtype.html for the details (and pandas-dev/pandas#54792 for progress of implementation).

This is already available in the main branch of pandas (and will also be in am upcoming 2.3 release) behind a feature flag pd.options.future.infer_string = True.

Right now, if you enable this flag (with nightly version of pandas) and use fastparquet to write a dataframe with a string column, this errors as follows (because fastparquet is not yet aware of the new dtype):

In [1]: pd.options.future.infer_string = True

In [2]: df = pd.DataFrame({"a": ["some", "strings"]})

In [3]: df.dtypes
Out[3]: 
a    str
dtype: object

In [4]: df.to_parquet("test_new_string_dtype.parquet", engine="fastparquet")
...
File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:904, in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times, index_cols, partition_cols, cols_dtype)
    902     se.name = column
    903 else:
--> 904     se, type = find_type(data[column], fixed_text=fixed,
    905                          object_encoding=oencoding, times=times,
    906                          is_index=is_index)
    907 col_has_nulls = has_nulls
    908 if has_nulls is None:

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:222, in find_type(data, fixed_text, object_encoding, times, is_index)
    218     type, converted_type, width = (parquet_thrift.Type.BYTE_ARRAY,
    219                                    parquet_thrift.ConvertedType.UTF8,
    220                                    None)
    221 else:
--> 222     raise ValueError("Don't know how to convert data type: %s" % dtype)
    223 se = parquet_thrift.SchemaElement(
    224     name=norm_col_name(data.name, is_index), type_length=width,
    225     converted_type=converted_type, type=type,
   (...)
    228     i32=True
    229 )
    230 return se, type

ValueError: Don't know how to convert data type: str

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions