forked from jcrobak/parquet-python
-
-
Notifications
You must be signed in to change notification settings - Fork 190
Closed
Description
Pandas decided to introduce a default string dtype (which will be used by default instead of object-dtype when inferring values to be strings), see https://pandas.pydata.org/pdeps/0014-string-dtype.html for the details (and pandas-dev/pandas#54792 for progress of implementation).
This is already available in the main branch of pandas (and will also be in am upcoming 2.3 release) behind a feature flag pd.options.future.infer_string = True.
Right now, if you enable this flag (with nightly version of pandas) and use fastparquet to write a dataframe with a string column, this errors as follows (because fastparquet is not yet aware of the new dtype):
In [1]: pd.options.future.infer_string = True
In [2]: df = pd.DataFrame({"a": ["some", "strings"]})
In [3]: df.dtypes
Out[3]:
a str
dtype: object
In [4]: df.to_parquet("test_new_string_dtype.parquet", engine="fastparquet")
...
File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:904, in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times, index_cols, partition_cols, cols_dtype)
902 se.name = column
903 else:
--> 904 se, type = find_type(data[column], fixed_text=fixed,
905 object_encoding=oencoding, times=times,
906 is_index=is_index)
907 col_has_nulls = has_nulls
908 if has_nulls is None:
File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:222, in find_type(data, fixed_text, object_encoding, times, is_index)
218 type, converted_type, width = (parquet_thrift.Type.BYTE_ARRAY,
219 parquet_thrift.ConvertedType.UTF8,
220 None)
221 else:
--> 222 raise ValueError("Don't know how to convert data type: %s" % dtype)
223 se = parquet_thrift.SchemaElement(
224 name=norm_col_name(data.name, is_index), type_length=width,
225 converted_type=converted_type, type=type,
(...)
228 i32=True
229 )
230 return se, type
ValueError: Don't know how to convert data type: str
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels