You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: restore files unintentionally reverted from main
The T2E commit was built from an older snapshot of main, which
caused it to revert parquet utility functions, pyarrow dependency,
FAQ content, docs references, and related usages across the codebase.
This commit restores all those files to their current main state,
leaving only the actual T2E changes in the PR diff.
## When loading `.parquet` files, categorical columns seem to be returned as `int`, losing the information that they were categorical.
4
+
5
+
This is a known issue with parquet file support in Python.
6
+
Both existing libraries, `pyarrow` as well as `fastparquet` do not exactly reproduce original input data types when it comes to categorical columns.
7
+
See e.g. [Issue 29017](https://github.com/apache/arrow/issues/29017) and [Issue 27067](https://github.com/apache/arrow/issues/27067).
8
+
9
+
To ensure proper data type roundtrip, the module `octopus.utils` provides the functions `parquet_load()` and `parquet_save()` to store and reconstruct precise dtype information in the parquet metadata.
10
+
Files written with `parquet_save()` are expected to be readable with every parquet-compatible code.
11
+
Still, proper dtypes are only guaranteed to be reconstructed using `parquet_load()`.
12
+
13
+
For details on which dtypes are tested and supported, see [tests/infrastructure/test_file_io.py](https://github.com/emdgroup/octopus/blob/main/tests/infrastructure/test_file_io.py).
0 commit comments