-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Feature request
I would love to use Hugging Face datasets library to directly load datasets composed of .tsfile files, for example:
ds = load_dataset("username/dataset-with-tsfile-files")
This feature would allow researchers working on time-series tasks to seamlessly integrate datasets stored in the Apache TsFile format into the Hugging Face ecosystem.
Motivation
Apache TsFile is a mature Apache project and a dedicated file format designed for efficient time-series data storage and retrieval. The repository is here.
It has been widely adopted in the IoT community and serves as the underlying storage format for projects like Apache IoTDB.
Apache TsFile has the following advantages in the time-series area:
- Time-series native schema. Time-series data is organized by device and sensor IDs.
- A complete multi-language API (Python, Java, C++, C) for reading and writing tsfile.
- Superior write throughput and query efficiency.
- High compression ratio through per-series encoding and compression schemes.
- Efficient dataset transformation. ETL-free file compaction and efficient random access to time-series chunks, enabling faster data loading and lower query latency.
These properties make TsFile highly suitable for time-series model training, especially where time-series random access and efficient I/O are critical.
More details can be referred from this paper βApache TsFile: An IoT-native Time Series File Format (VLDB 2024)β.
Integrating TsFile support into datasets will benefit the broader machine learning community working on tasks such as forecasting and anomaly detection.
Your contribution
As a member of the TsFile community, I recently initiated a proposal to integrate TsFile with Huggingface, which has received enthusiastic responses from the community.
We are willing to do the following contributions:
- Implement and contribute the PR that adds TsFile dataset support to Hugging Face datasets.
- Provide long-term maintenance for this integration.
- Any other needs for TsFile to support large-scale time-series datasets.
We are excited to contribute and continuously participate in the future evolution of TsFile and datasets to better support time-series data workload.