Replies: 4 comments 9 replies
-
|
Hi, interesting! As far as I know, gpfdist is a protocol for external tables, and the gpfdist tool is an implementation of that protocol. Other tools, like GPSS (Greenplum Streaming Server), also implement the gpfdist protocol for external tables. The data format is specified by the format option when defining the external table. CREATE EXTERNAL TABLE ext1 (d varchar(20)) location ('gpfdist://9727/d.dat') format 'csv' (DELIMITER '|');Are you planning to support additional protocols(ex: SFTP) for external tables : CREATE EXTERNAL TABLE ext1 (d varchar(20)) location ('SFTP://9727/d.dat')or provide more format options? CREATE EXTERNAL TABLE ext1 (d varchar(20)) location ('gpfdist://9727/d.dat') format 'SFTP' ; |
Beta Was this translation helpful? Give feedback.
-
|
The discussion seems to support more protocols for external tables, not multiple data sources for a single external table. To be clear, the external table has supported multiple data sources for a single external table. The topic has two targets:
Transfer ProtocolLooks to support more clients to fetch files. BTW, gpfdist supports to transform on server, like File FormatThe external table support CUSTOM format: You could consider to implement a new file format. |
Beta Was this translation helpful? Give feedback.
-
|
The CUSTOM format assumes row-based data stream. I'm not sure it's suitable for column-base storage format, needs more spike. |
Beta Was this translation helpful? Give feedback.
-
|
It appears that data import and export is indeed a frequently used and frequently discussed functionality in MPP databases. Based on previous discussions and this topic, Cloudberry currently has three main frameworks for parallel data import and export: (1) PXF; (2) FDW; (3) gpfdist, which is discussed here. Each framework aims to access more data sources by adding support for more protocols and file formats. Apart from the differences in the frameworks themselves, the logic for protocol support and file format support is basically the same. Different technical teams have chosen different frameworks due to their own historical reasons. To avoid duplicated development work (even though Apache Cloudberry is an open-source project, development resources are still very valuable), I personally think that each team can first submit all the code they're willing to open-source to GitHub, then create a dedicated discussion topic. Through public discussion, we can determine which framework the Cloudberry project should focus on supporting to achieve support for more data sources. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Proposers
No response
Proposal Status
Under Discussion
Abstract
The gpfdist utility has been enhanced to support SFTP/HDFS protocols, enabling high-performance data ingestion from multiple distributed sources. Additionally, it now supports various file formats including Parquet, going beyond traditional CSV/text limitations.
Motivation
Cloudberry, as a distributed database, derives its strengths from enhancing data processing performance through collaboration across multiple nodes. However, with technological advancements, data volume has grown explosively, data sources are distributed across diverse locations, and data formats have become increasingly varied. Meanwhile, while Cloudberry’s parallel file distribution tool, gpfdis, which can achieve efficient data ingestion, it has flaws like requiring source data files and the tool to be on the same machine, and only supporting CSV or text files.
To address these challenges, we aim to enhance the gpfdist tool functionally by expanding support for SFTP/HDFS protocols, thereby enabling high-performance ingestion of multi-source data.
Implementation
Principle: This enhancement integrates SFTP/HDFS protocols to enable remote file access and parallel data loading; simultaneously, it introduces a file format parsing layer to support structured formats such as Parquet. The GPFDSIT architecture schematic is shown below.

Design: A modular architecture is adopted, decoupling protocol adapters (e.g., SFTP/HDFS) and format parsers (e.g., Parquet) from the core engine of gpfdist to improve scalability and maintainability.
Process flow (with Diagram):
Rollout/Adoption Plan
No response
Are you willing to submit a PR?
Beta Was this translation helpful? Give feedback.
All reactions