Extend the gpfdist tool to support SFTP/HDFS protocols for high-performance multi-source data ingestion #1205

ZTE-EBASE · 2025-07-02T01:44:52Z

ZTE-EBASE
Jul 2, 2025

Proposers

No response

Proposal Status

Under Discussion

Abstract

The gpfdist utility has been enhanced to support SFTP/HDFS protocols, enabling high-performance data ingestion from multiple distributed sources. Additionally, it now supports various file formats including Parquet, going beyond traditional CSV/text limitations.

Motivation

Cloudberry, as a distributed database, derives its strengths from enhancing data processing performance through collaboration across multiple nodes. However, with technological advancements, data volume has grown explosively, data sources are distributed across diverse locations, and data formats have become increasingly varied. Meanwhile, while Cloudberry’s parallel file distribution tool, gpfdis, which can achieve efficient data ingestion, it has flaws like requiring source data files and the tool to be on the same machine, and only supporting CSV or text files.

To address these challenges, we aim to enhance the gpfdist tool functionally by expanding support for SFTP/HDFS protocols, thereby enabling high-performance ingestion of multi-source data.

Implementation

Principle: This enhancement integrates SFTP/HDFS protocols to enable remote file access and parallel data loading; simultaneously, it introduces a file format parsing layer to support structured formats such as Parquet. The GPFDSIT architecture schematic is shown below.

Design: A modular architecture is adopted, decoupling protocol adapters (e.g., SFTP/HDFS) and format parsers (e.g., Parquet) from the core engine of gpfdist to improve scalability and maintainability.

Process flow (with Diagram):

The client tool sends a command to the Coordinator node.
The Coordinator node parses the SQL, generates a query plan and distributes it to all segment nodes.
Each segment node starts executing the query and initiates an HTTP request to gpfdist.
For readonly external table, gpfdist reads the file, splits it into chunks and determines the row boundary positions. For writeonly external table, gpfdist writes the received data to disk.
Gpfdist distributes the data chunks or execution results to the segments.
The segments return the processed results to the Coordinator.
The Coordinator returns the query results to the client.

Rollout/Adoption Plan

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

avamingli · 2025-07-02T04:57:31Z

avamingli
Jul 2, 2025
Collaborator

Hi, interesting!

As far as I know, gpfdist is a protocol for external tables, and the gpfdist tool is an implementation of that protocol. Other tools, like GPSS (Greenplum Streaming Server), also implement the gpfdist protocol for external tables. The data format is specified by the format option when defining the external table.

CREATE EXTERNAL TABLE ext1 (d varchar(20)) location ('gpfdist://9727/d.dat') format 'csv' (DELIMITER '|');

Are you planning to support additional protocols(ex: SFTP) for external tables :

CREATE EXTERNAL TABLE ext1 (d varchar(20)) location ('SFTP://9727/d.dat')

or provide more format options?

CREATE EXTERNAL TABLE ext1 (d varchar(20)) location ('gpfdist://9727/d.dat') format 'SFTP' ;

3 replies

ZTE-EBASE Jul 2, 2025
Author

Minimize kernel code changes by reusing the gpfdist protocol. Add an sftp/hdfs protocol marker and use it to call the corresponding functions for data reading. Meanwhile, it can address the issue of data files not being on the same machine as gpfdist.
For example:

CREATE EXTERNAL TABLE ext1 (d varchar(20)) location ('gpfdist://ip:port/<sftp://sftp-user:passwd@sftp-hostip:sftp-port/file.csv>') format 'csv' (DELIMITER '|');

CREATE EXTERNAL TABLE ext2 (d varchar(20)) location ('gpfdist://ip:port/<hdfs://namenode:port/file-path.parquet>') format 'csv' (DELIMITER '|');

avamingli Jul 2, 2025
Collaborator

Seems need to implement HDFS/SFTP client codes in gpfdist, then why not directly use FDW?

ZTE-EBASE Jul 2, 2025
Author

Yes, our implementation relies on libssh along with the arrow/parquet libraries. This approach is tailored to specific business requirements, and since the business scenario involves large-scale data, we adopt a parallel strategy to achieve high-performance data ingestion and querying.
Regarding the HDFS protocol, we have implemented FDW (Foreign Data Wrapper) for it. However, this involved a significant amount of code modification and changes to the kernel. Should there be a need, we can provide this implementation later.

gfphoenix78 · 2025-07-03T02:30:31Z

gfphoenix78
Jul 3, 2025
Collaborator

The discussion seems to support more protocols for external tables, not multiple data sources for a single external table. To be clear, the external table has supported multiple data sources for a single external table.

The topic has two targets:

support more transfer protocol
support addtional file format
Let's discuss about them one by one.

Transfer Protocol

Looks to support more clients to fetch files. BTW, gpfdist supports to transform on server, like
https://github.com/apache/cloudberry/blob/main/src/bin/gpfdist/regress/input/exttab1.source#L541

File Format

The external table support CUSTOM format:

Syntax:
CREATE [READABLE] EXTERNAL [TEMPORARY | TEMP] TABLE table_name
     ( column_name data_type [, ...] | LIKE other_table )
      LOCATION ('file://seghost[:port]/path/file' [, ...])
        | ('gpfdist://filehost[:port]/file_pattern[#transform]'
        | ('gpfdists://filehost[:port]/file_pattern[#transform]'
            [, ...])
      FORMAT 'TEXT'
            [( [HEADER]
               [DELIMITER [AS] 'delimiter' | 'OFF']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CSV'
            [( [HEADER]
               [QUOTE [AS] 'quote']
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE NOT NULL column [, ...]]
               [ESCAPE [AS] 'escape']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CUSTOM' (Formatter=<formatter specifications>)
     [ OPTIONS ( key 'value' [, ...] ) ]
     [ ENCODING 'encoding' ]
     [ [LOG ERRORS] SEGMENT REJECT LIMIT count
       [ROWS | PERCENT] ]

You could consider to implement a new file format.

3 replies

avamingli Jul 3, 2025
Collaborator

Indeed, it's a good perspective.
I had forgotten about this approach; it seems promising and definitely worth a try.

ZTE-EBASE Jul 3, 2025
Author

As mentioned above, the enhanced features focus on addressing the issue of gpfdist being co-located with files and also supporting access to structured files like Parquet. While the newly added CUSTOM may facilitate the latter, it cannot resolve the core issue described in the former. However, the suggestion is worth a try.

avamingli Jul 4, 2025
Collaborator

With transform, you could do anything to data before it's inserted into the database. For example, it can work as an independent tool that integrates and parses various source data.

And have you evaluated hdfs_fdw as a potential solution for Cloudberry?
Much of the code is already compatible with PostgreSQL and works well, so adapting this FDW for Cloudberry might require significantly less effort compared to integrating multiple client-side protocols into gpfdist. This approach could also prove more cost-effective in the long run.
By making FDW MPP, it could use multiple segments to insert, select.

gfphoenix78 · 2025-07-03T02:55:16Z

gfphoenix78
Jul 3, 2025
Collaborator

The CUSTOM format assumes row-based data stream. I'm not sure it's suitable for column-base storage format, needs more spike.

1 reply

ZTE-EBASE Jul 3, 2025
Author

Based on the current implementation, the gpfdist end can add format conversion logic to transform columnar storage formats into native row storage formats in memory. This enables subsequent reuse of execution logic to achieve the desired functionality.

jianlirong · 2025-07-04T10:40:15Z

jianlirong
Jul 4, 2025

It appears that data import and export is indeed a frequently used and frequently discussed functionality in MPP databases. Based on previous discussions and this topic, Cloudberry currently has three main frameworks for parallel data import and export: (1) PXF; (2) FDW; (3) gpfdist, which is discussed here. Each framework aims to access more data sources by adding support for more protocols and file formats. Apart from the differences in the frameworks themselves, the logic for protocol support and file format support is basically the same. Different technical teams have chosen different frameworks due to their own historical reasons.

To avoid duplicated development work (even though Apache Cloudberry is an open-source project, development resources are still very valuable), I personally think that each team can first submit all the code they're willing to open-source to GitHub, then create a dedicated discussion topic. Through public discussion, we can determine which framework the Cloudberry project should focus on supporting to achieve support for more data sources.

2 replies

yjhjstz Jul 8, 2025
Collaborator

I’d like to propose enhancing CloudBerry by integrating DuckDB, leveraging its Data Sources capability. DuckDB supports querying a wide variety of sources—CSV, Parquet, JSON, SQLite, PostgreSQL, MySQL, and cloud storage (S3, Azure Blob, Iceberg, etc.)—directly via SQL

Why this matters for CloudBerry
Direct data source access: DuckDB can read and filter data from diverse formats and storages without preprocessing.

In-process analytics engine: Embedding DuckDB allows CloudBerry to push down complex SQL operations into a fast, vectorized OLAP engine.

leborchuk Jul 9, 2025
Collaborator

+1 for duckDB. But have one question, are there any security reasons we should be aware of for using duckDB data sources? (For example, execute untrusted user binary inside cloudberry process)

Extend the gpfdist tool to support SFTP/HDFS protocols for high-performance multi-source data ingestion #1205

Uh oh!

ZTE-EBASE Jul 2, 2025

Proposers

Proposal Status

Abstract

Motivation

Implementation

Rollout/Adoption Plan

Are you willing to submit a PR?

Replies: 4 comments · 9 replies

Uh oh!

avamingli Jul 2, 2025 Collaborator

Uh oh!

Uh oh!

ZTE-EBASE Jul 2, 2025 Author

Uh oh!

avamingli Jul 2, 2025 Collaborator

Uh oh!

ZTE-EBASE Jul 2, 2025 Author

Uh oh!

gfphoenix78 Jul 3, 2025 Collaborator

Transfer Protocol

File Format

Uh oh!

avamingli Jul 3, 2025 Collaborator

Uh oh!

ZTE-EBASE Jul 3, 2025 Author

Uh oh!

avamingli Jul 4, 2025 Collaborator

Uh oh!

gfphoenix78 Jul 3, 2025 Collaborator

Uh oh!

ZTE-EBASE Jul 3, 2025 Author

Uh oh!

jianlirong Jul 4, 2025

Uh oh!

yjhjstz Jul 8, 2025 Collaborator

Uh oh!

leborchuk Jul 9, 2025 Collaborator

ZTE-EBASE
Jul 2, 2025

Replies: 4 comments 9 replies

avamingli
Jul 2, 2025
Collaborator

ZTE-EBASE Jul 2, 2025
Author

avamingli Jul 2, 2025
Collaborator

ZTE-EBASE Jul 2, 2025
Author

gfphoenix78
Jul 3, 2025
Collaborator

avamingli Jul 3, 2025
Collaborator

ZTE-EBASE Jul 3, 2025
Author

avamingli Jul 4, 2025
Collaborator

gfphoenix78
Jul 3, 2025
Collaborator

ZTE-EBASE Jul 3, 2025
Author

jianlirong
Jul 4, 2025

yjhjstz Jul 8, 2025
Collaborator

leborchuk Jul 9, 2025
Collaborator