Skip to content

[Bug]: Improve performance of the Iceberg AddFiles transform #38012

@chamikaramj

Description

@chamikaramj

What happened?

Currently Iceberg AddFiles transform has some performance bottlenecks when we try to write a large number of files. For example, we fully read parquet files being written [1] which can significantly slow down the process. We should look into improving the single VM performance without compromising consistency guarantees of the sink.

[1]

static org.apache.parquet.io.InputFile getParquetInputFile(String filePath) throws IOException {

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions