-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[Bug]: Improve performance of the Iceberg AddFiles transform #38012
Copy link
Copy link
Closed
Description
What happened?
Currently Iceberg AddFiles transform has some performance bottlenecks when we try to write a large number of files. For example, we fully read parquet files being written [1] which can significantly slow down the process. We should look into improving the single VM performance without compromising consistency guarantees of the sink.
[1]
beam/sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AddFiles.java
Line 685 in e08e9d5
| static org.apache.parquet.io.InputFile getParquetInputFile(String filePath) throws IOException { |
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner
Reactions are currently unavailable