-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Closed
Description
What would you like to happen?
Is your feature request related to a problem? Please describe.
While using the ReadFromCsv YAML transform in Apache Beam, I noticed that there is no way to access the source file name for each record. For data lineage, debugging, or downstream processing, having the originating file name attached to each row is very helpful.
Describe the solution you'd like
A parameter or option in the YAML transform, such as include_filename: true, which would add a field (e.g., filename) to each output row, containing the name or path of the source CSV file.
Describe alternatives you've considered
- Reading files separately and adding the file name manually via additional steps in YAML, but this doesn't scale for many files.
- Using the Python SDK with a custom
DoFn, which is more flexible, but less declarative and harder to maintain compared to YAML.
Additional context
- This is especially useful in data lakes or pipelines with multiple files as sources.
- Similar features are supported in other data processing frameworks (like Spark's input_file_name()).
Links
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner
Reactions are currently unavailable