Skip to content

[Feature Request]: Add option to include file name as metadata in ReadFromCsv YAML transform #35773

@jonuchauhan

Description

@jonuchauhan

What would you like to happen?

Is your feature request related to a problem? Please describe.
While using the ReadFromCsv YAML transform in Apache Beam, I noticed that there is no way to access the source file name for each record. For data lineage, debugging, or downstream processing, having the originating file name attached to each row is very helpful.

Describe the solution you'd like
A parameter or option in the YAML transform, such as include_filename: true, which would add a field (e.g., filename) to each output row, containing the name or path of the source CSV file.

Describe alternatives you've considered

  • Reading files separately and adding the file name manually via additional steps in YAML, but this doesn't scale for many files.
  • Using the Python SDK with a custom DoFn, which is more flexible, but less declarative and harder to maintain compared to YAML.

Additional context

  • This is especially useful in data lakes or pipelines with multiple files as sources.
  • Similar features are supported in other data processing frameworks (like Spark's input_file_name()).

Links

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions