Skip to content

[Bug]: Managed IO of Iceberg appends NULL / Unprintable characters for string type columns. #33963

@saathwik-tk

Description

@saathwik-tk

What happened?

Data Ingested via

pipeline.apply(<Source>)
              .apply(JsonToRow.withSchema(mySchema))
              .apply(Managed.write(Managed.ICEBERG).withConfig(myConfig))

Issue is:
After data gets ingested we cannot see the data with the below queries (given that field_name has many values with 'value1')

  • SELECT * FROM catalog_name.namespace.table_name WHERE field_name = 'value1';
  • SELECT * FROM catalog_name.namespace.table_name WHERE field_name like 'value1';

However we can see the data with the below queries

  • SELECT * FROM catalog_name.namespace.table_name WHERE TRIM(field_name) = 'value1'
  • SELECT * FROM catalog_name.namespace.table_name WHERE field_name like '%value1'
  • SELECT * FROM catalog_name.namespace.table_name WHERE field_name like 'value1%'
  • SELECT * FROM catalog_name.namespace.table_name WHERE field_name like '%value1%'

Even though I tried using the below approach, I saw the same issue, but the below approach makes sure that it is not from the source data.

pipeline.apply(<Source>)
              .apply(JsonToRow.withSchema(mySchema)).setCoder(RowCoder.of(mySchema))
              .apply(ParDo.of(new DoFn<Row, Row>() {
                    @ProcessElement
                    public void processFn(@Element Row row, OutputReceiver<Row> out){
                        List<Object> cleanedValues = schema.getFields().stream()
                                .map(field -> {
                                    Object value = row.getValue(field.getName());
                                    if(value instanceof  String){
                                        return ((String) value).trim();
                                    }
                                    return value;
                                })
                                .collect(Collectors.toList());
                        Row trimmedRow = Row.withSchema(mySchema)
                                .addValues(cleanedValues)
                                .build();
                        out.output(trimmedRow);
                    }
                })).setCoder(RowCoder.of(mySchema))
              .apply(Managed.write(Managed.ICEBERG).withConfig(myConfig));

However the issue is not seen in all the string type values, but seen in few not seen in most, some of the strings include '1234567890', '2025-02-12' or any date of this type in a string format.

NOTE:
Use the same reproduction as this
Beam Version: 2.62.0
Iceberg Version: 1.4.2

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions