-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Closed as not planned
Closed as not planned
Copy link
Description
What happened?
Data Ingested via
pipeline.apply(<Source>)
.apply(JsonToRow.withSchema(mySchema))
.apply(Managed.write(Managed.ICEBERG).withConfig(myConfig))
Issue is:
After data gets ingested we cannot see the data with the below queries (given that field_name has many values with 'value1')
SELECT * FROM catalog_name.namespace.table_name WHERE field_name = 'value1';SELECT * FROM catalog_name.namespace.table_name WHERE field_name like 'value1';
However we can see the data with the below queries
SELECT * FROM catalog_name.namespace.table_name WHERE TRIM(field_name) = 'value1'SELECT * FROM catalog_name.namespace.table_name WHERE field_name like '%value1'SELECT * FROM catalog_name.namespace.table_name WHERE field_name like 'value1%'SELECT * FROM catalog_name.namespace.table_name WHERE field_name like '%value1%'
Even though I tried using the below approach, I saw the same issue, but the below approach makes sure that it is not from the source data.
pipeline.apply(<Source>)
.apply(JsonToRow.withSchema(mySchema)).setCoder(RowCoder.of(mySchema))
.apply(ParDo.of(new DoFn<Row, Row>() {
@ProcessElement
public void processFn(@Element Row row, OutputReceiver<Row> out){
List<Object> cleanedValues = schema.getFields().stream()
.map(field -> {
Object value = row.getValue(field.getName());
if(value instanceof String){
return ((String) value).trim();
}
return value;
})
.collect(Collectors.toList());
Row trimmedRow = Row.withSchema(mySchema)
.addValues(cleanedValues)
.build();
out.output(trimmedRow);
}
})).setCoder(RowCoder.of(mySchema))
.apply(Managed.write(Managed.ICEBERG).withConfig(myConfig));
However the issue is not seen in all the string type values, but seen in few not seen in most, some of the strings include '1234567890', '2025-02-12' or any date of this type in a string format.
NOTE:
Use the same reproduction as this
Beam Version: 2.62.0
Iceberg Version: 1.4.2
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner
Reactions are currently unavailable