Skip to content

Check field_is_present always passes for CSV/Parquet files #1065

@jschoedl

Description

@jschoedl

For CSV and Parquet files that are obtained over DuckDB (S3, GCS, Azure and local environment), the check field_is_present always passes, independent of whether the column is present in the CSV or not.

Cause: In create_view_with_schema_union(...), we create an empty table based on the datacontract and later insert the actual data, e.g. from a CSV file. If a column is not present in the CSV file, it will still be in the checked table (but filled with NULLs).

Possible solutions:

  1. Remove the field_is_present check (for non-required fields). However, it is useful to check for field presence as discussed in Support for historical data validation causes error in CSV files without headers #1018.
  2. Fix the check and enforce field presence in the header even for non-required fields. This would break, e.g., test_csv_optional_field_missing_from_old_data() in test_test_schema_evolution.py, which assumes that a CSV-file that misses a non-required field should pass all tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions