Skip to content

Replication of Null-values in nested JSON records to BigQuery #1098

@roosa590

Description

@roosa590

The current loading of nested JSON / JSONB data structures to BigQuery (v2) destination is dropping some information as it does not differentiate between a nested object containing only NULL values (explicit null) compared to a non-existing object (implicit null).

Let's suppose a source table in e.g. PostgreSQL with three records. The table has, among others, the column person. The contents of the person column for the three records are:

  1. {"name": "Person A", "job": {"employer": "Best employer", "employed_since_date": "2020-04-01"}}
  2. {"name": "Person B", "job": {"employer": null, "employed_since_date": null}} (The person is unemployed, so the job info is an explicit null)
  3. {"name": "Person C"} (The job information is unknown, i.e. implicit null)

As such the source table in the pgSQL DB, while containing null values for the job object's sub-objects in the second record, carries some information in itself: it is known thatPerson A is employed, Person B is unemployed, while Person C's job info is unknown.

However, both of these are replicated to BigQuery with the following schema:

  • person - RECORD
    • person.name - STRING: Person A / Person B / Person C
    • person.job - RECORD
      • person.job.employer - STRING: null for the two latter records
      • person.job.employed_since_date - DATE: null for the two latter records

In the BigQuery v1 connector where JSON / JSONB columns were replicated as STRING, the information was carried over (the existence or lack of a job object in the JSON string), but is indistinguishable in the v2 connector.

Sure, this could be fixed on the DB side by adding e.g. an is_employed: (true/false) object to the person.job object and populate it for every record, but a version upgrade in the BigQuery destination connector should not cause need for changes on the source side for customers. We need a possibility to differentiate between explicit and implicit nulls, as they may carry some information in and of themselves.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions