-
Notifications
You must be signed in to change notification settings - Fork 38
Description
The current loading of nested JSON / JSONB data structures to BigQuery (v2) destination is dropping some information as it does not differentiate between a nested object containing only NULL values (explicit null) compared to a non-existing object (implicit null).
Let's suppose a source table in e.g. PostgreSQL with three records. The table has, among others, the column person. The contents of the person column for the three records are:
{"name": "Person A", "job": {"employer": "Best employer", "employed_since_date": "2020-04-01"}}{"name": "Person B", "job": {"employer": null, "employed_since_date": null}}(The person is unemployed, so the job info is an explicit null){"name": "Person C"}(The job information is unknown, i.e. implicit null)
As such the source table in the pgSQL DB, while containing null values for the job object's sub-objects in the second record, carries some information in itself: it is known thatPerson A is employed, Person B is unemployed, while Person C's job info is unknown.
However, both of these are replicated to BigQuery with the following schema:
person-RECORDperson.name-STRING:Person A/Person B/Person Cperson.job-RECORDperson.job.employer-STRING:nullfor the two latter recordsperson.job.employed_since_date-DATE:nullfor the two latter records
In the BigQuery v1 connector where JSON / JSONB columns were replicated as STRING, the information was carried over (the existence or lack of a job object in the JSON string), but is indistinguishable in the v2 connector.
Sure, this could be fixed on the DB side by adding e.g. an is_employed: (true/false) object to the person.job object and populate it for every record, but a version upgrade in the BigQuery destination connector should not cause need for changes on the source side for customers. We need a possibility to differentiate between explicit and implicit nulls, as they may carry some information in and of themselves.