Skip to content

[Bug] Snapshot with hard_deletes:new_record erroring with new column (incompatible types) #12429

@clairetaylor-monzo

Description

@clairetaylor-monzo

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I can see some other bugs around snapshots and hard_deletes:new_record but I can't find this specific issue:

I'm adding a new column to an ephemeral model that is feeding a yml defined snpashot. (where the snapshot has already been run).
On running the snapshot, it fails with a type mismatch error in the UNION ALL

Database Error in snapshot snapshot_monzo_change_log_change_id
Column 4 in UNION ALL has incompatible types: STRING, STRING, STRING, INT64 at [238:5]

It looks like this occurs du to the deletion_records CTE:

deletion_records as (
    select 'insert' as dbt_change_type,
      snapshotted_data.airtable_id, 
      snapshotted_data.change_id, 
      NULL as proposal_name,  -- New column doesn't exist in existing table
      snapshotted_data.dbt_unique_key as dbt_unique_key,
      ...
)

Expected Behavior

Expected that the new column will be added to the snapshot.

I suspect that the type of the column needs to be inferred from the newly added column and the CTE look something like

deletion_records as (
    select 'insert' as dbt_change_type,
      snapshotted_data.airtable_id, 
      snapshotted_data.change_id, 
      CAST(NULL AS STRING) as proposal_name,  -- New column doesn't exist in existing table
      snapshotted_data.dbt_unique_key as dbt_unique_key,
      ...
)

Possibly this is a bigquery specific issue, but I'm not totally sure.

Steps To Reproduce

Create ephemeral model:

{{
  config(materialized="ephemeral")
}}

SELECT
  id AS airtable_id,
  change_id
FROM {{ source('some_source', 'some_table') }}

Create yml snapshot based on model

version: 2
snapshots:
  - name: snapshot_test
    relation: ref('eph_model')
    config:
      strategy: check
      unique_key: airtable_id
      check_cols: all
      hard_deletes: new_record

Run the snapshot
dbt snapshot --select snapshot_test

Add new column to the ephemeral model:

SELECT
  id AS airtable_id,
  change_id,
  proposal_name  -- New column added
FROM {{ source('some_source', 'some_table') }}

Run the snapshot again, and the above error is seen.

Relevant log output

Database Error in snapshot snapshot_my_table (snapshots/snapshot_my_table.yml)
  Column 4 in UNION ALL has incompatible types: STRING, STRING, STRING, INT64 at [238:5]
1 of 1 ERROR snapshotting dev_db.dbt_dev.snapshot_my_table [ERROR in 5.77s]



create or replace table `prod`.`snapshot_my_table__dbt_tmp`
OPTIONS(
    expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour)
)
as (
    with snapshot_query as (
        with __dbt__cte__eph_my_table as (
            SELECT
              id AS airtable_id,
              change_id,
              proposal_name  -- NEW COLUMN
            FROM `prod`.`my_source_table`
        )
        select * from __dbt__cte__eph_my_table
    ),

    snapshotted_data as (
        select *,
        airtable_id as dbt_unique_key
        from `prod`.`snapshot_my_table`
        where valid_to_timestamp is null
        -- NOTE: This table does NOT have the 'proposal_name' column yet
    ),

    insertions_source_data as (
        select *,
        airtable_id as dbt_unique_key,
        current_timestamp() as _meta_insertion_timestamp,
        current_timestamp() as valid_from_timestamp,
        coalesce(nullif(current_timestamp(), current_timestamp()), null) as valid_to_timestamp,
        to_hex(md5(concat(coalesce(cast(airtable_id as string), ''), '|',coalesce(cast(current_timestamp() as string), '')))) as episode_id
        from snapshot_query
    ),

    updates_source_data as (
        select *,
        airtable_id as dbt_unique_key,
        current_timestamp() as _meta_insertion_timestamp,
        current_timestamp() as valid_from_timestamp,
        current_timestamp() as valid_to_timestamp
        from snapshot_query
    ),

    deletes_source_data as (
        select *,
        airtable_id as dbt_unique_key
        from snapshot_query
    ),

    insertions as (
        select
            'insert' as dbt_change_type,
            source_data.*,
            'False' as is_deleted
        from insertions_source_data as source_data
        left outer join snapshotted_data
            on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where snapshotted_data.dbt_unique_key is null
           or (snapshotted_data.dbt_unique_key is not null and (TRUE or snapshotted_data.is_deleted = 'True'))
    ),

    updates as (
        select
            'update' as dbt_change_type,
            source_data.*,
            snapshotted_data.episode_id,
            snapshotted_data.is_deleted
        from updates_source_data as source_data
        join snapshotted_data
            on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where (TRUE or snapshotted_data.is_deleted = 'True')
    ),

    deletes as (
        select
            'delete' as dbt_change_type,
            source_data.*,
            current_timestamp() as valid_from_timestamp,
            current_timestamp() as _meta_insertion_timestamp,
            current_timestamp() as valid_to_timestamp,
            snapshotted_data.episode_id,
            snapshotted_data.is_deleted
        from snapshotted_data
        left join deletes_source_data as source_data
            on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where source_data.dbt_unique_key is null
          and not (snapshotted_data.is_deleted = 'True' and snapshotted_data.valid_to_timestamp is null)
    ),

    deletion_records as (
        select
            'insert' as dbt_change_type,
            snapshotted_data.`airtable_id`,
            snapshotted_data.`change_id`,
            NULL as `proposal_name`,  -- PROBLEM: Tries to add this column but snapshotted_data doesn't have it
            snapshotted_data.dbt_unique_key as dbt_unique_key,
            current_timestamp() as valid_from_timestamp,
            current_timestamp() as _meta_insertion_timestamp,
            snapshotted_data.valid_to_timestamp as valid_to_timestamp,
            to_hex(md5(concat(coalesce(cast(episode_id as string), ''), '|',coalesce(cast(current_timestamp() as string), '')))) as episode_id,
            'True' as is_deleted
        from snapshotted_data
        left join deletes_source_data as source_data
            on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where source_data.dbt_unique_key is null
          and not (snapshotted_data.is_deleted = 'True' and snapshotted_data.valid_to_timestamp is null)
    )

    -- LINE 238: UNION ALL fails here due to column type mismatch
    select * from insertions
    union all
    select * from updates
    union all
    select * from deletes
    union all
    select * from deletion_records
);

Environment

OS: Debian GNU/Linux 12 
Python: 3.12.12 
dbt-core: 1.10.13 
Database adapter: dbt-bigquery 1.10.2

Which database adapter are you using with dbt?

bigquery

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions