Skip to content

[Regression] dbt docs generate no longer works with tables cataloged with AWS AppFlow #1006

@sanrodari

Description

@sanrodari

Is this a regression?

  • I believe this is a regression in functionality
  • I have searched the existing issues, and I could not find an existing issue for this regression

Which packages are affected?

  • dbt-adapters
  • dbt-tests-adapter
  • dbt-athena
  • dbt-athena-community
  • dbt-bigquery
  • dbt-postgres
  • dbt-redshift
  • dbt-snowflake
  • dbt-spark

Current Behavior

The table type for AWS AppFlow tables is not parsed correctly, even though it used to work.

Expected/Previous Behavior

The table type for AWS AppFlow tables is parsed correctly. They should be of the TableType.TABLE type.

Steps To Reproduce

  1. Create a Glue table using AppFlow, Salesforce integration.
  2. Use that table as a dbt source
  - name: salesforce
    schema: landing_zone
    tables:
      - name: foo_bar
        identifier: sf_appflow_sf_foo_bar_1739369457_latest
  1. Run dbt docs generate
  2. Expected error 21:17:54 dbt encountered 1 failure while writing the catalog

Relevant log output

21:17:08  Running with dbt=1.9.3
21:17:08  Registered adapter: athena=1.9.2
21:17:09  Found 435 models, 382 data tests, 54 seeds, 155 sources, 21 exposures, 643 macros
21:17:09
21:17:09  Concurrency: 16 threads (target='dev')
21:17:09
21:17:40  Building catalog
21:17:54  Encountered an error while generating catalog: Table type cannot be None for table redacted.landing_zone.sf_appflow_sf_foo_bar_1739369457_latest
21:17:54  dbt encountered 1 failure while writing the catalog
21:17:54  Catalog written to /Users/redacted/repos/redacted/dbt_elt/target/catalog.json

Environment

❯ dbt debug
21:21:20  Running with dbt=1.9.3
21:21:20  dbt version: 1.9.3
21:21:20  python version: 3.10.16
21:21:20  python path: /Users/redacted/repos/redacted/dbt_elt/.venv/bin/python3
21:21:20  os info: macOS-15.4-arm64-arm-64bit
21:21:20  Using profiles dir at /Users/redacted/repos/redacted/dbt_elt
21:21:20  Using profiles.yml file at /Users/redacted/repos/redacted/dbt_elt/profiles.yml
21:21:20  Using dbt_project.yml file at /Users/redacted/repos/redacted/dbt_elt/dbt_project.yml
21:21:20  adapter type: athena
21:21:20  adapter version: 1.9.2
21:21:20  Configuration:
21:21:20    profiles.yml file [OK found and valid]
21:21:20    dbt_project.yml file [OK found and valid]
21:21:20  Required dependencies:
21:21:20   - git [OK found]

21:21:20  Connection:
21:21:20    s3_staging_dir: s3://redacted/staging_curated_zones/redacted/athena_query_results
21:21:20    work_group: None
21:21:20    skip_workgroup_check: False
21:21:20    region_name: us-east-1
21:21:20    database: awsdatacatalog
21:21:20    schema: curated_zone_redacted
21:21:20    poll_interval: 1.0
21:21:20    aws_profile_name: None
21:21:20    aws_access_key_id: None
21:21:20    endpoint_url: None
21:21:20    s3_data_dir: s3://redacted/staging_curated_zones/redacted
21:21:20    s3_data_naming: schema_table_unique
21:21:20    s3_tmp_table_dir: None
21:21:20    debug_query_state: False
21:21:20    seed_s3_upload_args: None
21:21:20    lf_tags_database: None
21:21:20    spark_work_group: None
21:21:20  Registered adapter: athena=1.9.2
21:21:22    Connection test: [OK connection ok]

21:21:22  All checks passed!

Additional Context

I isolated the issue, and I think I know when the regression was introduced:

Repro code (I copy pasted get_table_type from dbt/adapters/athena/relation.py into a script and feed it with a AppFlow generated table):

import json
from enum import Enum
import boto3

glue_client = boto3.client("glue")


class TableType(Enum):
    TABLE = "table"
    VIEW = "view"
    CTE = "cte"
    MATERIALIZED_VIEW = "materializedview"
    ICEBERG = "iceberg_table"

    def is_physical(self) -> bool:
        return self in [TableType.TABLE, TableType.ICEBERG]


RELATION_TYPE_MAP = {
    "EXTERNAL_TABLE": TableType.TABLE,
    "EXTERNAL": TableType.TABLE,  # type returned by federated query tables
    "GOVERNED": TableType.TABLE,
    "MANAGED_TABLE": TableType.TABLE,
    "VIRTUAL_VIEW": TableType.VIEW,
    "table": TableType.TABLE,
    "view": TableType.VIEW,
    "cte": TableType.CTE,
    "materializedview": TableType.MATERIALIZED_VIEW,
}


def get_table_type(table):
    table_full_name = ".".join(
        filter(None, [table.get("CatalogId"), table.get("DatabaseName"), table["Name"]])
    )

    input_table_type = table.get("TableType")
    if input_table_type and input_table_type not in RELATION_TYPE_MAP:
        raise ValueError(
            f"Table type {table['TableType']} is not supported for table {table_full_name}"
        )

    if table.get("Parameters", {}).get("table_type", "").lower() == "iceberg":
        _type = TableType.ICEBERG
    elif not input_table_type:
        raise ValueError(f"Table type cannot be None for table {table_full_name}")
    else:
        _type = RELATION_TYPE_MAP[input_table_type]

    print(f"table_name : {table_full_name}")
    print(f"table type : {_type}")

    return _type


if __name__ == "__main__":
    table = glue_client.get_table(
        DatabaseName="landing_zone",
        Name="sf_appflow_sf_foo_bar_1739369457_latest",
    )
    print(json.dumps(table, indent=2, default=str))

    table_type = get_table_type(table["Table"])
    print(f"Table type", table_type)
❯ python -m repro

{
  "Table": {
    "Name": "sf_appflow_sf_foo_bar_1739369457_latest",
    "DatabaseName": "landing_zone",
    "CreateTime": "2025-02-12 09:10:58-05:00",
    "UpdateTime": "2025-04-17 16:01:01-04:00",
    "Retention": 0,
    "StorageDescriptor": {
      "Columns": [
        {
          "Name": "id",
          "Type": "string",
          "Parameters": {
            "AppFlowLabel": "Record ID",
            "AppFlowDescription": "Record ID"
          }
        },
        {
          "Name": "ownerid",
          "Type": "string",
          "Parameters": {
            "AppFlowLabel": "Owner ID",
            "AppFlowDescription": "Owner ID"
          }
        },
        ...
      ],
      "Location": "s3://redacted/landing_zone/sf/appflow/sf_foo_bar/schemaVersion_2/acb94e86-a77a-33a8-97d4-84765bcd4739/",
      "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
      "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
      "Compressed": false,
      "NumberOfBuckets": 0,
      "SerdeInfo": {
        "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
        "Parameters": {
          "skip.header.line.count": "1"
        }
      },
      "SortColumns": [],
      "StoredAsSubDirectories": false
    },
    "PartitionKeys": [],
    "Parameters": {
      "sourceConnectorObject": "foo_bar__c",
      "sourceConnectorName": "salesforce",
      "appflowName": "sf_foo_bar",
      "appflowDescription": "",
      "createdBy": "appflow.amazonaws.com",
      "appflowARN": "",
      "classification": "PARQUET"
    },
    "CreatedBy": "arn:aws:sts::redacted:assumed-role/appflow-glue-catalog/SandstoneMRS-455df3aa-c9c5-48bd-86b4-6757fcb6d95f",
    "IsRegisteredWithLakeFormation": false,
    "CatalogId": "redacted",
    "VersionId": "1541",
    "IsMultiDialectView": false
  },
  "ResponseMetadata": {
    "RequestId": "0311a58f-bbfb-49ff-8870-ecbd3a3da715",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Thu, 17 Apr 2025 20:59:41 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "22350",
      "connection": "keep-alive",
      "x-amzn-requestid": "0311a58f-bbfb-49ff-8870-ecbd3a3da715",
      "cache-control": "no-cache"
    },
    "RetryAttempts": 0
  }
}


Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.10.16/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/homebrew/Cellar/[email protected]/3.10.16/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/redacted/repos/redacted/dbt_elt/repro.py", line 61, in <module>
    table_type = get_table_type(table["Table"])
  File "/Users/redacted/repos/redacted/dbt_elt/repro.py", line 45, in get_table_type
    raise ValueError(f"Table type cannot be None for table {table_full_name}")
ValueError: Table type cannot be None for table 854713690974.landing_zone.sf_appflow_sf_foo_bar_1739369457_latest

The code used to default to EXTERNAL_TABLE, and in consequence the mapping used to yield TableType.TABLE type. Related PR: dbt-labs/dbt-athena#661.

cc: @svdimchenko

Metadata

Metadata

Assignees

No one assigned

    Labels

    pkg:dbt-athenaIssue affects dbt-athenatype:bugSomething isn't working as documentedtype:regressionSomething used to work and is no longer working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions