Skip to content

Add redshift support#271

Open
CoderCookE wants to merge 9 commits intoDatavault-UK:masterfrom
CoderCookE:feat/add-redshift-support
Open

Add redshift support#271
CoderCookE wants to merge 9 commits intoDatavault-UK:masterfrom
CoderCookE:feat/add-redshift-support

Conversation

@CoderCookE
Copy link

Add Amazon Redshift Support

Summary

This PR adds full Amazon Redshift support to automate-dv, enabling users to build Data Vault 2.0 warehouses on Redshift using all table types (hubs, links, satellites, PITs, bridges, etc.).

Motivation

Amazon Redshift is a widely-used cloud data warehouse platform, and many organizations need to implement Data Vault patterns on Redshift. This implementation provides native Redshift support with optimizations for Redshift's SQL dialect and performance characteristics.

Implementation Details

Supported Features

All 10 Data Vault Table Types:

  • Hubs (hub.sql)
  • Links (link.sql)
  • Satellites (sat.sql)
  • Effectivity Satellites (eff_sat.sql)
  • Multi-Active Satellites (ma_sat.sql)
  • Point-in-Time Tables (pit.sql)
  • Bridge Tables (bridge.sql)
  • Extended Tracking Satellites (xts.sql)
  • Non-Historized Links (nh_link.sql)
  • Reference Tables (ref_table.sql)

Hash Algorithms:

  • MD5: Full support via native MD5() function → VARCHAR(32)
  • SHA1: Full support via native SHA1() function → VARCHAR(40)
  • SHA256: Full support via native SHA2(string, 256) function → VARCHAR(64)

Key Design Decisions

1. VARCHAR Hex Strings for Hash Storage

Decision: Store hashes as VARCHAR hex strings instead of binary types.

Rationale:

  • Simplifies the PIT (Point-in-Time) macro significantly - no need for PostgreSQL-specific ENCODE/DECODE operations
  • Makes hashes human-readable for debugging
  • Follows proven pattern from BigQuery and Databricks implementations
  • Storage overhead is negligible for most use cases

Implementation:

  • MD5: VARCHAR(32) - 32 hex characters
  • SHA1: VARCHAR(40) - 40 hex characters
  • SHA256: VARCHAR(64) - 64 hex characters

2. QUALIFY Clause for Performance

Decision: Require Redshift version with QUALIFY support (July 2023+).

Rationale:

  • QUALIFY provides significant performance improvements over subquery patterns
  • Most production Redshift clusters are kept current for security and performance
  • Cleaner, more maintainable SQL

Example:

SELECT columns
FROM table
QUALIFY ROW_NUMBER() OVER(PARTITION BY pk ORDER BY ldts) = 1

3. Inheritance Strategy

Approach: Mix of custom implementations and inheritance from existing adapters:

  • Custom: hub.sql (QUALIFY pattern), pit.sql (VARCHAR hex simplification)
  • PostgreSQL-based: link.sql, sat.sql (compatible window functions)
  • Default: ref_table.sql, xts.sql, eff_sat.sql, ma_sat.sql, nh_link.sql, bridge.sql

Rationale: Maximize code reuse while optimizing for Redshift-specific features.

Files Changed

Supporting Macros (7 files modified)

  • macros/internal/metadata_processing/get_escape_characters.sql - Double quotes like PostgreSQL
  • macros/internal/metadata_processing/concat_ws.sql - Native CONCAT_WS support
  • macros/supporting/data_types/type_binary.sql - VARCHAR(32/64) for hashes
  • macros/supporting/data_types/type_string.sql - VARCHAR type
  • macros/supporting/data_types/type_timestamp.sql - TIMESTAMP (no timezone)
  • macros/supporting/hash_components/select_hash_alg.sql - MD5, SHA256, SHA1 handling
  • macros/supporting/casting/cast_binary.sql - Dynamic VARCHAR casting

Table Macros (10 files created)

  • macros/tables/redshift/*.sql - All 10 Data Vault table types

Documentation (1 file created)

  • macros/tables/redshift/README.md - Comprehensive implementation guide

Total: 17 files changed, 560 insertions(+)

Configuration

Users can configure their hash algorithm in dbt_project.yml:

vars:
  # Use MD5 (default)
  hash: 'md5'

  # Or use SHA256 for stronger hashing
  hash: 'sha'

Requirements

  • Minimum Redshift Version: July 2023 or later (QUALIFY support required)
  • dbt Version: >=1.0.0, <3.0.0
  • dbt-redshift adapter: Latest version recommended

Usage Example

-- models/raw_vault/hubs/hub_customer.sql
{{- config(
    materialized='incremental',
    schema='raw_vault'
) -}}

{%- set src_pk = 'CUSTOMER_PK' -%}
{%- set src_nk = 'CUSTOMER_ID' -%}
{%- set src_ldts = 'LOAD_DATETIME' -%}
{%- set src_source = 'RECORD_SOURCE' -%}

{{ automate_dv.hub(src_pk=src_pk,
                   src_nk=src_nk,
                   src_ldts=src_ldts,
                   src_source=src_source,
                   source_model='stg_customer') }}

Migration Notes

For users migrating from PostgreSQL to Redshift:

  1. Hash Storage: Redshift uses VARCHAR hex strings vs PostgreSQL's BYTEA binary. MD5 hash values remain compatible when compared as strings.

  2. Data Type Mapping:

    • PostgreSQL BYTEA → Redshift VARCHAR(32) or VARCHAR(64)
    • PostgreSQL TIMESTAMP → Redshift TIMESTAMP
    • PostgreSQL VARCHAR → Redshift VARCHAR
  3. Performance: QUALIFY clause provides better performance than PostgreSQL's DISTINCT ON for many queries.

Documentation Updates Needed

This PR includes in-code documentation (macros/tables/redshift/README.md). External documentation updates needed:

References


Note: This PR is ready for review and testing. I'm happy to make adjustments based on feedback and add any additional test coverage or documentation as needed.

Add Redshift-specific data type definitions:
- type_binary: VARCHAR(32) for MD5 hash hex strings
- type_string: VARCHAR
- type_timestamp: TIMESTAMP (no timezone)
Add Redshift-specific metadata processing:
- get_escape_characters: Use double quotes like PostgreSQL
- concat_ws: Use native CONCAT_WS function
Add Redshift hash function support:
- MD5: Returns UPPER(MD5(...)) as VARCHAR(32) hex string
- SHA256: Returns UPPER(SHA2(..., 256)) as VARCHAR(64) hex string
- SHA1: Warns and falls back to MD5 (not supported)
- cast_binary: Dynamic VARCHAR casting based on hash type
- type_binary: Dynamic VARCHAR length (32 for MD5, 64 for SHA256)

Redshift supports MD5 and SHA256 (via SHA2 function) natively.
SHA1 is not supported and falls back to MD5 with a warning.
Create Redshift table macros directory and add simple table types
that inherit from default implementations:
- ref_table: Reference tables
- xts: Extended tracking satellite
- eff_sat: Effectivity satellite
- ma_sat: Multi-active satellite

These macros work with Redshift standard SQL without requiring
platform-specific optimizations.
Implement hub macro using Redshift QUALIFY clause for better
performance. Uses ROW_NUMBER() window function with QUALIFY
to deduplicate records efficiently.

Requires Redshift version with QUALIFY support (July 2023+).
Add link and non-historized link macros:
- link: Inherits from PostgreSQL (uses ROW_NUMBER pattern)
- nh_link: Inherits from default implementation

PostgreSQL link pattern works well with Redshift as both support
similar window function syntax.
Implement satellite macro by inheriting from PostgreSQL.
Uses LAG window function and ROW_NUMBER for change detection
and deduplication. Redshift supports all required window
functions (LAG, ROW_NUMBER, PARTITION BY).
Implement Point-in-Time (PIT) macro for Redshift using simplified
VARCHAR hex string approach instead of PostgreSQL BYTEA encoding.

Key differences from PostgreSQL:
- Removed ENCODE/DECODE functions (PostgreSQL-specific)
- Uses simple MAX() aggregation on VARCHAR(32) hex strings
- Works because hashes are stored as VARCHAR not binary

This simplification is possible due to our choice of VARCHAR(32)
for hash storage, making the code cleaner and more maintainable.
Implement bridge macro by inheriting from default implementation.
No ENCODE/DECODE operations needed (unlike PIT), so default
implementation works without modifications.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant