Open
Conversation
Add Redshift-specific data type definitions: - type_binary: VARCHAR(32) for MD5 hash hex strings - type_string: VARCHAR - type_timestamp: TIMESTAMP (no timezone)
Add Redshift-specific metadata processing: - get_escape_characters: Use double quotes like PostgreSQL - concat_ws: Use native CONCAT_WS function
Add Redshift hash function support: - MD5: Returns UPPER(MD5(...)) as VARCHAR(32) hex string - SHA256: Returns UPPER(SHA2(..., 256)) as VARCHAR(64) hex string - SHA1: Warns and falls back to MD5 (not supported) - cast_binary: Dynamic VARCHAR casting based on hash type - type_binary: Dynamic VARCHAR length (32 for MD5, 64 for SHA256) Redshift supports MD5 and SHA256 (via SHA2 function) natively. SHA1 is not supported and falls back to MD5 with a warning.
Create Redshift table macros directory and add simple table types that inherit from default implementations: - ref_table: Reference tables - xts: Extended tracking satellite - eff_sat: Effectivity satellite - ma_sat: Multi-active satellite These macros work with Redshift standard SQL without requiring platform-specific optimizations.
Implement hub macro using Redshift QUALIFY clause for better performance. Uses ROW_NUMBER() window function with QUALIFY to deduplicate records efficiently. Requires Redshift version with QUALIFY support (July 2023+).
Add link and non-historized link macros: - link: Inherits from PostgreSQL (uses ROW_NUMBER pattern) - nh_link: Inherits from default implementation PostgreSQL link pattern works well with Redshift as both support similar window function syntax.
Implement satellite macro by inheriting from PostgreSQL. Uses LAG window function and ROW_NUMBER for change detection and deduplication. Redshift supports all required window functions (LAG, ROW_NUMBER, PARTITION BY).
Implement Point-in-Time (PIT) macro for Redshift using simplified VARCHAR hex string approach instead of PostgreSQL BYTEA encoding. Key differences from PostgreSQL: - Removed ENCODE/DECODE functions (PostgreSQL-specific) - Uses simple MAX() aggregation on VARCHAR(32) hex strings - Works because hashes are stored as VARCHAR not binary This simplification is possible due to our choice of VARCHAR(32) for hash storage, making the code cleaner and more maintainable.
Implement bridge macro by inheriting from default implementation. No ENCODE/DECODE operations needed (unlike PIT), so default implementation works without modifications.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Amazon Redshift Support
Summary
This PR adds full Amazon Redshift support to automate-dv, enabling users to build Data Vault 2.0 warehouses on Redshift using all table types (hubs, links, satellites, PITs, bridges, etc.).
Motivation
Amazon Redshift is a widely-used cloud data warehouse platform, and many organizations need to implement Data Vault patterns on Redshift. This implementation provides native Redshift support with optimizations for Redshift's SQL dialect and performance characteristics.
Implementation Details
Supported Features
✅ All 10 Data Vault Table Types:
hub.sql)link.sql)sat.sql)eff_sat.sql)ma_sat.sql)pit.sql)bridge.sql)xts.sql)nh_link.sql)ref_table.sql)✅ Hash Algorithms:
MD5()function →VARCHAR(32)SHA1()function →VARCHAR(40)SHA2(string, 256)function →VARCHAR(64)Key Design Decisions
1. VARCHAR Hex Strings for Hash Storage
Decision: Store hashes as
VARCHARhex strings instead of binary types.Rationale:
ENCODE/DECODEoperationsImplementation:
VARCHAR(32)- 32 hex charactersVARCHAR(40)- 40 hex charactersVARCHAR(64)- 64 hex characters2. QUALIFY Clause for Performance
Decision: Require Redshift version with
QUALIFYsupport (July 2023+).Rationale:
QUALIFYprovides significant performance improvements over subquery patternsExample:
3. Inheritance Strategy
Approach: Mix of custom implementations and inheritance from existing adapters:
hub.sql(QUALIFY pattern),pit.sql(VARCHAR hex simplification)link.sql,sat.sql(compatible window functions)ref_table.sql,xts.sql,eff_sat.sql,ma_sat.sql,nh_link.sql,bridge.sqlRationale: Maximize code reuse while optimizing for Redshift-specific features.
Files Changed
Supporting Macros (7 files modified)
macros/internal/metadata_processing/get_escape_characters.sql- Double quotes like PostgreSQLmacros/internal/metadata_processing/concat_ws.sql- Native CONCAT_WS supportmacros/supporting/data_types/type_binary.sql- VARCHAR(32/64) for hashesmacros/supporting/data_types/type_string.sql- VARCHAR typemacros/supporting/data_types/type_timestamp.sql- TIMESTAMP (no timezone)macros/supporting/hash_components/select_hash_alg.sql- MD5, SHA256, SHA1 handlingmacros/supporting/casting/cast_binary.sql- Dynamic VARCHAR castingTable Macros (10 files created)
macros/tables/redshift/*.sql- All 10 Data Vault table typesDocumentation (1 file created)
macros/tables/redshift/README.md- Comprehensive implementation guideTotal: 17 files changed, 560 insertions(+)
Configuration
Users can configure their hash algorithm in
dbt_project.yml:Requirements
Usage Example
Migration Notes
For users migrating from PostgreSQL to Redshift:
Hash Storage: Redshift uses VARCHAR hex strings vs PostgreSQL's BYTEA binary. MD5 hash values remain compatible when compared as strings.
Data Type Mapping:
BYTEA→ RedshiftVARCHAR(32)orVARCHAR(64)TIMESTAMP→ RedshiftTIMESTAMPVARCHAR→ RedshiftVARCHARPerformance: QUALIFY clause provides better performance than PostgreSQL's DISTINCT ON for many queries.
Documentation Updates Needed
This PR includes in-code documentation (
macros/tables/redshift/README.md). External documentation updates needed:References
Note: This PR is ready for review and testing. I'm happy to make adjustments based on feedback and add any additional test coverage or documentation as needed.