-
Notifications
You must be signed in to change notification settings - Fork 121
Add configuration to disable sample collection for PII tables #830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add disable_samples_on_pii_tables and pii_table_tags config vars - Create is_pii_table helper macro for PII detection - Modify test materialization to skip sampling for PII tables - Add integration tests for PII sampling behavior - Extend dbt_project.test() to support model_config parameter Fixes ELE-4833 Co-Authored-By: Yosef Arbiv <[email protected]>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
WalkthroughThe changes introduce a configuration-driven mechanism to disable sample row collection for tables tagged as PII within dbt tests. Two new configuration variables are added, a macro to detect PII tables is implemented, and the test materialization logic is updated to conditionally suppress sample collection. Integration tests verify the new behavior. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant dbt_test_macro as handle_dbt_test (macro)
participant is_pii_table_macro as is_pii_table (macro)
participant Config
participant DB
User->>dbt_test_macro: Run test on table
dbt_test_macro->>is_pii_table_macro: Check if table is PII
is_pii_table_macro->>Config: Get disable_samples_on_pii_tables & pii_table_tags
is_pii_table_macro-->>dbt_test_macro: Return true/false
alt Table is PII and disabling is enabled
dbt_test_macro->>DB: Do NOT collect sample rows (limit=0)
else
dbt_test_macro->>DB: Collect sample rows (limit from config)
end
dbt_test_macro-->>User: Return test results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~18 minutes Assessment against linked issues
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
👋 @devin-ai-integration[bot] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
macros/edr/system/system_utils/is_pii_table.sql (1)
1-14: LGTM with a minor suggestion for parameter naming.The macro logic is sound with good performance optimization through early return. The implementation correctly:
- Uses secure defaults when the feature is disabled
- Leverages existing utility functions for tag intersection
- Handles missing model_tags gracefully with default empty list
Consider renaming the parameter for better clarity:
-{% macro is_pii_table(flattened_test) %} +{% macro is_pii_table(test_node) %}This makes it clearer that the macro can work with any test node, not just flattened ones.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
integration_tests/tests/dbt_project.py(4 hunks)integration_tests/tests/test_sampling_pii.py(1 hunks)macros/edr/materializations/test/test.sql(1 hunks)macros/edr/system/system_utils/get_config_var.sql(1 hunks)macros/edr/system/system_utils/is_pii_table.sql(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
integration_tests/tests/dbt_project.py (2)
Learnt from: haritamar
PR: #825
File: models/run_results.yml:139-145
Timestamp: 2025-07-15T19:31:54.689Z
Learning: In dbt-fusion (dbt 2.0+), the meta configuration for models should be nested under the config block (i.e., config: meta:), not as a top-level meta key. This is different from traditional dbt where meta was a sibling of config.
Learnt from: haritamar
PR: #825
File: integration_tests/dbt_project/models/exposures.yml:18-24
Timestamp: 2025-07-15T19:31:02.744Z
Learning: For dbt exposures, the meta configuration can be nested under the config block (i.e., config: meta:). This is a valid and supported pattern in modern dbt versions, contrary to some older documentation that suggested only top-level meta blocks were supported.
macros/edr/materializations/test/test.sql (2)
Learnt from: haritamar
PR: #825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.
Learnt from: haritamar
PR: #827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
🔇 Additional comments (7)
macros/edr/system/system_utils/get_config_var.sql (1)
67-69: LGTM! Secure defaults for PII configuration.The new configuration variables follow established patterns and use secure defaults:
disable_samples_on_pii_tablesdefaults tofalse, ensuring the feature is opt-inpii_table_tagsdefaults to['pii'], providing a sensible default while allowing customizationintegration_tests/tests/dbt_project.py (2)
112-112: LGTM! Consistent parameter addition across overloads.The
model_configparameter is correctly added to all method signatures with proper type hints and default values.
167-168: Clean implementation of config merging.The model_config is properly merged into the table_yaml dictionary, allowing tests to pass additional configuration like tags for PII testing scenarios.
macros/edr/materializations/test/test.sql (1)
53-57: Excellent security implementation for PII protection.The conditional sample limit override effectively prevents sensitive data collection:
- Defaults to configured sample count for normal operation
- Sets sample_limit to 0 for PII tables, leveraging existing query optimization
- Clean separation of concerns using the dedicated
is_pii_table()macroThis approach ensures zero sample data exposure for PII-tagged tables when the feature is enabled.
integration_tests/tests/test_sampling_pii.py (3)
26-53: Comprehensive test coverage for PII sampling disabled scenario.The test correctly verifies that PII-tagged tables don't collect samples when the feature is enabled:
- Uses guaranteed-fail test scenario with 50 null values
- Properly configures PII tags and feature flags
- Verifies zero samples are collected despite test failure
55-82: Good verification that non-PII tables are unaffected.This test ensures the feature doesn't interfere with normal sampling behavior for non-PII tables, maintaining expected functionality for tables tagged as "normal".
84-110: Essential test for feature toggle validation.This test verifies that when the PII feature is disabled, even PII-tagged tables collect samples normally. This ensures backward compatibility and proper feature flag behavior.
| SAMPLES_QUERY = """ | ||
| with latest_elementary_test_result as ( | ||
| select id | ||
| from {{{{ ref("elementary_test_results") }}}} | ||
| where lower(table_name) = lower('{test_id}') | ||
| order by created_at desc | ||
| limit 1 | ||
| ) | ||
|
|
||
| select result_row | ||
| from {{{{ ref("test_result_rows") }}}} | ||
| where elementary_test_results_id in (select * from latest_elementary_test_result) | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the query handles edge cases properly.
The SAMPLES_QUERY looks correct but should handle potential edge cases:
- What if multiple tests exist for the same table_name?
- Consider adding ORDER BY for deterministic results
🏁 Script executed:
#!/bin/bash
# Verify the query logic and check for potential issues with multiple test results
# Search for similar query patterns in the codebase to ensure consistency
rg -A 10 -B 5 "elementary_test_results.*table_name.*order by.*created_at desc"
# Look for any existing edge case handling in similar queries
rg -A 5 "limit 1" --type sqlLength of output: 2695
Ensure deterministic ordering and handle no-results in SAMPLES_QUERY
The current CTE picks the latest test by created_at DESC LIMIT 1, but if two runs share the same timestamp you may get a non-deterministic row. Also, if no test results exist for the given test_id, the final query simply returns no rows.
Please update integration_tests/tests/test_sampling_pii.py (lines 9–21) to:
- Add a secondary sort key (e.g.
id DESC) to make the choice deterministic - Optionally guard against an empty CTE (for example, by raising an error or returning a default)
Suggested diff:
with latest_elementary_test_result as (
select id
from {{ ref("elementary_test_results") }}
where lower(table_name) = lower('{test_id}')
- order by created_at desc
+ order by created_at desc, id desc
limit 1
)
select result_row
from {{ ref("test_result_rows") }}
where elementary_test_results_id in (
select * from latest_elementary_test_result
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SAMPLES_QUERY = """ | |
| with latest_elementary_test_result as ( | |
| select id | |
| from {{{{ ref("elementary_test_results") }}}} | |
| where lower(table_name) = lower('{test_id}') | |
| order by created_at desc | |
| limit 1 | |
| ) | |
| select result_row | |
| from {{{{ ref("test_result_rows") }}}} | |
| where elementary_test_results_id in (select * from latest_elementary_test_result) | |
| """ | |
| SAMPLES_QUERY = """ | |
| with latest_elementary_test_result as ( | |
| select id | |
| from {{{{ ref("elementary_test_results") }}}} | |
| where lower(table_name) = lower('{test_id}') | |
| order by created_at desc, id desc | |
| limit 1 | |
| ) | |
| select result_row | |
| from {{{{ ref("test_result_rows") }}}} | |
| where elementary_test_results_id in (select * from latest_elementary_test_result) | |
| """ |
🤖 Prompt for AI Agents
In integration_tests/tests/test_sampling_pii.py around lines 9 to 21, the
SAMPLES_QUERY CTE orders by created_at descending but lacks a secondary sort
key, causing non-deterministic results if timestamps tie. Also, it does not
handle the case when no test results exist for the given test_id. Fix this by
adding a secondary order criterion such as id DESC to ensure deterministic
ordering, and add logic to handle an empty CTE result, for example by raising an
error or returning a default value to avoid silent failures.
- Add 'id desc' as secondary sort key to ensure deterministic results - Addresses CodeRabbit comment about non-deterministic ordering when timestamps tie Co-Authored-By: Yosef Arbiv <[email protected]>
Add configuration to disable sample collection for PII tables
Summary
This PR implements configuration to prevent sample collection for tables tagged as containing PII data, addressing Linear issue ELE-4833. When tests fail on PII tables, the system will skip collecting sample rows that might contain sensitive personal information.
Key Changes:
disable_samples_on_pii_tables(boolean, default: false) andpii_table_tags(list/string, default: ['pii'])is_pii_table()macro checks if model tags intersect with configured PII tagssample_limit = 0for PII tables when feature is enableddbt_project.test()method to supportmodel_configparameter for test setupid desc)Review & Testing Checklist for Human
Recommended Test Plan:
tags: ['pii']and run failing tests to confirm no samples are storeddisable_samples_on_pii_tables: falseto ensure PII tables still collect samples when disabledDiagram
%%{ init : { "theme" : "default" }}%% graph TD Config["macros/edr/system/system_utils/<br/>get_config_var.sql"]:::major-edit PiiHelper["macros/edr/system/system_utils/<br/>is_pii_table.sql"]:::major-edit TestMat["macros/edr/materializations/<br/>test/test.sql"]:::major-edit TestFramework["integration_tests/tests/<br/>dbt_project.py"]:::minor-edit IntegrationTest["integration_tests/tests/<br/>test_sampling_pii.py"]:::major-edit ListUtils["macros/utils/list_utils/<br/>lists_intersection.sql"]:::context DictUtils["macros/utils/dict_utils/<br/>insensitive_get_dict_value.sql"]:::context Config -->|"provides config vars"| PiiHelper PiiHelper -->|"called by"| TestMat TestMat -->|"sets sample_limit=0<br/>for PII tables"| SampleCollection["Sample Collection<br/>Process"]:::context ListUtils -->|"used for tag<br/>intersection"| PiiHelper DictUtils -->|"extracts model_tags<br/>from flattened_test"| PiiHelper TestFramework -->|"supports model_config<br/>for testing"| IntegrationTest IntegrationTest -->|"validates PII<br/>sampling behavior"| TestMat subgraph Legend L1[Major Edit]:::major-edit L2[Minor Edit]:::minor-edit L3[Context/No Edit]:::context end classDef major-edit fill:#90EE90 classDef minor-edit fill:#87CEEB classDef context fill:#FFFFFFNotes
Environment Issue: Integration tests couldn't be executed locally due to dbt profile configuration issues (postgres target not found in elementary_tests profile). Additionally, all CI checks are failing at the "Write dbt profiles" step, which appears to be related to
secrets.CI_PROFILES_YMLconfiguration rather than code issues. The implementation follows established patterns in the codebase and passes all pre-commit hooks, but end-to-end verification is needed.Implementation Approach: The solution leverages existing utility functions (
lists_intersection,insensitive_get_dict_value) and follows the same patterns used elsewhere in the codebase for configuration and tag handling.CR Comment Fix: Addressed CodeRabbit feedback by adding deterministic ordering (
id desc) to SAMPLES_QUERY to prevent non-deterministic results when timestamps tie.Link to Devin run: https://app.devin.ai/sessions/ad4bd978c51e445e84763d97943c38a8
Requested by: Yosef Arbiv (@arbiv)
Linear Issue: ELE-4833
Summary by CodeRabbit
New Features
Bug Fixes
Tests