Add configuration to disable sample collection for PII tables #830

devin-ai-integration · 2025-07-28T20:28:17Z

Add configuration to disable sample collection for PII tables

Summary

This PR implements configuration to prevent sample collection for tables tagged as containing PII data, addressing Linear issue ELE-4833. When tests fail on PII tables, the system will skip collecting sample rows that might contain sensitive personal information.

Key Changes:

Configuration: Added disable_samples_on_pii_tables (boolean, default: false) and pii_table_tags (list/string, default: ['pii'])
PII Detection: New is_pii_table() macro checks if model tags intersect with configured PII tags
Sampling Logic: Modified test materialization to set sample_limit = 0 for PII tables when feature is enabled
Testing: Added integration tests covering PII/non-PII scenarios and feature toggle behavior
Framework: Extended dbt_project.test() method to support model_config parameter for test setup
Bug Fix: Fixed SAMPLES_QUERY deterministic ordering by adding secondary sort key (id desc)

Review & Testing Checklist for Human

End-to-end functionality: Test with actual models tagged as PII to verify samples are not collected when tests fail
Tag format compatibility: Test with different tag formats (string vs list, case sensitivity, multiple tags, empty tags)
Regression testing: Verify existing sampling functionality works normally for non-PII tables and when feature is disabled
Feature toggle validation: Confirm the feature can be properly enabled/disabled via dbt_project.yml configuration
Performance impact: Assess whether the new PII detection logic introduces any noticeable performance overhead

Recommended Test Plan:

Create test models with tags: ['pii'] and run failing tests to confirm no samples are stored
Test with disable_samples_on_pii_tables: false to ensure PII tables still collect samples when disabled
Verify non-PII tagged models continue collecting samples normally
Test edge cases like empty tags, mixed tag types, and custom PII tag names

Diagram

%%{ init : { "theme" : "default" }}%%
graph TD
    Config["macros/edr/system/system_utils/<br/>get_config_var.sql"]:::major-edit
    PiiHelper["macros/edr/system/system_utils/<br/>is_pii_table.sql"]:::major-edit
    TestMat["macros/edr/materializations/<br/>test/test.sql"]:::major-edit
    TestFramework["integration_tests/tests/<br/>dbt_project.py"]:::minor-edit
    IntegrationTest["integration_tests/tests/<br/>test_sampling_pii.py"]:::major-edit
    
    ListUtils["macros/utils/list_utils/<br/>lists_intersection.sql"]:::context
    DictUtils["macros/utils/dict_utils/<br/>insensitive_get_dict_value.sql"]:::context
    
    Config -->|"provides config vars"| PiiHelper
    PiiHelper -->|"called by"| TestMat
    TestMat -->|"sets sample_limit=0<br/>for PII tables"| SampleCollection["Sample Collection<br/>Process"]:::context
    
    ListUtils -->|"used for tag<br/>intersection"| PiiHelper
    DictUtils -->|"extracts model_tags<br/>from flattened_test"| PiiHelper
    
    TestFramework -->|"supports model_config<br/>for testing"| IntegrationTest
    IntegrationTest -->|"validates PII<br/>sampling behavior"| TestMat

    subgraph Legend
        L1[Major Edit]:::major-edit
        L2[Minor Edit]:::minor-edit  
        L3[Context/No Edit]:::context
    end

    classDef major-edit fill:#90EE90
    classDef minor-edit fill:#87CEEB
    classDef context fill:#FFFFFF

Notes

Environment Issue: Integration tests couldn't be executed locally due to dbt profile configuration issues (postgres target not found in elementary_tests profile). Additionally, all CI checks are failing at the "Write dbt profiles" step, which appears to be related to secrets.CI_PROFILES_YML configuration rather than code issues. The implementation follows established patterns in the codebase and passes all pre-commit hooks, but end-to-end verification is needed.

Implementation Approach: The solution leverages existing utility functions (lists_intersection, insensitive_get_dict_value) and follows the same patterns used elsewhere in the codebase for configuration and tag handling.

CR Comment Fix: Addressed CodeRabbit feedback by adding deterministic ordering (id desc) to SAMPLES_QUERY to prevent non-deterministic results when timestamps tie.

Link to Devin run: https://app.devin.ai/sessions/ad4bd978c51e445e84763d97943c38a8
Requested by: Yosef Arbiv (@arbiv)
Linear Issue: ELE-4833

Summary by CodeRabbit

New Features
- Introduced configuration options to control sample data collection on tables tagged as containing PII (Personally Identifiable Information).
- Added a macro to identify PII tables based on configurable tags.
- Default configuration now includes options for disabling sample collection on PII tables and specifying PII-related tags.
Bug Fixes
- Sample data is no longer collected for tables tagged as PII when the relevant configuration is enabled.
Tests
- Added tests to verify sample collection behavior for PII and non-PII tables under different configuration scenarios.

- Add disable_samples_on_pii_tables and pii_table_tags config vars - Create is_pii_table helper macro for PII detection - Modify test materialization to skip sampling for PII tables - Add integration tests for PII sampling behavior - Extend dbt_project.test() to support model_config parameter Fixes ELE-4833 Co-Authored-By: Yosef Arbiv <[email protected]>

devin-ai-integration · 2025-07-28T20:28:20Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

linear · 2025-07-28T20:28:21Z

ELE-4833 Add configuration to disable sample for PII tables

coderabbitai · 2025-07-28T20:28:25Z

Walkthrough

The changes introduce a configuration-driven mechanism to disable sample row collection for tables tagged as PII within dbt tests. Two new configuration variables are added, a macro to detect PII tables is implemented, and the test materialization logic is updated to conditionally suppress sample collection. Integration tests verify the new behavior.

Changes

Cohort / File(s)	Change Summary
PII Sampling Configuration `macros/edr/system/system_utils/get_config_var.sql`	Added `disable_samples_on_pii_tables` (default: false) and `pii_table_tags` (default: ['pii']) to the default config dictionary.
PII Table Detection Macro `macros/edr/system/system_utils/is_pii_table.sql`	Introduced new macro `is_pii_table` to determine if a table is PII-tagged based on config and model tags.
Test Materialization Enforcement `macros/edr/materializations/test/test.sql`	Updated `handle_dbt_test` macro: skips sample collection for PII tables when configured, by setting sample row limit to zero.
Integration Tests for PII Sampling `integration_tests/tests/test_sampling_pii.py`	Added three tests to verify sample row collection behavior with various PII tag and configuration scenarios.
Test Method Extension `integration_tests/tests/dbt_project.py`	Extended `DbtProject.test()` to accept and merge an optional `model_config` parameter for test setup flexibility.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant dbt_test_macro as handle_dbt_test (macro)
    participant is_pii_table_macro as is_pii_table (macro)
    participant Config
    participant DB

    User->>dbt_test_macro: Run test on table
    dbt_test_macro->>is_pii_table_macro: Check if table is PII
    is_pii_table_macro->>Config: Get disable_samples_on_pii_tables & pii_table_tags
    is_pii_table_macro-->>dbt_test_macro: Return true/false
    alt Table is PII and disabling is enabled
        dbt_test_macro->>DB: Do NOT collect sample rows (limit=0)
    else
        dbt_test_macro->>DB: Collect sample rows (limit from config)
    end
    dbt_test_macro-->>User: Return test results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~18 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Add `disable_samples_on_pii_tables` and `pii_table_tags` config variables (ELE-4833)	✅
Implement macro `is_pii_table(node)` to detect PII tables (ELE-4833)	✅
Update test materialization to skip sample collection for PII tables when configured (ELE-4833)	✅
Add integration tests to verify sampling is disabled for PII tables and enabled for others (ELE-4833)	✅

Poem

In the land of data, where secrets may lie,
A rabbit hops softly, with a watchful eye.
Now PII tables are safe from the sample’s keen view,
With configs and tags, we know what to do!
🥕 No more samples where privacy’s key—
Hooray for the code, and carrots for me!

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60b3aac and c56e916.

📒 Files selected for processing (1)

integration_tests/tests/test_sampling_pii.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

integration_tests/tests/test_sampling_pii.py

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch devin/ELE-4833-1753733759

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-07-28T20:28:28Z

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

macros/edr/system/system_utils/is_pii_table.sql (1)
1-14: LGTM with a minor suggestion for parameter naming.

The macro logic is sound with good performance optimization through early return. The implementation correctly:

Uses secure defaults when the feature is disabled

Leverages existing utility functions for tag intersection

Handles missing model_tags gracefully with default empty list

Consider renaming the parameter for better clarity:
-{% macro is_pii_table(flattened_test) %}
+{% macro is_pii_table(test_node) %}
This makes it clearer that the macro can work with any test node, not just flattened ones.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6d557d and 60b3aac.

📒 Files selected for processing (5)

integration_tests/tests/dbt_project.py (4 hunks)
integration_tests/tests/test_sampling_pii.py (1 hunks)
macros/edr/materializations/test/test.sql (1 hunks)
macros/edr/system/system_utils/get_config_var.sql (1 hunks)
macros/edr/system/system_utils/is_pii_table.sql (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

integration_tests/tests/dbt_project.py (2)

Learnt from: haritamar
PR: #825
File: models/run_results.yml:139-145
Timestamp: 2025-07-15T19:31:54.689Z
Learning: In dbt-fusion (dbt 2.0+), the meta configuration for models should be nested under the config block (i.e., config: meta:), not as a top-level meta key. This is different from traditional dbt where meta was a sibling of config.

Learnt from: haritamar
PR: #825
File: integration_tests/dbt_project/models/exposures.yml:18-24
Timestamp: 2025-07-15T19:31:02.744Z
Learning: For dbt exposures, the meta configuration can be nested under the config block (i.e., config: meta:). This is a valid and supported pattern in modern dbt versions, contrary to some older documentation that suggested only top-level meta blocks were supported.

macros/edr/materializations/test/test.sql (2)

Learnt from: haritamar
PR: #825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Learnt from: haritamar
PR: #827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

🔇 Additional comments (7)

macros/edr/system/system_utils/get_config_var.sql (1)

67-69: LGTM! Secure defaults for PII configuration.

The new configuration variables follow established patterns and use secure defaults:

disable_samples_on_pii_tables defaults to false, ensuring the feature is opt-in

pii_table_tags defaults to ['pii'], providing a sensible default while allowing customization

integration_tests/tests/dbt_project.py (2)

112-112: LGTM! Consistent parameter addition across overloads.

The model_config parameter is correctly added to all method signatures with proper type hints and default values.

167-168: Clean implementation of config merging.

The model_config is properly merged into the table_yaml dictionary, allowing tests to pass additional configuration like tags for PII testing scenarios.

macros/edr/materializations/test/test.sql (1)

53-57: Excellent security implementation for PII protection.

The conditional sample limit override effectively prevents sensitive data collection:

Defaults to configured sample count for normal operation

Sets sample_limit to 0 for PII tables, leveraging existing query optimization

Clean separation of concerns using the dedicated is_pii_table() macro

This approach ensures zero sample data exposure for PII-tagged tables when the feature is enabled.

integration_tests/tests/test_sampling_pii.py (3)

26-53: Comprehensive test coverage for PII sampling disabled scenario.

The test correctly verifies that PII-tagged tables don't collect samples when the feature is enabled:

Uses guaranteed-fail test scenario with 50 null values

Properly configures PII tags and feature flags

Verifies zero samples are collected despite test failure

55-82: Good verification that non-PII tables are unaffected.

This test ensures the feature doesn't interfere with normal sampling behavior for non-PII tables, maintaining expected functionality for tables tagged as "normal".

84-110: Essential test for feature toggle validation.

This test verifies that when the PII feature is disabled, even PII-tagged tables collect samples normally. This ensures backward compatibility and proper feature flag behavior.

coderabbitai · 2025-07-28T20:33:33Z

integration_tests/tests/test_sampling_pii.py

+SAMPLES_QUERY = """
+    with latest_elementary_test_result as (
+        select id
+        from {{{{ ref("elementary_test_results") }}}}
+        where lower(table_name) = lower('{test_id}')
+        order by created_at desc
+        limit 1
+    )
+
+    select result_row
+    from {{{{ ref("test_result_rows") }}}}
+    where elementary_test_results_id in (select * from latest_elementary_test_result)
+"""


💡 Verification agent

🧩 Analysis chain

Verify the query handles edge cases properly.

The SAMPLES_QUERY looks correct but should handle potential edge cases:

What if multiple tests exist for the same table_name?

Consider adding ORDER BY for deterministic results

🏁 Script executed:

#!/bin/bash # Verify the query logic and check for potential issues with multiple test results # Search for similar query patterns in the codebase to ensure consistency rg -A 10 -B 5 "elementary_test_results.*table_name.*order by.*created_at desc" # Look for any existing edge case handling in similar queries rg -A 5 "limit 1" --type sql

Length of output: 2695

Ensure deterministic ordering and handle no-results in SAMPLES_QUERY

The current CTE picks the latest test by created_at DESC LIMIT 1, but if two runs share the same timestamp you may get a non-deterministic row. Also, if no test results exist for the given test_id, the final query simply returns no rows.

Please update integration_tests/tests/test_sampling_pii.py (lines 9–21) to:

Add a secondary sort key (e.g. id DESC) to make the choice deterministic

Optionally guard against an empty CTE (for example, by raising an error or returning a default)

Suggested diff:

with latest_elementary_test_result as ( select id from {{ ref("elementary_test_results") }} where lower(table_name) = lower('{test_id}') - order by created_at desc + order by created_at desc, id desc limit 1 ) select result_row from {{ ref("test_result_rows") }} where elementary_test_results_id in ( select * from latest_elementary_test_result )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

SAMPLES_QUERY = """

with latest_elementary_test_result as (

select id

from {{{{ ref("elementary_test_results") }}}}

where lower(table_name) = lower('{test_id}')

order by created_at desc

limit 1

)

select result_row

from {{{{ ref("test_result_rows") }}}}

where elementary_test_results_id in (select * from latest_elementary_test_result)

"""

SAMPLES_QUERY = """

with latest_elementary_test_result as (

select id

from {{{{ ref("elementary_test_results") }}}}

where lower(table_name) = lower('{test_id}')

order by created_at desc, id desc

limit 1

)

select result_row

from {{{{ ref("test_result_rows") }}}}

where elementary_test_results_id in (select * from latest_elementary_test_result)

"""

🤖 Prompt for AI Agents

In integration_tests/tests/test_sampling_pii.py around lines 9 to 21, the SAMPLES_QUERY CTE orders by created_at descending but lacks a secondary sort key, causing non-deterministic results if timestamps tie. Also, it does not handle the case when no test results exist for the given test_id. Fix this by adding a secondary order criterion such as id DESC to ensure deterministic ordering, and add logic to handle an empty CTE result, for example by raising an error or returning a default value to avoid silent failures.

- Add 'id desc' as secondary sort key to ensure deterministic results - Addresses CodeRabbit comment about non-deterministic ordering when timestamps tie Co-Authored-By: Yosef Arbiv <[email protected]>

devin-ai-integration bot had a problem deploying to elementary_test_env July 28, 2025 20:28 Failure

coderabbitai bot reviewed Jul 28, 2025

View reviewed changes

Fix SAMPLES_QUERY deterministic ordering per CR comment

c56e916

- Add 'id desc' as secondary sort key to ensure deterministic results - Addresses CodeRabbit comment about non-deterministic ordering when timestamps tie Co-Authored-By: Yosef Arbiv <[email protected]>

devin-ai-integration bot had a problem deploying to elementary_test_env July 29, 2025 08:00 Failure

devin-ai-integration bot closed this Jul 29, 2025

devin-ai-integration bot deleted the devin/ELE-4833-1753733759 branch July 29, 2025 08:47

devin-ai-integration bot temporarily deployed to elementary_test_env July 29, 2025 08:49 Inactive

devin-ai-integration bot had a problem deploying to elementary_test_env July 29, 2025 08:49 Error

devin-ai-integration bot mentioned this pull request Jul 30, 2025

Fix schema name generation for branch names with forward slashes #834

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add configuration to disable sample collection for PII tables #830

Add configuration to disable sample collection for PII tables #830

devin-ai-integration bot commented Jul 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

devin-ai-integration bot commented Jul 28, 2025

Uh oh!

linear bot commented Jul 28, 2025

Uh oh!

coderabbitai bot commented Jul 28, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jul 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add configuration to disable sample collection for PII tables #830

Add configuration to disable sample collection for PII tables #830

Conversation

devin-ai-integration bot commented Jul 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add configuration to disable sample collection for PII tables

Summary

Review & Testing Checklist for Human

Diagram

Notes

Summary by CodeRabbit

Uh oh!

devin-ai-integration bot commented Jul 28, 2025

🤖 Devin AI Engineer

Uh oh!

linear bot commented Jul 28, 2025

Uh oh!

coderabbitai bot commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Assessment against linked issues

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration bot commented Jul 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 28, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)