Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Jul 28, 2025

Add configuration to disable sample collection for PII tables

Summary

This PR implements configuration to prevent sample collection for tables tagged as containing PII data, addressing Linear issue ELE-4833. When tests fail on PII tables, the system will skip collecting sample rows that might contain sensitive personal information.

Key Changes:

  • Configuration: Added disable_samples_on_pii_tables (boolean, default: false) and pii_table_tags (list/string, default: ['pii'])
  • PII Detection: New is_pii_table() macro checks if model tags intersect with configured PII tags
  • Sampling Logic: Modified test materialization to set sample_limit = 0 for PII tables when feature is enabled
  • Testing: Added integration tests covering PII/non-PII scenarios and feature toggle behavior
  • Framework: Extended dbt_project.test() method to support model_config parameter for test setup
  • Bug Fix: Fixed SAMPLES_QUERY deterministic ordering by adding secondary sort key (id desc)

Review & Testing Checklist for Human

  • End-to-end functionality: Test with actual models tagged as PII to verify samples are not collected when tests fail
  • Tag format compatibility: Test with different tag formats (string vs list, case sensitivity, multiple tags, empty tags)
  • Regression testing: Verify existing sampling functionality works normally for non-PII tables and when feature is disabled
  • Feature toggle validation: Confirm the feature can be properly enabled/disabled via dbt_project.yml configuration
  • Performance impact: Assess whether the new PII detection logic introduces any noticeable performance overhead

Recommended Test Plan:

  1. Create test models with tags: ['pii'] and run failing tests to confirm no samples are stored
  2. Test with disable_samples_on_pii_tables: false to ensure PII tables still collect samples when disabled
  3. Verify non-PII tagged models continue collecting samples normally
  4. Test edge cases like empty tags, mixed tag types, and custom PII tag names

Diagram

%%{ init : { "theme" : "default" }}%%
graph TD
    Config["macros/edr/system/system_utils/<br/>get_config_var.sql"]:::major-edit
    PiiHelper["macros/edr/system/system_utils/<br/>is_pii_table.sql"]:::major-edit
    TestMat["macros/edr/materializations/<br/>test/test.sql"]:::major-edit
    TestFramework["integration_tests/tests/<br/>dbt_project.py"]:::minor-edit
    IntegrationTest["integration_tests/tests/<br/>test_sampling_pii.py"]:::major-edit
    
    ListUtils["macros/utils/list_utils/<br/>lists_intersection.sql"]:::context
    DictUtils["macros/utils/dict_utils/<br/>insensitive_get_dict_value.sql"]:::context
    
    Config -->|"provides config vars"| PiiHelper
    PiiHelper -->|"called by"| TestMat
    TestMat -->|"sets sample_limit=0<br/>for PII tables"| SampleCollection["Sample Collection<br/>Process"]:::context
    
    ListUtils -->|"used for tag<br/>intersection"| PiiHelper
    DictUtils -->|"extracts model_tags<br/>from flattened_test"| PiiHelper
    
    TestFramework -->|"supports model_config<br/>for testing"| IntegrationTest
    IntegrationTest -->|"validates PII<br/>sampling behavior"| TestMat

    subgraph Legend
        L1[Major Edit]:::major-edit
        L2[Minor Edit]:::minor-edit  
        L3[Context/No Edit]:::context
    end

    classDef major-edit fill:#90EE90
    classDef minor-edit fill:#87CEEB
    classDef context fill:#FFFFFF
Loading

Notes

Environment Issue: Integration tests couldn't be executed locally due to dbt profile configuration issues (postgres target not found in elementary_tests profile). Additionally, all CI checks are failing at the "Write dbt profiles" step, which appears to be related to secrets.CI_PROFILES_YML configuration rather than code issues. The implementation follows established patterns in the codebase and passes all pre-commit hooks, but end-to-end verification is needed.

Implementation Approach: The solution leverages existing utility functions (lists_intersection, insensitive_get_dict_value) and follows the same patterns used elsewhere in the codebase for configuration and tag handling.

CR Comment Fix: Addressed CodeRabbit feedback by adding deterministic ordering (id desc) to SAMPLES_QUERY to prevent non-deterministic results when timestamps tie.


Link to Devin run: https://app.devin.ai/sessions/ad4bd978c51e445e84763d97943c38a8
Requested by: Yosef Arbiv (@arbiv)
Linear Issue: ELE-4833

Summary by CodeRabbit

  • New Features

    • Introduced configuration options to control sample data collection on tables tagged as containing PII (Personally Identifiable Information).
    • Added a macro to identify PII tables based on configurable tags.
    • Default configuration now includes options for disabling sample collection on PII tables and specifying PII-related tags.
  • Bug Fixes

    • Sample data is no longer collected for tables tagged as PII when the relevant configuration is enabled.
  • Tests

    • Added tests to verify sample collection behavior for PII and non-PII tables under different configuration scenarios.

- Add disable_samples_on_pii_tables and pii_table_tags config vars
- Create is_pii_table helper macro for PII detection
- Modify test materialization to skip sampling for PII tables
- Add integration tests for PII sampling behavior
- Extend dbt_project.test() to support model_config parameter

Fixes ELE-4833

Co-Authored-By: Yosef Arbiv <[email protected]>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@linear
Copy link

linear bot commented Jul 28, 2025

@coderabbitai
Copy link

coderabbitai bot commented Jul 28, 2025

Walkthrough

The changes introduce a configuration-driven mechanism to disable sample row collection for tables tagged as PII within dbt tests. Two new configuration variables are added, a macro to detect PII tables is implemented, and the test materialization logic is updated to conditionally suppress sample collection. Integration tests verify the new behavior.

Changes

Cohort / File(s) Change Summary
PII Sampling Configuration
macros/edr/system/system_utils/get_config_var.sql
Added disable_samples_on_pii_tables (default: false) and pii_table_tags (default: ['pii']) to the default config dictionary.
PII Table Detection Macro
macros/edr/system/system_utils/is_pii_table.sql
Introduced new macro is_pii_table to determine if a table is PII-tagged based on config and model tags.
Test Materialization Enforcement
macros/edr/materializations/test/test.sql
Updated handle_dbt_test macro: skips sample collection for PII tables when configured, by setting sample row limit to zero.
Integration Tests for PII Sampling
integration_tests/tests/test_sampling_pii.py
Added three tests to verify sample row collection behavior with various PII tag and configuration scenarios.
Test Method Extension
integration_tests/tests/dbt_project.py
Extended DbtProject.test() to accept and merge an optional model_config parameter for test setup flexibility.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant dbt_test_macro as handle_dbt_test (macro)
    participant is_pii_table_macro as is_pii_table (macro)
    participant Config
    participant DB

    User->>dbt_test_macro: Run test on table
    dbt_test_macro->>is_pii_table_macro: Check if table is PII
    is_pii_table_macro->>Config: Get disable_samples_on_pii_tables & pii_table_tags
    is_pii_table_macro-->>dbt_test_macro: Return true/false
    alt Table is PII and disabling is enabled
        dbt_test_macro->>DB: Do NOT collect sample rows (limit=0)
    else
        dbt_test_macro->>DB: Collect sample rows (limit from config)
    end
    dbt_test_macro-->>User: Return test results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~18 minutes

Assessment against linked issues

Objective Addressed Explanation
Add disable_samples_on_pii_tables and pii_table_tags config variables (ELE-4833)
Implement macro is_pii_table(node) to detect PII tables (ELE-4833)
Update test materialization to skip sample collection for PII tables when configured (ELE-4833)
Add integration tests to verify sampling is disabled for PII tables and enabled for others (ELE-4833)

Poem

In the land of data, where secrets may lie,
A rabbit hops softly, with a watchful eye.
Now PII tables are safe from the sample’s keen view,
With configs and tags, we know what to do!
🥕 No more samples where privacy’s key—
Hooray for the code, and carrots for me!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60b3aac and c56e916.

📒 Files selected for processing (1)
  • integration_tests/tests/test_sampling_pii.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • integration_tests/tests/test_sampling_pii.py
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/ELE-4833-1753733759

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions
Copy link
Contributor

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
macros/edr/system/system_utils/is_pii_table.sql (1)

1-14: LGTM with a minor suggestion for parameter naming.

The macro logic is sound with good performance optimization through early return. The implementation correctly:

  • Uses secure defaults when the feature is disabled
  • Leverages existing utility functions for tag intersection
  • Handles missing model_tags gracefully with default empty list

Consider renaming the parameter for better clarity:

-{% macro is_pii_table(flattened_test) %}
+{% macro is_pii_table(test_node) %}

This makes it clearer that the macro can work with any test node, not just flattened ones.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6d557d and 60b3aac.

📒 Files selected for processing (5)
  • integration_tests/tests/dbt_project.py (4 hunks)
  • integration_tests/tests/test_sampling_pii.py (1 hunks)
  • macros/edr/materializations/test/test.sql (1 hunks)
  • macros/edr/system/system_utils/get_config_var.sql (1 hunks)
  • macros/edr/system/system_utils/is_pii_table.sql (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
integration_tests/tests/dbt_project.py (2)

Learnt from: haritamar
PR: #825
File: models/run_results.yml:139-145
Timestamp: 2025-07-15T19:31:54.689Z
Learning: In dbt-fusion (dbt 2.0+), the meta configuration for models should be nested under the config block (i.e., config: meta:), not as a top-level meta key. This is different from traditional dbt where meta was a sibling of config.

Learnt from: haritamar
PR: #825
File: integration_tests/dbt_project/models/exposures.yml:18-24
Timestamp: 2025-07-15T19:31:02.744Z
Learning: For dbt exposures, the meta configuration can be nested under the config block (i.e., config: meta:). This is a valid and supported pattern in modern dbt versions, contrary to some older documentation that suggested only top-level meta blocks were supported.

macros/edr/materializations/test/test.sql (2)

Learnt from: haritamar
PR: #825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Learnt from: haritamar
PR: #827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

🔇 Additional comments (7)
macros/edr/system/system_utils/get_config_var.sql (1)

67-69: LGTM! Secure defaults for PII configuration.

The new configuration variables follow established patterns and use secure defaults:

  • disable_samples_on_pii_tables defaults to false, ensuring the feature is opt-in
  • pii_table_tags defaults to ['pii'], providing a sensible default while allowing customization
integration_tests/tests/dbt_project.py (2)

112-112: LGTM! Consistent parameter addition across overloads.

The model_config parameter is correctly added to all method signatures with proper type hints and default values.


167-168: Clean implementation of config merging.

The model_config is properly merged into the table_yaml dictionary, allowing tests to pass additional configuration like tags for PII testing scenarios.

macros/edr/materializations/test/test.sql (1)

53-57: Excellent security implementation for PII protection.

The conditional sample limit override effectively prevents sensitive data collection:

  • Defaults to configured sample count for normal operation
  • Sets sample_limit to 0 for PII tables, leveraging existing query optimization
  • Clean separation of concerns using the dedicated is_pii_table() macro

This approach ensures zero sample data exposure for PII-tagged tables when the feature is enabled.

integration_tests/tests/test_sampling_pii.py (3)

26-53: Comprehensive test coverage for PII sampling disabled scenario.

The test correctly verifies that PII-tagged tables don't collect samples when the feature is enabled:

  • Uses guaranteed-fail test scenario with 50 null values
  • Properly configures PII tags and feature flags
  • Verifies zero samples are collected despite test failure

55-82: Good verification that non-PII tables are unaffected.

This test ensures the feature doesn't interfere with normal sampling behavior for non-PII tables, maintaining expected functionality for tables tagged as "normal".


84-110: Essential test for feature toggle validation.

This test verifies that when the PII feature is disabled, even PII-tagged tables collect samples normally. This ensures backward compatibility and proper feature flag behavior.

Comment on lines 9 to 21
SAMPLES_QUERY = """
with latest_elementary_test_result as (
select id
from {{{{ ref("elementary_test_results") }}}}
where lower(table_name) = lower('{test_id}')
order by created_at desc
limit 1
)

select result_row
from {{{{ ref("test_result_rows") }}}}
where elementary_test_results_id in (select * from latest_elementary_test_result)
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify the query handles edge cases properly.

The SAMPLES_QUERY looks correct but should handle potential edge cases:

  • What if multiple tests exist for the same table_name?
  • Consider adding ORDER BY for deterministic results

🏁 Script executed:

#!/bin/bash
# Verify the query logic and check for potential issues with multiple test results

# Search for similar query patterns in the codebase to ensure consistency
rg -A 10 -B 5 "elementary_test_results.*table_name.*order by.*created_at desc"

# Look for any existing edge case handling in similar queries
rg -A 5 "limit 1" --type sql

Length of output: 2695


Ensure deterministic ordering and handle no-results in SAMPLES_QUERY

The current CTE picks the latest test by created_at DESC LIMIT 1, but if two runs share the same timestamp you may get a non-deterministic row. Also, if no test results exist for the given test_id, the final query simply returns no rows.

Please update integration_tests/tests/test_sampling_pii.py (lines 9–21) to:

  • Add a secondary sort key (e.g. id DESC) to make the choice deterministic
  • Optionally guard against an empty CTE (for example, by raising an error or returning a default)

Suggested diff:

     with latest_elementary_test_result as (
         select id
         from {{ ref("elementary_test_results") }}
         where lower(table_name) = lower('{test_id}')
-        order by created_at desc
+        order by created_at desc, id desc
         limit 1
     )
 
     select result_row
     from {{ ref("test_result_rows") }}
     where elementary_test_results_id in (
         select * from latest_elementary_test_result
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SAMPLES_QUERY = """
with latest_elementary_test_result as (
select id
from {{{{ ref("elementary_test_results") }}}}
where lower(table_name) = lower('{test_id}')
order by created_at desc
limit 1
)
select result_row
from {{{{ ref("test_result_rows") }}}}
where elementary_test_results_id in (select * from latest_elementary_test_result)
"""
SAMPLES_QUERY = """
with latest_elementary_test_result as (
select id
from {{{{ ref("elementary_test_results") }}}}
where lower(table_name) = lower('{test_id}')
order by created_at desc, id desc
limit 1
)
select result_row
from {{{{ ref("test_result_rows") }}}}
where elementary_test_results_id in (select * from latest_elementary_test_result)
"""
🤖 Prompt for AI Agents
In integration_tests/tests/test_sampling_pii.py around lines 9 to 21, the
SAMPLES_QUERY CTE orders by created_at descending but lacks a secondary sort
key, causing non-deterministic results if timestamps tie. Also, it does not
handle the case when no test results exist for the given test_id. Fix this by
adding a secondary order criterion such as id DESC to ensure deterministic
ordering, and add logic to handle an empty CTE result, for example by raising an
error or returning a default value to avoid silent failures.

- Add 'id desc' as secondary sort key to ensure deterministic results
- Addresses CodeRabbit comment about non-deterministic ordering when timestamps tie

Co-Authored-By: Yosef Arbiv <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant