generated from databricks-industry-solutions/industry-solutions-blueprints
-
Notifications
You must be signed in to change notification settings - Fork 66
Sfe 4426 cluster config secrets #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
arunpamulapati
wants to merge
35
commits into
release/0.6.0
Choose a base branch
from
SFE-4426_cluster_config_secrets
base: release/0.6.0
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix two bugs in TruffleHog secret scanning: 1. Fix incorrect column name in get_current_run_id() - changed from 'run_time' to 'check_time' to match run_number_table schema 2. Add comprehensive error handling for TruffleHog download failures with clear user guidance to allowlist required domains The column name mismatch caused the function to fall back to timestamp-based run IDs. Download failures now provide actionable messages directing users to contact IT/Security teams.
The runID column in run_number_table is GENERATED ALWAYS AS IDENTITY, which means it auto-increments and cannot accept explicit values. Changed get_current_run_id() to: - Only insert check_time (not runID) - Let database auto-generate runID via IDENTITY column - Retrieve generated runID by querying max value This matches the pattern used in insertNewBatchRun() from common.py and fixes the error: "DELTA_IDENTITY_COLUMNS_EXPLICIT_INSERT_NOT_SUPPORTED"
…Stages 1-2) Stage 1: Rename secret_scan_results to notebooks_secret_scan_results - Renamed function in common.py: create_notebooks_secret_scan_results_table() - Updated 5 references in trufflehog_scan.py (function calls, INSERTs, SELECTs) - Updated 8 references in SAT_Dashboard_definition.json (all 4 datasets) - Updated documentation in usage.mdx Stage 2: Add clusters_secret_scan_results table - Created create_clusters_secret_scan_results_table() in common.py - Added initialization call in initialize.py - Schema includes: cluster_id, cluster_name, config_field, config_key - Same partitioning strategy (scan_date) as notebooks table This lays the foundation for scanning spark_env_vars in cluster configurations. Next: Implement cluster scanning logic and orchestration.
…Stages 3-4) Stage 3: Create cluster_secrets_scan.py (650+ lines) - Full cluster configuration secret scanning implementation - Scans spark_env_vars using TruffleHog dual-scan approach - Pattern based on trufflehog_scan.py for consistency - Functions: get_all_clusters(), extract_spark_env_vars(), serialize_env_vars_to_file() - TruffleHog scanning: scan_cluster_config_for_secrets(), process_trufflehog_output() - Database: insert_cluster_secret_scan_results(), insert_no_secrets_tracking_row() - Main workflow: main_cluster_scanning_workflow() - Stores results in clusters_secret_scan_results table with cluster_id, config_field, config_key Stage 4: Orchestrator integration in security_analysis_secrets_scanner.py - Added generate_shared_run_id(): Creates shared run_id for both scans - Added processClusterScan(): Orchestrates cluster scanning per workspace - Modified processTruffleHogScan(): Now accepts run_id parameter - Modified runTruffleHogScanForAllWorkspaces(): Runs both notebook + cluster scans * Generates shared run_id per workspace * Calls both scans sequentially * Graceful error handling (one scan failure doesn't block the other) Both notebook and cluster scans now share run_id for correlation. Next: Update dashboard with UNION queries to show both sources.
Update all 4 secret scanner datasets to combine notebook and cluster findings: - Secret Scanner Metadata: Add clusters_with_secrets metric - Secret Scanner Workspaces: UNION both tables for workspace listing - Secret Scanner Details By Workspace: Add source_type, config_field, config_key columns - Secret Scanner Metadata By Workspace: Add clusters_with_secrets metric All queries now use UNION ALL to combine: - notebooks_secret_scan_results (notebooks) - clusters_secret_scan_results (clusters with spark_env_vars) Shared run_id enables correlation between notebook and cluster scans. Dashboard now provides unified view of all secrets across both sources.
…ION query Fixed NUM_COLUMNS_MISMATCH error by explicitly selecting columns instead of using s.* in combined_results CTE. Both notebooks and clusters now select same 9 columns (source_type, object_id, workspace_id, detector_name, secret_sha256, verified, secrets_found, run_id, scan_time) for proper UNION.
Update "Secret Scanner Details By Workspace" table widget to support both notebooks and clusters: - Rename notebook_name → object_name, notebook_path → object_path - Add new columns: source_type, config_field, config_key - Reorder columns: source_type(0), object_name(1), object_path(2), config_field(3), config_key(4), detection_type(5), secret_hash(6), scan_time_est(7) Table now displays both notebook and cluster secret findings in unified view.
Standardize cluster_secrets_scan.py setup to match trufflehog_scan.py: - Add Step 1: Install Dependencies and Setup TruffleHog (%sh cell) - Remove duplicate imports and logging configuration - Remove duplicate db_client initialization - Use existing loggr from common setup instead of creating new logger - Update comments to reflect TruffleHog installation in Step 1 - Streamline Step 2: Configuration and Authentication - Keep Step 3: Configuration Setup with Config class Both scanners now follow identical setup pattern for consistency.
Add check for existing TruffleHog binary before attempting installation. Since the orchestrator runs notebook scanner first (which installs TruffleHog), cluster scanner can reuse the existing binary instead of reinstalling. Changes: - Check if /tmp/trufflehog exists before installation - If exists, skip installation and reuse existing binary - If not exists, proceed with normal installation - Add success message after setup verification This prevents installation conflicts when cluster scanner runs after notebook scanner.
…v vars Add detailed logging to understand why clusters with spark_env_vars are being skipped: 1. Log available keys in cluster config for debugging 2. Check alternative field names (spark_env_variables, environment_variables, env_vars) 3. Print visible status for each cluster processed: -⚠️ Failed to get config - ⏭️ No environment variables found (with reason) - ✅ Found N environment variables 4. Show first cluster config keys structure for debugging This will help identify: - Correct field name for environment variables in API response - Which clusters are being skipped and why - If specific clusters (like "Arun's Personal Compute Cluster") are processed
Critical bug fix: get_cluster_info() was returning empty list instead of full config.
Before:
- Called .get('satelements', []) on API response
- Always returned empty list []
- spark_env_vars and other config fields were never available
After:
- Returns full API response from /clusters/get
- Includes all fields: spark_env_vars, spark_conf, custom_tags, etc.
- Enables cluster secret scanning to access environment variables
This fixes the issue where cluster scanner found 62 clusters but scanned 0.
All clusters appeared to have no spark_env_vars because the API response
was being discarded.
Build new wheel with clusters_client.get_cluster_info() fix. Changes: - Bump version from 0.0.124 to 0.0.125 - Rebuild wheel with fixed get_cluster_info() method - Add wheel to notebooks/Includes/ for deployment This wheel contains the critical fix that enables cluster secret scanning by returning the full cluster configuration including spark_env_vars.
Update install_sat_sdk to install the new SDK version 0.0.125 which contains the critical get_cluster_info() fix for cluster secret scanning. Changes: - Update SDK_VERSION from 0.1.38 to 0.0.125 - Install from local wheel file in notebooks/Includes/ - Use --find-links to locate the wheel file This ensures cluster scanner can access full cluster configurations including spark_env_vars field.
Replace ClustersClient SDK calls with direct API calls to avoid SDK bugs. Changes: - Remove dependency on ClustersClient from clientpkgs - Use db_client.get() for direct API calls (same pattern as notebook scanner) - get_all_clusters(): Direct call to /clusters/list endpoint - get_cluster_config(): Direct call to /clusters/get endpoint - Returns full cluster configuration including spark_env_vars This bypasses the broken SDK completely and makes the cluster scanner independent of SDK wheel updates. Uses same reliable pattern as trufflehog_scan.py for notebook retrieval.
… into SFE-4426_cluster_config_secrets
…tion Add comprehensive INFO-level logging to understand why clusters aren't being scanned: Changes: - Show config keys for first 3 clusters (not just first) - Check and log spark_env_vars field presence for EVERY cluster - Log spark_env_vars value type when field exists - Special debug output for test cluster (containing "Arun" or "Personal") - Show full spark_env_vars value and keys for test cluster - Change logger.debug to logger.info for env var extraction This will help diagnose: - Whether spark_env_vars field exists in cluster configs - What type spark_env_vars is (dict, None, empty dict) - Why test cluster with env vars might not be detected - If API response structure is different than expected
… into SFE-4426_cluster_config_secrets
The dbutils.notebook.exit() at the end was clearing all debug output. Now debug information is captured in the return dict and survives the exit. Debug info includes: - sample_cluster_configs: Config keys from first 3 clusters - test_cluster_found: Whether test cluster (Arun/Personal) was found - test_cluster_has_spark_env_vars: Whether test cluster has the field - test_cluster_spark_env_vars_value: First 200 chars of the value - test_cluster_config_keys: All config keys for test cluster - clusters_with_spark_env_vars_field: Count of clusters with the field This allows us to diagnose why clusters aren't being scanned even after notebook.exit() clears the cell output.
…in API response
Critical bug fix: db_client.get() returns response in format:
{
'satelements': <actual_cluster_config>,
'http_status_code': 200
}
Previously we were using the wrapper dict as the cluster config, which only
had keys ['satelements', 'http_status_code'] instead of the actual cluster
fields like 'cluster_id', 'spark_version', 'spark_env_vars', etc.
Now we properly extract the cluster config from response['satelements'].
This was discovered via debug output showing:
"Cluster #1 config keys: ['satelements', 'http_status_code']"
This fix will allow spark_env_vars to be detected and scanned.
The db_client.get() response format for /clusters/get is:
{
'satelements': [<cluster_config>], # List with ONE element
'http_status_code': 200
}
Previous fix assumed satelements was a dict, but it's actually a list
containing a single dict element. Now we:
1. Check if satelements is a list
2. Extract the first (and only) element: satelements[0]
3. Return that as the cluster config
This fixes the error: 'list' object has no attribute 'keys'
Fixed: local variable 'cluster_id' referenced before assignment The error occurred because cluster_id and other variables were defined inside the try block. If an exception occurred during table creation (before variable assignment), the except block would try to use cluster_id which hadn't been defined yet. Solution: Move variable extraction to the very beginning of the function, before the try block. Now these variables are always defined, even if an exception occurs early.
Replaced dbutils.notebook.exit() with comprehensive summary display to preserve all scan output and debug information. Changes: - Remove notebook.exit() call that was clearing all output - Add detailed final summary with statistics - Include debug information in output - Add recommendations based on scan results - Follow same pattern as trufflehog_scan.py (no exit at end) The orchestrator doesn't use the return value anyway, so this allows users to see the full scan progress and results without having output cleared by exit(). Summary displays: - Total clusters found and scanned - Secrets detected count - Debug info (clusters with spark_env_vars, test cluster status) - Recommended actions based on findings - Best practices for secure configurations
Add cleanup section matching trufflehog_scan.py pattern to show: - Temporary cluster config files in /tmp/clusters/ - TruffleHog binary and config files - Cleanup information and file lifecycle This provides visibility into what files were created during the scan and confirms they will be cleaned up when cluster terminates.
Add detailed INFO-level logging to diagnose why secrets aren't being inserted into clusters_secret_scan_results table: Changes: - Add logging before/after insert_cluster_secret_scan_results call - Add try/except around insertion with detailed error logging - Add logging inside insert function for each step: - Table creation - Processing each secret - SQL statement execution - Log full traceback on insertion errors - Show progress indicators in output This will help identify: - Whether table creation succeeds - Which secrets are being processed - Whether SQL execution succeeds - What specific error occurs during insertion
…d idempotent TruffleHog installation Rename: - trufflehog_scan.py → notebook_secret_scan.py for clarity Changes to notebook_secret_scan.py: - Add idempotent check for TruffleHog installation - Skip installation if /tmp/trufflehog already exists - Prevents installation conflicts when running multiple scanners Changes to cluster_secrets_scan.py: - Add directory creation verification with write test - Ensure /tmp/clusters/ is writable before scanning starts - Add defensive directory creation in serialize_env_vars_to_file() - Update comment to reference notebook_secret_scan.py Changes to orchestrator: - Update notebook path from trufflehog_scan to notebook_secret_scan This ensures both scanners can run in sequence without conflicts.
…eck=True bug
CRITICAL BUG FIX: Notebook scanner was ignoring run_id passed from orchestrator,
breaking correlation with cluster scanner.
Changes:
1. Respect run_id from orchestrator
- Check json_.get("run_id") first
- Use passed run_id if provided (shared with cluster scan)
- Fall back to generate_run_id() for standalone execution
- Now matches cluster_secrets_scan.py pattern
2. Fix subprocess.run() check=True bug
- Remove check=True parameter from TruffleHog scans
- Was incorrectly raising exceptions on non-zero exit codes
- Comment said "Don't raise exception" but check=True does the opposite
- Now properly handles non-zero exit codes
Before:
- Notebook scanner always generated its own run_id
- Notebook and cluster scans had different run_ids
- No correlation possible between findings
After:
- Both scanners use shared run_id from orchestrator
- Enables correlation of notebook + cluster findings in same run
- Maintains backward compatibility for standalone execution
This fixes the fundamental issue preventing cross-scanner correlation.
…nces Clean up cluster_secrets_scan.py for production: Removed: - Test cluster detection (Arun/Personal cluster references) - debug_info structure and all related tracking - Sample cluster config logging (first 3 clusters) - Excessive database insertion logging - Verbose print statements for each step Simplified: - Logging for clusters without env vars (debug level only) - Database insertion (removed per-secret logging) - Progress messages (cleaner output) - Final summary (removed debug info section) Result: - 90 lines removed - Production-ready code - Essential logging preserved - No personal/test-specific code Ready for final test and PR.
…otebook environment Fix import error: No module named 'common' In Databricks notebooks, functions from files loaded via %run are available directly in the namespace, not as importable modules. Changed: - Remove: from common import create_clusters_secret_scan_results_table - Direct call: create_clusters_secret_scan_results_table() This matches the pattern used in notebook_secret_scan.py and fixes the database insertion error.
CRITICAL SECURITY FIX: Escape single quotes in all SQL string values
to prevent SQL injection attacks.
Issue:
- Cluster/notebook names with apostrophes (e.g., "Arun Pamulapati's Cluster")
were breaking SQL queries
- SQL syntax error: VALUES ('...', 'Arun Pamulapati's...', ...)
would terminate string at first apostrophe
- Could be exploited for SQL injection if malicious names used
Fix:
- Escape all string values by replacing ' with '' (SQL standard)
- Applied to both cluster_secrets_scan.py and notebook_secret_scan.py
- Escaped values:
* workspace_id, cluster_id, cluster_name, notebook_id, notebook_path, notebook_name
* config_field, config_key, detector_name, secret_sha256, source_file
Before:
VALUES ('{workspace_id}', '{cluster_name}', ...)
-> FAILS with names containing apostrophes
After:
workspace_id_escaped = workspace_id.replace("'", "''")
VALUES ('{workspace_id_escaped}', '{cluster_name_escaped}', ...)
-> SAFE: "Arun Pamulapati's Cluster" becomes "Arun Pamulapati''s Cluster"
This prevents both SQL syntax errors and SQL injection attacks.
- Fix whitespace indentation in workspace_bootstrap.py - Clean up comment formatting in clusters_client.py - Update SDK version in setup.py
- Add CLAUDE.md to .gitignore to exclude from version control - Update get_cluster_info() docstring to reflect actual usage - Clean up trailing whitespace in comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Type of Change
How Has This Been Tested?
Checklist
Screenshots (If Applicable)
See dashbaord
Additional Notes
Also fixed some indent issues, added claude.md ingroe.