Sfe 4426 cluster config secrets #268

arunpamulapati · 2026-01-08T16:01:57Z

Description

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (non-code changes like README or docs)

How Has This Been Tested?

Checklist

[ X] I have reviewed the CONTRIBUTING and VERSIONING documentation.
My code follows the code style of this project.
I have verified, added or updated the tests to cover my changes (if applicable).
I have verified that my changes do not introduce any breaking changes on all the supported clouds (AWS, Azure, GCP).
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation (if applicable).

Screenshots (If Applicable)

See dashbaord

Additional Notes

Also fixed some indent issues, added claude.md ingroe.

Fix two bugs in TruffleHog secret scanning: 1. Fix incorrect column name in get_current_run_id() - changed from 'run_time' to 'check_time' to match run_number_table schema 2. Add comprehensive error handling for TruffleHog download failures with clear user guidance to allowlist required domains The column name mismatch caused the function to fall back to timestamp-based run IDs. Download failures now provide actionable messages directing users to contact IT/Security teams.

The runID column in run_number_table is GENERATED ALWAYS AS IDENTITY, which means it auto-increments and cannot accept explicit values. Changed get_current_run_id() to: - Only insert check_time (not runID) - Let database auto-generate runID via IDENTITY column - Retrieve generated runID by querying max value This matches the pattern used in insertNewBatchRun() from common.py and fixes the error: "DELTA_IDENTITY_COLUMNS_EXPLICIT_INSERT_NOT_SUPPORTED"

…Stages 1-2) Stage 1: Rename secret_scan_results to notebooks_secret_scan_results - Renamed function in common.py: create_notebooks_secret_scan_results_table() - Updated 5 references in trufflehog_scan.py (function calls, INSERTs, SELECTs) - Updated 8 references in SAT_Dashboard_definition.json (all 4 datasets) - Updated documentation in usage.mdx Stage 2: Add clusters_secret_scan_results table - Created create_clusters_secret_scan_results_table() in common.py - Added initialization call in initialize.py - Schema includes: cluster_id, cluster_name, config_field, config_key - Same partitioning strategy (scan_date) as notebooks table This lays the foundation for scanning spark_env_vars in cluster configurations. Next: Implement cluster scanning logic and orchestration.

…Stages 3-4) Stage 3: Create cluster_secrets_scan.py (650+ lines) - Full cluster configuration secret scanning implementation - Scans spark_env_vars using TruffleHog dual-scan approach - Pattern based on trufflehog_scan.py for consistency - Functions: get_all_clusters(), extract_spark_env_vars(), serialize_env_vars_to_file() - TruffleHog scanning: scan_cluster_config_for_secrets(), process_trufflehog_output() - Database: insert_cluster_secret_scan_results(), insert_no_secrets_tracking_row() - Main workflow: main_cluster_scanning_workflow() - Stores results in clusters_secret_scan_results table with cluster_id, config_field, config_key Stage 4: Orchestrator integration in security_analysis_secrets_scanner.py - Added generate_shared_run_id(): Creates shared run_id for both scans - Added processClusterScan(): Orchestrates cluster scanning per workspace - Modified processTruffleHogScan(): Now accepts run_id parameter - Modified runTruffleHogScanForAllWorkspaces(): Runs both notebook + cluster scans * Generates shared run_id per workspace * Calls both scans sequentially * Graceful error handling (one scan failure doesn't block the other) Both notebook and cluster scans now share run_id for correlation. Next: Update dashboard with UNION queries to show both sources.

Update all 4 secret scanner datasets to combine notebook and cluster findings: - Secret Scanner Metadata: Add clusters_with_secrets metric - Secret Scanner Workspaces: UNION both tables for workspace listing - Secret Scanner Details By Workspace: Add source_type, config_field, config_key columns - Secret Scanner Metadata By Workspace: Add clusters_with_secrets metric All queries now use UNION ALL to combine: - notebooks_secret_scan_results (notebooks) - clusters_secret_scan_results (clusters with spark_env_vars) Shared run_id enables correlation between notebook and cluster scans. Dashboard now provides unified view of all secrets across both sources.

…ION query Fixed NUM_COLUMNS_MISMATCH error by explicitly selecting columns instead of using s.* in combined_results CTE. Both notebooks and clusters now select same 9 columns (source_type, object_id, workspace_id, detector_name, secret_sha256, verified, secrets_found, run_id, scan_time) for proper UNION.

Update "Secret Scanner Details By Workspace" table widget to support both notebooks and clusters: - Rename notebook_name → object_name, notebook_path → object_path - Add new columns: source_type, config_field, config_key - Reorder columns: source_type(0), object_name(1), object_path(2), config_field(3), config_key(4), detection_type(5), secret_hash(6), scan_time_est(7) Table now displays both notebook and cluster secret findings in unified view.

Standardize cluster_secrets_scan.py setup to match trufflehog_scan.py: - Add Step 1: Install Dependencies and Setup TruffleHog (%sh cell) - Remove duplicate imports and logging configuration - Remove duplicate db_client initialization - Use existing loggr from common setup instead of creating new logger - Update comments to reflect TruffleHog installation in Step 1 - Streamline Step 2: Configuration and Authentication - Keep Step 3: Configuration Setup with Config class Both scanners now follow identical setup pattern for consistency.

Add check for existing TruffleHog binary before attempting installation. Since the orchestrator runs notebook scanner first (which installs TruffleHog), cluster scanner can reuse the existing binary instead of reinstalling. Changes: - Check if /tmp/trufflehog exists before installation - If exists, skip installation and reuse existing binary - If not exists, proceed with normal installation - Add success message after setup verification This prevents installation conflicts when cluster scanner runs after notebook scanner.

…v vars Add detailed logging to understand why clusters with spark_env_vars are being skipped: 1. Log available keys in cluster config for debugging 2. Check alternative field names (spark_env_variables, environment_variables, env_vars) 3. Print visible status for each cluster processed: - ⚠️ Failed to get config - ⏭️ No environment variables found (with reason) - ✅ Found N environment variables 4. Show first cluster config keys structure for debugging This will help identify: - Correct field name for environment variables in API response - Which clusters are being skipped and why - If specific clusters (like "Arun's Personal Compute Cluster") are processed

Critical bug fix: get_cluster_info() was returning empty list instead of full config. Before: - Called .get('satelements', []) on API response - Always returned empty list [] - spark_env_vars and other config fields were never available After: - Returns full API response from /clusters/get - Includes all fields: spark_env_vars, spark_conf, custom_tags, etc. - Enables cluster secret scanning to access environment variables This fixes the issue where cluster scanner found 62 clusters but scanned 0. All clusters appeared to have no spark_env_vars because the API response was being discarded.

Build new wheel with clusters_client.get_cluster_info() fix. Changes: - Bump version from 0.0.124 to 0.0.125 - Rebuild wheel with fixed get_cluster_info() method - Add wheel to notebooks/Includes/ for deployment This wheel contains the critical fix that enables cluster secret scanning by returning the full cluster configuration including spark_env_vars.

Update install_sat_sdk to install the new SDK version 0.0.125 which contains the critical get_cluster_info() fix for cluster secret scanning. Changes: - Update SDK_VERSION from 0.1.38 to 0.0.125 - Install from local wheel file in notebooks/Includes/ - Use --find-links to locate the wheel file This ensures cluster scanner can access full cluster configurations including spark_env_vars field.

Replace ClustersClient SDK calls with direct API calls to avoid SDK bugs. Changes: - Remove dependency on ClustersClient from clientpkgs - Use db_client.get() for direct API calls (same pattern as notebook scanner) - get_all_clusters(): Direct call to /clusters/list endpoint - get_cluster_config(): Direct call to /clusters/get endpoint - Returns full cluster configuration including spark_env_vars This bypasses the broken SDK completely and makes the cluster scanner independent of SDK wheel updates. Uses same reliable pattern as trufflehog_scan.py for notebook retrieval.

… into SFE-4426_cluster_config_secrets

…tion Add comprehensive INFO-level logging to understand why clusters aren't being scanned: Changes: - Show config keys for first 3 clusters (not just first) - Check and log spark_env_vars field presence for EVERY cluster - Log spark_env_vars value type when field exists - Special debug output for test cluster (containing "Arun" or "Personal") - Show full spark_env_vars value and keys for test cluster - Change logger.debug to logger.info for env var extraction This will help diagnose: - Whether spark_env_vars field exists in cluster configs - What type spark_env_vars is (dict, None, empty dict) - Why test cluster with env vars might not be detected - If API response structure is different than expected

… into SFE-4426_cluster_config_secrets

The dbutils.notebook.exit() at the end was clearing all debug output. Now debug information is captured in the return dict and survives the exit. Debug info includes: - sample_cluster_configs: Config keys from first 3 clusters - test_cluster_found: Whether test cluster (Arun/Personal) was found - test_cluster_has_spark_env_vars: Whether test cluster has the field - test_cluster_spark_env_vars_value: First 200 chars of the value - test_cluster_config_keys: All config keys for test cluster - clusters_with_spark_env_vars_field: Count of clusters with the field This allows us to diagnose why clusters aren't being scanned even after notebook.exit() clears the cell output.

…in API response Critical bug fix: db_client.get() returns response in format: { 'satelements': <actual_cluster_config>, 'http_status_code': 200 } Previously we were using the wrapper dict as the cluster config, which only had keys ['satelements', 'http_status_code'] instead of the actual cluster fields like 'cluster_id', 'spark_version', 'spark_env_vars', etc. Now we properly extract the cluster config from response['satelements']. This was discovered via debug output showing: "Cluster #1 config keys: ['satelements', 'http_status_code']" This fix will allow spark_env_vars to be detected and scanned.

The db_client.get() response format for /clusters/get is: { 'satelements': [<cluster_config>], # List with ONE element 'http_status_code': 200 } Previous fix assumed satelements was a dict, but it's actually a list containing a single dict element. Now we: 1. Check if satelements is a list 2. Extract the first (and only) element: satelements[0] 3. Return that as the cluster config This fixes the error: 'list' object has no attribute 'keys'

Fixed: local variable 'cluster_id' referenced before assignment The error occurred because cluster_id and other variables were defined inside the try block. If an exception occurred during table creation (before variable assignment), the except block would try to use cluster_id which hadn't been defined yet. Solution: Move variable extraction to the very beginning of the function, before the try block. Now these variables are always defined, even if an exception occurs early.

Replaced dbutils.notebook.exit() with comprehensive summary display to preserve all scan output and debug information. Changes: - Remove notebook.exit() call that was clearing all output - Add detailed final summary with statistics - Include debug information in output - Add recommendations based on scan results - Follow same pattern as trufflehog_scan.py (no exit at end) The orchestrator doesn't use the return value anyway, so this allows users to see the full scan progress and results without having output cleared by exit(). Summary displays: - Total clusters found and scanned - Secrets detected count - Debug info (clusters with spark_env_vars, test cluster status) - Recommended actions based on findings - Best practices for secure configurations

Add cleanup section matching trufflehog_scan.py pattern to show: - Temporary cluster config files in /tmp/clusters/ - TruffleHog binary and config files - Cleanup information and file lifecycle This provides visibility into what files were created during the scan and confirms they will be cleaned up when cluster terminates.

Add detailed INFO-level logging to diagnose why secrets aren't being inserted into clusters_secret_scan_results table: Changes: - Add logging before/after insert_cluster_secret_scan_results call - Add try/except around insertion with detailed error logging - Add logging inside insert function for each step: - Table creation - Processing each secret - SQL statement execution - Log full traceback on insertion errors - Show progress indicators in output This will help identify: - Whether table creation succeeds - Which secrets are being processed - Whether SQL execution succeeds - What specific error occurs during insertion

…d idempotent TruffleHog installation Rename: - trufflehog_scan.py → notebook_secret_scan.py for clarity Changes to notebook_secret_scan.py: - Add idempotent check for TruffleHog installation - Skip installation if /tmp/trufflehog already exists - Prevents installation conflicts when running multiple scanners Changes to cluster_secrets_scan.py: - Add directory creation verification with write test - Ensure /tmp/clusters/ is writable before scanning starts - Add defensive directory creation in serialize_env_vars_to_file() - Update comment to reference notebook_secret_scan.py Changes to orchestrator: - Update notebook path from trufflehog_scan to notebook_secret_scan This ensures both scanners can run in sequence without conflicts.

…eck=True bug CRITICAL BUG FIX: Notebook scanner was ignoring run_id passed from orchestrator, breaking correlation with cluster scanner. Changes: 1. Respect run_id from orchestrator - Check json_.get("run_id") first - Use passed run_id if provided (shared with cluster scan) - Fall back to generate_run_id() for standalone execution - Now matches cluster_secrets_scan.py pattern 2. Fix subprocess.run() check=True bug - Remove check=True parameter from TruffleHog scans - Was incorrectly raising exceptions on non-zero exit codes - Comment said "Don't raise exception" but check=True does the opposite - Now properly handles non-zero exit codes Before: - Notebook scanner always generated its own run_id - Notebook and cluster scans had different run_ids - No correlation possible between findings After: - Both scanners use shared run_id from orchestrator - Enables correlation of notebook + cluster findings in same run - Maintains backward compatibility for standalone execution This fixes the fundamental issue preventing cross-scanner correlation.

…nces Clean up cluster_secrets_scan.py for production: Removed: - Test cluster detection (Arun/Personal cluster references) - debug_info structure and all related tracking - Sample cluster config logging (first 3 clusters) - Excessive database insertion logging - Verbose print statements for each step Simplified: - Logging for clusters without env vars (debug level only) - Database insertion (removed per-secret logging) - Progress messages (cleaner output) - Final summary (removed debug info section) Result: - 90 lines removed - Production-ready code - Essential logging preserved - No personal/test-specific code Ready for final test and PR.

…otebook environment Fix import error: No module named 'common' In Databricks notebooks, functions from files loaded via %run are available directly in the namespace, not as importable modules. Changed: - Remove: from common import create_clusters_secret_scan_results_table - Direct call: create_clusters_secret_scan_results_table() This matches the pattern used in notebook_secret_scan.py and fixes the database insertion error.

CRITICAL SECURITY FIX: Escape single quotes in all SQL string values to prevent SQL injection attacks. Issue: - Cluster/notebook names with apostrophes (e.g., "Arun Pamulapati's Cluster") were breaking SQL queries - SQL syntax error: VALUES ('...', 'Arun Pamulapati's...', ...) would terminate string at first apostrophe - Could be exploited for SQL injection if malicious names used Fix: - Escape all string values by replacing ' with '' (SQL standard) - Applied to both cluster_secrets_scan.py and notebook_secret_scan.py - Escaped values: * workspace_id, cluster_id, cluster_name, notebook_id, notebook_path, notebook_name * config_field, config_key, detector_name, secret_sha256, source_file Before: VALUES ('{workspace_id}', '{cluster_name}', ...) -> FAILS with names containing apostrophes After: workspace_id_escaped = workspace_id.replace("'", "''") VALUES ('{workspace_id_escaped}', '{cluster_name_escaped}', ...) -> SAFE: "Arun Pamulapati's Cluster" becomes "Arun Pamulapati''s Cluster" This prevents both SQL syntax errors and SQL injection attacks.

- Fix whitespace indentation in workspace_bootstrap.py - Clean up comment formatting in clusters_client.py - Update SDK version in setup.py

- Add CLAUDE.md to .gitignore to exclude from version control - Update get_cluster_info() docstring to reflect actual usage - Clean up trailing whitespace in comments

arunpamulapati and others added 30 commits January 7, 2026 08:09

Edit to help claude with the setup

ae87e1e

Merge remote-tracking branch 'origin/SFE-4426_cluster_config_secrets'…

ee252e8

… into SFE-4426_cluster_config_secrets

Merge remote-tracking branch 'origin/SFE-4426_cluster_config_secrets'…

ac57fa1

… into SFE-4426_cluster_config_secrets

Pointing to correct SDK

88f24a6

arunpamulapati added 5 commits January 7, 2026 20:46

chore: code cleanup and formatting fixes

e23bfbf

- Fix whitespace indentation in workspace_bootstrap.py - Clean up comment formatting in clusters_client.py - Update SDK version in setup.py

chore: add CLAUDE.md to gitignore and update clusters_client docstring

6c0835a

- Add CLAUDE.md to .gitignore to exclude from version control - Update get_cluster_info() docstring to reflect actual usage - Clean up trailing whitespace in comments

fix: add newline at end of clusters_client.py

c9de967

revert: restore clusters_client.py to original main branch state

fa0060a

arunpamulapati requested review from ramdaskmdb and shreelshah12 January 8, 2026 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sfe 4426 cluster config secrets #268

Sfe 4426 cluster config secrets #268

Uh oh!

arunpamulapati commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sfe 4426 cluster config secrets #268

Are you sure you want to change the base?

Sfe 4426 cluster config secrets #268

Uh oh!

Conversation

arunpamulapati commented Jan 8, 2026

Description

Type of Change

How Has This Been Tested?

Checklist

Screenshots (If Applicable)

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants