Skip to content

Conversation

@arunpamulapati
Copy link
Collaborator

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (non-code changes like README or docs)

How Has This Been Tested?

Checklist

  • [ X] I have reviewed the CONTRIBUTING and VERSIONING documentation.
  • My code follows the code style of this project.
  • I have verified, added or updated the tests to cover my changes (if applicable).
  • I have verified that my changes do not introduce any breaking changes on all the supported clouds (AWS, Azure, GCP).
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation (if applicable).

Screenshots (If Applicable)

See dashbaord

Additional Notes

Also fixed some indent issues, added claude.md ingroe.

arunpamulapati and others added 30 commits January 7, 2026 08:09
Fix two bugs in TruffleHog secret scanning:
1. Fix incorrect column name in get_current_run_id() - changed from
   'run_time' to 'check_time' to match run_number_table schema
2. Add comprehensive error handling for TruffleHog download failures
   with clear user guidance to allowlist required domains

The column name mismatch caused the function to fall back to
timestamp-based run IDs. Download failures now provide actionable
messages directing users to contact IT/Security teams.
The runID column in run_number_table is GENERATED ALWAYS AS IDENTITY,
which means it auto-increments and cannot accept explicit values.

Changed get_current_run_id() to:
- Only insert check_time (not runID)
- Let database auto-generate runID via IDENTITY column
- Retrieve generated runID by querying max value

This matches the pattern used in insertNewBatchRun() from common.py
and fixes the error: "DELTA_IDENTITY_COLUMNS_EXPLICIT_INSERT_NOT_SUPPORTED"
…Stages 1-2)

Stage 1: Rename secret_scan_results to notebooks_secret_scan_results
- Renamed function in common.py: create_notebooks_secret_scan_results_table()
- Updated 5 references in trufflehog_scan.py (function calls, INSERTs, SELECTs)
- Updated 8 references in SAT_Dashboard_definition.json (all 4 datasets)
- Updated documentation in usage.mdx

Stage 2: Add clusters_secret_scan_results table
- Created create_clusters_secret_scan_results_table() in common.py
- Added initialization call in initialize.py
- Schema includes: cluster_id, cluster_name, config_field, config_key
- Same partitioning strategy (scan_date) as notebooks table

This lays the foundation for scanning spark_env_vars in cluster configurations.
Next: Implement cluster scanning logic and orchestration.
…Stages 3-4)

Stage 3: Create cluster_secrets_scan.py (650+ lines)
- Full cluster configuration secret scanning implementation
- Scans spark_env_vars using TruffleHog dual-scan approach
- Pattern based on trufflehog_scan.py for consistency
- Functions: get_all_clusters(), extract_spark_env_vars(), serialize_env_vars_to_file()
- TruffleHog scanning: scan_cluster_config_for_secrets(), process_trufflehog_output()
- Database: insert_cluster_secret_scan_results(), insert_no_secrets_tracking_row()
- Main workflow: main_cluster_scanning_workflow()
- Stores results in clusters_secret_scan_results table with cluster_id, config_field, config_key

Stage 4: Orchestrator integration in security_analysis_secrets_scanner.py
- Added generate_shared_run_id(): Creates shared run_id for both scans
- Added processClusterScan(): Orchestrates cluster scanning per workspace
- Modified processTruffleHogScan(): Now accepts run_id parameter
- Modified runTruffleHogScanForAllWorkspaces(): Runs both notebook + cluster scans
  * Generates shared run_id per workspace
  * Calls both scans sequentially
  * Graceful error handling (one scan failure doesn't block the other)

Both notebook and cluster scans now share run_id for correlation.
Next: Update dashboard with UNION queries to show both sources.
Update all 4 secret scanner datasets to combine notebook and cluster findings:
- Secret Scanner Metadata: Add clusters_with_secrets metric
- Secret Scanner Workspaces: UNION both tables for workspace listing
- Secret Scanner Details By Workspace: Add source_type, config_field, config_key columns
- Secret Scanner Metadata By Workspace: Add clusters_with_secrets metric

All queries now use UNION ALL to combine:
- notebooks_secret_scan_results (notebooks)
- clusters_secret_scan_results (clusters with spark_env_vars)

Shared run_id enables correlation between notebook and cluster scans.
Dashboard now provides unified view of all secrets across both sources.
…ION query

Fixed NUM_COLUMNS_MISMATCH error by explicitly selecting columns instead
of using s.* in combined_results CTE. Both notebooks and clusters now
select same 9 columns (source_type, object_id, workspace_id, detector_name,
secret_sha256, verified, secrets_found, run_id, scan_time) for proper UNION.
Update "Secret Scanner Details By Workspace" table widget to support
both notebooks and clusters:
- Rename notebook_name → object_name, notebook_path → object_path
- Add new columns: source_type, config_field, config_key
- Reorder columns: source_type(0), object_name(1), object_path(2),
  config_field(3), config_key(4), detection_type(5), secret_hash(6),
  scan_time_est(7)

Table now displays both notebook and cluster secret findings in unified view.
Standardize cluster_secrets_scan.py setup to match trufflehog_scan.py:
- Add Step 1: Install Dependencies and Setup TruffleHog (%sh cell)
- Remove duplicate imports and logging configuration
- Remove duplicate db_client initialization
- Use existing loggr from common setup instead of creating new logger
- Update comments to reflect TruffleHog installation in Step 1
- Streamline Step 2: Configuration and Authentication
- Keep Step 3: Configuration Setup with Config class

Both scanners now follow identical setup pattern for consistency.
Add check for existing TruffleHog binary before attempting installation.
Since the orchestrator runs notebook scanner first (which installs TruffleHog),
cluster scanner can reuse the existing binary instead of reinstalling.

Changes:
- Check if /tmp/trufflehog exists before installation
- If exists, skip installation and reuse existing binary
- If not exists, proceed with normal installation
- Add success message after setup verification

This prevents installation conflicts when cluster scanner runs after notebook scanner.
…v vars

Add detailed logging to understand why clusters with spark_env_vars are being skipped:

1. Log available keys in cluster config for debugging
2. Check alternative field names (spark_env_variables, environment_variables, env_vars)
3. Print visible status for each cluster processed:
   - ⚠️ Failed to get config
   - ⏭️ No environment variables found (with reason)
   - ✅ Found N environment variables
4. Show first cluster config keys structure for debugging

This will help identify:
- Correct field name for environment variables in API response
- Which clusters are being skipped and why
- If specific clusters (like "Arun's Personal Compute Cluster") are processed
Critical bug fix: get_cluster_info() was returning empty list instead of full config.

Before:
- Called .get('satelements', []) on API response
- Always returned empty list []
- spark_env_vars and other config fields were never available

After:
- Returns full API response from /clusters/get
- Includes all fields: spark_env_vars, spark_conf, custom_tags, etc.
- Enables cluster secret scanning to access environment variables

This fixes the issue where cluster scanner found 62 clusters but scanned 0.
All clusters appeared to have no spark_env_vars because the API response
was being discarded.
Build new wheel with clusters_client.get_cluster_info() fix.

Changes:
- Bump version from 0.0.124 to 0.0.125
- Rebuild wheel with fixed get_cluster_info() method
- Add wheel to notebooks/Includes/ for deployment

This wheel contains the critical fix that enables cluster secret scanning
by returning the full cluster configuration including spark_env_vars.
Update install_sat_sdk to install the new SDK version 0.0.125 which
contains the critical get_cluster_info() fix for cluster secret scanning.

Changes:
- Update SDK_VERSION from 0.1.38 to 0.0.125
- Install from local wheel file in notebooks/Includes/
- Use --find-links to locate the wheel file

This ensures cluster scanner can access full cluster configurations
including spark_env_vars field.
Replace ClustersClient SDK calls with direct API calls to avoid SDK bugs.

Changes:
- Remove dependency on ClustersClient from clientpkgs
- Use db_client.get() for direct API calls (same pattern as notebook scanner)
- get_all_clusters(): Direct call to /clusters/list endpoint
- get_cluster_config(): Direct call to /clusters/get endpoint
- Returns full cluster configuration including spark_env_vars

This bypasses the broken SDK completely and makes the cluster scanner
independent of SDK wheel updates. Uses same reliable pattern as
trufflehog_scan.py for notebook retrieval.
…tion

Add comprehensive INFO-level logging to understand why clusters aren't being scanned:

Changes:
- Show config keys for first 3 clusters (not just first)
- Check and log spark_env_vars field presence for EVERY cluster
- Log spark_env_vars value type when field exists
- Special debug output for test cluster (containing "Arun" or "Personal")
- Show full spark_env_vars value and keys for test cluster
- Change logger.debug to logger.info for env var extraction

This will help diagnose:
- Whether spark_env_vars field exists in cluster configs
- What type spark_env_vars is (dict, None, empty dict)
- Why test cluster with env vars might not be detected
- If API response structure is different than expected
The dbutils.notebook.exit() at the end was clearing all debug output.
Now debug information is captured in the return dict and survives the exit.

Debug info includes:
- sample_cluster_configs: Config keys from first 3 clusters
- test_cluster_found: Whether test cluster (Arun/Personal) was found
- test_cluster_has_spark_env_vars: Whether test cluster has the field
- test_cluster_spark_env_vars_value: First 200 chars of the value
- test_cluster_config_keys: All config keys for test cluster
- clusters_with_spark_env_vars_field: Count of clusters with the field

This allows us to diagnose why clusters aren't being scanned even after
notebook.exit() clears the cell output.
…in API response

Critical bug fix: db_client.get() returns response in format:
{
  'satelements': <actual_cluster_config>,
  'http_status_code': 200
}

Previously we were using the wrapper dict as the cluster config, which only
had keys ['satelements', 'http_status_code'] instead of the actual cluster
fields like 'cluster_id', 'spark_version', 'spark_env_vars', etc.

Now we properly extract the cluster config from response['satelements'].

This was discovered via debug output showing:
"Cluster #1 config keys: ['satelements', 'http_status_code']"

This fix will allow spark_env_vars to be detected and scanned.
The db_client.get() response format for /clusters/get is:
{
  'satelements': [<cluster_config>],  # List with ONE element
  'http_status_code': 200
}

Previous fix assumed satelements was a dict, but it's actually a list
containing a single dict element. Now we:
1. Check if satelements is a list
2. Extract the first (and only) element: satelements[0]
3. Return that as the cluster config

This fixes the error: 'list' object has no attribute 'keys'
Fixed: local variable 'cluster_id' referenced before assignment

The error occurred because cluster_id and other variables were defined
inside the try block. If an exception occurred during table creation
(before variable assignment), the except block would try to use cluster_id
which hadn't been defined yet.

Solution: Move variable extraction to the very beginning of the function,
before the try block. Now these variables are always defined, even if an
exception occurs early.
Replaced dbutils.notebook.exit() with comprehensive summary display
to preserve all scan output and debug information.

Changes:
- Remove notebook.exit() call that was clearing all output
- Add detailed final summary with statistics
- Include debug information in output
- Add recommendations based on scan results
- Follow same pattern as trufflehog_scan.py (no exit at end)

The orchestrator doesn't use the return value anyway, so this allows
users to see the full scan progress and results without having output
cleared by exit().

Summary displays:
- Total clusters found and scanned
- Secrets detected count
- Debug info (clusters with spark_env_vars, test cluster status)
- Recommended actions based on findings
- Best practices for secure configurations
Add cleanup section matching trufflehog_scan.py pattern to show:
- Temporary cluster config files in /tmp/clusters/
- TruffleHog binary and config files
- Cleanup information and file lifecycle

This provides visibility into what files were created during the scan
and confirms they will be cleaned up when cluster terminates.
Add detailed INFO-level logging to diagnose why secrets aren't being
inserted into clusters_secret_scan_results table:

Changes:
- Add logging before/after insert_cluster_secret_scan_results call
- Add try/except around insertion with detailed error logging
- Add logging inside insert function for each step:
  - Table creation
  - Processing each secret
  - SQL statement execution
- Log full traceback on insertion errors
- Show progress indicators in output

This will help identify:
- Whether table creation succeeds
- Which secrets are being processed
- Whether SQL execution succeeds
- What specific error occurs during insertion
…d idempotent TruffleHog installation

Rename:
- trufflehog_scan.py → notebook_secret_scan.py for clarity

Changes to notebook_secret_scan.py:
- Add idempotent check for TruffleHog installation
- Skip installation if /tmp/trufflehog already exists
- Prevents installation conflicts when running multiple scanners

Changes to cluster_secrets_scan.py:
- Add directory creation verification with write test
- Ensure /tmp/clusters/ is writable before scanning starts
- Add defensive directory creation in serialize_env_vars_to_file()
- Update comment to reference notebook_secret_scan.py

Changes to orchestrator:
- Update notebook path from trufflehog_scan to notebook_secret_scan

This ensures both scanners can run in sequence without conflicts.
…eck=True bug

CRITICAL BUG FIX: Notebook scanner was ignoring run_id passed from orchestrator,
breaking correlation with cluster scanner.

Changes:
1. Respect run_id from orchestrator
   - Check json_.get("run_id") first
   - Use passed run_id if provided (shared with cluster scan)
   - Fall back to generate_run_id() for standalone execution
   - Now matches cluster_secrets_scan.py pattern

2. Fix subprocess.run() check=True bug
   - Remove check=True parameter from TruffleHog scans
   - Was incorrectly raising exceptions on non-zero exit codes
   - Comment said "Don't raise exception" but check=True does the opposite
   - Now properly handles non-zero exit codes

Before:
- Notebook scanner always generated its own run_id
- Notebook and cluster scans had different run_ids
- No correlation possible between findings

After:
- Both scanners use shared run_id from orchestrator
- Enables correlation of notebook + cluster findings in same run
- Maintains backward compatibility for standalone execution

This fixes the fundamental issue preventing cross-scanner correlation.
…nces

Clean up cluster_secrets_scan.py for production:

Removed:
- Test cluster detection (Arun/Personal cluster references)
- debug_info structure and all related tracking
- Sample cluster config logging (first 3 clusters)
- Excessive database insertion logging
- Verbose print statements for each step

Simplified:
- Logging for clusters without env vars (debug level only)
- Database insertion (removed per-secret logging)
- Progress messages (cleaner output)
- Final summary (removed debug info section)

Result:
- 90 lines removed
- Production-ready code
- Essential logging preserved
- No personal/test-specific code

Ready for final test and PR.
…otebook environment

Fix import error: No module named 'common'

In Databricks notebooks, functions from files loaded via %run are
available directly in the namespace, not as importable modules.

Changed:
- Remove: from common import create_clusters_secret_scan_results_table
- Direct call: create_clusters_secret_scan_results_table()

This matches the pattern used in notebook_secret_scan.py and fixes
the database insertion error.
CRITICAL SECURITY FIX: Escape single quotes in all SQL string values
to prevent SQL injection attacks.

Issue:
- Cluster/notebook names with apostrophes (e.g., "Arun Pamulapati's Cluster")
  were breaking SQL queries
- SQL syntax error: VALUES ('...', 'Arun Pamulapati's...', ...)
  would terminate string at first apostrophe
- Could be exploited for SQL injection if malicious names used

Fix:
- Escape all string values by replacing ' with '' (SQL standard)
- Applied to both cluster_secrets_scan.py and notebook_secret_scan.py
- Escaped values:
  * workspace_id, cluster_id, cluster_name, notebook_id, notebook_path, notebook_name
  * config_field, config_key, detector_name, secret_sha256, source_file

Before:
VALUES ('{workspace_id}', '{cluster_name}', ...)
-> FAILS with names containing apostrophes

After:
workspace_id_escaped = workspace_id.replace("'", "''")
VALUES ('{workspace_id_escaped}', '{cluster_name_escaped}', ...)
-> SAFE: "Arun Pamulapati's Cluster" becomes "Arun Pamulapati''s Cluster"

This prevents both SQL syntax errors and SQL injection attacks.
- Fix whitespace indentation in workspace_bootstrap.py
- Clean up comment formatting in clusters_client.py
- Update SDK version in setup.py
- Add CLAUDE.md to .gitignore to exclude from version control
- Update get_cluster_info() docstring to reflect actual usage
- Clean up trailing whitespace in comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants