Skip to content

Latest commit

 

History

History
238 lines (184 loc) · 17.4 KB

File metadata and controls

238 lines (184 loc) · 17.4 KB

Development Log

This file documents detailed technical changes, internal refactorings, and development notes. For user-facing highlights, see CHANGELOG.md.

Version 0.9.4 — Detailed Changes

New Features — Technical Details

Pipes Monitoring Plugin

  • Implemented PipesPlugin to monitor Snowpipe status and validation
  • Uses SYSTEM$PIPE_STATUS function for real-time pipe monitoring
  • Uses VALIDATE_PIPE_LOAD function for validation checks
  • Delivers telemetry as logs, metrics, and events

Streams Monitoring Plugin

  • Implemented StreamsPlugin to monitor Snowflake Streams
  • Tracks stream staleness using SHOW STREAMS output
  • Monitors pending changes and stream health
  • Reports stale streams as warning events

Stage Monitoring Plugin

  • Implemented StagePlugin to monitor staged data
  • Tracks internal and external stages
  • Monitors COPY INTO activities from QUERY_HISTORY and COPY_HISTORY views
  • Reports on staged file sizes, counts, and load patterns

Data Lineage Plugin

  • Implemented DataLineagePlugin combining static and dynamic lineage
  • Static lineage from OBJECT_DEPENDENCIES view (DDL-based relationships)
  • Dynamic lineage from ACCESS_HISTORY view (runtime data flow)
  • Column-level lineage tracking with direct and indirect dependencies
  • Lineage graphs delivered as structured events

SNOWFLAKE.TELEMETRY.EVENTS Support (BDX-1172)

  • Issue: When a customer account had EVENT_TABLE = snowflake.telemetry.events (the Snowflake-managed shared event table), SETUP_EVENT_TABLE() listed it in a_no_custom_event_t — the "not a real custom table" array — and took the IF branch, creating DSOA's own DTAGENT_DB.STATUS.EVENT_LOG table and ignoring the Snowflake-managed table entirely.
  • Root cause: 'snowflake.telemetry.events' was excluded from the view-creation path because the original ELSE branch attempted GRANT SELECT ON TABLE snowflake.telemetry.events TO ROLE DTAGENT_VIEWER, which Snowflake rejects — privileges cannot be granted on Snowflake-managed objects.
  • Fix: Two-part change in src/dtagent/plugins/event_log.sql/init/009_event_log_init.sql:
    1. Removed 'snowflake.telemetry.events' from a_no_custom_event_t so it falls through to the ELSE branch
    2. Wrapped the GRANT SELECT in a BEGIN/EXCEPTION WHEN OTHER THEN SYSTEM$LOG_WARN() block — attempts the grant and logs warnings, ignoring failures for any read-only or Snowflake-managed table; more robust than a string comparison
  • Behaviour after fix: When EVENT_TABLE = snowflake.telemetry.events, DSOA creates DTAGENT_DB.STATUS.EVENT_LOG as a view over it, exactly as for any other pre-existing customer event table. All three event_log SQL views continue to query DTAGENT_DB.STATUS.EVENT_LOG unchanged — no Python changes needed.

Configurable Lookback Time

  • Motivation: Lookback windows were hardcoded across SQL views in every plugin that uses F_LAST_PROCESSED_TS. This could not be tuned per deployment without modifying SQL files.
  • Approach: Replace each literal with CONFIG.F_GET_CONFIG_VALUE('plugins.<plugin>.lookback_hours', <default>) and add lookback_hours to each plugin's config YAML — consistent with how retention_hours is already handled in P_CLEANUP_EVENT_LOG.
  • Pattern: timeadd(hour, -1*F_GET_CONFIG_VALUE('plugins.<plugin>.lookback_hours', <N>), current_timestamp) — the -1* multiplier converts the positive config value to a negative offset.
  • Note: The F_LAST_PROCESSED_TS guard in each view's GREATEST(...) clause ensures normal incremental runs are unaffected; lookback_hours only bounds the fallback window when no prior timestamp exists.
  • Files changed (SQL views + config YAMLs):
Plugin SQL view(s) Default
event_log 051_v_event_log.sql, 051_v_event_log_metrics_instrumented.sql, 051_v_event_log_spans_instrumented.sql 24h
login_history 061_v_login_history.sql, 061_v_sessions.sql 24h
warehouse_usage 070_v_warehouse_event_history.sql, 071_v_warehouse_load_history.sql, 072_v_warehouse_metering_history.sql 24h
tasks 061_v_serverless_tasks.sqllookback_hours (4h); 063_v_task_versions.sqllookback_hours_versions (720h = 1 month) separate keys, original defaults preserved
event_usage 051_v_event_usage.sql 6h
data_schemas 051_v_data_schemas.sql 4h

Bug Fixes — Technical Details

Dynamic Tables Grant — Schema-Level Granularity (BDX-640)

  • Issue: P_GRANT_MONITOR_DYNAMIC_TABLES() always granted MONITOR at database level, even when the include pattern specified a particular schema (e.g. PROD_DB.ANALYTICS.%). This caused the procedure to over-grant: a user expecting grants only on PROD_DB.ANALYTICS received grants on all schemas in PROD_DB.

  • Root cause: The CTE extracted only split_part(value, '.', 0) (the database part) and the schema part was never inspected.

  • Fix: Three-pass approach in 032_p_grant_monitor_dynamic_tables.sql:

    1. Database passsplit_part(value, '.', 1) = '%'GRANT … IN DATABASE.
    2. Schema passsplit_part(value, '.', 1) != '%' and split_part(value, '.', 2) = '%'GRANT … IN SCHEMA db.schema.
    3. Table passsplit_part(value, '.', 1) != '%' and split_part(value, '.', 2) != '%'GRANT … ON DYNAMIC TABLE db.schema.table (no FUTURE grant — not supported by Snowflake at individual table level).
  • Grant matrix:

    Include pattern Grant level
    %.%.% All databases
    PROD_DB.%.% Database PROD_DB
    PROD_DB.ANALYTICS.% Schema PROD_DB.ANALYTICS
    PROD_DB.ANALYTICS.ORDERS_DT Table PROD_DB.ANALYTICS.ORDERS_DT
  • Files changed: 032_p_grant_monitor_dynamic_tables.sql, bom.yml, config.md

  • Tests added: test/bash/test_grant_monitor_dynamic_tables.bats — structural content checks covering both grant paths

Log ObservedTimestamp Unit Correction

  • Issue: OTel log observed_timestamp field was sent in milliseconds
  • Root cause: OTLP spec requires nanoseconds for observed_timestamp, but code was converting to milliseconds
  • Fix: Modified process_timestamps_for_telemetry() to return observed_timestamp_ns in nanoseconds
  • Impact: Logs now comply with OTLP spec
  • Note: Dynatrace OTLP Logs API still requires milliseconds for timestamp field (deviation from spec)

Inbound Shares Reporting Flag

  • Issue: HAS_DB_DELETED flag incorrectly reported for deleted shared databases in TMP_SHARES view
  • Root cause: Logic error in SQL view predicate
  • Fix: Corrected SQL logic in shares.sql/ view definition
  • Impact: Accurate reporting of deleted shared database status

Self-Monitoring Log Filtering

  • Issue: Database name filtering logic failed to correctly identify DTAGENT_DB references
  • Root cause: String matching logic didn't account for fully qualified names
  • Fix: Updated filtering logic in self-monitoring plugin
  • Impact: Self-monitoring logs now correctly exclude internal agent operations

Improvements — Technical Details

Timestamp Handling Refactoring

  • Motivation: Eliminate wasteful ns→ms→ns conversions and clarify API requirements
  • Approach: Unified timestamp handling with smart unit detection
  • Implementation:
    • All SQL views produce nanoseconds via extract(epoch_nanosecond ...)
    • Conversion to appropriate unit occurs only at API boundary
    • validate_timestamp() works internally in nanoseconds to preserve precision
    • Added return_unit parameter ("ms" or "ns") for explicit output control
    • Added skip_range_validation parameter for observed_timestamp (no time range check)
    • Created process_timestamps_for_telemetry() utility for standard timestamp processing pattern
  • Changes to validate_timestamp():
    • Works internally in nanoseconds throughout validation logic
    • Converts to requested unit only at the end
    • Raises ValueError if return_unit not in ["ms", "ns"]
    • Added skip_range_validation for observed_timestamp (preserves original value without range checks)
  • Changes to process_timestamps_for_telemetry():
    • New utility function implementing standard pattern for logs and events
    • Extracts timestamp and observed_timestamp from data dict
    • Falls back to timestamp value when observed_timestamp not provided
    • Validates timestamp with range checking (returns milliseconds)
    • Validates observed_timestamp without range checking (returns nanoseconds)
    • Returns (timestamp_ms, observed_timestamp_ns) tuple
    • Hardcoded units: always milliseconds for timestamp, nanoseconds for observed_timestamp
  • Removed obsolete functions:
    • get_timestamp_in_ms() — replaced by validate_timestamp(value, return_unit="ms")
    • validate_timestamp_ms() — replaced by validate_timestamp(value, return_unit="ms")
  • Added new functions:
    • get_timestamp() — returns nanoseconds from SQL query results
  • API Documentation:
    • Added comprehensive documentation links in all telemetry classes
    • Documented Dynatrace OTLP Logs API deviation (milliseconds for timestamp field)
    • Documented OTLP standard requirements (nanoseconds for most timestamp fields)
  • Fallback Logic:
    • observed_timestamp now correctly falls back to timestamp value when not provided
    • Only event_log plugin provides explicit observed_timestamp values
    • All other plugins rely on fallback mechanism

Build System Virtual Environment

  • Change: All scripts/dev/ scripts now auto-activate .venv/
  • Implementation: Added source .venv/bin/activate to script preambles
  • Impact: Eliminates common "wrong Python" errors during development

Documentation — Autogenerated Files

  • Change: Updated .github/copilot-instructions.md with autogenerated file documentation
  • Coverage:
    • Documentation files: docs/PLUGINS.md, docs/SEMANTICS.md, docs/APPENDIX.md
    • Build artifacts: build/_dtagent.py, build/_send_telemetry.py, build/_semantics.py, build/_version.py, build/_metric_semantics.txt
  • Guidance: Never edit autogenerated files manually; edit source files and regenerate

Budgets Plugin Enhancement

  • Change: Enhanced budget data collection using SYSTEM$SHOW_BUDGETS_IN_ACCOUNT()
  • Previous: Manual query construction
  • New: Leverages Snowflake system function for comprehensive budget data
  • Impact: More accurate and complete budget information

Error Handling — Two-Phase Commit for Query Telemetry (BDX-694 / BDX-706)

  • Issue: STATUS.UPDATE_PROCESSED_QUERIES was called regardless of whether the OTLP trace flush succeeded, meaning queries could be silently lost on export failures without being retried on the next cycle.
  • Root cause: _process_span_rows in src/dtagent/plugins/__init__.py called UPDATE_PROCESSED_QUERIES unconditionally after flush_traces().
  • Fix: Captured the boolean return value of flush_traces() into flush_succeeded and gated the UPDATE_PROCESSED_QUERIES call behind if report_status and flush_succeeded.
  • Impact: Queries whose spans fail to export are re-queued on the next agent run, ensuring at-least-once delivery semantics for span telemetry.

Event Log Lookback — Configurable Window (BDX-706)

  • Issue: V_EVENT_LOG used a hardcoded timeadd(hour, -24, current_timestamp) lower bound, preventing operators from adjusting the lookback window without editing SQL.
  • Fix:
    • src/dtagent/plugins/event_log.sql/051_v_event_log.sql: replaced literal with CONFIG.F_GET_CONFIG_VALUE('plugins.event_log.lookback_hours', 24)::int.
    • src/dtagent/plugins/event_log.config/event_log-config.yml: added lookback_hours: 24 (default preserves prior behaviour).
  • Impact: Operators can increase the window for initial deployments or decrease it for high-volume environments without any SQL change.

Query Hierarchy Validation (BDX-620)

  • Goal: Confirm that nested stored procedure call chains are correctly represented as OTel parent-child spans.
  • Validation approach:
    • P_REFRESH_RECENT_QUERIES sets IS_ROOT=TRUE for top-level calls (no parent_query_id) and IS_PARENT=TRUE for any query that has at least one child in the same batch. Leaf queries have IS_ROOT=FALSE, IS_PARENT=FALSE.
    • _process_span_rows in src/dtagent/plugins/__init__.py iterates only IS_ROOT=TRUE rows as top-level spans; child spans are fetched recursively via Spans._get_sub_rows using PARENT_QUERY_ID.
    • ExistingIdGenerator in src/dtagent/otel/spans.py propagates the root's _TRACE_ID and _SPAN_ID down the hierarchy so every sub-span shares the correct trace context.
  • New test fixture: test/test_data/query_history_nested_sp.ndjson — 3-row synthetic SP chain: outer SP (root) → inner SP (mid) → leaf SELECT.
  • New test file: test/plugins/test_query_history_span_hierarchy.py
    • test_span_hierarchy: integration test verifying 3 entries processed, 3 spans, 3 logs, 27 metrics across all disabled_telemetry combinations.
    • test_is_root_only_processes_top_level: unit test confirming only 1 root row and 2 non-root rows in the fixture.
    • test_is_parent_flags_intermediate_nodes: unit test asserting correct IS_ROOT/IS_PARENT/PARENT_QUERY_ID values for each level of the hierarchy.
  • Impact: Span hierarchies for stored procedure chains are confirmed correct and regression-protected.

Test Infrastructure Refactoring

  • Change: Refactored tests to use synthetic JSON fixtures
  • Previous: Live Dynatrace API calls for validation
  • New: Input/output validation against golden JSON files
  • Impact: Faster, more reliable, deterministic tests

Event Tables Cost Optimization Documentation (BDX-688)

  • Change: Expanded event_log.config/config.md from a minimal 5-line note to a full configuration reference
  • Content added:
    • Configuration options table covering all 7 plugin settings with types, defaults, and descriptions
    • Cost optimization guidance section explaining the cost impact of LOOKBACK_HOURS, MAX_ENTRIES, RETENTION_HOURS, and SCHEDULE
    • Key guidance: retention_hours should be >= lookback_hours to prevent cleanup from removing events before they are processed
  • Files changed:
    • src/dtagent/plugins/event_log.config/config.md — full configuration reference + cost guidance
    • src/dtagent/plugins/event_log.config/readme.md — updated to mention configurable lookback window

Span Timestamp Handling Fix (BDX-706)

  • Issue: _process_span_rows() in src/dtagent/plugins/__init__.py called _report_execution() with current_timestamp() (a Snowflake lazy column expression) instead of the actual last-row timestamp.
  • Root cause: When STATUS.LOG_PROCESSED_MEASUREMENTS stored this value, it received the string 'Column[current_timestamp]' rather than a real timestamp. On the next run, F_LAST_PROCESSED_TS would return a malformed value, causing the GREATEST(...) guard in each SQL view to use the fallback lookback window — potentially re-processing spans already sent.
  • Fix: Added last_processed_timestamp variable tracking row_dict.get("TIMESTAMP", last_processed_timestamp) within the row iteration loop, mirroring the identical pattern used by _log_entries(). Passed str(last_processed_timestamp) to _report_execution() instead of current_timestamp().
  • Side effect removed: Dropped the now-unused from snowflake.snowpark.functions import current_timestamp import — pylint flagged this as unused after the fix.
  • Impact: Spans and traces will no longer be re-processed after an agent restart. The F_LAST_PROCESSED_TS('event_log_spans') guard now advances correctly after each run.
  • Affects: event_log plugin (_process_span_entries) and any future plugin using _process_span_rows with log_completion=True

Version 0.9.3 — Detailed Changes

Detailed technical changes for prior versions can be added here as needed.

Version 0.9.2 — Detailed Changes

Detailed technical changes for prior versions can be added here as needed.

Notes

  • This file is not auto-generated. Manual maintenance required.
  • Focus on technical implementation details, root causes, and internal changes.
  • For user-facing release notes, see CHANGELOG.md.
  • Entries should help future developers understand decisions and troubleshoot issues.