Skip to content

Add SNMP ping (switch health, port up/down) #172

@abishekve

Description

@abishekve

Here is a professional GitHub issue draft to add SNMP ping for switch health with optional port up/down checks, aligned to the existing probe engine, outage flow, and YAML configuration patterns.[1][2]

Title

Add SNMP ping (switch health, port up/down)[2]

Background

Pulse currently supports ICMP, TCP, and HTTP reachability, and extending to SNMP aligns with the plant-aware objective while staying within the availability-only scope for v1.0.[2]
The new SNMP probe must plug into the existing ProbeService → OutageDetectionService pipeline, reuse timeout semantics, record RTT, and surface consistently in API, CSV, and live board.[3][1]

Objective

Implement a new probe type snmp that reports UP when an SNMP GET succeeds within timeout (defaulting to sysUpTime.0), records RTT, and optionally evaluates a configured OID for port status to annotate health while keeping the primary outcome availability-focused.[1][2]

Scope

  • Backend probe: SNMP reachability via UDP/161 using a minimal GET on a default OID (sysUpTime.0), with configurable version (v1/v2c), community, timeout, and retries, mapping success/failure to UP/DOWN with RTT as request roundtrip.[1]
  • Optional port check: allow an OID parameter (e.g., ifOperStatus for a specific interface index) to be fetched after reachability; availability remains based on SNMP reachability while port status is recorded in the result metadata for UI display.[1]
  • YAML schema: add type: snmp with host, port (default 161), version, community, timeout, retries, and optional oid/expectedValue for port-state hints; validate and surface in Apply/diff/versioning.[4]

Non-goals

  • No deep device inventory, traps, or bulk walks; this is a lightweight reachability/health check appropriate for v1 availability scope.[2]
  • No SNMPv3 security profiles in this phase; start with v1/v2c to minimize complexity and configuration burden.[2]

Design Notes

  • Result mapping: Success = UP with RTT measured as GET roundtrip; Failure = DOWN on timeout, no response, or auth error, with granular error categories for observability.[1]
  • Timeouts/retries: Adopt existing per-probe timeout and retry semantics; wire through cancellation tokens and error paths consistent with other probes.[1]
  • Outage flow: Feed CheckResult into OutageDetectionService unchanged to honor 2/2 flap damping and transactional outage open/close behavior.[3]

Tasks

  • Backend
    • Implement SnmpPingProbe with reachability GET to default OID and RTT capture, plus optional fetch of a configured OID for port status annotation.[1]
    • Extend ProbeService.ProbeAsync to route type: snmp and produce standardized CheckResult with error categorization (timeout, noResponse, authError).[3]
    • Add unit tests for success, timeout, no response, bad community, and optional port OID resolution; add an integration test using a mock SNMP agent.[1]
  • Configuration & Apply
    • Update config.schema.json to include enum value snmp with properties: host, port, version (v1/v2c), community, timeout, retries, and optional oid/expectedValue.[4]
    • Extend ConfigurationParser validations and Apply diff to show additions/changes and preserve version snapshots and warnings for invalid parameter combinations.[4]
  • API/UI
    • Ensure API DTOs and CSV export include probe type snmp, RTT, and optional portStatus metadata without changing outage semantics.[5]
    • Update Configuration editor to add SNMP fields with inline validation and help text, and label SNMP endpoints distinctly in live board and detail pages.[4]
  • Docs
    • Add docs examples for snmp endpoints, defaults, version/community notes, optional port status OID, and firewall/UDP considerations.[5]
    • Note performance expectations for SNMP RTT and error categorization in probes-spec.md.[1]

Acceptance Criteria

  • A YAML endpoint with type: snmp applies cleanly, appears in diff/versioning, and is visible/editable in the UI with sensible defaults and validations.[4]
  • SNMP endpoints report UP when a GET completes within timeout and DOWN on timeout/no response/auth error, with RTT populated and errors categorized.[1]
  • Outage transitions for SNMP respect 2/2 flap damping and persist open/close events as with other probe types.[3]
  • API and CSV show probe type snmp and RTT, and UI clearly distinguishes SNMP endpoints and optionally displays port status if configured.[5]

Risks & Mitigations

  • Variability across vendors and MIBs for port OIDs; default to sysUpTime.0 for reachability and document port-OID as optional.[1]
  • UDP filtering or rate-limiting in OT networks; provide clear error categorization and operator guidance in docs and UI.[1]

Testing Plan

  • Unit tests for SnmpPingProbe covering success/failure modes with deterministic timings and cancellations.[1]
  • Integration test against a mock agent to validate port OID flow and RTT reporting, plus E2E from YAML apply → probe execution → outage transitions → API/CSV verification.

References

  • Probe semantics and budgets to mirror: probes-spec.md.
  • Flow integration and execution boundaries: outage-probe-flow-analysis.md.
  • Configuration/Apply/versioning architecture and file layout: Configuration.md.
  • API/doc surfacing and examples: README.md.
  • Scope guardrails and availability-only emphasis: scope-v1.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions