-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description
ICMP fallback enhances outage classification accuracy by automatically performing ICMP ping tests when TCP or HTTP probes fail. This provides better distinction between service-level failures (application down) and network connectivity issues (host unreachable), enabling more precise root cause analysis and reducing false positive outage alerts.
Currently, ThingConnect Pulse treats all probe failures equally - a failed HTTP probe could indicate either a web service crash or complete network isolation. ICMP fallback adds intelligent secondary testing to classify outage types, helping manufacturing teams understand whether issues are network infrastructure problems or specific service failures.
Instead of showing generic "TCP connection timeout", the system will display:
- 🟠 Service Outage: TCP port unavailable, host reachable via ICMP
- 🔴 Network Outage: Host completely unreachable (both TCP and ICMP fail)
Scope of Work
Core Implementation:
- Add automatic ICMP fallback logic when TCP/HTTP probes fail
- Implement outage classification system (Network, Service, Mixed)
- Extend
CheckResultmodel to include fallback probe results - Update outage detection logic to consider fallback test outcomes
Probe Enhancement:
- Modify
ProbeServiceto perform secondary ICMP tests on TCP/HTTP failures - Add configurable fallback timeout (shorter than primary probe timeout)
- Ensure fallback probes respect existing concurrency limits and jitter scheduling
- Preserve original error messages while adding fallback context
Database Schema:
- Extend
CheckResultRawtable with fallback probe fields - Add
OutageClassificationenum (Network, Service, Mixed, Unknown) - Update
Outagetable to store classification results - Maintain backwards compatibility with existing data
UI Integration:
- Display outage classification in dashboard status indicators
- Show fallback probe results in endpoint detail views
- Add filtering by outage type in history views
- Include classification information in CSV exports
Acceptance Criteria
Functional Requirements:
- TCP/HTTP probe failures automatically trigger ICMP fallback within 2 seconds
- ICMP fallback timeout defaults to 500ms (configurable via settings)
- Outage classification correctly identifies Network vs Service vs Mixed failures
- Fallback probes respect existing ±10% jitter and concurrency limits
- Original probe error messages preserved alongside fallback results
Classification Logic:
- Network Outage: Primary probe fails, ICMP fallback fails → Host unreachable
- Service Outage: Primary probe fails, ICMP fallback succeeds → Service down
- Mixed Outage: Inconsistent results over multiple checks → Network instability
- Unknown: Fallback probe times out or encounters errors → Default classification
Performance Requirements:
- Fallback probes add <1 second overhead to failed probe cycles
- No impact on successful probe execution (no fallback needed)
- Memory usage increases by <20 bytes per endpoint for fallback state tracking
- Database storage increases by ~30% for failed check records (fallback data)
User Experience:
- Dashboard shows color-coded outage types (red=Network, orange=Service, yellow=Mixed)
- Endpoint detail pages display "Last seen via ICMP" timestamps for service outages
- History charts distinguish between outage classification types
- CSV exports include OutageType and FallbackResult columns
###Classification
None = -1, // Explicitly healthy, no outage detected
Unknown = 0, // Not enough information to classify
Network = 1, // Host completely unreachable
Service = 2, // Service down, host reachable
Intermittent = 3, // Flapping/unstable connection
Performance = 4, // Slow response times
PartialService = 5, // HTTP errors but TCP works
DnsResolution = 6, // DNS fails but IP works
Congestion = 7, // Network performance issues
Maintenance = 8 // Planned downtime classification