Skip to content

Commit b5d1415

Browse files
hemanandrclaude
andcommitted
Implement Phase 3: Complete Monitoring Engine
- **OutageDetectionService**: Flap damping with 2/2 thresholds, state transitions, outage tracking - **ProbeService**: ICMP ping, TCP connect, HTTP checks with timeout and retry logic - **DiscoveryService**: CIDR range and wildcard expansion for target discovery - **MonitoringBackgroundService**: Concurrent probe execution with semaphore limiting (100 max) - **MonitorState**: Per-endpoint state tracking with success/fail streaks - **CheckResult**: Unified probe result model with status, RTT, and error details Successfully tested with existing seed data endpoints: - Cloudflare DNS and Google Search endpoints monitored every 60 seconds - Check results persisted to database with proper EF Core integration - Background service manages endpoint refresh and concurrent probe scheduling - All probe types (ICMP, TCP, HTTP) working with proper error handling Resolves Issue #26: Outage Detection Service Implements SEC-06: Discovery & Expansion Logic Implements ENV-14: Probe Concurrency Caps Phase 3 monitoring engine is complete and operational. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1 parent ed2bcbc commit b5d1415

19 files changed

+1169
-10
lines changed

.claude/settings.local.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@
55
"Bash(dotnet build)",
66
"Bash(dotnet add package:*)",
77
"Bash(curl:*)",
8-
"Bash(taskkill:*)"
8+
"Bash(taskkill:*)",
9+
"Bash(dotnet clean:*)"
910
],
1011
"deny": [],
1112
"ask": []

DEVELOPMENT_PLAN.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ You can effectively work on **up to 6 parallel worktrees** without conflicts:
7676
| SEC-05 | P1 | 1d | Monitoring model & probe specs | 3 |
7777
| SEC-11 | P1 | 1d | Rollup algorithms specification | 6 |
7878

79-
### PHASE 2: Data Layer (Week 1, Days 3-5)
79+
### PHASE 2: Data Layer (Week 1, Days 3-5)**COMPLETE**
8080
**Database foundation - EPIC #5**
8181

8282
| Issue | Priority | Time | Description | Worktree | Status |
@@ -85,6 +85,12 @@ You can effectively work on **up to 6 parallel worktrees** without conflicts:
8585
| #11 | P1 | 4-6h | Config version snapshot storage | 2 |**COMPLETE** - Commit 88e2dc0 |
8686
| #12 | P2 | 3-4h | Settings table, watermarks | 2 |**COMPLETE** - Commit 61626cb |
8787

88+
**Phase 2 Summary**: All data layer components are complete and operational:
89+
- EF Core with SQLite properly configured and tested
90+
- ConfigVersion storage with SHA-256 hash-based duplicate detection
91+
- Settings service with memory caching and watermark tracking
92+
- Database migrations and entities match documented data model exactly
93+
8894
### PHASE 3: Monitoring Engine (Week 2, Days 1-3)
8995
**Core monitoring functionality**
9096

@@ -210,7 +216,7 @@ git worktree remove ../pulse-env-setup
210216

211217
- **Phase 0**: ✅ **COMPLETE** - Dev can run `dotnet build` successfully
212218
- **Phase 1**: All specs frozen, no more contract changes
213-
- **Phase 2**: ✅ **COMPLETE** - Database created, migrations run, config storage implemented
219+
- **Phase 2**: ✅ **COMPLETE** - Database created, migrations run, config storage & settings implemented with full testing
214220
- **Phase 3**: Can detect UP/DOWN state changes
215221
- **Phase 4**: All API endpoints return data (mock or real)
216222
- **Phase 5**: Rollups computed automatically

README.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -59,13 +59,19 @@ POST /api/config/apply
5959
Content-Type: text/plain
6060
[YAML configuration content]
6161

62-
# List all configuration versions
62+
# List all configuration versions
6363
GET /api/config/versions
6464

6565
# Download specific configuration version
6666
GET /api/config/versions/{id}
67+
68+
# Settings management (internal)
69+
GET /api/settings/{key}
70+
POST /api/settings/{key}
6771
```
6872

73+
**Status**: Configuration management endpoints are fully implemented and tested. Settings service provides watermark tracking for rollup jobs.
74+
6975
## Development
7076

7177
- **[Backend Setup](./ops/dev-backend.md)** - Zero-to-first-run backend development
@@ -95,10 +101,11 @@ GET /api/config/versions/{id}
95101

96102
### v1.0 Scope
97103
- **Network Monitoring**: ICMP ping, TCP connect, HTTP status checks
98-
- **Configuration**: YAML-based with JSON Schema validation and version tracking
99-
- **Data Storage**: SQLite with automatic rollups and retention
104+
- **Configuration**: YAML-based with JSON Schema validation and version tracking
105+
- **Data Storage**: SQLite with automatic rollups and retention foundation
100106
- **Web Interface**: Real-time status dashboard and historical views
101-
- **Configuration Management**: Apply, list, and download configuration versions
107+
- **Configuration Management**: ✅ Apply, list, and download configuration versions
108+
- **Settings Management**: ✅ Key-value store with watermark tracking for rollup jobs
102109
- **Alerting**: Status change detection with flap damping
103110
- **Deployment**: Single Windows service installer
104111

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
using Microsoft.AspNetCore.Mvc;
2+
using Microsoft.EntityFrameworkCore;
3+
using ThingConnect.Pulse.Server.Data;
4+
using ThingConnect.Pulse.Server.Models;
5+
using ThingConnect.Pulse.Server.Services.Monitoring;
6+
7+
namespace ThingConnect.Pulse.Server.Controllers;
8+
9+
/// <summary>
10+
/// Temporary controller for testing monitoring functionality.
11+
/// </summary>
12+
[ApiController]
13+
[Route("api/test/monitoring")]
14+
public class TestMonitoringController : ControllerBase
15+
{
16+
private readonly PulseDbContext _context;
17+
private readonly IProbeService _probeService;
18+
private readonly IOutageDetectionService _outageService;
19+
private readonly IDiscoveryService _discoveryService;
20+
21+
public TestMonitoringController(
22+
PulseDbContext context,
23+
IProbeService probeService,
24+
IOutageDetectionService outageService,
25+
IDiscoveryService discoveryService)
26+
{
27+
_context = context;
28+
_probeService = probeService;
29+
_outageService = outageService;
30+
_discoveryService = discoveryService;
31+
}
32+
33+
/// <summary>
34+
/// Test probe functionality with different endpoint types.
35+
/// </summary>
36+
[HttpPost("test-probes")]
37+
public async Task<IActionResult> TestProbes()
38+
{
39+
var results = new List<object>();
40+
41+
// Test ICMP probe
42+
var pingResult = await _probeService.PingAsync(Guid.NewGuid(), "8.8.8.8", 2000);
43+
results.Add(new { Type = "ICMP", Target = "8.8.8.8", Status = pingResult.Status, RTT = pingResult.RttMs, Error = pingResult.Error });
44+
45+
// Test TCP probe
46+
var tcpResult = await _probeService.TcpConnectAsync(Guid.NewGuid(), "google.com", 80, 2000);
47+
results.Add(new { Type = "TCP", Target = "google.com:80", Status = tcpResult.Status, RTT = tcpResult.RttMs, Error = tcpResult.Error });
48+
49+
// Test HTTP probe
50+
var httpResult = await _probeService.HttpCheckAsync(Guid.NewGuid(), "httpbin.org", 80, "/get", null, 3000);
51+
results.Add(new { Type = "HTTP", Target = "httpbin.org/get", Status = httpResult.Status, RTT = httpResult.RttMs, Error = httpResult.Error });
52+
53+
return Ok(new { Results = results, Timestamp = DateTimeOffset.UtcNow });
54+
}
55+
56+
/// <summary>
57+
/// Test outage detection with simulated probe results.
58+
/// </summary>
59+
[HttpPost("test-outage-detection")]
60+
public async Task<IActionResult> TestOutageDetection()
61+
{
62+
var testEndpointId = Guid.NewGuid();
63+
var results = new List<object>();
64+
65+
// Simulate probe sequence: SUCCESS, SUCCESS, FAIL, FAIL (should trigger DOWN)
66+
var sequence = new[]
67+
{
68+
CheckResult.Success(testEndpointId, DateTimeOffset.UtcNow.AddMinutes(-4), 25.5),
69+
CheckResult.Success(testEndpointId, DateTimeOffset.UtcNow.AddMinutes(-3), 28.1),
70+
CheckResult.Failure(testEndpointId, DateTimeOffset.UtcNow.AddMinutes(-2), "Connection timeout"),
71+
CheckResult.Failure(testEndpointId, DateTimeOffset.UtcNow.AddMinutes(-1), "Connection refused"),
72+
CheckResult.Success(testEndpointId, DateTimeOffset.UtcNow.AddSeconds(-30), 22.3),
73+
CheckResult.Success(testEndpointId, DateTimeOffset.UtcNow, 19.7)
74+
};
75+
76+
foreach (var result in sequence)
77+
{
78+
var stateChanged = await _outageService.ProcessCheckResultAsync(result);
79+
var state = _outageService.GetMonitorState(testEndpointId);
80+
81+
results.Add(new
82+
{
83+
Timestamp = result.Timestamp,
84+
Status = result.Status,
85+
StateChanged = stateChanged,
86+
LastPublicStatus = state?.LastPublicStatus,
87+
FailStreak = state?.FailStreak ?? 0,
88+
SuccessStreak = state?.SuccessStreak ?? 0,
89+
OpenOutageId = state?.OpenOutageId
90+
});
91+
}
92+
93+
return Ok(new { TestEndpointId = testEndpointId, Sequence = results });
94+
}
95+
96+
/// <summary>
97+
/// Test discovery service with various target types.
98+
/// </summary>
99+
[HttpPost("test-discovery")]
100+
public async Task<IActionResult> TestDiscovery()
101+
{
102+
var results = new List<object>();
103+
104+
// Test CIDR expansion
105+
var cidrHosts = _discoveryService.ExpandCidr("192.168.1.0/30").Take(10).ToList();
106+
results.Add(new { Type = "CIDR", Input = "192.168.1.0/30", Expanded = cidrHosts });
107+
108+
// Test wildcard expansion
109+
var wildcardHosts = _discoveryService.ExpandWildcard("10.0.0.*", 1, 5).ToList();
110+
results.Add(new { Type = "Wildcard", Input = "10.0.0.*", Range = "1-5", Expanded = wildcardHosts });
111+
112+
// Test hostname resolution
113+
var resolvedHosts = await _discoveryService.ResolveHostnameAsync("google.com");
114+
results.Add(new { Type = "Hostname", Input = "google.com", Resolved = resolvedHosts });
115+
116+
return Ok(new { Results = results, Timestamp = DateTimeOffset.UtcNow });
117+
}
118+
119+
/// <summary>
120+
/// Get current raw check results for verification.
121+
/// </summary>
122+
[HttpGet("check-results")]
123+
public async Task<IActionResult> GetCheckResults()
124+
{
125+
var recentResults = await _context.CheckResultsRaw
126+
.OrderByDescending(cr => cr.Ts)
127+
.Take(20)
128+
.Select(cr => new
129+
{
130+
cr.Id,
131+
cr.EndpointId,
132+
cr.Ts,
133+
cr.Status,
134+
cr.RttMs,
135+
cr.Error
136+
})
137+
.ToListAsync();
138+
139+
return Ok(new { Results = recentResults, Count = recentResults.Count });
140+
}
141+
142+
/// <summary>
143+
/// Get current outages.
144+
/// </summary>
145+
[HttpGet("outages")]
146+
public async Task<IActionResult> GetOutages()
147+
{
148+
var outages = await _context.Outages
149+
.OrderByDescending(o => o.StartedTs)
150+
.Select(o => new
151+
{
152+
o.Id,
153+
o.EndpointId,
154+
o.StartedTs,
155+
o.EndedTs,
156+
o.DurationSeconds,
157+
o.LastError
158+
})
159+
.ToListAsync();
160+
161+
return Ok(new { Outages = outages, Count = outages.Count });
162+
}
163+
}
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
using ThingConnect.Pulse.Server.Data;
2+
3+
namespace ThingConnect.Pulse.Server.Models;
4+
5+
/// <summary>
6+
/// Result of a single probe check (ICMP, TCP, or HTTP).
7+
/// </summary>
8+
public sealed class CheckResult
9+
{
10+
/// <summary>
11+
/// The endpoint that was checked.
12+
/// </summary>
13+
public Guid EndpointId { get; set; }
14+
15+
/// <summary>
16+
/// Timestamp when the check was performed.
17+
/// </summary>
18+
public DateTimeOffset Timestamp { get; set; }
19+
20+
/// <summary>
21+
/// Result status: UP or DOWN.
22+
/// </summary>
23+
public UpDown Status { get; set; }
24+
25+
/// <summary>
26+
/// Round-trip time in milliseconds. Null if not applicable or failed.
27+
/// </summary>
28+
public double? RttMs { get; set; }
29+
30+
/// <summary>
31+
/// Error message if the check failed. Null if successful.
32+
/// </summary>
33+
public string? Error { get; set; }
34+
35+
/// <summary>
36+
/// Creates a successful check result.
37+
/// </summary>
38+
public static CheckResult Success(Guid endpointId, DateTimeOffset timestamp, double? rttMs = null)
39+
{
40+
return new CheckResult
41+
{
42+
EndpointId = endpointId,
43+
Timestamp = timestamp,
44+
Status = UpDown.up,
45+
RttMs = rttMs,
46+
Error = null
47+
};
48+
}
49+
50+
/// <summary>
51+
/// Creates a failed check result.
52+
/// </summary>
53+
public static CheckResult Failure(Guid endpointId, DateTimeOffset timestamp, string error)
54+
{
55+
return new CheckResult
56+
{
57+
EndpointId = endpointId,
58+
Timestamp = timestamp,
59+
Status = UpDown.down,
60+
RttMs = null,
61+
Error = error
62+
};
63+
}
64+
}

ThingConnect.Pulse.Server/Program.cs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
using ThingConnect.Pulse.Server.Data;
44
using ThingConnect.Pulse.Server.Infrastructure;
55
using ThingConnect.Pulse.Server.Services;
6+
using ThingConnect.Pulse.Server.Services.Monitoring;
67

78
namespace ThingConnect.Pulse.Server;
89

@@ -19,11 +20,20 @@ public static void Main(string[] args)
1920
// Add memory cache for settings service
2021
builder.Services.AddMemoryCache();
2122

23+
// Add HTTP client for probes
24+
builder.Services.AddHttpClient();
25+
2226
// Add configuration services
2327
builder.Services.AddSingleton<ConfigParser>();
2428
builder.Services.AddScoped<IConfigurationService, ConfigurationService>();
2529
builder.Services.AddScoped<ISettingsService, SettingsService>();
2630

31+
// Add monitoring services
32+
builder.Services.AddScoped<IProbeService, ProbeService>();
33+
builder.Services.AddScoped<IOutageDetectionService, OutageDetectionService>();
34+
builder.Services.AddScoped<IDiscoveryService, DiscoveryService>();
35+
builder.Services.AddHostedService<MonitoringBackgroundService>();
36+
2737
builder.Services.AddControllers(options =>
2838
{
2939
options.InputFormatters.Insert(0, new PlainTextInputFormatter());

0 commit comments

Comments
 (0)