Skip to content

Commit c6f725d

Browse files
Metrics improvements based on recent testing (#4862)
* Add a metrics wrapper * Add more metrics * Remove counter for forwarding * Move failure counters * Cleanup * Refactor * Use seconds as unit for duration * Cleanup * Remove commented out code * Apply suggestions from code review Co-authored-by: Ramon Smits <[email protected]> * Fixup * Align naming with https://prometheus.io/docs/practices/naming/ * Use explicit System.Diagnostics.DiagnosticSource to get access to Gauge and instrumentation advice * Fixup * Add result dimension to docs * Make sure histogram buckets work * Fix batch result * Document queries * Fixup --------- Co-authored-by: Ramon Smits <[email protected]>
1 parent d759847 commit c6f725d

14 files changed

+349
-228
lines changed

docs/telemetry.md

Lines changed: 17 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -14,32 +14,23 @@ It's recommended to use a local [OTEL Collector](https://opentelemetry.io/docs/c
1414

1515
Example configuration: https://github.com/andreasohlund/Docker/tree/main/otel-monitoring
1616

17-
The following metrics are available:
18-
19-
### Ingestion
20-
21-
#### Success or failure
22-
23-
- `sc.audit.ingestion.success` - Successful ingested audit message count (Counter)
24-
- `sc.audit.ingestion.retry` - Retried audit message count (Counter)
25-
- `sc.audit.ingestion.failed` - Failed audit message count (Counter)
26-
27-
The above metrics also have the following attributes attached:
28-
29-
- `messaging.message.body.size` - The size of the message body in bytes
30-
- `messaging.message.type` - The logical message type of the message if present
31-
32-
#### Details
33-
34-
- `sc.audit.ingestion.duration` - Audit message processing duration in milliseconds (Histogram)
35-
- `sc.audit.ingestion.forwarded` - Count of the number of forwarded audit messages if forwarding is enabled (Counter)
36-
37-
### Batching
38-
39-
- `sc.audit.ingestion.batch_duration` - Batch processing duration in milliseconds (Histogram)
40-
- Attributes:
41-
- `ingestion.batch_size`
42-
- `sc.audit.ingestion.consecutive_batch_failures` - Consecutive batch failures (Counter)
17+
The following ingestion metrics with their corresponding dimensions are available:
18+
19+
- `sc.audit.ingestion.batch_duration_seconds` - Message batch processing duration in seconds
20+
- `result` - Indicates if the full batch size was used (batch size == max concurrency of the transport): `full`, `partial` or `failed`
21+
- `sc.audit.ingestion.message_duration_seconds` - Audit message processing duration in seconds
22+
- `message.category` - Indicates the category of the message ingested: `audit-message`, `saga-update` or `control-message`
23+
- `result` - Indicates the outcome of the operation: `success`, `failed` or `skipped` (if the message was filtered out and skipped)
24+
- `sc.audit.ingestion.failures_total` - Failure counter
25+
- `message.category` - Indicates the category of the message ingested: `audit-message`, `saga-update` or `control-message`
26+
- `result` - Indicates how the failure was resolved: `retry` or `stored-poision`
27+
- `sc.audit.ingestion.consecutive_batch_failure_total` - Consecutive batch failures
28+
29+
Example queries in PromQL for use in Grafana:
30+
31+
- Ingestion rate: `sum (rate(sc_audit_ingestion_message_duration_seconds_count[$__rate_interval])) by (exported_job)`
32+
- Failure rate: `sum(rate(sc_audit_ingestion_failures_total[$__rate_interval])) by (exported_job,result)`
33+
- Message duration: `histogram_quantile(0.9,sum(rate(sc_audit_ingestion_message_duration_seconds_bucket[$__rate_interval])) by (le,exported_job))`
4334

4435
## Monitoring
4536

src/Directory.Packages.props

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@
5050
<PackageVersion Include="NUnit" Version="4.3.2" />
5151
<PackageVersion Include="NUnit.Analyzers" Version="4.6.0" />
5252
<PackageVersion Include="NUnit3TestAdapter" Version="5.0.0" />
53+
<PackageVersion Include="OpenTelemetry.Exporter.Console" Version="1.9.0" />
5354
<PackageVersion Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.9.0" />
5455
<PackageVersion Include="OpenTelemetry.Extensions.Hosting" Version="1.9.0" />
5556
<PackageVersion Include="Particular.Approvals" Version="2.0.1" />
@@ -63,6 +64,7 @@
6364
<PackageVersion Include="ReactiveUI.WPF" Version="20.1.63" />
6465
<PackageVersion Include="ServiceControl.Contracts" Version="5.0.0" />
6566
<PackageVersion Include="System.Configuration.ConfigurationManager" Version="8.0.1" />
67+
<PackageVersion Include="System.Diagnostics.DiagnosticSource" Version="9.0.2" />
6668
<PackageVersion Include="System.Diagnostics.PerformanceCounter" Version="8.0.1" />
6769
<PackageVersion Include="System.DirectoryServices.AccountManagement" Version="8.0.1" />
6870
<PackageVersion Include="System.Linq.Async" Version="6.0.1" />
@@ -89,4 +91,4 @@
8991
<GlobalPackageReference Include="Microsoft.Build.CopyOnWrite" Version="1.0.334" />
9092
<GlobalPackageReference Include="Particular.Packaging" Version="4.2.2" />
9193
</ItemGroup>
92-
</Project>
94+
</Project>

src/ServiceControl.Audit/Auditing/AuditIngestion.cs

Lines changed: 29 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22
{
33
using System;
44
using System.Collections.Generic;
5-
using System.Diagnostics.Metrics;
65
using System.Threading;
76
using System.Threading.Channels;
87
using System.Threading.Tasks;
98
using Infrastructure.Settings;
9+
using Metrics;
1010
using Microsoft.Extensions.Hosting;
1111
using NServiceBus;
1212
using NServiceBus.Logging;
@@ -26,7 +26,8 @@ public AuditIngestion(
2626
AuditIngestionCustomCheck.State ingestionState,
2727
AuditIngestor auditIngestor,
2828
IAuditIngestionUnitOfWorkFactory unitOfWorkFactory,
29-
IHostApplicationLifetime applicationLifetime)
29+
IHostApplicationLifetime applicationLifetime,
30+
IngestionMetrics metrics)
3031
{
3132
inputEndpoint = settings.AuditQueue;
3233
this.transportCustomization = transportCustomization;
@@ -35,21 +36,24 @@ public AuditIngestion(
3536
this.unitOfWorkFactory = unitOfWorkFactory;
3637
this.settings = settings;
3738
this.applicationLifetime = applicationLifetime;
39+
this.metrics = metrics;
3840

3941
if (!transportSettings.MaxConcurrency.HasValue)
4042
{
4143
throw new ArgumentException("MaxConcurrency is not set in TransportSettings");
4244
}
4345

44-
channel = Channel.CreateBounded<MessageContext>(new BoundedChannelOptions(transportSettings.MaxConcurrency.Value)
46+
MaxBatchSize = transportSettings.MaxConcurrency.Value;
47+
48+
channel = Channel.CreateBounded<MessageContext>(new BoundedChannelOptions(MaxBatchSize)
4549
{
4650
SingleReader = true,
4751
SingleWriter = false,
4852
AllowSynchronousContinuations = false,
4953
FullMode = BoundedChannelFullMode.Wait
5054
});
5155

52-
errorHandlingPolicy = new AuditIngestionFaultPolicy(failedImportsStorage, settings.LoggingSettings, OnCriticalError);
56+
errorHandlingPolicy = new AuditIngestionFaultPolicy(failedImportsStorage, settings.LoggingSettings, OnCriticalError, metrics);
5357

5458
watchdog = new Watchdog(
5559
"audit message ingestion",
@@ -190,22 +194,21 @@ async Task EnsureStopped(CancellationToken cancellationToken)
190194

191195
async Task OnMessage(MessageContext messageContext, CancellationToken cancellationToken)
192196
{
193-
var tags = Telemetry.GetIngestedMessageTags(messageContext.Headers, messageContext.Body);
194-
using (new DurationRecorder(ingestionDuration, tags))
197+
using var messageIngestionMetrics = metrics.BeginIngestion(messageContext);
198+
199+
if (settings.MessageFilter != null && settings.MessageFilter(messageContext))
195200
{
196-
if (settings.MessageFilter != null && settings.MessageFilter(messageContext))
197-
{
198-
return;
199-
}
201+
messageIngestionMetrics.Skipped();
202+
return;
203+
}
200204

201-
var taskCompletionSource = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
202-
messageContext.SetTaskCompletionSource(taskCompletionSource);
205+
var taskCompletionSource = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
206+
messageContext.SetTaskCompletionSource(taskCompletionSource);
203207

204-
await channel.Writer.WriteAsync(messageContext, cancellationToken);
205-
await taskCompletionSource.Task;
208+
await channel.Writer.WriteAsync(messageContext, cancellationToken);
209+
_ = await taskCompletionSource.Task;
206210

207-
successfulMessagesCounter.Add(1, tags);
208-
}
211+
messageIngestionMetrics.Success();
209212
}
210213

211214
public override async Task StartAsync(CancellationToken cancellationToken)
@@ -218,27 +221,24 @@ protected override async Task ExecuteAsync(CancellationToken stoppingToken)
218221
{
219222
try
220223
{
221-
var contexts = new List<MessageContext>(transportSettings.MaxConcurrency.Value);
224+
var contexts = new List<MessageContext>(MaxBatchSize);
222225

223226
while (await channel.Reader.WaitToReadAsync(stoppingToken))
224227
{
225228
// will only enter here if there is something to read.
226229
try
227230
{
231+
using var batchMetrics = metrics.BeginBatch(MaxBatchSize);
232+
228233
// as long as there is something to read this will fetch up to MaximumConcurrency items
229-
using (var recorder = new DurationRecorder(batchDuration))
234+
while (channel.Reader.TryRead(out var context))
230235
{
231-
while (channel.Reader.TryRead(out var context))
232-
{
233-
contexts.Add(context);
234-
}
235-
236-
recorder.Tags.Add("ingestion.batch_size", contexts.Count);
237-
238-
await auditIngestor.Ingest(contexts);
236+
contexts.Add(context);
239237
}
240238

241-
consecutiveBatchFailuresCounter.Record(0);
239+
await auditIngestor.Ingest(contexts);
240+
241+
batchMetrics.Complete(contexts.Count);
242242
}
243243
catch (Exception e)
244244
{
@@ -255,9 +255,6 @@ protected override async Task ExecuteAsync(CancellationToken stoppingToken)
255255
}
256256

257257
logger.Info("Ingesting messages failed", e);
258-
259-
// no need to do interlocked increment since this is running sequential
260-
consecutiveBatchFailuresCounter.Record(consecutiveBatchFailures++);
261258
}
262259
finally
263260
{
@@ -298,8 +295,8 @@ public override async Task StopAsync(CancellationToken cancellationToken)
298295

299296
TransportInfrastructure transportInfrastructure;
300297
IMessageReceiver messageReceiver;
301-
long consecutiveBatchFailures = 0;
302298

299+
readonly int MaxBatchSize;
303300
readonly SemaphoreSlim startStopSemaphore = new(1);
304301
readonly string inputEndpoint;
305302
readonly ITransportCustomization transportCustomization;
@@ -309,12 +306,9 @@ public override async Task StopAsync(CancellationToken cancellationToken)
309306
readonly IAuditIngestionUnitOfWorkFactory unitOfWorkFactory;
310307
readonly Settings settings;
311308
readonly Channel<MessageContext> channel;
312-
readonly Histogram<double> batchDuration = Telemetry.Meter.CreateHistogram<double>(Telemetry.CreateInstrumentName("ingestion", "batch_duration"), unit: "ms", "Average audit message batch processing duration");
313-
readonly Counter<long> successfulMessagesCounter = Telemetry.Meter.CreateCounter<long>(Telemetry.CreateInstrumentName("ingestion", "success"), description: "Successful ingested audit message count");
314-
readonly Histogram<long> consecutiveBatchFailuresCounter = Telemetry.Meter.CreateHistogram<long>(Telemetry.CreateInstrumentName("ingestion", "consecutive_batch_failures"), unit: "count", description: "Consecutive audit ingestion batch failure");
315-
readonly Histogram<double> ingestionDuration = Telemetry.Meter.CreateHistogram<double>(Telemetry.CreateInstrumentName("ingestion", "duration"), unit: "ms", description: "Average incoming audit message processing duration");
316309
readonly Watchdog watchdog;
317310
readonly IHostApplicationLifetime applicationLifetime;
311+
readonly IngestionMetrics metrics;
318312

319313
static readonly ILog logger = LogManager.GetLogger<AuditIngestion>();
320314

0 commit comments

Comments
 (0)