Skip to content

Commit f6bdcc6

Browse files
authored
health_check: add stats counters to monitor health check behavior (envoyproxy#37409)
## Description This PR adds stats to the health check HTTP filter. These new stats provide visibility into health check behavior including request counts, successful/failed checks, cached responses, and cluster health status. These stats help operators monitor the health checking system and diagnose issues. Here is a list of key stats added: - **request_total** (Counter) : Total number of requests that were served from this health check filter - **failed** (Counter) : Total number of health checks that failed (including failures from cluster status) - **ok** (Counter) : Total number of health checks that passed - **cached_response** (Counter) : Total number of requests that were responded to with cached health check status - **failed_cluster_not_found** (Counter) : Total number of failed health checks due to referenced cluster not being found - **failed_cluster_empty** (Counter) : Total number of failed health checks due to empty cluster membership when checking cluster health - **failed_cluster_unhealthy** (Counter) : Total number of failed health checks due to cluster falling below minimum healthy percentage threshold - **degraded** (Counter) : Total number of health check responses that reported degraded status --- **Commit Message:** health_check: add stats counters to monitor health check behavior **Additional Description:** This change improves observability of the health check filter by exposing key metrics about health check processing and cluster health state. The stats are scoped under the connection manager and follow standard Envoy stats naming conventions. **Risk Level: Low** **Testing:** Added unit and integration tests verifying all stats counters **Docs Changes:** Added **Release Notes:** Added --------- Signed-off-by: Rohit Agrawal <[email protected]>
1 parent be1d2dc commit f6bdcc6

File tree

7 files changed

+227
-21
lines changed

7 files changed

+227
-21
lines changed

changelogs/current.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,10 @@ new_features:
303303
change: |
304304
Added :ref:`attribute <arch_overview_attributes>` ``upstream.cx_pool_ready_duration``
305305
to get the duration from when the upstream request was created to when the upstream connection pool is ready.
306+
- area: health_check
307+
change: |
308+
Added new health check filter stats including total requests, successful/failed checks, cached responses, and
309+
cluster health status counters. These stats help track health check behavior and cluster health state.
306310
307311
deprecated:
308312
- area: rbac

docs/root/configuration/http/http_filters/health_check_filter.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,23 @@ Health check
1616
<operations_admin_interface_healthcheck_fail>` admin endpoint has been called. (The
1717
:ref:`/healthcheck/ok <operations_admin_interface_healthcheck_ok>` admin endpoint reverses this
1818
behavior).
19+
20+
Statistics
21+
----------
22+
23+
The health check filter outputs statistics in the ``http.<stat_prefix>.health_check.`` namespace. The
24+
:ref:`stat prefix <envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.stat_prefix>`
25+
comes from the owning HTTP connection manager.
26+
27+
.. csv-table::
28+
:header: Name, Type, Description
29+
:widths: 1, 1, 2
30+
31+
request_total, Counter, Total number of requests processed by this health check filter ()including responses served from the cache)
32+
failed, Counter, Total number of health checks that failed (including failures due to cluster status and responses served from the cache)
33+
ok, Counter, Total number of health checks that passed
34+
cached_response, Counter, Total number of requests that were responded to with cached health check status
35+
failed_cluster_not_found, Counter, Total number of failed health checks due to referenced cluster not being found
36+
failed_cluster_empty, Counter, Total number of failed health checks due to empty cluster membership when checking cluster health
37+
failed_cluster_unhealthy, Counter, Total number of failed health checks due to cluster falling below minimum healthy percentage threshold
38+
degraded, Counter, Total number of health check responses that reported degraded status

source/extensions/filters/http/health_check/config.cc

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,11 @@ namespace HealthCheck {
1717

1818
Http::FilterFactoryCb HealthCheckFilterConfig::createFilterFactoryFromProtoTyped(
1919
const envoy::extensions::filters::http::health_check::v3::HealthCheck& proto_config,
20-
const std::string&, Server::Configuration::FactoryContext& context) {
20+
const std::string& stats_prefix, Server::Configuration::FactoryContext& context) {
2121
ASSERT(proto_config.has_pass_through_mode());
2222

23+
auto stats = std::make_shared<HealthCheckFilterStats>(
24+
HealthCheckFilterStats::generateStats(stats_prefix, context.scope()));
2325
const bool pass_through_mode = proto_config.pass_through_mode().value();
2426
const int64_t cache_time_ms = PROTOBUF_GET_MS_OR_DEFAULT(proto_config, cache_time, 0);
2527

@@ -48,10 +50,11 @@ Http::FilterFactoryCb HealthCheckFilterConfig::createFilterFactoryFromProtoTyped
4850
}
4951

5052
return [&context, pass_through_mode, cache_manager, header_match_data,
51-
cluster_min_healthy_percentages](Http::FilterChainFactoryCallbacks& callbacks) -> void {
53+
cluster_min_healthy_percentages,
54+
stats](Http::FilterChainFactoryCallbacks& callbacks) -> void {
5255
callbacks.addStreamFilter(std::make_shared<HealthCheckFilter>(
5356
context.serverFactoryContext(), pass_through_mode, cache_manager, header_match_data,
54-
cluster_min_healthy_percentages));
57+
cluster_min_healthy_percentages, stats));
5558
};
5659
}
5760

source/extensions/filters/http/health_check/health_check.cc

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,16 +119,19 @@ void HealthCheckFilter::onComplete() {
119119
Http::Code final_status = Http::Code::OK;
120120
const std::string* details = &RcDetails::get().HealthCheckOk;
121121
bool degraded = false;
122+
stats_->request_total_.inc();
122123
if (context_.healthCheckFailed()) {
123124
callbacks_->streamInfo().setResponseFlag(StreamInfo::CoreResponseFlag::FailedLocalHealthCheck);
124125
final_status = Http::Code::ServiceUnavailable;
125126
details = &RcDetails::get().HealthCheckFailed;
127+
stats_->failed_.inc();
126128
} else {
127129
if (cache_manager_) {
128130
const auto status_and_degraded = cache_manager_->getCachedResponse();
129131
final_status = status_and_degraded.first;
130132
details = &RcDetails::get().HealthCheckCached;
131133
degraded = status_and_degraded.second;
134+
stats_->cached_response_.inc();
132135
} else if (cluster_min_healthy_percentages_ != nullptr &&
133136
!cluster_min_healthy_percentages_->empty()) {
134137
// Check the status of the specified upstream cluster(s) to determine the right response.
@@ -142,9 +145,10 @@ void HealthCheckFilter::onComplete() {
142145
// If the cluster does not exist at all, consider the service unhealthy.
143146
final_status = Http::Code::ServiceUnavailable;
144147
details = &RcDetails::get().HealthCheckNoCluster;
145-
148+
stats_->failed_cluster_not_found_.inc();
146149
break;
147150
}
151+
148152
const auto& endpoint_stats = cluster->info()->endpointStats();
149153
const uint64_t membership_total = endpoint_stats.membership_total_.value();
150154
if (membership_total == 0) {
@@ -155,6 +159,7 @@ void HealthCheckFilter::onComplete() {
155159
} else {
156160
final_status = Http::Code::ServiceUnavailable;
157161
details = &RcDetails::get().HealthCheckClusterEmpty;
162+
stats_->failed_cluster_empty_.inc();
158163
break;
159164
}
160165
}
@@ -165,6 +170,7 @@ void HealthCheckFilter::onComplete() {
165170
membership_total * min_healthy_percentage) {
166171
final_status = Http::Code::ServiceUnavailable;
167172
details = &RcDetails::get().HealthCheckClusterUnhealthy;
173+
stats_->failed_cluster_unhealthy_.inc();
168174
break;
169175
}
170176
}
@@ -173,9 +179,16 @@ void HealthCheckFilter::onComplete() {
173179
if (!Http::CodeUtility::is2xx(enumToInt(final_status))) {
174180
callbacks_->streamInfo().setResponseFlag(
175181
StreamInfo::CoreResponseFlag::FailedLocalHealthCheck);
182+
stats_->failed_.inc();
183+
} else {
184+
stats_->ok_.inc();
176185
}
177186
}
178187

188+
if (degraded) {
189+
stats_->degraded_.inc();
190+
}
191+
179192
callbacks_->sendLocalReply(
180193
final_status, "",
181194
[degraded](auto& headers) {

source/extensions/filters/http/health_check/health_check.h

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,44 @@
88
#include "envoy/http/codes.h"
99
#include "envoy/http/filter.h"
1010
#include "envoy/server/filter_config.h"
11+
#include "envoy/stats/stats.h"
12+
#include "envoy/stats/stats_macros.h"
1113

14+
#include "source/common/common/assert.h"
1215
#include "source/common/http/header_utility.h"
1316

1417
namespace Envoy {
1518
namespace Extensions {
1619
namespace HttpFilters {
1720
namespace HealthCheck {
1821

22+
/**
23+
* All health check filter stats. @see stats_macros.h
24+
*/
25+
#define ALL_HEALTH_CHECK_FILTER_STATS(COUNTER) \
26+
COUNTER(request_total) \
27+
COUNTER(failed) \
28+
COUNTER(ok) \
29+
COUNTER(cached_response) \
30+
COUNTER(failed_cluster_not_found) \
31+
COUNTER(failed_cluster_empty) \
32+
COUNTER(failed_cluster_unhealthy) \
33+
COUNTER(degraded)
34+
35+
/**
36+
* Struct definition for all health check stats. @see stats_macros.h
37+
*/
38+
struct HealthCheckFilterStats {
39+
ALL_HEALTH_CHECK_FILTER_STATS(GENERATE_COUNTER_STRUCT)
40+
41+
static HealthCheckFilterStats generateStats(const std::string& prefix, Stats::Scope& scope) {
42+
const std::string final_prefix = absl::StrCat(prefix, "health_check.");
43+
return {ALL_HEALTH_CHECK_FILTER_STATS(POOL_COUNTER_PREFIX(scope, final_prefix))};
44+
}
45+
};
46+
47+
using HealthCheckFilterStatsSharedPtr = std::shared_ptr<HealthCheckFilterStats>;
48+
1949
/**
2050
* Shared cache manager used by all instances of a health check filter configuration as well as
2151
* all threads. This sets up a timer that will invalidate the cached response code and allow some
@@ -48,13 +78,11 @@ class HealthCheckCacheManager {
4878
};
4979

5080
using HealthCheckCacheManagerSharedPtr = std::shared_ptr<HealthCheckCacheManager>;
51-
81+
using HeaderDataVectorSharedPtr = std::shared_ptr<std::vector<Http::HeaderUtility::HeaderDataPtr>>;
5282
using ClusterMinHealthyPercentages = std::map<std::string, double>;
5383
using ClusterMinHealthyPercentagesConstSharedPtr =
5484
std::shared_ptr<const ClusterMinHealthyPercentages>;
5585

56-
using HeaderDataVectorSharedPtr = std::shared_ptr<std::vector<Http::HeaderUtility::HeaderDataPtr>>;
57-
5886
/**
5987
* Health check responder filter.
6088
*/
@@ -63,10 +91,11 @@ class HealthCheckFilter : public Http::StreamFilter {
6391
HealthCheckFilter(Server::Configuration::ServerFactoryContext& context, bool pass_through_mode,
6492
HealthCheckCacheManagerSharedPtr cache_manager,
6593
HeaderDataVectorSharedPtr header_match_data,
66-
ClusterMinHealthyPercentagesConstSharedPtr cluster_min_healthy_percentages)
94+
ClusterMinHealthyPercentagesConstSharedPtr cluster_min_healthy_percentages,
95+
HealthCheckFilterStatsSharedPtr stats)
6796
: context_(context), pass_through_mode_(pass_through_mode), cache_manager_(cache_manager),
6897
header_match_data_(std::move(header_match_data)),
69-
cluster_min_healthy_percentages_(cluster_min_healthy_percentages) {}
98+
cluster_min_healthy_percentages_(cluster_min_healthy_percentages), stats_(stats) {}
7099

71100
// Http::StreamFilterBase
72101
void onDestroy() override {}
@@ -108,6 +137,7 @@ class HealthCheckFilter : public Http::StreamFilter {
108137
HealthCheckCacheManagerSharedPtr cache_manager_;
109138
const HeaderDataVectorSharedPtr header_match_data_;
110139
ClusterMinHealthyPercentagesConstSharedPtr cluster_min_healthy_percentages_;
140+
const HealthCheckFilterStatsSharedPtr stats_;
111141
};
112142

113143
} // namespace HealthCheck

test/extensions/filters/http/health_check/health_check_integration_test.cc

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,37 @@ TEST_P(HealthCheckIntegrationTest, HealthCheckWithBufferFilter) {
179179
EXPECT_EQ("200", request("http", "GET", "/healthcheck", response));
180180
}
181181

182+
TEST_P(HealthCheckIntegrationTest, HealthCheckStats) {
183+
DISABLE_IF_ADMIN_DISABLED;
184+
initialize();
185+
186+
// Initial stats should be zero
187+
EXPECT_EQ(0, test_server_->counter("http.config_test.health_check.request_total")->value());
188+
EXPECT_EQ(0, test_server_->counter("http.config_test.health_check.ok")->value());
189+
EXPECT_EQ(0, test_server_->counter("http.config_test.health_check.failed")->value());
190+
191+
// Make a health check request - should result in OK response and increment request/ok counters
192+
BufferingStreamDecoderPtr response;
193+
EXPECT_EQ("200", request("http", "GET", "/healthcheck", response));
194+
EXPECT_EQ(1, test_server_->counter("http.config_test.health_check.request_total")->value());
195+
EXPECT_EQ(1, test_server_->counter("http.config_test.health_check.ok")->value());
196+
EXPECT_EQ(0, test_server_->counter("http.config_test.health_check.failed")->value());
197+
198+
// Fail the health check and verify failed counter increments
199+
EXPECT_EQ("200", request("admin", "POST", "/healthcheck/fail", response));
200+
EXPECT_EQ("503", request("http", "GET", "/healthcheck", response));
201+
EXPECT_EQ(2, test_server_->counter("http.config_test.health_check.request_total")->value());
202+
EXPECT_EQ(1, test_server_->counter("http.config_test.health_check.ok")->value());
203+
EXPECT_EQ(1, test_server_->counter("http.config_test.health_check.failed")->value());
204+
205+
// Restore health check and verify ok counter increments
206+
EXPECT_EQ("200", request("admin", "POST", "/healthcheck/ok", response));
207+
EXPECT_EQ("200", request("http", "GET", "/healthcheck", response));
208+
EXPECT_EQ(3, test_server_->counter("http.config_test.health_check.request_total")->value());
209+
EXPECT_EQ(2, test_server_->counter("http.config_test.health_check.ok")->value());
210+
EXPECT_EQ(1, test_server_->counter("http.config_test.health_check.failed")->value());
211+
}
212+
182213
INSTANTIATE_TEST_SUITE_P(Protocols, HealthCheckIntegrationTest,
183214
testing::ValuesIn(HttpProtocolIntegrationTest::getProtocolTestParams(
184215
{Http::CodecType::HTTP1, Http::CodecType::HTTP2},

0 commit comments

Comments
 (0)