Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/scorecards.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,6 @@ jobs:
# Upload the results to GitHub's code scanning dashboard (optional).
# Commenting out will disable upload of results to your repo's Code Scanning dashboard
- name: "Upload to code-scanning"
uses: github/codeql-action/upload-sarif@1b549b9259bda1cb5ddde3b41741a82a2d15a841 # v3.28.13
uses: github/codeql-action/upload-sarif@45775bd8235c68ba998cffa5171334d58593da47 # v3.28.15
with:
sarif_file: results.sarif
6 changes: 3 additions & 3 deletions .github/workflows/test-build-deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -93,15 +93,15 @@ jobs:

# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@1b549b9259bda1cb5ddde3b41741a82a2d15a841 # v3.28.13
uses: github/codeql-action/init@45775bd8235c68ba998cffa5171334d58593da47 # v3.28.15
with:
languages: go

- name: Autobuild
uses: github/codeql-action/autobuild@1b549b9259bda1cb5ddde3b41741a82a2d15a841 # v3.28.13
uses: github/codeql-action/autobuild@45775bd8235c68ba998cffa5171334d58593da47 # v3.28.15

- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@1b549b9259bda1cb5ddde3b41741a82a2d15a841 # v3.28.13
uses: github/codeql-action/analyze@45775bd8235c68ba998cffa5171334d58593da47 # v3.28.15


build:
Expand Down
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,12 @@
* [FEATURE] Query Frontend: Add dynamic interval size for query splitting. This is enabled by configuring experimental flags `querier.max-shards-per-query` and/or `querier.max-fetched-data-duration-per-query`. The split interval size is dynamically increased to maintain a number of shards and total duration fetched below the configured values. #6458
* [FEATURE] Querier/Ruler: Add `query_partial_data` and `rules_partial_data` limits to allow queries/rules to be evaluated with data from a single zone, if other zones are not available. #6526
* [FEATURE] Update prometheus alertmanager version to v0.28.0 and add new integration msteamsv2, jira, and rocketchat. #6590
* [FEATURE] Ingester/StoreGateway: Add `ResourceMonitor` module in Cortex, and add `ResourceBasedLimiter` in Ingesters and StoreGateways. #6674
* [FEATURE] Ingester: Support out-of-order native histogram ingestion. It automatically enabled when `-ingester.out-of-order-time-window > 0` and `-blocks-storage.tsdb.enable-native-histograms=true`. #6626 #6663
* [FEATURE] Ruler: Add support for percentage based sharding for rulers. #6680
* [FEATURE] Ruler: Add support for group labels. #6665
* [ENHANCEMENT] Querier: Support query parameters to metadata api (/api/v1/metadata) to allow user to limit metadata to return. #6681
* [ENHANCEMENT] Ingester: Add a `cortex_ingester_active_native_histogram_series` metric to track # of active NH series. #6695
* [ENHANCEMENT] Query Frontend: Add new limit `-frontend.max-query-response-size` for total query response size after decompression in query frontend. #6607
* [ENHANCEMENT] Alertmanager: Add nflog and silences maintenance metrics. #6659
* [ENHANCEMENT] Querier: limit label APIs to query only ingesters if `start` param is not been specified. #6618
Expand All @@ -30,6 +35,7 @@
* [BUGFIX] Ingester: Add check to avoid query 5xx when closing tsdb. #6616
* [BUGFIX] Querier: Fix panic when marshaling QueryResultRequest. #6601
* [BUGFIX] Ingester: Avoid resharding for query when restart readonly ingesters. #6642
* [BUGFIX] Query Frontend: Fix query frontend per `user` metrics clean up. #6698

## 1.19.0 2025-02-27

Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ Join us in shaping the future of Cortex, and let's build something amazing toget

### Talks

- Apr 2025 KubeCon talk "Cortex: Insights, Updates and Roadmap" ([video](https://youtu.be/3aUg2qxfoZU), [slides](https://static.sched.com/hosted_files/kccnceu2025/6c/Cortex%20Talk%20KubeCon%20EU%202025.pdf))
- Apr 2025 KubeCon talk "Taming 50 Billion Time Series: Operating Global-Scale Prometheus Deployments on Kubernetes" ([video](https://youtu.be/OqLpKJwKZlk), [slides](https://static.sched.com/hosted_files/kccnceu2025/b2/kubecon%20-%2050b%20-%20final.pdf))
- Nov 2024 KubeCon talk "Cortex Intro: Multi-Tenant Scalable Prometheus" ([video](https://youtu.be/OGAEWCoM6Tw), [slides](https://static.sched.com/hosted_files/kccncna2024/0f/Cortex%20Talk%20KubeCon%20US%202024.pdf))
- Mar 2024 KubeCon talk "Cortex Intro: Multi-Tenant Scalable Prometheus" ([video](https://youtu.be/by538PPSPQ0), [slides](https://static.sched.com/hosted_files/kccnceu2024/a1/Cortex%20Talk%20KubeConEU24.pptx.pdf))
- Apr 2023 KubeCon talk "How to Run a Rock Solid Multi-Tenant Prometheus" ([video](https://youtu.be/Pl5hEoRPLJU), [slides](https://static.sched.com/hosted_files/kccnceu2023/49/Kubecon2023.pptx.pdf))
Expand Down
1 change: 1 addition & 0 deletions cmd/cortex/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ import (
"github.com/prometheus/client_golang/prometheus"
collectorversion "github.com/prometheus/client_golang/prometheus/collectors/version"
"github.com/prometheus/common/version"
_ "go.uber.org/automaxprocs"
"gopkg.in/yaml.v2"

"github.com/cortexproject/cortex/pkg/cortex"
Expand Down
15 changes: 15 additions & 0 deletions docs/blocks-storage/store-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,6 +349,21 @@ store_gateway:
# CLI flag: -store-gateway.disabled-tenants
[disabled_tenants: <string> | default = ""]

instance_limits:
# EXPERIMENTAL: Max CPU utilization that this ingester can reach before
# rejecting new query request (across all tenants) in percentage, between 0
# and 1. monitored_resources config must include the resource type. 0 to
# disable.
# CLI flag: -store-gateway.instance-limits.cpu-utilization
[cpu_utilization: <float> | default = 0]

# EXPERIMENTAL: Max heap utilization that this ingester can reach before
# rejecting new query request (across all tenants) in percentage, between 0
# and 1. monitored_resources config must include the resource type. 0 to
# disable.
# CLI flag: -store-gateway.instance-limits.heap-utilization
[heap_utilization: <float> | default = 0]

hedged_request:
# If true, hedged requests are applied to object store calls. It can help
# with reducing tail latency.
Expand Down
40 changes: 38 additions & 2 deletions docs/configuration/config-file-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,12 @@ Where default_value is the value to use if the environment variable is undefined
# CLI flag: -http.prefix
[http_prefix: <string> | default = "/api/prom"]

# Comma-separated list of resources to monitor. Supported values are cpu and
# heap, which tracks metrics from github.com/prometheus/procfs and
# runtime/metrics that are close estimates. Empty string to disable.
# CLI flag: -monitored.resources
[monitored_resources: <string> | default = ""]

api:
# Use GZIP compression for API responses. Some endpoints serve large YAML or
# JSON blobs which can benefit from compression.
Expand Down Expand Up @@ -3197,6 +3203,20 @@ lifecycler:
[upload_compacted_blocks_enabled: <boolean> | default = true]

instance_limits:
# EXPERIMENTAL: Max CPU utilization that this ingester can reach before
# rejecting new query request (across all tenants) in percentage, between 0
# and 1. monitored_resources config must include the resource type. 0 to
# disable.
# CLI flag: -ingester.instance-limits.cpu-utilization
[cpu_utilization: <float> | default = 0]

# EXPERIMENTAL: Max heap utilization that this ingester can reach before
# rejecting new query request (across all tenants) in percentage, between 0
# and 1. monitored_resources config must include the resource type. 0 to
# disable.
# CLI flag: -ingester.instance-limits.heap-utilization
[heap_utilization: <float> | default = 0]

# Max ingestion rate (samples/sec) that ingester will accept. This limit is
# per-ingester, not per-tenant. Additional push requests will be rejected.
# Current ingestion rate is computed as exponentially weighted moving average,
Expand Down Expand Up @@ -3635,9 +3655,10 @@ query_rejection:

# The default tenant's shard size when the shuffle-sharding strategy is used by
# ruler. When this setting is specified in the per-tenant overrides, a value of
# 0 disables shuffle sharding for the tenant.
# 0 disables shuffle sharding for the tenant. If the value is < 1 the shard size
# will be a percentage of the total rulers.
# CLI flag: -ruler.tenant-shard-size
[ruler_tenant_shard_size: <int> | default = 0]
[ruler_tenant_shard_size: <float> | default = 0]

# Maximum number of rules per rule group per-tenant. 0 to disable.
# CLI flag: -ruler.max-rules-per-rule-group
Expand Down Expand Up @@ -5856,6 +5877,21 @@ sharding_ring:
# CLI flag: -store-gateway.disabled-tenants
[disabled_tenants: <string> | default = ""]

instance_limits:
# EXPERIMENTAL: Max CPU utilization that this ingester can reach before
# rejecting new query request (across all tenants) in percentage, between 0
# and 1. monitored_resources config must include the resource type. 0 to
# disable.
# CLI flag: -store-gateway.instance-limits.cpu-utilization
[cpu_utilization: <float> | default = 0]

# EXPERIMENTAL: Max heap utilization that this ingester can reach before
# rejecting new query request (across all tenants) in percentage, between 0
# and 1. monitored_resources config must include the resource type. 0 to
# disable.
# CLI flag: -store-gateway.instance-limits.heap-utilization
[heap_utilization: <float> | default = 0]

hedged_request:
# If true, hedged requests are applied to object store calls. It can help with
# reducing tail latency.
Expand Down
5 changes: 5 additions & 0 deletions docs/configuration/v1-guarantees.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,8 @@ Currently experimental features are:
- Query-frontend: dynamic query splits
- `querier.max-shards-per-query` (int) CLI flag
- `querier.max-fetched-data-duration-per-query` (duration) CLI flag
- Ingester/Store-Gateway: Resource-based throttling
- `-ingester.instance-limits.cpu-utilization`
- `-ingester.instance-limits.heap-utilization`
- `-store-gateway.instance-limits.cpu-utilization`
- `-store-gateway.instance-limits.heap-utilization`
56 changes: 56 additions & 0 deletions docs/guides/protecting-cortex-from-heavy-queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
title: "Protecting Cortex from Heavy Queries"
linkTitle: "Protecting Cortex from Heavy Queries"
weight: 11
slug: protecting-cortex-from-heavy-queries
---

PromQL is powerful, and is able to result in query requests that have very wide range of data fetched and samples processed. Heavy queries can cause:

1. CPU on any query component to be partially exhausted, increasing latency and causing incoming queries to queue up with high chance of time-out.
2. CPU on any query component to be fully exhausted, causing GC to slow down leading to the pod being out-of-memory and killed.
3. Heap memory on any query component to be exhausted, leading to the pod being out-of-memory and killed.

It's important to protect Cortex components by setting appropriate limits and throttling configurations based on your infrastructure and data ingested by the customers.

## Static limits

There are number of static limits that you could configure to block heavy queries from running.

### Max outstanding requests per tenant

See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_outstanding_requests_per_tenant for details.

### Max data bytes fetched per (sharded) query

See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_fetched_data_bytes_per_query for details.

### Max series fetched per (sharded) query

See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_fetched_series_per_query for details.

### Max chunks fetched per (sharded) query

See https://cortexmetrics.io/docs/configuration/configuration-file/#query_frontend_config:~:text=max_fetched_chunk_bytes_per_query for details.

### Max samples fetched per (sharded) query

See https://cortexmetrics.io/docs/configuration/configuration-file/#querier_config:~:text=max_samples for details.

## Resource-based throttling (Experimental)

Although the static limits are able to protect Cortex components from specific query patterns, they are not generic enough to cover different combinations of bad query patterns. For example, what if the query fetches relatively large postings, series and chunks that are slightly below the individual limits? For a more generic solution, you can enable resource-based throttling by setting CPU and heap utilization thresholds.

Currently, it only throttles incoming query requests with error code 429 (too many requests) when the resource usage breaches the configured thresholds.

For example, the following configuration will start throttling query requests if either CPU or heap utilization is above 80%, leaving 20% of room for inflight requests.

```
target: ingester
monitored_resources: cpu,heap
instance_limits:
cpu_utilization: 0.8
heap_utilization: 0.8
```

See https://cortexmetrics.io/docs/configuration/configuration-file/:~:text=instance_limits for details.
Loading
Loading