diff --git a/.github/workflows/test-build-deploy.yml b/.github/workflows/test-build-deploy.yml index 6b3be24bc8b..9d510d58c9d 100644 --- a/.github/workflows/test-build-deploy.yml +++ b/.github/workflows/test-build-deploy.yml @@ -253,7 +253,7 @@ jobs: deploy_website: needs: [build, test] - if: (github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/')) && github.repository == 'cortexproject/cortex' + if: github.ref == 'refs/heads/master' && github.repository == 'cortexproject/cortex' runs-on: ubuntu-24.04 container: image: quay.io/cortexproject/build-image:master-59491e9aae diff --git a/CHANGELOG.md b/CHANGELOG.md index c1a8e5ca4e2..a5da0c1cf29 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -99,8 +99,12 @@ * [BUGFIX] Compactor: Delete the prefix `blocks_meta` from the metadata fetcher metrics. #6832 * [BUGFIX] Store Gateway: Avoid race condition by deduplicating entries in bucket stores user scan. #6863 * [BUGFIX] Runtime-config: Change to check tenant limit validation when loading runtime config only for `all`, `distributor`, `querier`, and `ruler` targets. #6880 -* [BUGFIX] Frontend: Fix remote read snappy input due to request string logging when query stats enabled. #7025 * [BUGFIX] Distributor: Fix the `/distributor/all_user_stats` api to work during rolling updates on ingesters. #7026 +* [BUGFIX] Runtime-config: Fix panic when the runtime config is `null`. #7062 + +## 1.19.1 2025-09-20 + +* [BUGFIX] Frontend: Fix remote read snappy input due to request string logging when query stats enabled. #7025 ## 1.19.0 2025-02-27 diff --git a/VERSION b/VERSION index 815d5ca06d5..66e2ae6c25c 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.19.0 +1.19.1 diff --git a/docs/getting-started/.env b/docs/getting-started/.env index 3fca602f3b2..546dae1071d 100644 --- a/docs/getting-started/.env +++ b/docs/getting-started/.env @@ -1,4 +1,4 @@ -CORTEX_VERSION=v1.19.0 +CORTEX_VERSION=v1.19.1 GRAFANA_VERSION=10.4.2 PROMETHEUS_VERSION=v3.2.1 SEAWEEDFS_VERSION=3.67 diff --git a/docs/guides/ingesters-scaling-up-and-down.md b/docs/guides/ingesters-scaling-up-and-down.md index ff11e7ca3ae..dddd7bcad5c 100644 --- a/docs/guides/ingesters-scaling-up-and-down.md +++ b/docs/guides/ingesters-scaling-up-and-down.md @@ -22,11 +22,66 @@ no special care is required to take when scaling up ingesters. ## Scaling down -A running ingester holds several hours of time series data in memory before they're flushed to the long-term storage. When an ingester shuts down because of a scale down operation, the in-memory data must not be discarded in order to avoid any data loss. +A running ingester holds several hours of time series data in memory before they’re flushed to the long-term storage. When an ingester shuts down because of a scale down operation, the in-memory data must not be discarded in order to avoid any data loss. -Ingesters don't flush series to blocks at shutdown by default. However, Cortex ingesters expose an API endpoint [`/shutdown`](../api/_index.md#shutdown) that can be called to flush series to blocks and upload blocks to the long-term storage before the ingester terminates. +Ingesters don’t flush series to blocks at shutdown by default. However, Cortex ingesters expose an API endpoint [`/shutdown`](../api/_index.md#shutdown) that can be called to flush series to blocks and upload blocks to the long-term storage before the ingester terminates. -Even if ingester blocks are compacted and shipped to the storage at shutdown, it takes some time for queriers and store-gateways to discover the newly uploaded blocks. This is due to the fact that the blocks storage runs a periodic scanning of the storage bucket to discover blocks. If two or more ingesters are scaled down in a short period of time, queriers may miss some data at query time due to series that were stored in the terminated ingesters but their blocks haven't been discovered yet. +Even if ingester blocks are compacted and shipped to the storage at shutdown, it takes some time for queriers and store-gateways to discover the newly uploaded blocks. This is due to the fact that the blocks storage runs a periodic scanning of the storage bucket to discover blocks. If two or more ingesters are scaled down in a short period of time, queriers may miss some data at query time due to series that were stored in the terminated ingesters but their blocks haven’t been discovered yet. + +### New Gradual Scaling Approach (Recommended) + +Starting with Cortex 1.19.0, a new **READONLY** state for ingesters was introduced that enables gradual, safe scaling down without data loss or performance impact. This approach eliminates the need for complex configuration changes and allows for more flexible scaling operations. + +#### How the READONLY State Works + +The READONLY state allows ingesters to: +- **Stop accepting new writes** - Push requests will be rejected and redistributed to other ingesters +- **Continue serving queries** - Existing data remains available for queries, maintaining performance +- **Gradually age out data** - As time passes, data naturally ages out according to your retention settings +- **Be safely removed** - Once data has aged out, ingesters can be terminated without any impact + +#### Step-by-Step Scaling Process + +1. **Set ingesters to READONLY mode** + ```bash + # Transition ingester to READONLY state + curl -X POST http://ingester-1:8080/ingester/mode -d '{"mode": "READONLY"}' + curl -X POST http://ingester-2:8080/ingester/mode -d '{"mode": "READONLY"}' + curl -X POST http://ingester-3:8080/ingester/mode -d '{"mode": "READONLY"}' + ``` + +2. **Monitor data aging** (Optional but recommended) + ```bash + # Check user statistics and loaded blocks on the ingester + curl http://ingester-1:8080/ingester/all_user_stats + ``` + +3. **Wait for safe removal window** + - **Immediate removal** (after step 1): Safe once queries no longer need the ingester's data + - **Conservative approach**: Wait for `querier.query-ingesters-within` duration (e.g., 5 hours) + - **Complete data aging**: Wait for full retention period to ensure all blocks are removed + +4. **Remove ingesters** + ```bash + # Terminate the ingester processes + kubectl delete pod ingester-1 ingester-2 ingester-3 + ``` + +#### Timeline Example + +For a cluster with `querier.query-ingesters-within=5h`: + +- **T0**: Set ingesters 5, 6, 7 to READONLY state +- **T1**: Ingesters stop receiving new data but continue serving queries +- **T2 (T0 + 5h)**: Ingesters no longer receive query requests (safe to remove) +- **T3 (T0 + retention_period)**: All blocks naturally removed from ingesters +- **T4**: Remove ingesters from cluster + +**Any time after T2 is safe for removal without service impact.** + +### Legacy Approach (For Older Versions) + +If you’re running an older version of Cortex that doesn’t support the READONLY state, you’ll need to follow the legacy approach. The ingesters scale down is deemed an infrequent operation and no automation is currently provided. However, if you need to scale down ingesters, please be aware of the following: diff --git a/pkg/cortex/runtime_config.go b/pkg/cortex/runtime_config.go index 5f71746c2a2..974e22f04bd 100644 --- a/pkg/cortex/runtime_config.go +++ b/pkg/cortex/runtime_config.go @@ -85,9 +85,11 @@ func (l runtimeConfigLoader) load(r io.Reader) (any, error) { if strings.Contains(targetStr, target) { // only check if target is `all`, `distributor`, "querier", and "ruler" // refer to https://github.com/cortexproject/cortex/issues/6741#issuecomment-3067244929 - for _, ul := range overrides.TenantLimits { - if err := ul.Validate(l.cfg.Distributor.ShardByAllLabels, l.cfg.Ingester.ActiveSeriesMetricsEnabled); err != nil { - return nil, err + if overrides != nil { + for _, ul := range overrides.TenantLimits { + if err := ul.Validate(l.cfg.Distributor.ShardByAllLabels, l.cfg.Ingester.ActiveSeriesMetricsEnabled); err != nil { + return nil, err + } } } } diff --git a/pkg/cortex/runtime_config_test.go b/pkg/cortex/runtime_config_test.go index e65398426c2..bd2f6b22320 100644 --- a/pkg/cortex/runtime_config_test.go +++ b/pkg/cortex/runtime_config_test.go @@ -12,6 +12,16 @@ import ( "github.com/cortexproject/cortex/pkg/util/validation" ) +func TestLoadRuntimeConfig_ShouldNoPanicWhenNull(t *testing.T) { + yamlFile := strings.NewReader(` +null +`) + + loader := runtimeConfigLoader{cfg: Config{Target: []string{All}}} + _, err := loader.load(yamlFile) + require.NoError(t, err) +} + // Given limits are usually loaded via a config file, and that // a configmap is limited to 1MB, we need to minimise the limits file. // One way to do it is via YAML anchors. diff --git a/website/content/en/blog/2025/readonly-ingester-scaling.md b/website/content/en/blog/2025/readonly-ingester-scaling.md new file mode 100644 index 00000000000..1f3b281a90c --- /dev/null +++ b/website/content/en/blog/2025/readonly-ingester-scaling.md @@ -0,0 +1,180 @@ +--- +date: 2025-10-17 +title: "Introducing READONLY State: Gradual and Safe Ingester Scaling" +linkTitle: READONLY Ingester Scaling +tags: [ "blog", "cortex", "ingester", "scaling" ] +categories: [ "blog" ] +projects: [ "cortex" ] +description: > + Learn about Cortex's new READONLY state for ingesters introduced in version 1.19.0 that enables gradual, safe scaling down operations without data loss or performance impact. +author: Daniel Blando ([@danielblando](https://github.com/danielblando)) +--- + +## Introduction + +Scaling down ingesters in Cortex has traditionally been a complex and risky operation. The conventional approach required setting `querier.query-store-after=0s`, which forces all queries to hit storage directly, significantly impacting performance. With Cortex 1.19.0, we introduced a new **READONLY state** for ingesters that changes how you can safely scale down your Cortex clusters. + +## Why Traditional Scaling Falls Short + +The legacy approach to ingester scaling had several issues: + +**Performance Impact**: Setting `querier.query-store-after=0s` forces all queries to bypass ingesters entirely, increasing query latency and storage load. + +**Operational Complexity**: Traditional scaling required coordinating configuration changes across multiple components, precise timing, manual monitoring of bucket scanning intervals, and scaling ingesters one by one with waiting periods between each shutdown. + +**Risk of Data Loss**: Without proper coordination, scaling down could result in data loss if in-memory data wasn't properly flushed to storage before ingester termination. + +## What is the READONLY State? + +The READONLY state addresses these challenges. When an ingester transitions to READONLY state: + +- **Stops accepting new writes** - Push requests are rejected and redistributed to ACTIVE ingesters +- **Continues serving queries** - Existing data remains available, maintaining query performance +- **Gradually ages out data** - Data naturally expires according to your retention settings +- **Enables safe removal** - Ingesters can be terminated once data has aged out + +## How to Use READONLY State + +### Step 1: Transition to READONLY + +```bash +# Set multiple ingesters to READONLY simultaneously +curl -X POST http://ingester-1:8080/ingester/mode -d '{"mode": "READONLY"}' +curl -X POST http://ingester-2:8080/ingester/mode -d '{"mode": "READONLY"}' +curl -X POST http://ingester-3:8080/ingester/mode -d '{"mode": "READONLY"}' +``` + +### Step 2: Monitor Data Status (Optional) + +```bash +# Check user statistics and loaded blocks on the ingester +curl http://ingester-1:8080/ingester/all_user_stats +``` + +### Step 3: Choose Removal Strategy + +You have three options: + +- **Immediate removal**: Safe for service availability but may impact query performance +- **Conservative removal**: Wait for `querier.query-ingesters-within` duration (recommended) +- **Complete data aging**: Wait for full retention period + +### Step 4: Remove Ingesters + +```bash +# Terminate the ingester processes +kubectl delete pod ingester-1 ingester-2 ingester-3 +``` + +## Timeline Example + +For a cluster with `querier.query-ingesters-within=5h`: + +- **T0**: Set ingesters to READONLY state +- **T1**: Ingesters stop receiving new data but continue serving queries +- **T2 (T0 + 5h)**: Ingesters no longer receive query requests (safe to remove) +- **T3 (T0 + retention_period)**: All blocks naturally removed from ingesters + +**Any time after T2 is safe for removal without service impact.** + +## Benefits + +### Performance Preservation +Unlike the traditional approach, READONLY ingesters continue serving queries, maintaining performance during the scaling transition. + +### Operational Simplicity +- No configuration changes required across multiple components +- Batch operations supported - multiple ingesters can transition simultaneously (no more "one by one" requirement) +- No waiting periods between ingester transitions +- Flexible timing - remove ingesters when convenient +- Reversible operations - ingesters can return to ACTIVE state if needed + +### Enhanced Safety +- Gradual data aging without manual intervention +- Data remains available during transition +- Monitoring capabilities with `/ingester/all_user_stats` endpoint + +## Practical Examples + +### Basic READONLY Scaling + +```bash +#!/bin/bash +INGESTERS_TO_SCALE=("ingester-1" "ingester-2" "ingester-3") +WAIT_DURATION="5h" + +# Set ingesters to READONLY +for ingester in "${INGESTERS_TO_SCALE[@]}"; do + echo "Setting $ingester to READONLY..." + curl -X POST http://$ingester:8080/ingester/mode -d '{"mode": "READONLY"}' +done + +# Wait for safe removal window +echo "Waiting $WAIT_DURATION for safe removal..." +sleep $WAIT_DURATION + +# Remove ingesters +for ingester in "${INGESTERS_TO_SCALE[@]}"; do + echo "Removing $ingester..." + kubectl delete pod $ingester +done +``` + +### Advanced: Check for Empty Users Before Removal + +```bash +#!/bin/bash +check_ingester_ready() { + local ingester=$1 + local response=$(curl -s http://$ingester:8080/ingester/all_user_stats) + + # Empty array "[]" indicates no users/data remaining + if [[ "$response" == "[]" ]]; then + return 0 # Ready for removal + else + return 1 # Still has user data + fi +} + +INGESTERS_TO_SCALE=("ingester-1" "ingester-2" "ingester-3") + +# Set ingesters to READONLY +for ingester in "${INGESTERS_TO_SCALE[@]}"; do + echo "Setting $ingester to READONLY..." + curl -X POST http://$ingester:8080/ingester/mode -d '{"mode": "READONLY"}' +done + +# Wait and check for data removal +for ingester in "${INGESTERS_TO_SCALE[@]}"; do + echo "Waiting for $ingester to be ready for removal..." + while ! check_ingester_ready $ingester; do + echo "$ingester still has user data, waiting 30s..." + sleep 30 + done + + echo "Removing $ingester (no user data remaining)..." + kubectl delete pod $ingester +done +``` + +## Best Practices + +- **Test in non-production first** to validate the process with your configuration +- **Scale gradually** - don't remove too many ingesters simultaneously +- **Monitor throughout** - watch metrics during the entire process +- **Understand your query patterns** - know your `querier.query-ingesters-within` setting + +## Emergency Rollback + +If issues arise, return ingesters to ACTIVE state: + +```bash +# Revert to ACTIVE state +curl -X POST http://ingester-1:8080/ingester/mode -d '{"mode": "ACTIVE"}' +``` + +## Conclusion + +The READONLY state improves Cortex's operational capabilities. This feature makes scaling operations safer, simpler, more flexible, and more performant than the traditional approach. Configuration changes across multiple components are no longer required - set ingesters to READONLY and remove them when convenient. + +For detailed information and examples, check out our [Ingesters Scaling Guide](../../docs/guides/ingesters-scaling-up-and-down/). \ No newline at end of file