You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/architecture.md
+29-29Lines changed: 29 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,9 +21,9 @@ Incoming samples (writes from Prometheus) are handled by the [distributor](#dist
21
21
22
22
## Blocks storage
23
23
24
-
The blocks storage is based on [Prometheus TSDB](https://prometheus.io/docs/prometheus/latest/storage/): it stores each tenant's time series into their own TSDB which write out their series to a on-disk Block (defaults to 2h block range periods). Each Block is composed by a few files storing the chunks and the block index.
24
+
The blocks storage is based on [Prometheus TSDB](https://prometheus.io/docs/prometheus/latest/storage/): it stores each tenant's time series into their own TSDB which writes out their series to an on-disk Block (defaults to 2h block range periods). Each Block is composed of a few files storing the chunks and the block index.
25
25
26
-
The TSDB chunk files contain the samples for multiple series. The series inside the Chunks are then indexed by a per-block index, which indexes metric names and labels to time series in the chunk files.
26
+
The TSDB chunk files contain the samples for multiple series. The series inside the chunks are then indexed by a per-block index, which indexes metric names and labels to time series in the chunk files.
27
27
28
28
The blocks storage doesn't require a dedicated storage backend for the index. The only requirement is an object store for the Block files, which can be:
29
29
@@ -60,7 +60,7 @@ The **distributor** service is responsible for handling incoming samples from Pr
60
60
61
61
The validation done by the distributor includes:
62
62
63
-
- The metric labels name are formally correct
63
+
- The metric label names are formally correct
64
64
- The configured max number of labels per metric is respected
65
65
- The configured max length of a label name and value is respected
66
66
- The timestamp is not older/newer than the configured min/max time range
@@ -80,7 +80,7 @@ The supported KV stores for the HA tracker are:
80
80
*[Consul](https://www.consul.io)
81
81
*[Etcd](https://etcd.io)
82
82
83
-
Note: Memberlist is not supported. Memberlist-based KV store propagates updates using gossip, which is very slow for HA purposes: result is that different distributors may see different Prometheus server as elected HA replica, which is definitely not desirable.
83
+
Note: Memberlist is not supported. Memberlist-based KV store propagates updates using gossip, which is very slow for HA purposes: the result is that different distributors may see different Prometheus servers as the elected HA replica, which is definitely not desirable.
84
84
85
85
For more information, please refer to [config for sending HA pairs data to Cortex](guides/ha-pair-handling.md) in the documentation.
86
86
@@ -97,11 +97,11 @@ The trade-off associated with the latter is that writes are more balanced across
97
97
98
98
#### The hash ring
99
99
100
-
A hash ring (stored in a key-value store) is used to achieve consistent hashing for the series sharding and replication across the ingesters. All [ingesters](#ingester) register themselves into the hash ring with a set of tokens they own; each token is a random unsigned 32-bit number. Each incoming series is [hashed](#hashing) in the distributor and then pushed to the ingester owning the tokens range for the series hash number plus N-1 subsequent ingesters in the ring, where N is the replication factor.
100
+
A hash ring (stored in a key-value store) is used to achieve consistent hashing for the series sharding and replication across the ingesters. All [ingesters](#ingester) register themselves into the hash ring with a set of tokens they own; each token is a random unsigned 32-bit number. Each incoming series is [hashed](#hashing) in the distributor and then pushed to the ingester owning the token's range for the series hash number plus N-1 subsequent ingesters in the ring, where N is the replication factor.
101
101
102
102
To do the hash lookup, distributors find the smallest appropriate token whose value is larger than the [hash of the series](#hashing). When the replication factor is larger than 1, the next subsequent tokens (clockwise in the ring) that belong to different ingesters will also be included in the result.
103
103
104
-
The effect of this hash set up is that each token that an ingester owns is responsible for a range of hashes. If there are three tokens with values 0, 25, and 50, then a hash of 3 would be given to the ingester that owns the token 25; the ingester owning token 25 is responsible for the hash range of 1-25.
104
+
The effect of this hash setup is that each token that an ingester owns is responsible for a range of hashes. If there are three tokens with values 0, 25, and 50, then a hash of 3 would be given to the ingester that owns token 25; the ingester owning token 25 is responsible for the hash range of 1-25.
105
105
106
106
The supported KV stores for the hash ring are:
107
107
@@ -111,7 +111,7 @@ The supported KV stores for the hash ring are:
111
111
112
112
#### Quorum consistency
113
113
114
-
Since all distributors share access to the same hash ring, write requests can be sent to any distributor and you can setup a stateless load balancer in front of it.
114
+
Since all distributors share access to the same hash ring, write requests can be sent to any distributor and you can set up a stateless load balancer in front of it.
115
115
116
116
To ensure consistent query results, Cortex uses [Dynamo-style](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) quorum consistency on reads and writes. This means that the distributor will wait for a positive response of at least one half plus one of the ingesters to send the sample to before successfully responding to the Prometheus write request.
117
117
@@ -125,35 +125,35 @@ The **ingester** service is responsible for writing incoming series to a [long-t
125
125
126
126
Incoming series are not immediately written to the storage but kept in memory and periodically flushed to the storage (by default, 2 hours). For this reason, the [queriers](#querier) may need to fetch samples both from ingesters and long-term storage while executing a query on the read path.
127
127
128
-
Ingesters contain a **lifecycler** which manages the lifecycle of an ingester and stores the **ingester state** in the [hash ring](#the-hash-ring). Each ingester could be in one of the following states:
128
+
Ingesters contain a **lifecycler** which manages the lifecycle of an ingester and stores the **ingester state** in the [hash ring](#the-hash-ring). Each ingester can be in one of the following states:
129
129
130
130
-**`PENDING`**<br />
131
-
The ingester has just started. While in this state, the ingester doesn't receive neither write and read requests.
131
+
The ingester has just started. While in this state, the ingester doesn't receive either write or read requests.
132
132
-**`JOINING`**<br />
133
-
The ingester is starting up and joining the ring. While in this state the ingester doesn't receive neither write and read requests. The ingester will join the ring using tokens loaded from disk (if `-ingester.tokens-file-path` is configured) or generate a set of new random ones. Finally, the ingester optionally observes the ring for tokens conflicts and then, once any conflict is resolved, will move to `ACTIVE` state.
133
+
The ingester is starting up and joining the ring. While in this state the ingester doesn't receive either write or read requests. The ingester will join the ring using tokens loaded from disk (if `-ingester.tokens-file-path` is configured) or generate a set of new random ones. Finally, the ingester optionally observes the ring for token conflicts and then, once any conflict is resolved, will move to `ACTIVE` state.
134
134
-**`ACTIVE`**<br />
135
135
The ingester is up and running. While in this state the ingester can receive both write and read requests.
136
136
-**`LEAVING`**<br />
137
-
The ingester is shutting down and leaving the ring. While in this state the ingester doesn't receive write requests, while it could receive read requests.
137
+
The ingester is shutting down and leaving the ring. While in this state the ingester doesn't receive write requests, while it can still receive read requests.
138
138
-**`UNHEALTHY`**<br />
139
139
The ingester has failed to heartbeat to the ring's KV Store. While in this state, distributors skip the ingester while building the replication set for incoming series and the ingester does not receive write or read requests.
140
140
141
141
Ingesters are **semi-stateful**.
142
142
143
-
#### Ingesters failure and data loss
143
+
#### Ingester failure and data loss
144
144
145
145
If an ingester process crashes or exits abruptly, all the in-memory series that have not yet been flushed to the long-term storage will be lost. There are two main ways to mitigate this failure mode:
146
146
147
147
1. Replication
148
148
2. Write-ahead log (WAL)
149
149
150
-
The **replication** is used to hold multiple (typically 3) replicas of each time series in the ingesters. If the Cortex cluster loses an ingester, the in-memory series held by the lost ingester are also replicated to at least another ingester. In the event of a single ingester failure, no time series samples will be lost. However, in the event of multiple ingester failures, time series may be potentially lost if the failures affect all the ingesters holding the replicas of a specific time series.
150
+
The **replication** is used to hold multiple (typically 3) replicas of each time series in the ingesters. If the Cortex cluster loses an ingester, the in-memory series held by the lost ingester are also replicated to at least one other ingester. In the event of a single ingester failure, no time series samples will be lost. However, in the event of multiple ingester failures, time series may be potentially lost if the failures affect all the ingesters holding the replicas of a specific time series.
151
151
152
152
The **write-ahead log** (WAL) is used to write to a persistent disk all incoming series samples until they're flushed to the long-term storage. In the event of an ingester failure, a subsequent process restart will replay the WAL and recover the in-memory series samples.
153
153
154
-
Contrary to the sole replication and given the persistent disk data is not lost, in the event of multiple ingesters failure each ingester will recover the in-memory series samples from WAL upon subsequent restart. The replication is still recommended in order to ensure no temporary failures on the read path in the event of a single ingester failure.
154
+
Contrary to the sole replication and given that the persistent disk data is not lost, in the event of multiple ingester failures each ingester will recover the in-memory series samples from WAL upon subsequent restart. The replication is still recommended in order to ensure no temporary failures on the read path in the event of a single ingester failure.
155
155
156
-
#### Ingesters write de-amplification
156
+
#### Ingester write de-amplification
157
157
158
158
Ingesters store recently received samples in-memory in order to perform write de-amplification. If the ingesters would immediately write received samples to the long-term storage, the system would be very difficult to scale due to the very high pressure on the storage. For this reason, the ingesters batch and compress samples in-memory and periodically flush them out to the storage.
159
159
@@ -169,10 +169,10 @@ Queriers are **stateless** and can be scaled up and down as needed.
169
169
170
170
### Compactor
171
171
172
-
The **compactor** is a service which is responsible to:
172
+
The **compactor** is a service which is responsible for:
173
173
174
-
-Compact multiple blocks of a given tenant into a single optimized larger block. This helps to reduce storage costs (deduplication, index size reduction), and increase query speed (querying fewer blocks is faster).
175
-
-Keep the per-tenant bucket index updated. The [bucket index](./blocks-storage/bucket-index.md) is used by [queriers](./blocks-storage/querier.md), [store-gateways](#store-gateway) and rulers to discover new blocks in the storage.
174
+
-Compacting multiple blocks of a given tenant into a single optimized larger block. This helps to reduce storage costs (deduplication, index size reduction), and increase query speed (querying fewer blocks is faster).
175
+
-Keeping the per-tenant bucket index updated. The [bucket index](./blocks-storage/bucket-index.md) is used by [queriers](./blocks-storage/querier.md), [store-gateways](#store-gateway) and rulers to discover new blocks in the storage.
176
176
177
177
For more information, see the [compactor documentation](./blocks-storage/compactor.md).
178
178
@@ -190,7 +190,7 @@ The store gateway is **semi-stateful**.
190
190
191
191
### Query frontend
192
192
193
-
The **query frontend** is an **optional service** providing the querier's API endpoints and can be used to accelerate the read path. When the query frontend is in place, incoming query requests should be directed to the query frontend instead of the queriers. The querier service will be still required within the cluster, in order to execute the actual queries.
193
+
The **query frontend** is an **optional service** providing the querier's API endpoints and can be used to accelerate the read path. When the query frontend is in place, incoming query requests should be directed to the query frontend instead of the queriers. The querier service will still be required within the cluster, in order to execute the actual queries.
194
194
195
195
The query frontend internally performs some query adjustments and holds queries in an internal queue. In this setup, queriers act as workers which pull jobs from the queue, execute them, and return them to the query-frontend for aggregation. Queriers need to be configured with the query frontend address (via the `-querier.frontend-address` CLI flag) in order to allow them to connect to the query frontends.
196
196
@@ -199,15 +199,15 @@ Query frontends are **stateless**. However, due to how the internal queue works,
199
199
Flow of the query in the system when using query-frontend:
200
200
201
201
1) Query is received by query frontend, which can optionally split it or serve from the cache.
202
-
2) Query frontend stores the query into in-memory queue, where it waits for some querier to pick it up.
202
+
2) Query frontend stores the query into an in-memory queue, where it waits for some querier to pick it up.
203
203
3) Querier picks up the query, and executes it.
204
204
4) Querier sends result back to query-frontend, which then forwards it to the client.
205
205
206
-
Query frontend can also be used with any Prometheus-API compatible service. In this mode Cortex can be used as an query accelerator with it's caching and splitting features on other prometheus query engines like Thanos Querier or your own Prometheus server. Query frontend needs to be configured with downstream url address(via the `-frontend.downstream-url` CLI flag), which is the endpoint of the prometheus server intended to be connected with Cortex.
206
+
Query frontend can also be used with any Prometheus-API compatible service. In this mode Cortex can be used as a query accelerator with its caching and splitting features on other prometheus query engines like Thanos Querier or your own Prometheus server. Query frontend needs to be configured with downstream url address(via the `-frontend.downstream-url` CLI flag), which is the endpoint of the prometheus server intended to be connected with Cortex.
207
207
208
208
#### Queueing
209
209
210
-
The query frontend queuing mechanism is used to:
210
+
The query frontend queueing mechanism is used to:
211
211
212
212
* Ensure that large queries, that could cause an out-of-memory (OOM) error in the querier, will be retried on failure. This allows administrators to under-provision memory for queries, or optimistically run more small queries in parallel, which helps to reduce the total cost of ownership (TCO).
213
213
* Prevent multiple large requests from being convoyed on a single querier by distributing them across all queriers using a first-in/first-out queue (FIFO).
@@ -223,7 +223,7 @@ The query frontend supports caching query results and reuses them on subsequent
223
223
224
224
### Query Scheduler
225
225
226
-
Query Scheduler is an **optional** service that moves the internal queue from query frontend into separate component.
226
+
Query Scheduler is an **optional** service that moves the internal queue from query frontend into a separate component.
227
227
This enables independent scaling of query frontends and number of queues (query scheduler).
228
228
229
229
In order to use query scheduler, both query frontend and queriers must be configured with query scheduler address
@@ -232,10 +232,10 @@ In order to use query scheduler, both query frontend and queriers must be config
232
232
Flow of the query in the system changes when using query scheduler:
233
233
234
234
1) Query is received by query frontend, which can optionally split it or serve from the cache.
235
-
2) Query frontend forwards the query to random query scheduler process.
236
-
3) Query scheduler stores the query into in-memory queue, where it waits for some querier to pick it up.
237
-
3) Querier picks up the query, and executes it.
238
-
4) Querier sends result back to query-frontend, which then forwards it to the client.
235
+
2) Query frontend forwards the query to a random query scheduler process.
236
+
3) Query scheduler stores the query into an in-memory queue, where it waits for some querier to pick it up.
237
+
4) Querier picks up the query, and executes it.
238
+
5) Querier sends result back to query-frontend, which then forwards it to the client.
239
239
240
240
Query schedulers are **stateless**. It is recommended to run two replicas to make sure queries can still be serviced while one replica is restarting.
241
241
@@ -263,7 +263,7 @@ If all of the alertmanager nodes failed simultaneously there would be a loss of
263
263
### Configs API
264
264
265
265
The **configs API** is an **optional service** managing the configuration of Rulers and Alertmanagers.
266
-
It provides APIs to get/set/update the ruler and alertmanager configurations and store them into backend.
267
-
Current supported backend are PostgreSQL and in-memory.
266
+
It provides APIs to get/set/update the ruler and alertmanager configurations and store them in the backend.
267
+
Current supported backends are PostgreSQL and in-memory.
0 commit comments