Skip to content

Commit 334c394

Browse files
authored
operator: Enhancement Proposal to configure log retention (grafana#6232)
Signed-off-by: Shweta Padubidri <[email protected]>
1 parent 1cfc30f commit 334c394

File tree

1 file changed

+224
-0
lines changed

1 file changed

+224
-0
lines changed
Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
---
2+
title: lokistack log retention support
3+
authors:
4+
- @shwetaap
5+
reviewers:
6+
- @xperimental
7+
- @periklis
8+
creation-date: 2022-05-21
9+
last-updated: 2022-05-21
10+
tracking-link:
11+
see-also:
12+
---
13+
14+
# LokiStack Log Retention Support
15+
16+
## Summary
17+
18+
Retention in Grafana Loki is achieved either through the Table Manager or the Compactor.
19+
Retention through the Compactor is supported only with the boltdb-shipper store. The Compactor retention will become the default and have long term support. It supports more granular retention policies on per-tenant and per-stream use cases. So we are defining the Lokistack resources to support retention via Compactor only.
20+
The following sections describe a set of APIs in form of custom resource definitions (CRD) that enable users of `LokiStack` resources to support:
21+
- Enable Retention configuration in LokiStack.
22+
- Define the Global and Per-tenant retention period and stream configurations within the Lokistack custom resource.
23+
24+
## Motivation
25+
26+
The Loki Operator manages `LokiStack` resources that consists of a set of Loki components for ingestion/quering and optionally a gateway microservice that ensures authenticated and authorized access to logs stored by Loki.
27+
Retention in Loki has always been global for a cluster and deferred to the underlying object store. Since v2.3.0 Loki can handle retention through the Compactor component. Retention can be configured per tenant and per stream. These different retention configurations allow storage cost control and meet security and compliance requirements in a more granular way.
28+
A common use case for custom policies is to delete high-frequency logs earlier than other (low-frequency) logs.
29+
30+
### Goals
31+
32+
* The user can enable the retention via the `LokiStack` custom resource.
33+
* The user can declare per-tenant and global policies through the LokiStack custom resource. These are ordered by priority.
34+
* The policies support time-based deletion of older logs.
35+
36+
37+
### Non-Goals
38+
39+
* Stress-testing the compactor on each T-shirt-size with an overwhelming amount of retention rules
40+
41+
## Proposal
42+
43+
The following enhancement proposal describes the required API additions and changes in the Loki Operator to add support for configuring custom log retention per tenant
44+
https://grafana.com/docs/loki/latest/operations/storage/retention/
45+
https://grafana.com/docs/loki/latest/configuration/#compactor_config
46+
47+
### API Extensions
48+
49+
#### LokiStack Changes: Support for configuring log retention
50+
51+
52+
```go
53+
54+
import "github.com/prometheus/prometheus/model/labels"
55+
56+
// LokiDuration defines the type for Prometheus durations.
57+
//
58+
// +kubebuilder:validation:Pattern:="((([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?|0)"
59+
type LokiDuration string
60+
61+
// LokiStackSpec defines the desired state of LokiStack
62+
type LokiStackSpec struct {
63+
...
64+
// Retention defines the spec for log retention
65+
//
66+
// +optional
67+
// +kubebuilder:validation:Optional
68+
Retention *RetentionSpec `json:"retention,omitempty"`
69+
...
70+
}
71+
72+
// RetentionSpec defines the spec for the enabling retention in the Compactor.
73+
type RetentionSpec struct {
74+
75+
// DeleteDelay defines Delay after which chunks will be fully deleted during retention
76+
//
77+
// +optional
78+
// +kubebuilder:validation:optional
79+
DeleteDelay LokiDuration `json:"deletedelay,omitempty"`
80+
}
81+
82+
83+
// LimitsTemplateSpec defines the limits applied at ingestion or query path.
84+
type LimitsTemplateSpec struct {
85+
...
86+
// RetentionLimits defines the configuration of the retention period.
87+
//
88+
// +optional
89+
// +kubebuilder:validation:Optional
90+
RetentionLimits *RetentionLimitSpec `json:"retention,omitempty"`
91+
}
92+
93+
// RetentionLimitSpec configures the retention period and retention stream
94+
type RetentionLimitSpec struct {
95+
// PeriodDays defines the log retention period.
96+
//
97+
// +optional
98+
// +kubebuilder:validation:Optional
99+
PeriodDays int `json:"period,omitempty"`
100+
101+
// Stream defines the log stream.
102+
//
103+
// +optional
104+
// +kubebuilder:validation:Optional
105+
Stream []*StreamSpec `json:"stream,omitempty"`
106+
}
107+
108+
// StreamSpec defines the map of per pod status per LokiStack component.
109+
// Each component is represented by a separate map of v1.Phase to a list of pods.
110+
type StreamSpec struct {
111+
// PeriodDays defines the log retention period.
112+
//
113+
// +optional
114+
// +kubebuilder:validation:Optional
115+
PeriodDays int `json:"period,omitempty"`
116+
// Priority defines the retenton priority.
117+
//
118+
// +optional
119+
// +kubebuilder:validation:Optional
120+
Priority int32 `json:"priority,omitempty"`
121+
// Selector is a set of labels to identify the log stream.
122+
//
123+
// +optional
124+
// +kubebuilder:validation:Optional
125+
Selector *labels.Matcher `json:"selector,omitempty"`
126+
}
127+
```
128+
129+
### Implementation Details/Notes/Constraints
130+
131+
```yaml
132+
apiVersion: loki.grafana.com/v1beta1
133+
kind: LokiStack
134+
metadata:
135+
name: lokistack-dev
136+
spec:
137+
size: 1x.extra-small
138+
storage:
139+
secret:
140+
name: test
141+
type: s3
142+
storageClassName: gp2
143+
retention:
144+
deleteDelay:
145+
limits:
146+
global:
147+
retentionLimits:
148+
periodDays: 31
149+
stream:
150+
- selector:
151+
name: namespace
152+
type: equal
153+
value: dev
154+
priority: 1
155+
periodDays: 1
156+
tenants:
157+
tenanta:
158+
retentionLimits:
159+
periodDays: 7
160+
stream:
161+
- selector:
162+
name: namespace
163+
type: equal
164+
value: prod
165+
priority: 2
166+
periodDays: 14
167+
- selector:
168+
name: container
169+
type: equal
170+
value: loki
171+
priority: 1
172+
periodDays: 3
173+
tenantb:
174+
retentionLimits:
175+
periodDays:
176+
stream:
177+
- selector:
178+
name: container
179+
type: equal
180+
value: nginx
181+
priority: 1
182+
periodDays: 1
183+
```
184+
185+
#### General constraints
186+
187+
### Risks and Mitigations
188+
189+
## Design Details
190+
Retention is enabled in the cluster when the `retention` block is added to the Lokstack custom resource. `deleteDelay` is the time after which the compactor will delete marked chunks. boltdb-shipper indexes are refreshed from the shared store on components using it (querier and ruler) at a specific interval. This means deleting chunks instantly could lead to components still having reference to old chunks and so they could fail to execute queries. Having a delay allows for components to refresh their store and so remove gracefully their reference of those chunks. It also provides a short window of time in which to cancel chunk deletion in the case of a configuration mistake.
191+
`DeleteWorkerCount` specifies the maximum quantity of goroutine workers instantiated to delete chunks. - https://grafana.com/docs/loki/latest/operations/storage/retention/#retention-configuration. The operator instantiates loki cluster of different t-shirt sizes. A pre-determined default value of `DeleteWorkerCount` per t-shirt size cluster is set to avoid issues like large number of goroutine workers instantiated on small clusters. The user cannot input `DeleteWorkerCount`.
192+
193+
Retention period is configured within the limits_config configuration section.
194+
195+
There are two ways of setting retention policies:
196+
197+
retention_period which is applied globally.
198+
retention_stream which is only applied to chunks matching the selector
199+
200+
This can be confiured at a global level(applied to all tenants) or on a per-tenant basis.
201+
202+
The API confugures RetentionLimit in the same way as configuring IngestionLimit/QueryLimit. During The Lokistack resource reconciliation, the configuration from the `global` section is added into the `limits_config` sextion of the loki-config.yaml and the configuration from the multiple `tenants` is provided in the running_config file in the overrides section.
203+
204+
205+
Once the configuration is read, the following rules are applied to decide the retention period
206+
207+
A rule to apply is selected by choosing the first in this list that matches:
208+
209+
If a per-tenant retention_stream matches the current stream, the highest priority is picked.
210+
If a global retention_stream matches the current stream, the highest priority is picked.
211+
If a per-tenant retention_period is specified, it will be applied.
212+
The global retention_period will be selected if nothing else matched.
213+
If no global retention_period is specified, the default value of 744h (30days) retention is used.
214+
215+
216+
### Open Questions [optional]
217+
218+
## Implementation History
219+
220+
## Drawbacks
221+
User is not allowed to input the `DeleteWorkerCount` value
222+
223+
## Alternatives
224+

0 commit comments

Comments
 (0)