|
| 1 | +--- |
| 2 | +title: lokistack log retention support |
| 3 | +authors: |
| 4 | + - @shwetaap |
| 5 | +reviewers: |
| 6 | + - @xperimental |
| 7 | + - @periklis |
| 8 | +creation-date: 2022-05-21 |
| 9 | +last-updated: 2022-05-21 |
| 10 | +tracking-link: |
| 11 | +see-also: |
| 12 | +--- |
| 13 | + |
| 14 | +# LokiStack Log Retention Support |
| 15 | + |
| 16 | +## Summary |
| 17 | + |
| 18 | +Retention in Grafana Loki is achieved either through the Table Manager or the Compactor. |
| 19 | +Retention through the Compactor is supported only with the boltdb-shipper store. The Compactor retention will become the default and have long term support. It supports more granular retention policies on per-tenant and per-stream use cases. So we are defining the Lokistack resources to support retention via Compactor only. |
| 20 | +The following sections describe a set of APIs in form of custom resource definitions (CRD) that enable users of `LokiStack` resources to support: |
| 21 | +- Enable Retention configuration in LokiStack. |
| 22 | +- Define the Global and Per-tenant retention period and stream configurations within the Lokistack custom resource. |
| 23 | + |
| 24 | +## Motivation |
| 25 | + |
| 26 | +The Loki Operator manages `LokiStack` resources that consists of a set of Loki components for ingestion/quering and optionally a gateway microservice that ensures authenticated and authorized access to logs stored by Loki. |
| 27 | +Retention in Loki has always been global for a cluster and deferred to the underlying object store. Since v2.3.0 Loki can handle retention through the Compactor component. Retention can be configured per tenant and per stream. These different retention configurations allow storage cost control and meet security and compliance requirements in a more granular way. |
| 28 | +A common use case for custom policies is to delete high-frequency logs earlier than other (low-frequency) logs. |
| 29 | + |
| 30 | +### Goals |
| 31 | + |
| 32 | +* The user can enable the retention via the `LokiStack` custom resource. |
| 33 | +* The user can declare per-tenant and global policies through the LokiStack custom resource. These are ordered by priority. |
| 34 | +* The policies support time-based deletion of older logs. |
| 35 | + |
| 36 | + |
| 37 | +### Non-Goals |
| 38 | + |
| 39 | +* Stress-testing the compactor on each T-shirt-size with an overwhelming amount of retention rules |
| 40 | + |
| 41 | +## Proposal |
| 42 | + |
| 43 | +The following enhancement proposal describes the required API additions and changes in the Loki Operator to add support for configuring custom log retention per tenant |
| 44 | +https://grafana.com/docs/loki/latest/operations/storage/retention/ |
| 45 | +https://grafana.com/docs/loki/latest/configuration/#compactor_config |
| 46 | + |
| 47 | +### API Extensions |
| 48 | + |
| 49 | +#### LokiStack Changes: Support for configuring log retention |
| 50 | + |
| 51 | + |
| 52 | +```go |
| 53 | + |
| 54 | +import "github.com/prometheus/prometheus/model/labels" |
| 55 | + |
| 56 | +// LokiDuration defines the type for Prometheus durations. |
| 57 | +// |
| 58 | +// +kubebuilder:validation:Pattern:="((([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?|0)" |
| 59 | +type LokiDuration string |
| 60 | + |
| 61 | +// LokiStackSpec defines the desired state of LokiStack |
| 62 | +type LokiStackSpec struct { |
| 63 | +... |
| 64 | + // Retention defines the spec for log retention |
| 65 | + // |
| 66 | + // +optional |
| 67 | + // +kubebuilder:validation:Optional |
| 68 | + Retention *RetentionSpec `json:"retention,omitempty"` |
| 69 | +... |
| 70 | +} |
| 71 | + |
| 72 | +// RetentionSpec defines the spec for the enabling retention in the Compactor. |
| 73 | +type RetentionSpec struct { |
| 74 | + |
| 75 | + // DeleteDelay defines Delay after which chunks will be fully deleted during retention |
| 76 | + // |
| 77 | + // +optional |
| 78 | + // +kubebuilder:validation:optional |
| 79 | + DeleteDelay LokiDuration `json:"deletedelay,omitempty"` |
| 80 | +} |
| 81 | + |
| 82 | + |
| 83 | +// LimitsTemplateSpec defines the limits applied at ingestion or query path. |
| 84 | +type LimitsTemplateSpec struct { |
| 85 | +... |
| 86 | + // RetentionLimits defines the configuration of the retention period. |
| 87 | + // |
| 88 | + // +optional |
| 89 | + // +kubebuilder:validation:Optional |
| 90 | + RetentionLimits *RetentionLimitSpec `json:"retention,omitempty"` |
| 91 | +} |
| 92 | + |
| 93 | +// RetentionLimitSpec configures the retention period and retention stream |
| 94 | +type RetentionLimitSpec struct { |
| 95 | + // PeriodDays defines the log retention period. |
| 96 | + // |
| 97 | + // +optional |
| 98 | + // +kubebuilder:validation:Optional |
| 99 | + PeriodDays int `json:"period,omitempty"` |
| 100 | + |
| 101 | + // Stream defines the log stream. |
| 102 | + // |
| 103 | + // +optional |
| 104 | + // +kubebuilder:validation:Optional |
| 105 | + Stream []*StreamSpec `json:"stream,omitempty"` |
| 106 | +} |
| 107 | + |
| 108 | +// StreamSpec defines the map of per pod status per LokiStack component. |
| 109 | +// Each component is represented by a separate map of v1.Phase to a list of pods. |
| 110 | +type StreamSpec struct { |
| 111 | + // PeriodDays defines the log retention period. |
| 112 | + // |
| 113 | + // +optional |
| 114 | + // +kubebuilder:validation:Optional |
| 115 | + PeriodDays int `json:"period,omitempty"` |
| 116 | + // Priority defines the retenton priority. |
| 117 | + // |
| 118 | + // +optional |
| 119 | + // +kubebuilder:validation:Optional |
| 120 | + Priority int32 `json:"priority,omitempty"` |
| 121 | + // Selector is a set of labels to identify the log stream. |
| 122 | + // |
| 123 | + // +optional |
| 124 | + // +kubebuilder:validation:Optional |
| 125 | + Selector *labels.Matcher `json:"selector,omitempty"` |
| 126 | +} |
| 127 | +``` |
| 128 | + |
| 129 | +### Implementation Details/Notes/Constraints |
| 130 | + |
| 131 | +```yaml |
| 132 | +apiVersion: loki.grafana.com/v1beta1 |
| 133 | +kind: LokiStack |
| 134 | +metadata: |
| 135 | + name: lokistack-dev |
| 136 | +spec: |
| 137 | + size: 1x.extra-small |
| 138 | + storage: |
| 139 | + secret: |
| 140 | + name: test |
| 141 | + type: s3 |
| 142 | + storageClassName: gp2 |
| 143 | + retention: |
| 144 | + deleteDelay: |
| 145 | + limits: |
| 146 | + global: |
| 147 | + retentionLimits: |
| 148 | + periodDays: 31 |
| 149 | + stream: |
| 150 | + - selector: |
| 151 | + name: namespace |
| 152 | + type: equal |
| 153 | + value: dev |
| 154 | + priority: 1 |
| 155 | + periodDays: 1 |
| 156 | + tenants: |
| 157 | + tenanta: |
| 158 | + retentionLimits: |
| 159 | + periodDays: 7 |
| 160 | + stream: |
| 161 | + - selector: |
| 162 | + name: namespace |
| 163 | + type: equal |
| 164 | + value: prod |
| 165 | + priority: 2 |
| 166 | + periodDays: 14 |
| 167 | + - selector: |
| 168 | + name: container |
| 169 | + type: equal |
| 170 | + value: loki |
| 171 | + priority: 1 |
| 172 | + periodDays: 3 |
| 173 | + tenantb: |
| 174 | + retentionLimits: |
| 175 | + periodDays: |
| 176 | + stream: |
| 177 | + - selector: |
| 178 | + name: container |
| 179 | + type: equal |
| 180 | + value: nginx |
| 181 | + priority: 1 |
| 182 | + periodDays: 1 |
| 183 | +``` |
| 184 | + |
| 185 | +#### General constraints |
| 186 | +
|
| 187 | +### Risks and Mitigations |
| 188 | +
|
| 189 | +## Design Details |
| 190 | +Retention is enabled in the cluster when the `retention` block is added to the Lokstack custom resource. `deleteDelay` is the time after which the compactor will delete marked chunks. boltdb-shipper indexes are refreshed from the shared store on components using it (querier and ruler) at a specific interval. This means deleting chunks instantly could lead to components still having reference to old chunks and so they could fail to execute queries. Having a delay allows for components to refresh their store and so remove gracefully their reference of those chunks. It also provides a short window of time in which to cancel chunk deletion in the case of a configuration mistake. |
| 191 | +`DeleteWorkerCount` specifies the maximum quantity of goroutine workers instantiated to delete chunks. - https://grafana.com/docs/loki/latest/operations/storage/retention/#retention-configuration. The operator instantiates loki cluster of different t-shirt sizes. A pre-determined default value of `DeleteWorkerCount` per t-shirt size cluster is set to avoid issues like large number of goroutine workers instantiated on small clusters. The user cannot input `DeleteWorkerCount`. |
| 192 | + |
| 193 | +Retention period is configured within the limits_config configuration section. |
| 194 | + |
| 195 | +There are two ways of setting retention policies: |
| 196 | + |
| 197 | +retention_period which is applied globally. |
| 198 | +retention_stream which is only applied to chunks matching the selector |
| 199 | + |
| 200 | +This can be confiured at a global level(applied to all tenants) or on a per-tenant basis. |
| 201 | + |
| 202 | +The API confugures RetentionLimit in the same way as configuring IngestionLimit/QueryLimit. During The Lokistack resource reconciliation, the configuration from the `global` section is added into the `limits_config` sextion of the loki-config.yaml and the configuration from the multiple `tenants` is provided in the running_config file in the overrides section. |
| 203 | + |
| 204 | + |
| 205 | +Once the configuration is read, the following rules are applied to decide the retention period |
| 206 | + |
| 207 | +A rule to apply is selected by choosing the first in this list that matches: |
| 208 | + |
| 209 | +If a per-tenant retention_stream matches the current stream, the highest priority is picked. |
| 210 | +If a global retention_stream matches the current stream, the highest priority is picked. |
| 211 | +If a per-tenant retention_period is specified, it will be applied. |
| 212 | +The global retention_period will be selected if nothing else matched. |
| 213 | +If no global retention_period is specified, the default value of 744h (30days) retention is used. |
| 214 | + |
| 215 | + |
| 216 | +### Open Questions [optional] |
| 217 | + |
| 218 | +## Implementation History |
| 219 | + |
| 220 | +## Drawbacks |
| 221 | +User is not allowed to input the `DeleteWorkerCount` value |
| 222 | + |
| 223 | +## Alternatives |
| 224 | + |
0 commit comments