-
Notifications
You must be signed in to change notification settings - Fork 83
docs(package): Add user documents for retention control; Add garbage collector to the multi-node guide. #1181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 141 commits
73f0d69
bc4c464
e6238a2
653f3dc
a269a97
1742ff0
efbb47b
5714945
b1fe0d4
a5f5a3b
53d7417
ccd23dc
5bb3685
bc688b9
0c8c6f6
0d186e6
75ac0ff
bb1e5f4
ba7cfe1
68454c6
5eccaaf
c1de746
f08802b
d797198
c5dc9b9
8c39e77
ea4318e
2c97441
2eff448
d570ab6
06332f4
d5e8e28
8a79b9b
8c77119
3a1afb2
73d76ac
5745e65
f3ba8b0
e566e74
db9a508
2f0f95a
e310102
a332799
cb53857
85b7823
398ab5e
3209ddd
3c5b0e4
fb41607
386453b
945c97b
1845462
bed13df
0d8d679
d40e773
7759a7a
271b8b3
e6b8cc7
c0b8563
7a468c3
1dd1cea
a0c3c29
a9bf615
fe05f5f
39a9278
7124828
eb80992
5ed44e7
af6b508
67fb01f
d6ad4de
ff7d700
7ffc77c
0255cbd
1076a3f
71c4d82
6bd9372
84df2e2
983bea1
bdb7817
a82a267
f699496
94e8ca1
90ce0a4
d6f9e5a
dc6a706
76bcb4a
a4e6f83
3c53cb0
66eba87
7b42568
097e47c
8dc8e26
afe43ce
85a3164
af75118
e5e90f7
bac6767
de1c334
9fdb3d5
2245244
b1e5a2c
4e93a30
6719872
f9fa626
450e16a
ade2e27
f1584ff
2c57dd6
5f479c5
8c5fb89
11e695f
1291c3f
f8c7369
9b48c9b
6cff24d
c367c15
e282020
b93bb4b
390333f
a4546cf
2c4821a
9d5d087
74af600
5f4f1e3
c8a919c
c911ccc
a02ed22
17defe5
f66b378
34a52f3
d9a3d09
c8b2500
e3ff836
8b2631a
3b4f74f
434a5ae
66bfe4b
d86f8fd
562bffe
91d468e
7f0c92e
0931335
e993809
d2800cf
9f85fd3
9654d87
edc2b35
21e2066
02eb9aa
15df283
fd6f121
1596cc6
0d8a48e
15ba738
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,6 +31,7 @@ worker components. The tables below list the components and their functions. | |
| | query_scheduler | Scheduler for search/aggregation jobs | | ||
| | results_cache | Storage for the workers to return search results to the UI | | ||
| | webui | Web server for the UI | | ||
| | garbage_collector | Background process for retention control | | ||
| ::: | ||
|
|
||
| :::{table} Worker components | ||
|
|
@@ -71,6 +72,8 @@ Running additional workers increases the parallelism of compression and search/a | |
| 4. Set `archive_output.directory` to a directory on the distributed filesystem. | ||
| * Ideally, the directory should be empty or should not yet exist (CLP will create it) since | ||
| CLP will write several files and directories directly to the given directory. | ||
| 5. (Optional) Configure retention periods for archives and search results. See | ||
| [retention control](guides-retention) for details. | ||
|
||
|
|
||
| 5. Download and extract the package on all nodes. | ||
| 6. Copy the `credentials.yml` and `clp-config.yml` files that you created above and paste them | ||
|
|
@@ -93,6 +96,7 @@ but all components in a group must be started before starting a component in the | |
|
|
||
| * `compression_scheduler` | ||
| * `query_scheduler` | ||
| * `garbage_collector` | ||
|
|
||
| **Group 3 components:** | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,167 @@ | ||
| # Retention control in CLP | ||
|
|
||
| CLP supports retention control to free up storage space by periodically deleting outdated archives | ||
| and search results. Retention applies to both the local filesystem and object storage. | ||
|
|
||
| This process is managed by background **garbage collector** jobs, which scan for and delete expired | ||
| data based on configured retention settings in `etc/clp-config.yml`. | ||
quinntaylormitchell marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| :::{note} | ||
| By default, retention control is disabled, and CLP retains data indefinitely. | ||
| ::: | ||
|
|
||
haiqi96 marked this conversation as resolved.
Show resolved
Hide resolved
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| --- | ||
|
|
||
| ## Definitions | ||
| This section explains the terms and criteria CLP uses to decide when data should be deleted. | ||
haiqi96 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| At a high level, CLP compares a data item's timestamp with the current time to determine whether | ||
| it has expired. The criteria used to assess this expiration differs slightly between archives and | ||
| search results. | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Terms | ||
| - **Current Time (`T`):** The current time (UTC) when a garbage collector job evaluates data | ||
| expiration. | ||
| - **Retention Period (`TTL`):** The configured duration for which CLP retains data before it is | ||
| considered expired. | ||
| - **Archive timestamp (`archive.T`):** The most recent timestamp among all log messages | ||
| contained in the archive. Not related to the time at which the logs were compressed. | ||
|
|
||
| Note that logs with outdated timestamps may be deleted immediately, depending on your retention | ||
| settings. | ||
| - **Search result timestamp (`search_result.T`):** The timestamp when a search result is inserted | ||
| into the results_cache. | ||
|
|
||
| :::{Note} | ||
| Archives whose log messages do not contain timestamps are not subject to retention. | ||
| ::: | ||
|
|
||
| ### Expiry criteria | ||
|
|
||
| - **Archive Expiry:** | ||
| An archive is considered expired if its retention period has elapsed since archive's timestamp, | ||
| i.e. that the difference between `T` and `archive.T` has surpassed `TTL`. | ||
quinntaylormitchell marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ```text | ||
| if (T - archive.T > TTL) then EXPIRED | ||
| ``` | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| For example, if a compressed archive has `archive.T = 16:00`(for simplicity, | ||
| we omit dates and seconds from the timestamp), and `TTL = 1 hour`, it will be | ||
| considered expired after `T = 17:00` since `T - 16:00 > 1:00` for all `T > 17:00`. | ||
|
|
||
| :::{caution} | ||
| Retention control assumes that archive timestamps are given in **UTC** time. Using retention | ||
| control on archives with local (i.e., non-UTC) timestamps can lead to an effective `TTL` that is | ||
| different from the intended value. | ||
|
|
||
| In the example above, if the package operates on a system in EDT (UTC-4) and `archive.T = 16:00` | ||
| is a local timestamp, then a garbage collection job operating at 16:30 local time will convert | ||
| `16:30 EDT` to `20:30 UTC`, and the expiry calculation will be `20:30 - 16:00 > 1:00`. In this | ||
| case, the archive would be considered expired, and would be deleted, even though it wouldn't have | ||
| actually reached its intended retention period. | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| To avoid this issue, either generate logs with UTC timestamps or adjust the retention period to | ||
| account for the offset: | ||
|
|
||
| `adjusted_retention_period = retention_period - signed_UTC_offset` | ||
| ::: | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| - **Search Result Expiry:** | ||
| A search result is considered expired if its retention period has elapsed since the search was | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| completed, i.e. that the difference between T and `search_result.T` has surpassed TTL. | ||
| ```text | ||
| if (T - search_result.T > TTL) then EXPIRED | ||
| ``` | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
haiqi96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| --- | ||
|
|
||
| ## Configuration | ||
| CLP allows users to specify different **retention_periods** for different types of data. | ||
| Additionally, the frequency of garbage collection job execution for each type of data can be | ||
| configured to a customized **sweep_interval**. These settings can be configured in | ||
| `etc/clp-config.yml`. | ||
|
|
||
| ### Configure retention period | ||
| To configure a retention period, update the appropriate `.retention_period` key in | ||
| `etc/clp-config.yml` with the desired retention period in minutes. | ||
|
|
||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| For example, to configure an archive retention period of 30 days (43,200 minutes): | ||
| ```yaml | ||
| archive_output: | ||
| # Other archive_output settings | ||
|
|
||
| # Retention period for archives, in minutes. | ||
| # Set to null to disable automatic deletion. | ||
| retention_period: 43200 | ||
| ``` | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Similarly, to configure a search result retention period of 1 day (1440 minutes): | ||
| ```yaml | ||
| results_cache: | ||
| # Other results_cache settings | ||
|
|
||
| # Retention period for search results, in minutes. | ||
| # Set to null to disable automatic deletion. | ||
| retention_period: 1440 | ||
| ``` | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ### Configure sweep interval | ||
haiqi96 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| The **`garbage_collector.sweep_interval`** parameter specifies the time interval at which garbage | ||
| collector jobs run to collect and delete expired data. | ||
|
|
||
| To configure a custom sweep frequency for different retention targets, you can set the subfields | ||
| under `garbage_collector.sweep_interval` individually in `etc/clp-config.yml`. For example, to | ||
| configure a sweep interval of 15 minutes for search results and 3 hours (180 minutes) for archives, | ||
| enter the following: | ||
|
|
||
| ```yaml | ||
| garbage_collector: | ||
| logging_level: "INFO" | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| # Interval (in minutes) at which garbage collector jobs run | ||
| sweep_interval: | ||
| archive: 180 | ||
| search_result: 15 | ||
| ``` | ||
|
|
||
| :::{note} | ||
| If the `.retention_period` for a data type is set to `null`, the corresponding garbage collection | ||
| task will not run even if `garbage_collector.sweep_interval.<datatype>` is configured. | ||
| ::: | ||
|
|
||
| --- | ||
|
|
||
| ## Internal | ||
haiqi96 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| This section documents some of CLP’s internal behavior for retention and garbage collection. | ||
|
|
||
| ### Handling data race conditions | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| CLP's retention system is designed to avoid data race conditions that may arise from the deletion of | ||
| archives or search results that may still be in use by active jobs. CLP employs the following | ||
| mechanisms to avoid these conditions: | ||
|
|
||
| - If any query job is running, CLP conservatively calculates a **safe expiry timestamp** based on | ||
| the earliest active search job. This ensures no archive that could be searched is deleted. | ||
|
|
||
| - CLP will **not** search an archive once it is considered expired, even if it has not yet been | ||
| deleted by the garbage collector. | ||
|
||
|
|
||
| :::{warning} | ||
| A hanging search job will prevent CLP from deleting expired archives. | ||
| Restarting the query scheduler will mark such jobs as failed and allow garbage collection to resume. | ||
| ::: | ||
|
|
||
| ### Fault tolerance | ||
| The garbage collector can resume execution from where it left off if a previous run fails. | ||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| This design ensures that CLP does not fall into an inconsistent state due to partial deletions. | ||
|
|
||
| If the CLP package stops unexpectedly while a garbage collection task is running (for example, due | ||
| to a host machine shutdown), simply restart the package and the garbage collector will continue from | ||
| the point of failure. | ||
|
|
||
| :::{note} | ||
| During failure recovery, there may be a temporary period during which an archive no longer exists in | ||
| the database, but still exists on disk or in object storage. Once recovery is complete, the physical | ||
| archive will also be deleted. | ||
| ::: | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -63,6 +63,7 @@ guides-overview | |
| guides-using-object-storage/index | ||
| guides-multi-node | ||
| guides-using-presto | ||
| guides-retention | ||
|
||
| ::: | ||
|
|
||
| :::{toctree} | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Add cross-link for discoverability to the new component row
Point readers from the components table directly to the new guide.
📝 Committable suggestion
🤖 Prompt for AI Agents