Skip to content
Merged
Show file tree
Hide file tree
Changes from 148 commits
Commits
Show all changes
155 commits
Select commit Hold shift + click to select a range
73f0d69
Barebone
haiqi96 Jun 17, 2025
bc4c464
backup of progress
haiqi96 Jun 18, 2025
e6238a2
Backup for initial handling for streams
haiqi96 Jun 18, 2025
653f3dc
Backup for initial handling for streams
haiqi96 Jun 18, 2025
a269a97
fix
haiqi96 Jun 18, 2025
1742ff0
small update to use fancier syntax
haiqi96 Jun 18, 2025
efbb47b
fixes
haiqi96 Jun 18, 2025
5714945
linter yeah
haiqi96 Jun 18, 2025
b1fe0d4
Adding simple handler for reusing
haiqi96 Jun 19, 2025
a5f5a3b
commit to propogate change
haiqi96 Jun 19, 2025
53d7417
Add handler
haiqi96 Jun 19, 2025
ccd23dc
Fix mistakes in the handler logic
haiqi96 Jun 19, 2025
5bb3685
Update scheduler to handle logs of time.
haiqi96 Jun 19, 2025
bc688b9
Merge branch 'main' into retension_period
haiqi96 Jun 19, 2025
0c8c6f6
Refactor dataset related code
haiqi96 Jun 20, 2025
0d186e6
Refactor dataset related code
haiqi96 Jun 20, 2025
75ac0ff
further refactor
haiqi96 Jun 20, 2025
bb1e5f4
Linter
haiqi96 Jun 20, 2025
ba7cfe1
A few more fixes
haiqi96 Jun 20, 2025
68454c6
Linter fixes
haiqi96 Jun 20, 2025
5eccaaf
Merge branch 'DatasetRefactor' into retension_period
haiqi96 Jun 20, 2025
c1de746
missing fixes
haiqi96 Jun 20, 2025
f08802b
Merge remote-tracking branch 'origin/DatasetRefactor' into retension_…
haiqi96 Jun 20, 2025
d797198
Fix mistake
haiqi96 Jun 20, 2025
c5dc9b9
Merge remote-tracking branch 'origin/DatasetRefactor' into retension_…
haiqi96 Jun 20, 2025
8c39e77
actually fixing
haiqi96 Jun 20, 2025
ea4318e
Merge remote-tracking branch 'origin/DatasetRefactor' into retension_…
haiqi96 Jun 20, 2025
2c97441
Intermediate backup for archive retention
haiqi96 Jun 20, 2025
2eff448
Update
haiqi96 Jun 20, 2025
d570ab6
Linter again
haiqi96 Jun 20, 2025
06332f4
Merge remote-tracking branch 'origin/DatasetRefactor' into retension_…
haiqi96 Jun 20, 2025
d5e8e28
some renaming
haiqi96 Jun 20, 2025
8a79b9b
adding reminder for myself
haiqi96 Jun 23, 2025
8c77119
Fixing permissions
haiqi96 Jun 23, 2025
3a1afb2
Add batch deletion support
haiqi96 Jun 24, 2025
73d76ac
Linter + code clean up
haiqi96 Jun 24, 2025
5745e65
More refactor
haiqi96 Jun 24, 2025
f3ba8b0
renaming
haiqi96 Jun 24, 2025
e566e74
Prepare for rearrangement
haiqi96 Jun 24, 2025
db9a508
Optimize logger
haiqi96 Jun 24, 2025
2f0f95a
Further refactor
haiqi96 Jun 25, 2025
e310102
Use asyncio
haiqi96 Jun 25, 2025
a332799
Refactoring
haiqi96 Jun 25, 2025
cb53857
Refactoring
haiqi96 Jun 25, 2025
85b7823
Update clp-config
haiqi96 Jun 25, 2025
398ab5e
Merge branch 'main' into DatasetRefactor
haiqi96 Jun 25, 2025
3209ddd
Merge branch 'DatasetRefactor' into retension_period
haiqi96 Jun 25, 2025
3c5b0e4
New line at eof
haiqi96 Jun 25, 2025
fb41607
Refactor retention cleaner name
haiqi96 Jun 25, 2025
386453b
Clean up
haiqi96 Jun 25, 2025
945c97b
linter
haiqi96 Jun 25, 2025
1845462
Adding more docstrings
haiqi96 Jun 25, 2025
bed13df
Temporarily remove stream retention
haiqi96 Jun 25, 2025
0d8d679
Linter
haiqi96 Jun 25, 2025
d40e773
Revert change for stream
haiqi96 Jun 25, 2025
7759a7a
Merge remote-tracking branch 'origin/main' into DatasetRefactor
haiqi96 Jun 27, 2025
271b8b3
Merge branch 'DatasetRefactor' into retension_period
haiqi96 Jun 27, 2025
e6b8cc7
Linter
haiqi96 Jun 27, 2025
c0b8563
Merge branch 'DatasetRefactor' into retension_period
haiqi96 Jun 27, 2025
7a468c3
Merge branch 'main' into DatasetRefactor
Bill-hbrhbr Jun 29, 2025
1dd1cea
Move default dataset metadata table creation to start_clp
Bill-hbrhbr Jun 29, 2025
a0c3c29
Remove unused import
Bill-hbrhbr Jun 29, 2025
a9bf615
Address review comments
Bill-hbrhbr Jun 30, 2025
fe05f5f
Replace the missing SUFFIX
Bill-hbrhbr Jun 30, 2025
39a9278
Move suffix constants from clp_config to clp_metadata_db_utils local …
Bill-hbrhbr Jun 30, 2025
7124828
Refactor archive_manager.py.
kirkrodrigues Jun 30, 2025
eb80992
Refactor s3_utils.py.
kirkrodrigues Jun 30, 2025
5ed44e7
compression_task.py: Fix typing errors and minor refactoring.
kirkrodrigues Jun 30, 2025
af6b508
compression_scheduler.py: Remove exception swallow which will hide un…
kirkrodrigues Jun 30, 2025
67fb01f
Refactor query_scheduler.py.
kirkrodrigues Jun 30, 2025
d6ad4de
clp_metadata_db_utils.py: Minor refactoring.
kirkrodrigues Jun 30, 2025
ff7d700
clp_metadata_db_utils.py: Rename _generic_get_table_name -> _get_tabl…
kirkrodrigues Jun 30, 2025
7ffc77c
clp_metadata_db_utils.py: Alphabetize new public functions.
kirkrodrigues Jun 30, 2025
0255cbd
clp_metadata_db_utils.py: Reorder public and private functions for co…
kirkrodrigues Jun 30, 2025
1076a3f
initialize-clp-metadata-db.py: Remove changes unrelated to PR.
kirkrodrigues Jun 30, 2025
71c4d82
Move default dataset creation into compression_scheduler so that it r…
kirkrodrigues Jun 30, 2025
6bd9372
Apply suggestions from code review
kirkrodrigues Jul 1, 2025
84df2e2
Merge branch 'main' into DatasetRefactor
kirkrodrigues Jul 1, 2025
983bea1
Remove bug fix that's no longer necessary.
kirkrodrigues Jul 1, 2025
bdb7817
Fix bug where dataset has a default value instead of None when using …
Bill-hbrhbr Jul 1, 2025
a82a267
Correctly feed in the input config dataset names
Bill-hbrhbr Jul 1, 2025
f699496
Remove unnecessary changes
Bill-hbrhbr Jul 1, 2025
94e8ca1
Merge branch 'DatasetRefactor' into retension_period
haiqi96 Jul 2, 2025
90ce0a4
Update the webui to pass the dataset name in the clp-json code path (…
kirkrodrigues Jul 2, 2025
d6f9e5a
Move dataset into the user function
haiqi96 Jul 2, 2025
dc6a706
Merge branch 'DatasetRefactor' of https://github.com/haiqi96/clp_fork…
haiqi96 Jul 2, 2025
76bcb4a
Remove unnecessary f string specifier
haiqi96 Jul 2, 2025
a4e6f83
Apply suggestions from code review
haiqi96 Jul 2, 2025
3c53cb0
Merge branch 'DatasetRefactor' into retension_period
haiqi96 Jul 2, 2025
66eba87
Polishing
haiqi96 Jul 2, 2025
7b42568
Add import type.
kirkrodrigues Jul 2, 2025
097e47c
Polishing more
haiqi96 Jul 3, 2025
8dc8e26
try adding query job handling
haiqi96 Jul 3, 2025
afe43ce
Merge branch 'main' into DatasetRefactor
haiqi96 Jul 3, 2025
85a3164
Merge remote-tracking branch 'origin/DatasetRefactor' into retension_…
haiqi96 Jul 3, 2025
af75118
Merge remote-tracking branch 'origin/main' into retension_period
haiqi96 Jul 3, 2025
e5e90f7
Fix wrong order
haiqi96 Jul 3, 2025
bac6767
Linter
haiqi96 Jul 3, 2025
de1c334
submit not-fully-tested-code
haiqi96 Jul 3, 2025
9fdb3d5
Apply suggestions from code review
haiqi96 Jul 4, 2025
2245244
Update components/job-orchestration/job_orchestration/retention/archi…
haiqi96 Jul 4, 2025
b1e5a2c
Apply suggestions from code review
haiqi96 Jul 4, 2025
4e93a30
Fix
haiqi96 Jul 4, 2025
6719872
Merge remote-tracking branch 'origin/main' into retension_period
haiqi96 Jul 4, 2025
f9fa626
nit fixes
haiqi96 Jul 4, 2025
450e16a
Update the logic to consider all running query jobs
haiqi96 Jul 17, 2025
ade2e27
Merge remote-tracking branch 'origin/main' into retension_period
haiqi96 Jul 17, 2025
f1584ff
linter
haiqi96 Jul 30, 2025
2c57dd6
Apply suggestions from code review
haiqi96 Aug 1, 2025
5f479c5
address code review concern
haiqi96 Aug 1, 2025
8c5fb89
Batch renaming
haiqi96 Aug 1, 2025
11e695f
Linter
haiqi96 Aug 1, 2025
1291c3f
Further refactor
haiqi96 Aug 1, 2025
f8c7369
Linter
haiqi96 Aug 1, 2025
9b48c9b
Apply suggestions from code review
haiqi96 Aug 4, 2025
6cff24d
Merge remote-tracking branch 'origin/main' into retension_period
haiqi96 Aug 4, 2025
c367c15
address review concern
haiqi96 Aug 4, 2025
e282020
Update logging
haiqi96 Aug 4, 2025
b93bb4b
Update components/job-orchestration/job_orchestration/garbage_collect…
haiqi96 Aug 4, 2025
390333f
Address review comments
haiqi96 Aug 5, 2025
a4546cf
Fix timezone
haiqi96 Aug 5, 2025
2c4821a
Apply suggestions from code review
haiqi96 Aug 7, 2025
9d5d087
Address code review comments and slight improved logging.
haiqi96 Aug 7, 2025
74af600
Linter
haiqi96 Aug 7, 2025
5f4f1e3
Add docs
haiqi96 Aug 8, 2025
c8a919c
Apply suggestions from code review
haiqi96 Aug 10, 2025
c911ccc
Apply suggestions from code review
haiqi96 Aug 10, 2025
a02ed22
Apply suggestions from code review
haiqi96 Aug 10, 2025
17defe5
Update
haiqi96 Aug 10, 2025
f66b378
slight update
haiqi96 Aug 10, 2025
34a52f3
Add empty line at eof
haiqi96 Aug 11, 2025
d9a3d09
Merge remote-tracking branch 'origin/main' into retention_readme
haiqi96 Aug 12, 2025
c8b2500
Merge branch 'main' into retention_readme
haiqi96 Aug 13, 2025
e3ff836
Update multi-node doc
haiqi96 Aug 15, 2025
8b2631a
Add section for non UTC timestamp
haiqi96 Aug 15, 2025
3b4f74f
Merge remote-tracking branch 'origin/main' into retention_readme
haiqi96 Aug 15, 2025
434a5ae
Apply suggestions from code review
haiqi96 Aug 16, 2025
66bfe4b
Reordering
haiqi96 Aug 16, 2025
d86f8fd
Apply suggestions from code review
haiqi96 Aug 16, 2025
562bffe
Merge branch 'retention_readme' of https://github.com/haiqi96/clp_for…
haiqi96 Aug 16, 2025
91d468e
Address code review comments
haiqi96 Aug 16, 2025
7f0c92e
Apply markdown lint configs.
kirkrodrigues Aug 19, 2025
0931335
Merge branch 'main' into retention_readme
kirkrodrigues Aug 19, 2025
e993809
Revise docs.
kirkrodrigues Aug 20, 2025
d2800cf
Apply suggestions from code review
kirkrodrigues Aug 20, 2025
9f85fd3
Add line to card
quinntaylormitchell Aug 20, 2025
9654d87
Properly format one of the times, and remove endline blankspace
quinntaylormitchell Aug 20, 2025
edc2b35
Apply suggestions from code review
kirkrodrigues Aug 20, 2025
21e2066
Rephrase expiry criteria formua.
kirkrodrigues Aug 20, 2025
02eb9aa
Minor edits and add details about search results retention.
kirkrodrigues Aug 20, 2025
15df283
Fix the example and make it more readable.
kirkrodrigues Aug 20, 2025
fd6f121
Haiqi suggestion.
kirkrodrigues Aug 20, 2025
1596cc6
Merge branch 'main' into retention_readme
haiqi96 Aug 20, 2025
0d8a48e
Address review comments and also fix the order presto card.
haiqi96 Aug 20, 2025
15ba738
Apply suggestions from code review
haiqi96 Aug 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/conf/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
myst_enable_extensions = [
"attrs_block",
"colon_fence",
"dollarmath",
]

myst_heading_anchors = 4
Expand Down
4 changes: 4 additions & 0 deletions docs/src/user-guide/guides-multi-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ worker components. The tables below list the components and their functions.
| query_scheduler | Scheduler for search/aggregation jobs |
| results_cache | Storage for the workers to return search results to the UI |
| webui | Web server for the UI |
| garbage_collector | Background process for retention control |
:::
Comment on lines +34 to 35
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add cross-link for discoverability to the new component row

Point readers from the components table directly to the new guide.

-| garbage_collector     | Background process for retention control                        |
+| garbage_collector     | Background process for retention control; see [Retention control](guides-retention) |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| garbage_collector | Background process for retention control |
:::
| garbage_collector | Background process for retention control; see [Retention control](guides-retention) |
:::
🤖 Prompt for AI Agents
In docs/src/user-guide/guides-multi-node.md around lines 34–35, the components
table row for "garbage_collector" should include a cross-link to the new guide
for discoverability; update the table cell to make the component name or its
description a Markdown link pointing to the new guide (for example the relative
path docs/src/user-guide/guides-garbage-collector.md or the correct guide
filename), ensuring the link text remains clear (e.g., "garbage_collector —
Background process for retention control") and the link URL points to the new
guide.


:::{table} Worker components
Expand Down Expand Up @@ -71,6 +72,8 @@ Running additional workers increases the parallelism of compression and search/a
4. Set `archive_output.directory` to a directory on the distributed filesystem.
* Ideally, the directory should be empty or should not yet exist (CLP will create it) since
CLP will write several files and directories directly to the given directory.
5. (Optional) Configure retention periods for archives and search results. See
[retention control](guides-retention) for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit out of scope of this doc, right?


5. Download and extract the package on all nodes.
6. Copy the `credentials.yml` and `clp-config.yml` files that you created above and paste them
Expand All @@ -93,6 +96,7 @@ but all components in a group must be started before starting a component in the

* `compression_scheduler`
* `query_scheduler`
* `garbage_collector`

**Group 3 components:**

Expand Down
7 changes: 7 additions & 0 deletions docs/src/user-guide/guides-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,11 @@ Multi-node deployment
^^^
How to deploy CLP across multiple nodes.
:::

:::{grid-item-card}
:link: guides-retention
Retention control
^^^
How to configure retention control for CLP.
:::
::::
211 changes: 211 additions & 0 deletions docs/src/user-guide/guides-retention.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
# Setting up data retention periods

CLP can automatically delete *archives* and/or *search results* once they're older than a configured
retention period. This guide explains:

* [How retention works in CLP](#how-retention-works)
* [How to configure retention](#retention-settings)
* [Additional concerns worth noting](#additional-concerns)

## How retention works

To support retention periods, CLP's garbage collector component periodically scans for and deletes
expired data (archives or search results). To understand the high-level algorithm, first consider
the following definitions:

| Term | Description |
|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $sweep\_interval$ | The interval (in minutes) at which the garbage collector wakes up to check for expired data. |
| $retention\_period$ | The duration (in minutes) for which data (an archive or search result) is retained before it is considered expired. |
| $current\_time$ | The time at which the garbage collector is performing a check. |
| $data\_timestamp$ | The end of the time range for the data being evaluated for expiration (e.g., for an archive, this is the timestamp of the most recent log event contained in the archive). |

When the garbage collector wakes up, it will scan for and delete any data that fits the expiry
criteria shown in [Figure 1](#figure-1):

(figure-1)=
:::{card}

$$data\_timestamp < current\_time - retention\_period$$
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See high-level comment.


+++
**Figure 1**: The criteria for determining whether a piece of data has expired and should be
deleted.
:::

For example, if...

* some data has $data\_timestamp = 1440$;
* $retention\_period = 30$; and
* $current\_time = 1500$ when the garbage collector runs;

... then the garbage collector will determine that the data has expired and delete it.

---

## Retention settings

There are three settings that affect how CLP's data retention operates:

* [Archive retention period](#archive-retention-period)
* [Search result retention period](#search-result-retention-period)
* [Garbage collector sweep interval](#garbage-collector-sweep-interval)

All settings can be configured in `etc/clp-config.yml` which is located in the CLP package
directory.

### Archive retention period

This setting determines how long an archive should be retained before it is automatically deleted.
To configure it, modify the value of `archive_output.retention_period` in `etc/clp-config.yml`.

For example, to configure an archive retention period of 30 days (43,200 minutes), use:

```yaml
archive_output:
# ... Other archive_output settings...

# Retention period for archives, in minutes.
# Set to null to disable automatic deletion.
retention_period: 43200
```
By default, `archive_output.retention_period` is `null`, which means that archives will be retained
indefinitely.

:::{warning}
If your log events use timestamps that *aren't* in the UTC time zone, you will need to adjust the
configured retention period to ensure expired archives are deleted at the correct time. See
[Handling log events with non-UTC timestamps](#handling-log-events-with-non-utc-timestamps) for
details.
:::

#### Archive expiry criteria

For archives, $data\_timestamp$ (in the expiry criteria equation from [Figure 1](#figure-1)) is the
timestamp of the most recent log event contained in the archive.

:::{note}
This is not the timestamp at which your logs were compressed. Therefore, if you compress
particularly old logs that have already expired according to the expiry criteria, they will be
deleted the next time the garbage collector runs.
:::

#### Handling log events with non-UTC timestamps

If your log events use timestamps that **aren't** in the UTC time zone, you will need to adjust the
configured retention period to ensure expired archives are deleted at the correct time. This is
because CLP currently doesn't support parsing time zone information, and the garbage collector runs
based on the UTC time zone.

For example, let's say:

* your log events use timestamps in the AWST timezone (UTC+8);
* you set a retention period of 1 hour;
* you have an archive with $data\_timestamp = 08:00$ AWST; and
* the garbage collector runs at $current\_time = 09:01$ AWST.

When the garbage collector runs, it will evaluate the archive's expiry criteria, substituting
$08:00$ for $data\_timestamp$, and $01:01$ for $current\_time$, since $09:01$ AWST = $01:01$ UTC.
The equation then becomes $08:00 < 01:01 - 01:00$, which evaluates to false. Thus, the garbage
collector won't delete the archive; in fact, it won't delete it until $09:01$ UTC, which is 8 hours
later than it should've been deleted.

Similarly, archives may be deleted prematurely if your log events use timestamps in a time zone that
is behind UTC.

To avoid this issue, you can adjust the retention period to account for the offset of the log
events' time zone from UTC:

$$adjusted\_retention\_period = retention\_period - signed\_utc\_offset$$

### Search result retention period

This setting determines how long search results should be retained before they are automatically
deleted. To configure it, modify the value of `results_cache.retention_period` in
`etc/clp-config.yml`.

For example, to configure a search result retention period of 1 day (1,440 minutes), use:

```yaml
results_cache:
# ... Other results_cache settings...
# Retention period for search results, in minutes.
# Set to null to disable automatic deletion.
retention_period: 1440
```

#### Search result expiry criteria

For search results, $data\_timestamp$ (in the expiry criteria equation from [Figure 1](#figure-1))
is the timestamp at which the search completed.

### Garbage collector sweep interval

This setting determines how often the garbage collector wakes up to check for and delete expired
data. To configure it, modify the value of `garbage_collector.sweep_interval` in
`etc/clp-config.yml`.

For example, to configure a sweep interval of 3 hours (180 minutes) for archives and 15 minutes for
search results, use:

```yaml
garbage_collector:
logging_level: "INFO"
# Interval (in minutes) at which garbage collector jobs run
sweep_interval:
archive: 180
search_result: 15
```

:::{note}
Since the garbage collector wakes up every $sweep\_interval$ minutes, data may be retained up to
$sweep\_interval$ minutes longer than the configured retention period.
:::

:::{note}
If the value of `archive_output.retention_period` is `null`, the corresponding garbage collection
task will not run even if `garbage_collector.sweep_interval.archive` is configured. The same applies
for `results_cache.retention_period` and `garbage_collector.sweep_interval.search_result`.
:::

---

## Additional concerns

It's worth understanding how CLP's retention system handles data races and ensures fault tolerance,
since these may affect the behavior of how long archives remain queryable and when they're deleted.

### Handling data races

CLP's retention system is designed to avoid deleting expired archives or search results that may
still be in use by active jobs. To do so, CLP employs the following mechanisms:

* If any query job is running, CLP conservatively calculates a **safe expiry timestamp** based on
the earliest active search job. This ensures no archive which may be searched by the active job is
deleted.

* CLP will **not** search an archive once it is considered expired, even if it has not yet been
deleted by the garbage collector.

:::{warning}
A hanging search job will prevent CLP from deleting expired archives. Restarting the query scheduler
will mark such jobs as killed and allow garbage collection to resume.
:::

### Fault tolerance

The garbage collector can resume execution from where it left off if a previous run fails. This
design ensures that CLP does not fall into an inconsistent state due to partial deletions.

If the CLP package stops unexpectedly while a garbage collection task is running (for example, due
to a host machine shutdown), simply restart the package and the garbage collector will resume from
the point of failure.

:::{note}
During failure recovery, there may be a temporary period during which an archive no longer exists in
the database, but still exists on disk or in object storage. Once recovery is complete, the physical
archive will also be deleted.
:::
1 change: 1 addition & 0 deletions docs/src/user-guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ guides-overview
guides-using-object-storage/index
guides-multi-node
guides-using-presto
guides-retention
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this above guides-multi-node? Same for the card in the overview.

:::

:::{toctree}
Expand Down