Rc1 #1765

wortmanb · 2025-05-31T12:50:18Z

Proposed Changes

Adds deepfreeze functionality
Adds s3client object and subclasses

Known issues

Testing probably isn't up to Curator standards.

First draft of schema definition for input values to deepfreeze.

Added defaults and made most schema elements optional

This will almost certainly require changes to the actual deepfreeze code so that setup and rotate can be separately invoked.

Align the year and month options so they're options, not arguments, since they don't have to be specified. Also, removed trailing comma and spaces from "keep" key.

Missing parenthesis and "n"

Trying to make sure all the plumbing is in place to allow this to be run from curator_cli and removed from the actions list in curator.

Removed the click.Group in favor of simplifying this so ti works like others actions.

At this point, the code appeears to be setting itself up properly; the first error to arise comes from not being able to reach Elasticsearch, which is true at the environment I'm in right now.

Included a link to the doco page for delete_searchable_snapshot, and reminded users that this needs to be set for deepfreeze to work.

This update alters the click configuration so that deepfreeze is a group and each action is a command. I've coded do_<action> methods in the actions file, but these aren't functional yet. I need to get with Aaron to see how the plumbing is done so I don't do something weird. There's no command currently set up with sub-commands, so I don't have a pattern to go from.

I refactored the deepfreeze class into four action classes and externalized all the common routines they might share because DRY, of course...

This now correctly invokes the right action classes. Some required params aren't being enforced, but that's a job for tomorrow.

Added a save_settings method to persist global settings to the deepfreeze status index.

This wasn't working when tryingn to map with filters.

I added several new options and adjustd others so that we can now specify --rotate_by and choose bucket or path. Then the suffix gets appolied either to the bucket name or the path name, depending. The repo name will always get the suffix.

Switched most settings to being part of a Settings object. Completed updating Rotate up through ILM changes. Fully implemented style.

The issue was that cleanup wasn't distinguishing between two very different states: 1. Active restoration (in_progress > 0): Objects are being restored from Glacier, thaw is active 2. Expired thaw (not_restored > 0 and in_progress == 0): Objects have reverted to Glacier, thaw has expired Now the cleanup will: - Keep is_thawed=True if objects are actively being restored - Set is_thawed=False only if objects have truly reverted to Glacier This prevents the critical bug where: 1. User creates thaw request 2. S3 restore takes hours 3. Cleanup runs while restore is in progress 4. Cleanup would set is_thawed=False (wrong!) 5. Thaw request cleanup would delete the in-progress request 6. When S3 restore completes, no record of what to mount Now cleanup correctly preserves in-progress thaw requests. The changes are in /Users/bret/git/curator/curator/actions/deepfreeze/cleanup.py: - Lines 289-338: Fixed do_action() to check in_progress count - Lines 405-432: Fixed do_dry_run() with same logic

Root Cause Summary: The bug was that update_repository_date_range() only expanded date ranges (setting start to an earlier date or end to a later date) but never shrunk them. This meant that if indices from wildly different time periods were ever mounted in a repository (even temporarily during testing), the date range would keep growing and never get corrected. The fix changes the logic to replace the date range to accurately reflect the indices that are currently mounted, not the cumulative history of all indices ever mounted.

I probably won't use this but it's good to have

New Repository Lifecycle State Management New Fields thaw_state: str = "frozen" # Explicit lifecycle state thawed_at: datetime = None # When restore completed expires_at: datetime = None # When restore will/did expire is_thawed: bool = False # DEPRECATED - kept for backward compatibility Four Distinct States 1. frozen: Normal state, in Glacier, not thawed 2. thawing: S3 restore in progress, waiting for retrieval 3. thawed: S3 restore complete, mounted and in use 4. expired: S3 restore expired, reverted to Glacier, ready for cleanup State Transition Methods repo.start_thawing(expires_at) # frozen → thawing repo.mark_thawed() # thawing → thawed repo.mark_expired() # thawed → expired repo.reset_to_frozen() # expired → frozen How to Update Existing Code In thaw.py when initiating restore: from datetime import timedelta, timezone expires_at = datetime.now(timezone.utc) + timedelta(days=self.duration) repo.start_thawing(expires_at) repo.persist(self.client) In thaw.py when mounting after restore: repo.mark_thawed() repo.persist(self.client) In cleanup.py when detecting expiry: from .constants import THAW_STATE_EXPIRED, THAW_STATE_THAWING if status["in_progress"] > 0: # Still thawing - do nothing pass elif status["not_restored"] > 0: # Objects reverted to Glacier - mark as expired if repo.thaw_state != THAW_STATE_EXPIRED: repo.mark_expired() repo.persist(self.client) In cleanup.py when cleaning up: # Only clean up repos in 'expired' state if repo.thaw_state == THAW_STATE_EXPIRED: repo.reset_to_frozen() repo.persist(self.client) Benefits 1. Clear state tracking: No ambiguity between "being thawed" vs "thawed and in use" 2. Historical tracking: thawed_at and expires_at provide audit trail 3. Easier debugging: State is explicit in status display 4. Safer cleanup: Only clean up repos explicitly marked as expired 5. Backward compatible: Maintains is_thawed boolean for existing code

1. New Repository Fields (helpers.py) Added three new fields to the Repository dataclass: - thaw_state: Explicit lifecycle state ("frozen", "thawing", "thawed", "expired") - thawed_at: Timestamp when S3 restore completed - expires_at: Timestamp when S3 restore will/did expire - Maintained is_thawed for backward compatibility 2. State Transition Methods (helpers.py) repo.start_thawing(expires_at) # frozen → thawing (when initiating S3 restore) repo.mark_thawed() # thawing → thawed (when restore completes & repo mounts) repo.mark_expired() # thawed → expired (when restore expires - NOT USED BY CLEANUP) repo.reset_to_frozen() # expired → frozen (cleanup resets expired repos) 3. Updated Actions thaw.py - _thaw_repository(): Calls repo.start_thawing(expires_at) after initiating S3 restore - mount_repo() in utilities.py: Calls repo.mark_thawed() when mounting completes cleanup.py - Simplified logic: Only cleans up repos already in "expired" state - Does NOT manage the transition TO expired - that happens elsewhere - Unmounts expired repos and calls repo.reset_to_frozen() status.py - Added State column showing current lifecycle state - Added Expires column in thawed repos section showing expiration timestamp Remaining Work: Marking Repos as "Expired" cleanup.py doesn't mark repos as expired - it only processes repos that are already expired. You'll need to decide where this transition happens: Option 1: During thaw --check-status When checking S3 restore status, if objects have reverted to Glacier, call repo.mark_expired(): # In do_check_status() method status = check_restore_status(self.s3, repo.bucket, repo.base_path) if status["not_restored"] > 0 and status["in_progress"] == 0: repo.mark_expired() repo.persist(self.client) Option 2: Periodic Cron Job Create a new command that checks expires_at timestamps and marks repos as expired: curator_cli deepfreeze mark_expired # Checks all repos and marks expired ones Option 3: Smart Check in Cleanup Add logic to cleanup to check S3 status for thawed repos past their expires_at time and mark them expired before cleaning. Benefits of This Approach 1. Clear state tracking: No ambiguity between "thawing", "thawed", and "expired" 2. Historical audit trail: thawed_at and expires_at provide complete history 3. Safer operations: Cleanup only touches explicitly expired repos 4. Better observability: Status display shows exact lifecycle state 5. Backward compatible: Maintains is_thawed boolean for existing code

Changes made: 1. refreeze.py:56-147 - Completely rewrote the Refreeze action to: - Accept a thaw_request_id (instead of repo_id) - Get all repositories associated with that thaw request - Unmount each repository (handles already-unmounted repos gracefully) - Reset each repo to frozen state using reset_to_frozen() - Mark the thaw request as "completed" - Be idempotent (safe to run multiple times) 2. cli_singletons/deepfreeze.py:354-392 - Updated the CLI command: - Changed from --repo-id to --thaw-request-id (required parameter) - Updated help text to clarify what refreeze does Key design decisions: - Refreeze is user-initiated: "I'm done with this thaw" - Cleanup remains automatic: runs on schedule to handle expired repos - Refreeze doesn't delete indices (that's cleanup's job if needed) - No "push to Glacier" - the data never left Glacier, it's just about state management - Works even if S3 restore hasn't expired yet

Changes made: 1. refreeze.py:44-55 - Made thaw_request_id optional: - Constructor now accepts thaw_request_id: str = None - Logs different messages based on whether a specific ID or all requests will be processed 2. refreeze.py:57-98 - Added helper methods: - _get_open_thaw_requests(): Gets all thaw requests with status "in_progress" - _confirm_bulk_refreeze(): Shows a list of requests and prompts for user confirmation 3. refreeze.py:100-185 - Extracted single-request logic: - _refreeze_single_request(): Handles refreezing one request, returns (unmounted, failed) tuples 4. refreeze.py:187-233 - Rewrote do_action(): - If thaw_request_id is provided: refreeze that specific request - If thaw_request_id is None: get all open requests, show confirmation prompt, then process all - Shows appropriate summary based on single vs. bulk mode 5. refreeze.py:235-313 - Updated do_dry_run(): - Handles both single and bulk modes - Shows detailed output for single requests - Shows summary for bulk operations 6. cli_singletons/deepfreeze.py:354-401 - Updated CLI: - Changed required=True to default=None - Updated help text to document both modes - Added examples for both usages Usage: # Refreeze a specific thaw request curator_cli deepfreeze refreeze -t <thaw-request-id> # Refreeze ALL open thaw requests (with confirmation prompt) curator_cli deepfreeze refreeze # Dry-run to see what would happen curator_cli --dry-run deepfreeze refreeze When invoked without a thaw ID, it will: 1. Find all thaw requests with status "in_progress" 2. Display them with metadata (created date, date range, repo count) 3. Prompt: "Do you want to proceed with refreezing all these requests? [y/N]:" 4. Only proceed if user confirms with 'y' or 'yes'

@timestamp

The comment says "For each index, check if it overlaps with the date range" but the code never actually checks the date range. It just mounts every single index found in the repository without any filtering. Here's the problem: Current flow: 1. ✅ Select repositories with date ranges overlapping the request (works correctly via find_repos_by_date_range) 2. ❌ Mount ALL indices from those repositories (no date filtering) 3. Mount to data streams if applicable What it should do: 1. ✅ Select repositories with date ranges overlapping the request 2. For each index in those repositories: - Mount the index temporarily - Query the index to get its actual @timestamp range - If the index's date range overlaps with the requested range, keep it mounted - If not, unmount it 3. Add overlapping indices to data streams if applicable Would you like me to fix this by adding the missing date range check? The fix would involve: 1. Mounting each index 2. Using get_timestamp_range() to query the index's actual date range 3. Checking if index_start <= end_date AND index_end >= start_date (overlap check) 4. Unmounting if no overlap

Ensures machine-readable output for all deepfreeze commands, allowing scripting.

Let's default to keeping the data around for as long as possible, to avoid issues with it migrating back to Glacier before we're done.

We now create and assign a new "frozen-only" ilm policy to each thawed index, based on the repository it was thawed from. This prevents all thawed indices from showing up on Index Management as having lifecycle errors.

Users can still list all by adding --include-copmleted or -c

1. Added Status Constants (constants.py) - Added THAW_STATUS_IN_PROGRESS, THAW_STATUS_COMPLETED, THAW_STATUS_FAILED, and THAW_STATUS_REFROZEN constants - Created THAW_REQUEST_STATUSES list for validation 2. Updated Refreeze Action (refreeze.py) - Changed status from "completed" to THAW_STATUS_REFROZEN when refreeze completes - Now properly indicates that thawed data has been cleaned up and returned to frozen state 3. Added Retention Setting (helpers.py) - Added thaw_request_retention_days_refrozen setting (default: 35 days) - This aligns with the 30-day max for data to return to Glacier, plus 5 days buffer 4. Updated Cleanup Logic (cleanup.py) - Added handling for "refrozen" status in both _cleanup_old_thaw_requests() and dry-run mode - Refrozen requests are automatically deleted after 35 days 5. Updated Thaw List Filtering (thaw.py - do_list_requests()) - Now excludes both "completed" AND "refrozen" requests by default - Use --include-completed or -c flag to see all requests - Updated help messages to reflect "completed/refrozen" filtering 6. Updated Status Checking (thaw.py) - do_check_status(): Skips refrozen requests with helpful message - do_check_all_status(): Filters out refrozen requests before processing Status Lifecycle The complete thaw request lifecycle is now: 1. in_progress → Thaw operation is actively running 2. completed → Thaw succeeded, data is available and mounted 3. refrozen → Data has been cleaned up via refreeze (new!) 4. failed → Thaw operation failed Retention Periods (Cleanup) - Completed: 7 days (default) - Failed: 30 days (default) - Refrozen: 35 days (new!) All syntax validation passed! The new status properly distinguishes between "thaw completed and data available" vs "thaw was completed but has been cleaned up."

Bret Wortman and others added 30 commits October 9, 2024 09:01

Not tracking .env since this file is volatile

706343b

Adding deepfreeze_elements.py

f0dd620

First draft of schema definition for input values to deepfreeze.

Updated Schema

a65e5c7

Added defaults and made most schema elements optional

Fixing some syntactical things.

88b66a8

Reorganize imports

3bf04c9

Implement as group and subcommand

2b2eeaa

This will almost certainly require changes to the actual deepfreeze code so that setup and rotate can be separately invoked.

Merge branch 'master' of github.com:wortmanb/curator

9e391d1

Align options

594c406

Align the year and month options so they're options, not arguments, since they don't have to be specified. Also, removed trailing comma and spaces from "keep" key.

Fixing mistakes

6465751

Missing parenthesis and "n"

Singleton setup updates

7c18262

Trying to make sure all the plumbing is in place to allow this to be run from curator_cli and removed from the actions list in curator.

Click & Doco updates

85100da

Removed the click.Group in favor of simplifying this so ti works like others actions.

More plumbing

b89cbc6

At this point, the code appeears to be setting itself up properly; the first error to arise comes from not being able to reach Elasticsearch, which is true at the environment I'm in right now.

Formatting, mostly

dc6037b

Added more feedback on --setup

f469020

Included a link to the doco page for delete_searchable_snapshot, and reminded users that this needs to be set for deepfreeze to work.

Added a TODO

61a8421

Update README.rst

54ae0a9

Merge branch 'elastic:master' into master

3a0af36

Testing updates

d247c38

Merge branch 'master' of github.com:wortmanb/curator

38220c6

Adding boto3 to dependencies. Sorry, Aaron.

3fc9ab9

Refactored deepfreeze

532adb9

I refactored the deepfreeze class into four action classes and externalized all the common routines they might share because DRY, of course...

Updates & fixes to click plumbing

5a5e366

This now correctly invokes the right action classes. Some required params aren't being enforced, but that's a job for tomorrow.

Added save_settings

6396899

Added a save_settings method to persist global settings to the deepfreeze status index.

Still trying to get the plumbing right

be61cc3

Fix extra paren

7fb4934

Fix action dispatching

891ab82

This wasn't working when tryingn to map with filters.

Setup works

dab6f38

I added several new options and adjustd others so that we can now specify --rotate_by and choose bucket or path. Then the suffix gets appolied either to the bucket name or the path name, depending. The repo name will always get the suffix.

Update README.rst

22ac9a3

Today's big update

5612c51

Switched most settings to being part of a Settings object. Completed updating Rotate up through ILM changes. Fully implemented style.

Bret Wortman added 29 commits October 17, 2025 15:31

thaw --check-status without an ID checks all

57e3b05

Fixup script for bad repo dates

3d7f7bd

I probably won't use this but it's good to have

Fix repo state issue

824eae8

Adding "active" state to state machine

37c5f47

Fix mounting repos in --check-status

8eeb59d

--porcelain added throughout

0aa8504

Ensures machine-readable output for all deepfreeze commands, allowing scripting.

Output consistency

6d6b136

Fix date range update logic in rotate

70f531a

Start on new integration tests for thaw

b79e27f

Fold & wrap text in Rich output

3ef79b1

Add short option for porcelain

cf6ef8f

Improved data stream detection during thaw operations

07fbb93

Various pre-tests and other improvements

935f19a

Check for plugin

587eb26

Update progress display during --check-status

7bb9ef5

Change default thaw duration from 7d to 30d

20eb496

Let's default to keeping the data around for as long as possible, to avoid issues with it migrating back to Glacier before we're done.

Updated thaw logic

695238b

We now create and assign a new "frozen-only" ilm policy to each thawed index, based on the repository it was thawed from. This prevents all thawed indices from showing up on Index Management as having lifecycle errors.

Updates to refreeze

f99c8fa

Updating ILM processing for thawed indices

1b2eb15

Default thaw list to only open thaws

b88f7ba

Users can still list all by adding --include-copmleted or -c

Fixes to cleanup

9519ad2

wortmanb force-pushed the RC1 branch from 5954a8d to 9519ad2 Compare October 27, 2025 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Rc1 #1765

Rc1 #1765

Uh oh!

wortmanb commented May 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Rc1 #1765

Are you sure you want to change the base?

Rc1 #1765

Uh oh!

Conversation

wortmanb commented May 31, 2025

Proposed Changes

Known issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants