-
Couldn't load subscription status.
- Fork 637
Rc1 #1765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wortmanb
wants to merge
265
commits into
elastic:master
Choose a base branch
from
wortmanb:RC1
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Rc1 #1765
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
First draft of schema definition for input values to deepfreeze.
Added defaults and made most schema elements optional
This will almost certainly require changes to the actual deepfreeze code so that setup and rotate can be separately invoked.
Align the year and month options so they're options, not arguments, since they don't have to be specified. Also, removed trailing comma and spaces from "keep" key.
Missing parenthesis and "n"
Trying to make sure all the plumbing is in place to allow this to be run from curator_cli and removed from the actions list in curator.
Removed the click.Group in favor of simplifying this so ti works like others actions.
At this point, the code appeears to be setting itself up properly; the first error to arise comes from not being able to reach Elasticsearch, which is true at the environment I'm in right now.
Included a link to the doco page for delete_searchable_snapshot, and reminded users that this needs to be set for deepfreeze to work.
This update alters the click configuration so that deepfreeze is a group and each action is a command. I've coded do_<action> methods in the actions file, but these aren't functional yet. I need to get with Aaron to see how the plumbing is done so I don't do something weird. There's no command currently set up with sub-commands, so I don't have a pattern to go from.
I refactored the deepfreeze class into four action classes and externalized all the common routines they might share because DRY, of course...
This now correctly invokes the right action classes. Some required params aren't being enforced, but that's a job for tomorrow.
Added a save_settings method to persist global settings to the deepfreeze status index.
This wasn't working when tryingn to map with filters.
I added several new options and adjustd others so that we can now specify --rotate_by and choose bucket or path. Then the suffix gets appolied either to the bucket name or the path name, depending. The repo name will always get the suffix.
Switched most settings to being part of a Settings object. Completed updating Rotate up through ILM changes. Fully implemented style.
The issue was that cleanup wasn't distinguishing between two very different states: 1. Active restoration (in_progress > 0): Objects are being restored from Glacier, thaw is active 2. Expired thaw (not_restored > 0 and in_progress == 0): Objects have reverted to Glacier, thaw has expired Now the cleanup will: - Keep is_thawed=True if objects are actively being restored - Set is_thawed=False only if objects have truly reverted to Glacier This prevents the critical bug where: 1. User creates thaw request 2. S3 restore takes hours 3. Cleanup runs while restore is in progress 4. Cleanup would set is_thawed=False (wrong!) 5. Thaw request cleanup would delete the in-progress request 6. When S3 restore completes, no record of what to mount Now cleanup correctly preserves in-progress thaw requests. The changes are in /Users/bret/git/curator/curator/actions/deepfreeze/cleanup.py: - Lines 289-338: Fixed do_action() to check in_progress count - Lines 405-432: Fixed do_dry_run() with same logic
Root Cause Summary: The bug was that update_repository_date_range() only expanded date ranges (setting start to an earlier date or end to a later date) but never shrunk them. This meant that if indices from wildly different time periods were ever mounted in a repository (even temporarily during testing), the date range would keep growing and never get corrected. The fix changes the logic to replace the date range to accurately reflect the indices that are currently mounted, not the cumulative history of all indices ever mounted.
I probably won't use this but it's good to have
New Repository Lifecycle State Management
New Fields
thaw_state: str = "frozen" # Explicit lifecycle state
thawed_at: datetime = None # When restore completed
expires_at: datetime = None # When restore will/did expire
is_thawed: bool = False # DEPRECATED - kept for backward
compatibility
Four Distinct States
1. frozen: Normal state, in Glacier, not thawed
2. thawing: S3 restore in progress, waiting for retrieval
3. thawed: S3 restore complete, mounted and in use
4. expired: S3 restore expired, reverted to Glacier, ready for cleanup
State Transition Methods
repo.start_thawing(expires_at) # frozen → thawing
repo.mark_thawed() # thawing → thawed
repo.mark_expired() # thawed → expired
repo.reset_to_frozen() # expired → frozen
How to Update Existing Code
In thaw.py when initiating restore:
from datetime import timedelta, timezone
expires_at = datetime.now(timezone.utc) +
timedelta(days=self.duration)
repo.start_thawing(expires_at)
repo.persist(self.client)
In thaw.py when mounting after restore:
repo.mark_thawed()
repo.persist(self.client)
In cleanup.py when detecting expiry:
from .constants import THAW_STATE_EXPIRED, THAW_STATE_THAWING
if status["in_progress"] > 0:
# Still thawing - do nothing
pass
elif status["not_restored"] > 0:
# Objects reverted to Glacier - mark as expired
if repo.thaw_state != THAW_STATE_EXPIRED:
repo.mark_expired()
repo.persist(self.client)
In cleanup.py when cleaning up:
# Only clean up repos in 'expired' state
if repo.thaw_state == THAW_STATE_EXPIRED:
repo.reset_to_frozen()
repo.persist(self.client)
Benefits
1. Clear state tracking: No ambiguity between "being thawed" vs
"thawed and in use"
2. Historical tracking: thawed_at and expires_at provide audit trail
3. Easier debugging: State is explicit in status display
4. Safer cleanup: Only clean up repos explicitly marked as expired
5. Backward compatible: Maintains is_thawed boolean for existing code
1. New Repository Fields (helpers.py)
Added three new fields to the Repository dataclass:
- thaw_state: Explicit lifecycle state ("frozen", "thawing", "thawed",
"expired")
- thawed_at: Timestamp when S3 restore completed
- expires_at: Timestamp when S3 restore will/did expire
- Maintained is_thawed for backward compatibility
2. State Transition Methods (helpers.py)
repo.start_thawing(expires_at) # frozen → thawing (when initiating S3
restore)
repo.mark_thawed() # thawing → thawed (when restore
completes & repo mounts)
repo.mark_expired() # thawed → expired (when restore
expires - NOT USED BY
CLEANUP)
repo.reset_to_frozen() # expired → frozen (cleanup resets
expired repos)
3. Updated Actions
thaw.py
- _thaw_repository(): Calls repo.start_thawing(expires_at) after
initiating S3 restore
- mount_repo() in utilities.py: Calls repo.mark_thawed() when mounting
completes
cleanup.py
- Simplified logic: Only cleans up repos already in "expired" state
- Does NOT manage the transition TO expired - that happens elsewhere
- Unmounts expired repos and calls repo.reset_to_frozen()
status.py
- Added State column showing current lifecycle state
- Added Expires column in thawed repos section showing expiration
timestamp
Remaining Work: Marking Repos as "Expired"
cleanup.py doesn't mark repos as expired - it only processes repos
that are already
expired. You'll need to decide where this transition happens:
Option 1: During thaw --check-status
When checking S3 restore status, if objects have reverted to Glacier,
call
repo.mark_expired():
# In do_check_status() method
status = check_restore_status(self.s3, repo.bucket, repo.base_path)
if status["not_restored"] > 0 and status["in_progress"] == 0:
repo.mark_expired()
repo.persist(self.client)
Option 2: Periodic Cron Job
Create a new command that checks expires_at timestamps and marks repos
as expired:
curator_cli deepfreeze mark_expired # Checks all repos and marks
expired ones
Option 3: Smart Check in Cleanup
Add logic to cleanup to check S3 status for thawed repos past their
expires_at time and
mark them expired before cleaning.
Benefits of This Approach
1. Clear state tracking: No ambiguity between "thawing", "thawed", and
"expired"
2. Historical audit trail: thawed_at and expires_at provide complete
history
3. Safer operations: Cleanup only touches explicitly expired repos
4. Better observability: Status display shows exact lifecycle state
5. Backward compatible: Maintains is_thawed boolean for existing code
Changes made:
1. refreeze.py:56-147 - Completely rewrote the Refreeze action to:
- Accept a thaw_request_id (instead of repo_id)
- Get all repositories associated with that thaw request
- Unmount each repository (handles already-unmounted repos
gracefully)
- Reset each repo to frozen state using reset_to_frozen()
- Mark the thaw request as "completed"
- Be idempotent (safe to run multiple times)
2. cli_singletons/deepfreeze.py:354-392 - Updated the CLI command:
- Changed from --repo-id to --thaw-request-id (required parameter)
- Updated help text to clarify what refreeze does
Key design decisions:
- Refreeze is user-initiated: "I'm done with this thaw"
- Cleanup remains automatic: runs on schedule to handle expired repos
- Refreeze doesn't delete indices (that's cleanup's job if needed)
- No "push to Glacier" - the data never left Glacier, it's just about
state management
- Works even if S3 restore hasn't expired yet
Changes made: 1. refreeze.py:44-55 - Made thaw_request_id optional: - Constructor now accepts thaw_request_id: str = None - Logs different messages based on whether a specific ID or all requests will be processed 2. refreeze.py:57-98 - Added helper methods: - _get_open_thaw_requests(): Gets all thaw requests with status "in_progress" - _confirm_bulk_refreeze(): Shows a list of requests and prompts for user confirmation 3. refreeze.py:100-185 - Extracted single-request logic: - _refreeze_single_request(): Handles refreezing one request, returns (unmounted, failed) tuples 4. refreeze.py:187-233 - Rewrote do_action(): - If thaw_request_id is provided: refreeze that specific request - If thaw_request_id is None: get all open requests, show confirmation prompt, then process all - Shows appropriate summary based on single vs. bulk mode 5. refreeze.py:235-313 - Updated do_dry_run(): - Handles both single and bulk modes - Shows detailed output for single requests - Shows summary for bulk operations 6. cli_singletons/deepfreeze.py:354-401 - Updated CLI: - Changed required=True to default=None - Updated help text to document both modes - Added examples for both usages Usage: # Refreeze a specific thaw request curator_cli deepfreeze refreeze -t <thaw-request-id> # Refreeze ALL open thaw requests (with confirmation prompt) curator_cli deepfreeze refreeze # Dry-run to see what would happen curator_cli --dry-run deepfreeze refreeze When invoked without a thaw ID, it will: 1. Find all thaw requests with status "in_progress" 2. Display them with metadata (created date, date range, repo count) 3. Prompt: "Do you want to proceed with refreezing all these requests? [y/N]:" 4. Only proceed if user confirms with 'y' or 'yes'
The comment says "For each index, check if it overlaps with the date
range" but the code
never actually checks the date range. It just mounts every single
index found in the
repository without any filtering.
Here's the problem:
Current flow:
1. ✅ Select repositories with date ranges overlapping the request
(works correctly via
find_repos_by_date_range)
2. ❌ Mount ALL indices from those repositories (no date filtering)
3. Mount to data streams if applicable
What it should do:
1. ✅ Select repositories with date ranges overlapping the request
2. For each index in those repositories:
- Mount the index temporarily
- Query the index to get its actual @timestamp range
- If the index's date range overlaps with the requested range, keep
it mounted
- If not, unmount it
3. Add overlapping indices to data streams if applicable
Would you like me to fix this by adding the missing date range check?
The fix would
involve:
1. Mounting each index
2. Using get_timestamp_range() to query the index's actual date range
3. Checking if index_start <= end_date AND index_end >= start_date
(overlap check)
4. Unmounting if no overlap
Ensures machine-readable output for all deepfreeze commands, allowing scripting.
Let's default to keeping the data around for as long as possible, to avoid issues with it migrating back to Glacier before we're done.
We now create and assign a new "frozen-only" ilm policy to each thawed index, based on the repository it was thawed from. This prevents all thawed indices from showing up on Index Management as having lifecycle errors.
Users can still list all by adding --include-copmleted or -c
1. Added Status Constants (constants.py) - Added THAW_STATUS_IN_PROGRESS, THAW_STATUS_COMPLETED, THAW_STATUS_FAILED, and THAW_STATUS_REFROZEN constants - Created THAW_REQUEST_STATUSES list for validation 2. Updated Refreeze Action (refreeze.py) - Changed status from "completed" to THAW_STATUS_REFROZEN when refreeze completes - Now properly indicates that thawed data has been cleaned up and returned to frozen state 3. Added Retention Setting (helpers.py) - Added thaw_request_retention_days_refrozen setting (default: 35 days) - This aligns with the 30-day max for data to return to Glacier, plus 5 days buffer 4. Updated Cleanup Logic (cleanup.py) - Added handling for "refrozen" status in both _cleanup_old_thaw_requests() and dry-run mode - Refrozen requests are automatically deleted after 35 days 5. Updated Thaw List Filtering (thaw.py - do_list_requests()) - Now excludes both "completed" AND "refrozen" requests by default - Use --include-completed or -c flag to see all requests - Updated help messages to reflect "completed/refrozen" filtering 6. Updated Status Checking (thaw.py) - do_check_status(): Skips refrozen requests with helpful message - do_check_all_status(): Filters out refrozen requests before processing Status Lifecycle The complete thaw request lifecycle is now: 1. in_progress → Thaw operation is actively running 2. completed → Thaw succeeded, data is available and mounted 3. refrozen → Data has been cleaned up via refreeze (new!) 4. failed → Thaw operation failed Retention Periods (Cleanup) - Completed: 7 days (default) - Failed: 30 days (default) - Refrozen: 35 days (new!) All syntax validation passed! The new status properly distinguishes between "thaw completed and data available" vs "thaw was completed but has been cleaned up."
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed Changes
Known issues