Skip to content

Conversation

@wortmanb
Copy link

Proposed Changes

  • Adds deepfreeze functionality
  • Adds s3client object and subclasses

Known issues

  • Testing probably isn't up to Curator standards.

Bret Wortman and others added 30 commits October 9, 2024 09:01
First draft of schema definition for input values to deepfreeze.
Added defaults and made most schema elements optional
This will almost certainly require changes to the actual deepfreeze code
so that setup and rotate can be separately invoked.
Align the year and month options so they're options, not arguments,
since they don't have to be specified. Also, removed trailing comma and
spaces from "keep" key.
Missing parenthesis and "n"
Trying to make sure all the plumbing is in place to allow this to be run
from curator_cli and removed from the actions list in curator.
Removed the click.Group in favor of simplifying this so ti works like
others actions.
At this point, the code appeears to be setting itself up properly; the
first error to arise comes from not being able to reach Elasticsearch,
which is true at the environment I'm in right now.
Included a link to the doco page for delete_searchable_snapshot, and
reminded users that this needs to be set for deepfreeze to work.
This update alters the click configuration so that deepfreeze is a group
and each action is a command. I've coded do_<action> methods in the
actions file, but these aren't functional yet. I need to get with Aaron
to see how the plumbing is done so I don't do something weird. There's
no command currently set up with sub-commands, so I don't have a pattern
to go from.
I refactored the deepfreeze class into four action classes and
externalized all the common routines they might share because DRY, of
course...
This now correctly invokes the right action classes. Some required
params aren't being enforced, but that's a job for tomorrow.
Added a save_settings method to persist global settings to the
deepfreeze status index.
This wasn't working when tryingn to map with filters.
I added several new options and adjustd others so that we can now
specify --rotate_by and choose bucket or path. Then the suffix gets
appolied either to the bucket name or the path name, depending. The repo
name will always get the suffix.
Switched most settings to being part of a Settings object.
Completed updating Rotate up through ILM changes.
Fully implemented style.
Bret Wortman added 29 commits October 17, 2025 15:31
The issue was that cleanup wasn't distinguishing
  between two very different states:

  1. Active restoration (in_progress > 0): Objects are being restored
from Glacier, thaw is
  active
  2. Expired thaw (not_restored > 0 and in_progress == 0): Objects have
reverted to Glacier,
   thaw has expired

  Now the cleanup will:
  - Keep is_thawed=True if objects are actively being restored
  - Set is_thawed=False only if objects have truly reverted to Glacier

  This prevents the critical bug where:
  1. User creates thaw request
  2. S3 restore takes hours
  3. Cleanup runs while restore is in progress
  4. Cleanup would set is_thawed=False (wrong!)
  5. Thaw request cleanup would delete the in-progress request
  6. When S3 restore completes, no record of what to mount

  Now cleanup correctly preserves in-progress thaw requests.

  The changes are in
/Users/bret/git/curator/curator/actions/deepfreeze/cleanup.py:
  - Lines 289-338: Fixed do_action() to check in_progress count
  - Lines 405-432: Fixed do_dry_run() with same logic
Root Cause Summary:
  The bug was that update_repository_date_range() only expanded date
ranges (setting start
  to an earlier date or end to a later date) but never shrunk them. This
meant that if
  indices from wildly different time periods were ever mounted in a
repository (even
  temporarily during testing), the date range would keep growing and
never get corrected.

  The fix changes the logic to replace the date range to accurately
reflect the indices that
   are currently mounted, not the cumulative history of all indices ever
mounted.
I probably won't use this but it's good to have
New Repository Lifecycle State Management

  New Fields

  thaw_state: str = "frozen"      # Explicit lifecycle state
  thawed_at: datetime = None      # When restore completed
  expires_at: datetime = None     # When restore will/did expire
  is_thawed: bool = False         # DEPRECATED - kept for backward
compatibility

  Four Distinct States

  1. frozen: Normal state, in Glacier, not thawed
  2. thawing: S3 restore in progress, waiting for retrieval
  3. thawed: S3 restore complete, mounted and in use
  4. expired: S3 restore expired, reverted to Glacier, ready for cleanup

  State Transition Methods

  repo.start_thawing(expires_at)  # frozen → thawing
  repo.mark_thawed()              # thawing → thawed
  repo.mark_expired()             # thawed → expired
  repo.reset_to_frozen()          # expired → frozen

  How to Update Existing Code

  In thaw.py when initiating restore:
  from datetime import timedelta, timezone
  expires_at = datetime.now(timezone.utc) +
timedelta(days=self.duration)
  repo.start_thawing(expires_at)
  repo.persist(self.client)

  In thaw.py when mounting after restore:
  repo.mark_thawed()
  repo.persist(self.client)

  In cleanup.py when detecting expiry:
  from .constants import THAW_STATE_EXPIRED, THAW_STATE_THAWING

  if status["in_progress"] > 0:
      # Still thawing - do nothing
      pass
  elif status["not_restored"] > 0:
      # Objects reverted to Glacier - mark as expired
      if repo.thaw_state != THAW_STATE_EXPIRED:
          repo.mark_expired()
          repo.persist(self.client)

  In cleanup.py when cleaning up:
  # Only clean up repos in 'expired' state
  if repo.thaw_state == THAW_STATE_EXPIRED:
      repo.reset_to_frozen()
      repo.persist(self.client)

  Benefits

  1. Clear state tracking: No ambiguity between "being thawed" vs
"thawed and in use"
  2. Historical tracking: thawed_at and expires_at provide audit trail
  3. Easier debugging: State is explicit in status display
  4. Safer cleanup: Only clean up repos explicitly marked as expired
  5. Backward compatible: Maintains is_thawed boolean for existing code
1. New Repository Fields (helpers.py)

  Added three new fields to the Repository dataclass:
  - thaw_state: Explicit lifecycle state ("frozen", "thawing", "thawed",
"expired")
  - thawed_at: Timestamp when S3 restore completed
  - expires_at: Timestamp when S3 restore will/did expire
  - Maintained is_thawed for backward compatibility

  2. State Transition Methods (helpers.py)

  repo.start_thawing(expires_at)  # frozen → thawing (when initiating S3
restore)
  repo.mark_thawed()              # thawing → thawed (when restore
completes & repo mounts)
  repo.mark_expired()             # thawed → expired (when restore
expires - NOT USED BY
  CLEANUP)
  repo.reset_to_frozen()          # expired → frozen (cleanup resets
expired repos)

  3. Updated Actions

  thaw.py

  - _thaw_repository(): Calls repo.start_thawing(expires_at) after
initiating S3 restore
  - mount_repo() in utilities.py: Calls repo.mark_thawed() when mounting
completes

  cleanup.py

  - Simplified logic: Only cleans up repos already in "expired" state
  - Does NOT manage the transition TO expired - that happens elsewhere
  - Unmounts expired repos and calls repo.reset_to_frozen()

  status.py

  - Added State column showing current lifecycle state
  - Added Expires column in thawed repos section showing expiration
timestamp

  Remaining Work: Marking Repos as "Expired"

  cleanup.py doesn't mark repos as expired - it only processes repos
that are already
  expired. You'll need to decide where this transition happens:

  Option 1: During thaw --check-status

  When checking S3 restore status, if objects have reverted to Glacier,
call
  repo.mark_expired():

  # In do_check_status() method
  status = check_restore_status(self.s3, repo.bucket, repo.base_path)
  if status["not_restored"] > 0 and status["in_progress"] == 0:
      repo.mark_expired()
      repo.persist(self.client)

  Option 2: Periodic Cron Job

  Create a new command that checks expires_at timestamps and marks repos
as expired:

  curator_cli deepfreeze mark_expired  # Checks all repos and marks
expired ones

  Option 3: Smart Check in Cleanup

  Add logic to cleanup to check S3 status for thawed repos past their
expires_at time and
  mark them expired before cleaning.

  Benefits of This Approach

  1. Clear state tracking: No ambiguity between "thawing", "thawed", and
"expired"
  2. Historical audit trail: thawed_at and expires_at provide complete
history
  3. Safer operations: Cleanup only touches explicitly expired repos
  4. Better observability: Status display shows exact lifecycle state
  5. Backward compatible: Maintains is_thawed boolean for existing code
Changes made:

  1. refreeze.py:56-147 - Completely rewrote the Refreeze action to:
    - Accept a thaw_request_id (instead of repo_id)
    - Get all repositories associated with that thaw request
    - Unmount each repository (handles already-unmounted repos
gracefully)
    - Reset each repo to frozen state using reset_to_frozen()
    - Mark the thaw request as "completed"
    - Be idempotent (safe to run multiple times)
  2. cli_singletons/deepfreeze.py:354-392 - Updated the CLI command:
    - Changed from --repo-id to --thaw-request-id (required parameter)
    - Updated help text to clarify what refreeze does

  Key design decisions:
  - Refreeze is user-initiated: "I'm done with this thaw"
  - Cleanup remains automatic: runs on schedule to handle expired repos
  - Refreeze doesn't delete indices (that's cleanup's job if needed)
  - No "push to Glacier" - the data never left Glacier, it's just about
state management
  - Works even if S3 restore hasn't expired yet
Changes made:

  1. refreeze.py:44-55 - Made thaw_request_id optional:
  - Constructor now accepts thaw_request_id: str = None
  - Logs different messages based on whether a specific ID or all
requests will be processed

  2. refreeze.py:57-98 - Added helper methods:
  - _get_open_thaw_requests(): Gets all thaw requests with status
"in_progress"
  - _confirm_bulk_refreeze(): Shows a list of requests and prompts for
user confirmation

  3. refreeze.py:100-185 - Extracted single-request logic:
  - _refreeze_single_request(): Handles refreezing one request, returns
(unmounted, failed)
  tuples

  4. refreeze.py:187-233 - Rewrote do_action():
  - If thaw_request_id is provided: refreeze that specific request
  - If thaw_request_id is None: get all open requests, show confirmation
prompt, then
  process all
  - Shows appropriate summary based on single vs. bulk mode

  5. refreeze.py:235-313 - Updated do_dry_run():
  - Handles both single and bulk modes
  - Shows detailed output for single requests
  - Shows summary for bulk operations

  6. cli_singletons/deepfreeze.py:354-401 - Updated CLI:
  - Changed required=True to default=None
  - Updated help text to document both modes
  - Added examples for both usages

  Usage:

  # Refreeze a specific thaw request
  curator_cli deepfreeze refreeze -t <thaw-request-id>

  # Refreeze ALL open thaw requests (with confirmation prompt)
  curator_cli deepfreeze refreeze

  # Dry-run to see what would happen
  curator_cli --dry-run deepfreeze refreeze

  When invoked without a thaw ID, it will:
  1. Find all thaw requests with status "in_progress"
  2. Display them with metadata (created date, date range, repo count)
  3. Prompt: "Do you want to proceed with refreezing all these requests?
[y/N]:"
  4. Only proceed if user confirms with 'y' or 'yes'
The comment says "For each index, check if it overlaps with the date
range" but the code
  never actually checks the date range. It just mounts every single
index found in the
  repository without any filtering.

  Here's the problem:

  Current flow:
  1. ✅ Select repositories with date ranges overlapping the request
(works correctly via
  find_repos_by_date_range)
  2. ❌ Mount ALL indices from those repositories (no date filtering)
  3. Mount to data streams if applicable

  What it should do:
  1. ✅ Select repositories with date ranges overlapping the request
  2. For each index in those repositories:
    - Mount the index temporarily
    - Query the index to get its actual @timestamp range
    - If the index's date range overlaps with the requested range, keep
it mounted
    - If not, unmount it
  3. Add overlapping indices to data streams if applicable

  Would you like me to fix this by adding the missing date range check?
The fix would
  involve:
  1. Mounting each index
  2. Using get_timestamp_range() to query the index's actual date range
  3. Checking if index_start <= end_date AND index_end >= start_date
(overlap check)
  4. Unmounting if no overlap
Ensures machine-readable output for all deepfreeze commands, allowing
scripting.
Let's default to keeping the data around for as long as possible, to
avoid issues with it migrating back to Glacier before we're done.
We now create and assign a new "frozen-only" ilm policy to each thawed
index, based on the repository it was thawed from. This prevents all
thawed indices from showing up on Index Management as having lifecycle
errors.
Users can still list all by adding --include-copmleted or -c
1. Added Status Constants (constants.py)

  - Added THAW_STATUS_IN_PROGRESS, THAW_STATUS_COMPLETED,
THAW_STATUS_FAILED, and
  THAW_STATUS_REFROZEN constants
  - Created THAW_REQUEST_STATUSES list for validation

  2. Updated Refreeze Action (refreeze.py)

  - Changed status from "completed" to THAW_STATUS_REFROZEN when
refreeze completes
  - Now properly indicates that thawed data has been cleaned up and
returned to frozen state

  3. Added Retention Setting (helpers.py)

  - Added thaw_request_retention_days_refrozen setting (default: 35
days)
  - This aligns with the 30-day max for data to return to Glacier, plus
5 days buffer

  4. Updated Cleanup Logic (cleanup.py)

  - Added handling for "refrozen" status in both
_cleanup_old_thaw_requests() and dry-run mode
  - Refrozen requests are automatically deleted after 35 days

  5. Updated Thaw List Filtering (thaw.py - do_list_requests())

  - Now excludes both "completed" AND "refrozen" requests by default
  - Use --include-completed or -c flag to see all requests
  - Updated help messages to reflect "completed/refrozen" filtering

  6. Updated Status Checking (thaw.py)

  - do_check_status(): Skips refrozen requests with helpful message
  - do_check_all_status(): Filters out refrozen requests before
processing

  Status Lifecycle

  The complete thaw request lifecycle is now:

  1. in_progress → Thaw operation is actively running
  2. completed → Thaw succeeded, data is available and mounted
  3. refrozen → Data has been cleaned up via refreeze (new!)
  4. failed → Thaw operation failed

  Retention Periods (Cleanup)

  - Completed: 7 days (default)
  - Failed: 30 days (default)
  - Refrozen: 35 days (new!)

  All syntax validation passed! The new status properly distinguishes
between "thaw completed and
  data available" vs "thaw was completed but has been cleaned up."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants