Skip to content

Commit 0076efd

Browse files
authored
Cleanup CI buckets/containers (#2646)
#### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Release storage on regular basis. We leak storage due to different types of failures that cause symbols and libraries not to be cleaned after tests. This PR adds a step for cleaning the storages - AWS S3, GCP, Azure once a week. By default all data older than 28 days will be cleaned, thus presevrving info for the last moth. The PR adds small boto and azure library for obtaining info and cleaning buckets/containers. This library can be further reused and enhanced if needed for more bucket management that we will need Prior to this PR the leftovers were as follows: ``` AWS TOTAL SIZE : 891,874,609,188 GCP TOTAL SIZE : 3,363,272,750 AZURE TOTAL SIZE: 2,706,787,373 ``` Cleanup buckets with no more than 1 month old data (initial run) : https://github.com/man-group/ArcticDB/actions/runs/17733685971/job/50390234882 Next run after the buckets are clean: https://github.com/man-group/ArcticDB/actions/runs/17756785784/job/50460916726 ``` 2025-09-16 06:28:50,748 - __main__ - INFO - Cleaning before: 2025-08-19 06:28:50.748631+00:00 2025-09-16 06:28:50,749 - __main__ - INFO - Cleaning-up GCP storage 2025-09-16 06:29:05,352 - __main__ - INFO - GCP TOTAL SIZE: 339770117 2025-09-16 06:29:19,172 - utils.bucket_management - INFO - Found 0 objects to delete before 2025-08-19 06:28:50.748631+00:00 2025-09-16 06:29:30,408 - __main__ - INFO - GCP TOTAL SIZE: 339770117 2025-09-16 06:29:30,408 - __main__ - INFO - Cleaning-up Azure storage 2025-09-16 06:29:35,005 - __main__ - INFO - AZURE TOTAL SIZE: 3591350 2025-09-16 06:29:38,719 - utils.bucket_management - INFO - Found 0 blobs to delete before 2025-08-19 06:28:50.748631+00:00 2025-09-16 06:29:42,433 - __main__ - INFO - AZURE TOTAL SIZE: 3591350 2025-09-16 06:29:42,433 - __main__ - INFO - Cleaning-up S3 storage 2025-09-16 06:29:43,227 - __main__ - INFO - AWS S3 TOTAL SIZE: 0 2025-09-16 06:29:43,418 - __main__ - INFO - AWS S3 TOTAL SIZE: 0 ``` #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev>
1 parent 4776764 commit 0076efd

File tree

6 files changed

+361
-43
lines changed

6 files changed

+361
-43
lines changed

.github/workflows/delete_sts_roles.yml

Lines changed: 0 additions & 34 deletions
This file was deleted.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
name: Scheduled Cleanup
2+
3+
on:
4+
schedule:
5+
- cron: "0 22 * * 6"
6+
push:
7+
branches:
8+
- delete_sts_roles
9+
workflow_dispatch:
10+
11+
jobs:
12+
run-script:
13+
runs-on: ubuntu-latest
14+
15+
steps:
16+
- name: Checkout Repository
17+
uses: actions/checkout@v3
18+
19+
- name: Set Up Python
20+
uses: actions/setup-python@v4
21+
with:
22+
python-version: "3.11"
23+
24+
- name: Install Dependencies
25+
run: pip install boto3 arcticdb azure-storage-blob azure-identity
26+
27+
- name: Set persistent storage variables
28+
uses: ./.github/actions/set_persistent_storage_env_vars
29+
with:
30+
aws_access_key: "${{ secrets.AWS_S3_ACCESS_KEY }}"
31+
aws_secret_key: "${{ secrets.AWS_S3_SECRET_KEY }}"
32+
gcp_access_key: "${{ secrets.GCP_S3_ACCESS_KEY }}"
33+
gcp_secret_key: "${{ secrets.GCP_S3_SECRET_KEY }}"
34+
azure_container: "githubblob" # DEFAULT BUCKET FOR AZURE
35+
azure_connection_string: "${{ secrets.AZURE_CONNECTION_STRING }}"
36+
37+
- name: Delete STS Roles
38+
run: |
39+
cd python
40+
# remove the empty protobuf libs so that protobufs are loaded from installed lib
41+
rm -rf arcticc
42+
PYTHONPATH=. python -m utils.s3_roles_delete
43+
44+
- name: Cleanup buckets
45+
run: |
46+
cd python
47+
PYTHONPATH=. python -m utils.cleanup_test_buckets
48+
49+
50+

python/utils/__init__.py

Whitespace-only changes.

python/utils/bucket_management.py

Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
"""
2+
Copyright 2025 Man Group Operations Limited
3+
4+
Use of this software is governed by the Business Source License 1.1 included in the file LICENSE.txt.
5+
6+
As of the Change Date specified in that file, in accordance with the Business Source License, use of this software will be governed by the Apache License, version 2.0.
7+
"""
8+
9+
from datetime import datetime, timedelta, timezone
10+
from concurrent.futures import ThreadPoolExecutor
11+
import boto3
12+
import os
13+
from typing import Callable, Optional
14+
from botocore.client import BaseClient
15+
from botocore.exceptions import ClientError
16+
from azure.storage.blob import BlobServiceClient
17+
from azure.storage.blob import BlobProperties
18+
from arcticdb.util.logger import get_logger
19+
20+
21+
logger = get_logger()
22+
23+
24+
def s3_client(client_type: str = "s3") -> BaseClient:
25+
"""Create a boto S3 client to Amazon AWS S3 store
26+
27+
Parameters:
28+
client_type - s3, iam etc valid boto clients
29+
"""
30+
return boto3.client(
31+
client_type,
32+
aws_access_key_id=os.getenv("ARCTICDB_REAL_S3_ACCESS_KEY"),
33+
aws_secret_access_key=os.getenv("ARCTICDB_REAL_S3_SECRET_KEY"),
34+
)
35+
36+
37+
def gcp_client() -> BaseClient:
38+
"""Returns a boto client to GCP stoage"""
39+
session = boto3.session.Session()
40+
return session.client(
41+
service_name="s3",
42+
aws_access_key_id=os.getenv("ARCTICDB_REAL_GCP_ACCESS_KEY"),
43+
aws_secret_access_key=os.getenv("ARCTICDB_REAL_GCP_SECRET_KEY"),
44+
endpoint_url=os.getenv("ARCTICDB_REAL_GCP_ENDPOINT"),
45+
)
46+
47+
48+
def azure_client() -> BlobServiceClient:
49+
"""Creates and returns a BlobServiceClient using the provided connection string."""
50+
connection_string = os.getenv("ARCTICDB_REAL_AZURE_CONNECTION_STRING")
51+
return BlobServiceClient.from_connection_string(connection_string)
52+
53+
54+
def list_bucket(
55+
client: BaseClient, bucket_name: str, handler: Callable[[dict], None], cutoff_date: Optional[datetime] = None
56+
) -> None:
57+
"""
58+
Lists objects in a bucket that were last modified before a given date,
59+
and applies a handler function to each.
60+
61+
Parameters:
62+
client: boto3 S3-compatible client (e.g., for GCS via HMAC).
63+
bucket_name: Name of the bucket.
64+
handler : Function to apply to each qualifying object.
65+
cutoff_date (Optional): Only include objects older than this date.
66+
Defaults to current UTC time.
67+
"""
68+
if cutoff_date is None:
69+
cutoff_date = datetime.now(timezone.utc)
70+
71+
paginator = client.get_paginator("list_objects_v2")
72+
for page in paginator.paginate(Bucket=bucket_name):
73+
for obj in page.get("Contents", []):
74+
if obj["LastModified"] < cutoff_date:
75+
handler(obj)
76+
77+
78+
def delete_gcp_bucket(
79+
client: BaseClient, bucket_name: str, cutoff_date: Optional[datetime] = None, max_workers: int = 50
80+
) -> None:
81+
"""
82+
Deletes objects in a GCS bucket that were last modified before a given date,
83+
using parallel deletion via HMAC credentials.
84+
85+
Parameters:
86+
bucket_name (str): Name of the GCS bucket.
87+
cutoff_date (Optional[datetime]): Only delete objects older than this date.
88+
Defaults to current UTC time.
89+
max_workers (int): Number of parallel threads for deletion.
90+
"""
91+
keys_to_delete: list[str] = []
92+
93+
def collect_key(obj: dict) -> None:
94+
keys_to_delete.append(obj["Key"])
95+
96+
list_bucket(client, bucket_name, collect_key, cutoff_date)
97+
logger.info(f"Found {len(keys_to_delete)} objects to delete before {cutoff_date or datetime.now(timezone.utc)}")
98+
99+
def delete_key(key: str) -> None:
100+
client.delete_object(Bucket=bucket_name, Key=key)
101+
logger.info(f"Deleted: {key}")
102+
103+
with ThreadPoolExecutor(max_workers=max_workers) as executor:
104+
executor.map(delete_key, keys_to_delete)
105+
106+
107+
def get_gcp_bucket_size(
108+
client: BaseClient,
109+
bucket_name: str,
110+
cutoff_date: Optional[datetime] = None,
111+
) -> int:
112+
"""Returns the size of specified GCP bucket
113+
114+
Parameters:
115+
client: boto3 S3-compatible client (e.g., for GCS via HMAC).
116+
bucket_name: Name of the bucket.
117+
cutoff_date (Optional): Only include objects older than this date.
118+
Defaults to current UTC time.
119+
"""
120+
return get_s3_bucket_size(client, bucket_name, cutoff_date)
121+
122+
123+
def list_azure_container(
124+
client: BlobServiceClient,
125+
container_name: str,
126+
handler: Callable[[BlobProperties], None],
127+
cutoff_date: Optional[datetime] = None,
128+
) -> None:
129+
"""
130+
Lists blobs in a container that were last modified before a given date,
131+
and applies a handler function to each.
132+
133+
Parameters:
134+
client : Authenticated BlobServiceClient.
135+
container_name : Name of the container.
136+
handler : Function to apply to each qualifying blob.
137+
cutoff_date (Optional[datetime]): Only include blobs older than this date.
138+
Defaults to current UTC time.
139+
"""
140+
if cutoff_date is None:
141+
cutoff_date = datetime.now(timezone.utc)
142+
143+
container_client = client.get_container_client(container_name)
144+
for blob in container_client.list_blobs():
145+
if blob.last_modified and blob.last_modified < cutoff_date:
146+
handler(blob)
147+
148+
149+
def get_azure_container_size(
150+
blob_service_client: BlobServiceClient, container_name: str, cutoff_date: Optional[datetime] = None
151+
) -> int:
152+
"""Calculates the total size of all blobs in a container."""
153+
total_size = 0
154+
155+
def size_accumulator(blob: BlobProperties) -> None:
156+
nonlocal total_size
157+
total_size += blob.size
158+
159+
list_azure_container(blob_service_client, container_name, size_accumulator, cutoff_date)
160+
return total_size
161+
162+
163+
def delete_azure_container(
164+
client: BlobServiceClient, container_name: str, cutoff_date: Optional[datetime] = None, max_workers: int = 20
165+
) -> None:
166+
"""
167+
Deletes blobs in an Azure container that were last modified before the cutoff date.
168+
169+
Parameters:
170+
client : Authenticated BlobServiceClient.
171+
container_name : Name of the container.
172+
cutoff_date : Only delete blobs older than this date.
173+
Defaults to current UTC time.
174+
max_workers : Number of parallel threads for deletion.
175+
"""
176+
container_client = client.get_container_client(container_name)
177+
blobs_to_delete: list[str] = []
178+
179+
def collect_blob(blob: BlobProperties) -> None:
180+
blobs_to_delete.append(blob.name)
181+
182+
list_azure_container(client, container_name, collect_blob, cutoff_date)
183+
184+
logger.info(f"Found {len(blobs_to_delete)} blobs to delete before {cutoff_date or datetime.now(timezone.utc)}")
185+
186+
def delete_blob(blob_name: str) -> None:
187+
try:
188+
# If needed we should optimize with
189+
# https://learn.microsoft.com/en-us/dotnet/api/azure.storage.blobs.specialized.blobbatchclient.deleteblobs?view=azure-dotnet
190+
container_client.delete_blob(blob_name)
191+
logger.info(f"Deleted: {blob_name}")
192+
except Exception as e:
193+
logger.error(f"Failed to delete {blob_name}: {e}")
194+
195+
with ThreadPoolExecutor(max_workers=max_workers) as executor:
196+
executor.map(delete_blob, blobs_to_delete)
197+
198+
199+
def get_s3_bucket_size(client: BaseClient, bucket_name: str, cutoff_date: Optional[datetime] = None) -> int:
200+
"""
201+
Calculates the total size of all objects in an S3 bucket.
202+
203+
Parameters:
204+
client : A boto3 S3 client.
205+
bucket_name : Name of the S3 bucket.
206+
cutoff_date : Only delete blobs older than this date.
207+
Defaults to current UTC time.
208+
209+
Returns:
210+
int: Total size in bytes.
211+
"""
212+
total_size = 0
213+
214+
def size_accumulator(obj: dict) -> None:
215+
nonlocal total_size
216+
total_size += obj["Size"]
217+
218+
list_bucket(client, bucket_name, size_accumulator, cutoff_date)
219+
return total_size
220+
221+
222+
def delete_s3_bucket_batch(
223+
client: BaseClient, bucket_name: str, cutoff_date: Optional[datetime] = None, batch_size: int = 1000
224+
) -> None:
225+
"""
226+
Deletes objects in an S3-compatible bucket that were last modified before the cutoff date,
227+
using batch deletion (up to 1000 objects per request).
228+
229+
Args:
230+
client : boto3 S3-compatible client
231+
bucket_name : Name of the bucket.
232+
cutoff_date : Only delete objects older than this date.
233+
Defaults to current UTC time.
234+
batch_size : Maximum number of objects per delete request (max 1000).
235+
"""
236+
batch: list[dict] = []
237+
238+
def delete_batch(batch):
239+
client.delete_objects(Bucket=bucket_name, Delete={"Objects": batch})
240+
logger.info(f"Deleted batch of {len(batch)} AWS S3 objects")
241+
242+
def collect_keys(obj: dict) -> None:
243+
batch.append({"Key": obj["Key"]})
244+
if len(batch) == batch_size:
245+
try:
246+
delete_batch(batch)
247+
except Exception as e:
248+
logger.error(f"Batch delete failed: {e}")
249+
batch.clear()
250+
251+
list_bucket(client, bucket_name, collect_keys, cutoff_date)
252+
253+
# Delete any remaining objects
254+
if batch:
255+
try:
256+
delete_batch(batch)
257+
except Exception as e:
258+
logger.error(f"Final batch delete failed: {e}")
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
"""
2+
Copyright 2025 Man Group Operations Limited
3+
4+
Use of this software is governed by the Business Source License 1.1 included in the file LICENSE.txt.
5+
6+
As of the Change Date specified in that file, in accordance with the Business Source License, use of this software will be governed by the Apache License, version 2.0.
7+
"""
8+
9+
from datetime import datetime, timedelta, timezone
10+
import os
11+
from arcticdb.util.logger import get_logger
12+
from .bucket_management import (
13+
azure_client,
14+
delete_azure_container,
15+
delete_gcp_bucket,
16+
delete_s3_bucket_batch,
17+
gcp_client,
18+
get_azure_container_size,
19+
get_gcp_bucket_size,
20+
get_s3_bucket_size,
21+
s3_client,
22+
)
23+
24+
25+
logger = get_logger()
26+
27+
now = datetime.now(timezone.utc)
28+
cutoff = now - timedelta(days=28)
29+
30+
logger.info(f"Cleaning before: {cutoff}")
31+
32+
logger.info("Cleaning-up GCP storage")
33+
gcp = gcp_client()
34+
gcp_bucket = os.getenv("ARCTICDB_REAL_GCP_BUCKET")
35+
logger.info(f"Before clean: GCP TOTAL SIZE: {get_gcp_bucket_size(gcp, gcp_bucket)}")
36+
delete_gcp_bucket(gcp, gcp_bucket, cutoff)
37+
logger.info(f"After clean: GCP TOTAL SIZE: {get_gcp_bucket_size(gcp, gcp_bucket)}")
38+
39+
logger.info("Cleaning-up Azure storage")
40+
azure = azure_client()
41+
azure_container = os.getenv("ARCTICDB_REAL_AZURE_CONTAINER")
42+
logger.info(f"Before clean: AZURE TOTAL SIZE: {get_azure_container_size(azure, azure_container)}")
43+
delete_azure_container(azure, azure_container, cutoff)
44+
logger.info(f"After clean: AZURE TOTAL SIZE: {get_azure_container_size(azure, azure_container)}")
45+
46+
logger.info("Cleaning-up S3 storage")
47+
s3 = s3_client()
48+
s3_bucket = os.getenv("ARCTICDB_REAL_S3_BUCKET")
49+
logger.info(f"Before clean: AWS S3 TOTAL SIZE: {get_s3_bucket_size(s3, s3_bucket)}")
50+
delete_s3_bucket_batch(s3, s3_bucket)
51+
logger.info(f"After clean: AWS S3 TOTAL SIZE: {get_s3_bucket_size(s3, s3_bucket)}")

0 commit comments

Comments
 (0)