Skip to content
This repository was archived by the owner on May 5, 2025. It is now read-only.

Commit 3b22b03

Browse files
feat: support zstd compression in miniostorage (#405)
* feat: support zstd compression in miniostorage we want to use zstd compression when compressing files for storage in object storage because it performs better than gzip which is what we were using before these changes are only being made to the minio storage service because we want to consolidate the storage service functionality into this one so both worker and API will be using this backend in the future (API was already using this one) we have to manually decompress the zstd compressed files in read_file but HTTPResponse takes care of it for us if the content encoding of the file is gzip the is_already_gzipped argument is being deprecated in favour of compression_type and is_compressed, also the ability to pass a str to write_file is being deprecated. we're keeping track of the use of these using sentry capture_message * fix: address feedback - using fget_object was unecessary since we were streaming the response data regardless - no need for all the warning logs and sentry stuff, we'll just do a 3 step migration in both API and worker (update shared supporting old behaviour, update {api,worker}, remove old behaviour support from shared) - zstandard version pinning can be more flexible - add test for content type = application/x-gzip since there was some specific handling for that in the GCP storage service * fix: update MinioStorageService - in write file: - data arg is not BinaryIO it's actually bytes | str | IO[bytes] bytes and str are self-explanatory it's just how it's being used currently, so we must support it. IO[bytes] is there to support files handles opened with "rb" that are being passed and BytesIO objects - start accepting None value for compression_type which will mean no automatic compression even if is_compressed is false - do automatic compression using gzip if is_compressed=False and compression_type="gzip" - in put_object set size = -1 and use a part_size of 20MiB. the specific part size is arbitrary. Different sources online suggest different numbers. It probably depends on the size of the underlying data we're trying to send but 20MiB seems like a good flat number to pick for now. - in read_file: - generally reorganize the function do spend less time under the try except blocks - use the CHUNK_SIZE const defined in storage/base for the amount to read from the streams - accept IO[bytes] for the file_obj since we don't use any of the BinaryIO specific methods - create GZipStreamReader that takes in a IO[bytes] and implements a read() method that reads a certain amount of bytes from the IO[bytes] compresses whatever it reads using gzip, and returns the result * fix(minio): check urllib3 version in read_file this is because if urllib3 is >= 2.0.0 and the zstd extra is installed then it is capable (and will) decode zstd encoded data when it's used in get_object so when we create the MinioStorageService we check the urllib3 version and we check if it's been installed with the zstd extra this commit also adds a test to ensure that the gzip compression and decompression used in the GzipStreamReader actually works * feat: add feature flag for new minio storage instead of doing a 0-100 launch of the new minio storage service i'd like to have it so we incrementally ship it using a feature flag. so if a repoid is passed to the get_appropriate_storage_service function and the chosen storage is minio, then it will check the use_new_minio feature to decide whether to use the new or old minio storage service as mentioned this will be decided via the repoid (to reduce the impact IF it is broken) changes had to be made to avoid circular imports in the model_utils and rollout_utils files * fix: revert changes to old minio
1 parent 27d6a8f commit 3b22b03

File tree

13 files changed

+837
-26
lines changed

13 files changed

+837
-26
lines changed

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ dependencies = [
3434
"requests>=2.32.3",
3535
"sentry-sdk>=2.13.0",
3636
"sqlalchemy<2",
37+
"zstandard>=0.23.0",
3738
]
3839

3940
[build-system]

shared/django_apps/rollouts/models.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ class RolloutUniverse(models.TextChoices):
2323

2424
def default_random_salt():
2525
# to resolve circular dependency
26-
from shared.django_apps.utils.model_utils import default_random_salt
26+
from shared.django_apps.utils.rollout_utils import default_random_salt
2727

2828
return default_random_salt()
2929

shared/django_apps/utils/model_utils.py

Lines changed: 0 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,8 @@
11
import json
22
import logging
3-
from random import choice
43
from typing import Any, Callable, Optional
54

65
from shared.api_archive.archive import ArchiveService
7-
from shared.django_apps.rollouts.models import RolloutUniverse
86
from shared.storage.exceptions import FileNotInStorageError
97
from shared.utils.ReportEncoder import ReportEncoder
108

@@ -148,24 +146,6 @@ def __set__(self, obj, value):
148146
setattr(obj, self.cached_value_property_name, value)
149147

150148

151-
def default_random_salt():
152-
ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
153-
return "".join([choice(ALPHABET) for _ in range(16)])
154-
155-
156-
def rollout_universe_to_override_string(rollout_universe: RolloutUniverse):
157-
if rollout_universe == RolloutUniverse.OWNER_ID:
158-
return "override_owner_ids"
159-
elif rollout_universe == RolloutUniverse.REPO_ID:
160-
return "override_repo_ids"
161-
elif rollout_universe == RolloutUniverse.EMAIL:
162-
return "override_emails"
163-
elif rollout_universe == RolloutUniverse.ORG_ID:
164-
return "override_org_ids"
165-
else:
166-
return ""
167-
168-
169149
# This is the place for DB trigger logic that's been moved into code
170150
# Owner
171151
def get_ownerid_if_member(
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
from random import choice
2+
3+
from shared.django_apps.rollouts.models import RolloutUniverse
4+
5+
6+
def default_random_salt():
7+
ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
8+
return "".join([choice(ALPHABET) for _ in range(16)])
9+
10+
11+
def rollout_universe_to_override_string(rollout_universe: RolloutUniverse):
12+
if rollout_universe == RolloutUniverse.OWNER_ID:
13+
return "override_owner_ids"
14+
elif rollout_universe == RolloutUniverse.REPO_ID:
15+
return "override_repo_ids"
16+
elif rollout_universe == RolloutUniverse.EMAIL:
17+
return "override_emails"
18+
elif rollout_universe == RolloutUniverse.ORG_ID:
19+
return "override_org_ids"
20+
else:
21+
return ""

shared/rollouts/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
Platform,
1818
RolloutUniverse,
1919
)
20-
from shared.django_apps.utils.model_utils import rollout_universe_to_override_string
20+
from shared.django_apps.utils.rollout_utils import rollout_universe_to_override_string
2121

2222
log = logging.getLogger("__name__")
2323

shared/rollouts/features.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@
22

33
BUNDLE_THRESHOLD_FLAG = Feature("bundle_threshold_flag")
44
INCLUDE_GITHUB_COMMENT_ACTIONS_BY_OWNER = Feature("include_github_comment_actions")
5+
USE_NEW_MINIO = Feature("use_new_minio")

shared/storage/__init__.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,22 @@
11
from shared.config import get_config
2+
from shared.rollouts.features import USE_NEW_MINIO
23
from shared.storage.aws import AWSStorageService
34
from shared.storage.base import BaseStorageService
45
from shared.storage.fallback import StorageWithFallbackService
56
from shared.storage.gcp import GCPStorageService
67
from shared.storage.minio import MinioStorageService
8+
from shared.storage.new_minio import NewMinioStorageService
79

810

9-
def get_appropriate_storage_service() -> BaseStorageService:
10-
chosen_storage = get_config("services", "chosen_storage", default="minio")
11-
return _get_appropriate_storage_service_given_storage(chosen_storage)
11+
def get_appropriate_storage_service(
12+
repoid: int | None = None,
13+
) -> BaseStorageService:
14+
chosen_storage: str = get_config("services", "chosen_storage", default="minio") # type: ignore
15+
return _get_appropriate_storage_service_given_storage(chosen_storage, repoid)
1216

1317

1418
def _get_appropriate_storage_service_given_storage(
15-
chosen_storage: str,
19+
chosen_storage: str, repoid: int | None
1620
) -> BaseStorageService:
1721
if chosen_storage == "gcp":
1822
gcp_config = get_config("services", "gcp", default={})
@@ -28,4 +32,6 @@ def _get_appropriate_storage_service_given_storage(
2832
return StorageWithFallbackService(gcp_service, aws_service)
2933
else:
3034
minio_config = get_config("services", "minio", default={})
35+
if repoid and USE_NEW_MINIO.check_value(repoid, default=False):
36+
return NewMinioStorageService(minio_config)
3137
return MinioStorageService(minio_config)

shared/storage/base.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from typing import BinaryIO, overload
22

33
CHUNK_SIZE = 1024 * 32
4+
PART_SIZE = 1024 * 1024 * 20 # 20MiB
45

56

67
# Interface class for interfacing with codecov's underlying storage layer

0 commit comments

Comments
 (0)