-
Notifications
You must be signed in to change notification settings - Fork 0
Store files in cloud object storage #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
3130898
feat: upload files to s3-compatible storage bucket
tillywoodfield 210e9e9
feat: exception handling for file uploads
tillywoodfield 3ff74f7
feat: refactor exception handling
tillywoodfield 394052f
fix: refactor environment loading
tillywoodfield 03a6d4d
feat: delete files for datasets removed from the registry
tillywoodfield bec0de6
docs: document environment variables for uploading files
tillywoodfield 59b8da6
test: fix env loading in CI runner
tillywoodfield File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ENABLE_UPLOAD=1 | ||
BUCKET_REGION="test-region" | ||
BUCKET_NAME="test-bucket" | ||
BUCKET_ACCESS_KEY_ID="test-id" | ||
BUCKET_ACCESS_KEY_SECRET="test-secret" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,6 @@ | ||
.venv | ||
__pycache__ | ||
|
||
.env.local | ||
|
||
data/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,18 @@ | ||
import logging | ||
import os | ||
import time | ||
|
||
from dotenv import find_dotenv, load_dotenv | ||
|
||
logging.basicConfig( | ||
level=logging.INFO, | ||
format="%(asctime)s:%(levelname)s:%(name)s:%(message)s", | ||
datefmt="%Y-%m-%dT%H:%M:%S", | ||
) | ||
logging.Formatter.converter = time.gmtime | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
APP_ENV = os.environ.get("APP_ENV", "local") | ||
logger.info(f"Loading {APP_ENV} environment variables") | ||
load_dotenv(find_dotenv(f".env.{APP_ENV}", usecwd=True)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
import logging | ||
import os | ||
import zipfile | ||
from pathlib import Path | ||
from typing import Any, Optional | ||
|
||
import boto3 | ||
import botocore | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
BUCKET_REGION = os.environ.get("BUCKET_REGION") | ||
BUCKET_NAME = os.environ.get("BUCKET_NAME") | ||
BUCKET_ACCESS_KEY_ID = os.environ.get("BUCKET_ACCESS_KEY_ID") | ||
BUCKET_ACCESS_KEY_SECRET = os.environ.get("BUCKET_ACCESS_KEY_SECRET") | ||
|
||
|
||
def _get_client() -> Any: | ||
session = boto3.session.Session() | ||
return session.client( | ||
"s3", | ||
endpoint_url=f"https://{BUCKET_REGION}.digitaloceanspaces.com/", | ||
config=botocore.config.Config(s3={"addressing_style": "virtual"}), | ||
region_name=BUCKET_REGION, | ||
aws_access_key_id=BUCKET_ACCESS_KEY_ID, | ||
aws_secret_access_key=BUCKET_ACCESS_KEY_SECRET, | ||
) | ||
|
||
|
||
def _upload_file(local_path: str, bucket_path: str, content_type: str) -> Optional[str]: | ||
try: | ||
logger.info(f"Uploading file {local_path}") | ||
client = _get_client() | ||
client.upload_file( | ||
local_path, | ||
BUCKET_NAME, | ||
bucket_path, | ||
ExtraArgs={"ACL": "public-read", "ContentType": content_type}, | ||
) | ||
public_url = ( | ||
f"https://{BUCKET_NAME}.{BUCKET_REGION}.digitaloceanspaces.com/" | ||
+ bucket_path | ||
) | ||
logger.info(f"Uploaded to {public_url}") | ||
return public_url | ||
except Exception as e: | ||
logger.warning(f"Failed to upload {local_path} with error {e}") | ||
return None | ||
|
||
|
||
def _upload_json(dataset_id: str, json_path: str) -> Optional[str]: | ||
return _upload_file( | ||
local_path=json_path, | ||
bucket_path=f"{dataset_id}/{dataset_id}.json", | ||
content_type="application/json", | ||
) | ||
|
||
|
||
def _upload_csv(dataset_id: str, csv_path: str) -> Optional[str]: | ||
try: | ||
directory = Path(csv_path) | ||
zip_file_path = f"{csv_path}_csv.zip" | ||
with zipfile.ZipFile(zip_file_path, mode="w") as archive: | ||
for file_path in directory.rglob("*"): | ||
archive.write(file_path, arcname=file_path.relative_to(directory)) | ||
except Exception as e: | ||
logger.warning(f"Failed to zip {csv_path} with error {e}") | ||
return None | ||
return _upload_file( | ||
local_path=zip_file_path, | ||
bucket_path=f"{dataset_id}/{dataset_id}_csv.zip", | ||
content_type="application/zip", | ||
) | ||
|
||
|
||
def _upload_xlsx(dataset_id: str, xlsx_path: str) -> Optional[str]: | ||
return _upload_file( | ||
local_path=xlsx_path, | ||
bucket_path=f"{dataset_id}/{dataset_id}.xlsx", | ||
content_type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", # noqa: E501 | ||
) | ||
|
||
|
||
def upload_files( | ||
dataset_id: str, | ||
json_path: Optional[str] = None, | ||
csv_path: Optional[str] = None, | ||
xlsx_path: Optional[str] = None, | ||
) -> tuple[Optional[str], Optional[str], Optional[str]]: | ||
if not bool(int(os.environ.get("ENABLE_UPLOAD", "0"))): | ||
logger.info("Upload is disabled, skipping") | ||
return None, None, None | ||
json_public_url = _upload_json(dataset_id, json_path) if json_path else None | ||
csv_public_url = _upload_csv(dataset_id, csv_path) if csv_path else None | ||
xlsx_public_url = _upload_xlsx(dataset_id, xlsx_path) if xlsx_path else None | ||
return json_public_url, csv_public_url, xlsx_public_url | ||
|
||
|
||
def delete_files_for_dataset(dataset_id: str) -> None: | ||
logger.info(f"Deleting files for dataset {dataset_id}") | ||
try: | ||
client = _get_client() | ||
response = client.list_objects_v2(Bucket=BUCKET_NAME, Prefix=dataset_id) | ||
if "Contents" in response: | ||
objects_to_delete = [{"Key": obj["Key"]} for obj in response["Contents"]] | ||
client.delete_objects( | ||
Bucket=BUCKET_NAME, Delete={"Objects": objects_to_delete} | ||
) | ||
except Exception as e: | ||
logger.warning(f"Failed to delete files with error {e}") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
import os | ||
|
||
os.environ["APP_ENV"] = "test" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole Endpoint URL could be a env var? Allows changing provider easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but then maybe the addressing_style config below should be a config? Maybe then it gets confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also ran into an issue later down the line where the
boto3
client doesn't return the public URL of the uploaded file, so we have to construct that ourselves, which can be different depending on the provider, and it's easier to have all the parts, rather than a single URL. This ended up being the nicest way I could find