Skip to content
Merged
13 changes: 13 additions & 0 deletions docs/authorization.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Contents:
- Users are automatically granted access to `/users/<user ID>/gen3-workflow/tasks` so they can view their own tasks.
- Admin access (the ability to see _all_ users’ tasks instead of just your own) can be granted to a user by granting them access to the parent resource `/services/workflow/gen3-workflow/tasks`.
- This supports sharing tasks with others; for example, "user1" may share "taskA" with "user2" if the system grants "user2" access to `/users/user1/gen3-workflow/tasks/taskA`.
- To delete their own S3 bucket along with all its objects, a user needs `delete` access to the resource `/services/workflow/user-bucket` on the `gen3-workflow` service -- a special privilege useful for automated testing but not intended for the average user.

#### Authorization configuration example

Expand Down Expand Up @@ -45,6 +46,12 @@ authz:
- gen3_workflow_reader
resource_paths:
- /services/workflow/gen3-workflow/tasks
- id: workflow_storage_deleter
description: Allows delete access to the user's own S3 bucket
role_ids:
- workflow_storage_deleter
resource_paths:
- /services/workflow/gen3-workflow

roles:
- id: gen3_workflow_reader
Expand All @@ -59,4 +66,10 @@ authz:
action:
service: gen3-workflow
method: create
- id: workflow_storage_deleter
permissions:
- id: workflow_storage_deleter
action:
service: gen3-workflow
method: delete
```
11 changes: 11 additions & 0 deletions docs/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -459,3 +459,14 @@ paths:
summary: Get Storage Info
tags:
- Storage
/storage/user-bucket:
delete:
operationId: delete_user_bucket
responses:
'204':
description: Successful Response
security:
- HTTPBearer: []
summary: Delete User Bucket
tags:
- Storage
96 changes: 96 additions & 0 deletions gen3workflow/aws_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,3 +189,99 @@ def create_user_bucket(user_id: str) -> Tuple[str, str, str]:
)

return user_bucket_name, "ga4gh-tes", config["USER_BUCKETS_REGION"]


def get_all_bucket_objects(user_bucket_name):
"""
Get all objects from the specified S3 bucket.
"""
response = s3_client.list_objects_v2(Bucket=user_bucket_name)
object_list = response.get("Contents", [])

# list_objects_v2 can utmost return 1000 objects in a single response
# if there are more objects, the response will have a key "IsTruncated" set to True
# and a key "NextContinuationToken" which can be used to get the next set of objects

# TODO:
# Currently, all objects are loaded into memory, which can be problematic for large datasets.
# To optimize, convert this function into a generator that accepts a `batch_size` parameter (capped at 1,000)
# and yields objects in batches.
while response.get("IsTruncated"):
response = s3_client.list_objects_v2(
Bucket=user_bucket_name,
ContinuationToken=response.get("NextContinuationToken"),
)
object_list += response.get("Contents", [])

return object_list


def delete_all_bucket_objects(user_id, user_bucket_name):
"""
Deletes all objects from the specified S3 bucket.

Args:
user_id (str): The user's unique Gen3 ID.
user_bucket_name (str): The name of the S3 bucket.
"""
object_list = get_all_bucket_objects(user_bucket_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine because this endpoint is only meant to be used in integration tests for now. But in general, it would be better practice to fetch 1000 objects and delete them, and then fetch the next 1000, rather than store in memory the full list of objects and then loop over them in batches of 1000.

If you don't want to spend time making this improvement now, could you add a TODO comment before the call to get_all_bucket_objects so we know to do it if we ever need to use this endpoint in production?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Initially, I considered making the get_all_bucket_objects function a generator that accepts a batch_size parameter (capped at 1,000) and yields results one batch at a time. This approach would avoid the batching logic we’re currently applying in the delete function and could also serve as a form of pagination for end users if needed.

However, given the narrow scope of this use case right now, it felt somewhat over-engineered. If you anticipate us using this functionality in the near future, I’d be happy to allocate time to implement it. Otherwise, I can leave a TODO as you suggested and probably we can backlog a ticket to revisit this later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea a comment is fine


if not object_list:
return

logger.debug(
f"Deleting all contents from '{user_bucket_name}' for user '{user_id}' before deleting the bucket"
)
keys = [{"Key": obj.get("Key")} for obj in object_list]

# According to the docs, up to 1000 objects can be deleted in a single request:
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.delete_objects

# TODO: When `get_all_bucket_objects` is converted to a generator,
# we can remove this batching logic and retrieve objects in batches of 1,000 for deletion.
limit = 1000
for offset in range(0, len(keys), limit):
response = s3_client.delete_objects(
Bucket=user_bucket_name,
Delete={"Objects": keys[offset : offset + limit]},
)
if response.get("Errors"):
logger.error(
f"Failed to delete objects from bucket '{user_bucket_name}' for user '{user_id}': {response}"
)
raise Exception(response)


def delete_user_bucket(user_id: str) -> Union[str, None]:
"""
Deletes all objects from a user's S3 bucket before deleting the bucket itself.

Args:
user_id (str): The user's unique Gen3 ID

Raises:
Exception: If there is an error during the deletion process.
"""
user_bucket_name = get_safe_name_from_hostname(user_id)

try:
s3_client.head_bucket(Bucket=user_bucket_name)
except ClientError as e:
error_code = e.response["Error"]["Code"]
if error_code == "404":
logger.warning(
f"Bucket '{user_bucket_name}' not found for user '{user_id}'."
)
return None

logger.info(f"Deleting bucket '{user_bucket_name}' for user '{user_id}'")
try:
delete_all_bucket_objects(user_id, user_bucket_name)
s3_client.delete_bucket(Bucket=user_bucket_name)
return user_bucket_name

except Exception as e:
logger.error(
f"Failed to delete bucket '{user_bucket_name}' for user '{user_id}': {e}"
)
raise
23 changes: 21 additions & 2 deletions gen3workflow/routes/storage.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from fastapi import APIRouter, Depends, Request
from starlette.status import HTTP_200_OK
from fastapi import APIRouter, Depends, Request, HTTPException
from starlette.status import HTTP_200_OK, HTTP_204_NO_CONTENT, HTTP_404_NOT_FOUND

from gen3workflow import aws_utils, logger
from gen3workflow.auth import Auth
Expand All @@ -19,3 +19,22 @@ async def get_storage_info(request: Request, auth=Depends(Auth)) -> dict:
"workdir": f"s3://{bucket_name}/{bucket_prefix}",
"region": bucket_region,
}


@router.delete("/user-bucket", status_code=HTTP_204_NO_CONTENT)
async def delete_user_bucket(request: Request, auth=Depends(Auth)) -> None:
await auth.authorize("delete", ["/services/workflow/user-bucket"])

token_claims = await auth.get_token_claims()
user_id = token_claims.get("sub")
logger.info(f"User '{user_id}' deleting their storage bucket")
deleted_bucket_name = aws_utils.delete_user_bucket(user_id)

if not deleted_bucket_name:
raise HTTPException(
HTTP_404_NOT_FOUND, "Deletion failed: No user bucket found."
)

logger.info(
f"Bucket '{deleted_bucket_name}' for user '{user_id}' deleted successfully"
)
128 changes: 128 additions & 0 deletions tests/test_misc.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import json
from moto import mock_aws
import pytest
from unittest.mock import patch, MagicMock

from conftest import TEST_USER_ID
from gen3workflow import aws_utils
Expand Down Expand Up @@ -53,6 +54,14 @@ def test_get_safe_name_from_hostname(reset_config_hostname):
async def test_storage_info(client, access_token_patcher, mock_aws_services):
# check that the user's storage information is as expected
expected_bucket_name = f"gen3wf-{config['HOSTNAME']}-{TEST_USER_ID}"

# Bucket must not exist before this test
with pytest.raises(ClientError) as e:
aws_utils.s3_client.head_bucket(Bucket=expected_bucket_name)
assert (
e.value.response.get("ResponseMetadata", {}).get("HTTPStatusCode") == 404
), f"Bucket exists: {e.value}"

res = await client.get("/storage/info", headers={"Authorization": "bearer 123"})
assert res.status_code == 200, res.text
storage_info = res.json()
Expand All @@ -62,6 +71,10 @@ async def test_storage_info(client, access_token_patcher, mock_aws_services):
"region": config["USER_BUCKETS_REGION"],
}

# check that the bucket was created after the call to `/storage/info`
bucket_exists = aws_utils.s3_client.head_bucket(Bucket=expected_bucket_name)
assert bucket_exists, "Bucket does not exist"

# check that the bucket is setup with KMS encryption
kms_key = aws_utils.kms_client.describe_key(KeyId=f"alias/{expected_bucket_name}")
kms_key_arn = kms_key["KeyMetadata"]["Arn"]
Expand Down Expand Up @@ -169,3 +182,118 @@ async def test_bucket_enforces_encryption(
# ServerSideEncryption="aws:kms",
# SSEKMSKeyId=authorized_kms_key_arn,
# )


@pytest.mark.asyncio
async def test_delete_user_bucket(client, access_token_patcher, mock_aws_services):
"""
The user should be able to delete their own bucket.
"""

# Create the bucket if it doesn't exist
res = await client.get("/storage/info", headers={"Authorization": "bearer 123"})
bucket_name = res.json()["bucket"]

# Verify the bucket exists
bucket_exists = aws_utils.s3_client.head_bucket(Bucket=bucket_name)
assert bucket_exists, "Bucket does not exist"

# Delete the bucket
res = await client.delete(
"/storage/user-bucket", headers={"Authorization": "bearer 123"}
)
assert res.status_code == 204, res.text

# Verify the bucket is deleted
with pytest.raises(ClientError) as e:
aws_utils.s3_client.head_bucket(Bucket=bucket_name)
assert (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here about ClientError vs assert statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was indented appropriately. No?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sorry, i think i got this wrong, the assert statement can/should be inside of the with block here. I confused it with a try block. Could you revert that? my bad

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. I think the assert statement must be outside the with block. (reference here)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whichever is fine.

e.value.response.get("ResponseMetadata", {}).get("HTTPStatusCode") == 404
), f"Bucket still exists: {e.value}"

# Attempt to Delete the bucket again, must receive a 404, since bucket not found.
res = await client.delete(
"/storage/user-bucket", headers={"Authorization": "bearer 123"}
)
assert res.status_code == 404, res.text


@pytest.mark.asyncio
async def test_delete_user_bucket_with_files(
client, access_token_patcher, mock_aws_services
):
"""
Attempt to delete a bucket that is not empty.
Endpoint must be able to delete all the files and then delete the bucket.
"""

# Create the bucket if it doesn't exist
res = await client.get("/storage/info", headers={"Authorization": "bearer 123"})
bucket_name = res.json()["bucket"]

# Remove the bucket policy enforcing KMS encryption
# Moto has limitations that prevent adding objects to a bucket with KMS encryption enabled.
# More details: https://github.com/uc-cdis/gen3-workflow/blob/554fc3eb4c1d333f9ef81c1a5f8e75a6b208cdeb/tests/test_misc.py#L161-L171
aws_utils.s3_client.delete_bucket_policy(Bucket=bucket_name)

# Upload more than 1000 objects to ensure batching is working correctly
object_count = 1200
for i in range(object_count):
aws_utils.s3_client.put_object(
Bucket=bucket_name, Key=f"file_{i}", Body=b"Dummy file contents"
)

# Verify all the objects in the bucket are fetched even when bucket has more than 1000 objects
object_list = aws_utils.get_all_bucket_objects(bucket_name)
assert len(object_list) == object_count

# Delete the bucket
res = await client.delete(
"/storage/user-bucket", headers={"Authorization": "bearer 123"}
)
assert res.status_code == 204, res.text

# Verify the bucket is deleted
with pytest.raises(ClientError) as e:
aws_utils.s3_client.head_bucket(Bucket=bucket_name)
assert (
e.value.response.get("ResponseMetadata", {}).get("HTTPStatusCode") == 404
), f"Bucket still exists: {e.value}"


@pytest.mark.asyncio
async def test_delete_user_bucket_no_token(client, mock_aws_services):
"""
Attempt to delete a bucket when the user is not logged in. Must receive a 401 error.
"""
mock_delete_bucket = MagicMock()
# Delete the bucket
with patch("gen3workflow.aws_utils.delete_user_bucket", mock_delete_bucket):
res = await client.delete("/storage/user-bucket")
assert res.status_code == 401, res.text
assert res.json() == {"detail": "Must provide an access token"}
mock_delete_bucket.assert_not_called()


@pytest.mark.asyncio
@pytest.mark.parametrize(
"client",
[pytest.param({"authorized": False, "tes_resp_code": 200}, id="unauthorized")],
indirect=True,
)
async def test_delete_user_bucket_unauthorized(
client, access_token_patcher, mock_aws_services
):
"""
Attempt to delete a bucket when the user is logged in but does not have the appropriate authorization.
Must receive a 403 error.
"""
mock_delete_bucket = MagicMock()
# Delete the bucket
with patch("gen3workflow.aws_utils.delete_user_bucket", mock_delete_bucket):
res = await client.delete(
"/storage/user-bucket", headers={"Authorization": "bearer 123"}
)
assert res.status_code == 403, res.text
assert res.json() == {"detail": "Permission denied"}
mock_delete_bucket.assert_not_called()
Loading