-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
Hi, I am following this example.
And I found when running the deploy function, it will ask for permission to create the default s3 bucket even when the code_location
parameter is passed to tensorflow estimator.
However based on the source code, if code_location
is passed to initialize the model, it should avoid create a new S3 bucket and reused the one parsed from the code_location
variable, and it will store the model output in this s3 bucket.
To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
The provided code need to be complete and runnable, if additional data is needed, please include them in the issue.
mys3bucket is existing s3 bucket, prefix is available,
import os
import sagemaker
from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()
role = get_execution_role()
region = sagemaker_session.boto_session.region_name
username = os.environ['USER']
base_job_name = f"users-{username}-tf-script-mode"
role = get_execution_role()
bucket = "mys3bucket"
prefix = f'data/users/{username}/tensorflow'
training_data_uri = 's3://sagemaker-sample-data-{}/tensorflow/mnist'.format(region)
from sagemaker.tensorflow import TensorFlow
source_dir = 's3://{}/{}/source'.format(bucket, prefix)
output_path = 's3://{}/{}/output'.format(bucket, prefix)
print(f"{source_dir=}")
print(f"{output_path=}")
hyperparams = {
'sagemaker_requirements': 'code/requirements.txt'
}
mnist_estimator = TensorFlow(entry_point='code/mnist.py',
base_job_name=base_job_name,
output_path=output_path,
code_location=source_dir,
hyperparameters=hyperparams,
role=role,
instance_count=2,
instance_type='ml.m5.large',
framework_version='2.1.0',
py_version='py3',
distribution={'parameter_server': {'enabled': True}})
## fit
print("start fitting")
mnist_estimator.fit(training_data_uri)
## deploy
print("start deploy")
predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.m5.large')
error message
WARNING:sagemaker.deprecations:update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker.tensorflow.model:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:531, in Session._create_s3_bucket_if_it_does_not_exist(self, bucket_name, region)
529 try:
530 # trying head bucket call
--> 531 s3.meta.client.head_bucket(Bucket=bucket.name)
532 except ClientError as e:
533 # bucket does not exist or forbidden to access
File /opt/conda/lib/python3.9/site-packages/botocore/client.py:553, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
552 # The "self" in this scope is referring to the BaseClient.
--> 553 return self._make_api_call(operation_name, kwargs)
File /opt/conda/lib/python3.9/site-packages/botocore/client.py:1009, in BaseClient._make_api_call(self, operation_name, api_params)
1008 error_class = self.exceptions.from_code(error_code)
-> 1009 raise error_class(parsed_response, operation_name)
1010 else:
ClientError: An error occurred (404) when calling the HeadBucket operation: Not Found
During handling of the above exception, another exception occurred:
ClientError Traceback (most recent call last)
Cell In[7], line 1
----> 1 predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.m5.large')
File /opt/conda/lib/python3.9/site-packages/sagemaker/estimator.py:1509, in EstimatorBase.deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, serverless_inference_config, async_inference_config, volume_size, model_data_download_timeout, container_startup_health_check_timeout, inference_recommendation_id, explainer_config, **kwargs)
1503 model.name = model_name
1505 tags = update_inference_tags_with_jumpstart_training_tags(
1506 inference_tags=tags, training_tags=self.tags
1507 )
-> 1509 return model.deploy(
1510 instance_type=instance_type,
1511 initial_instance_count=initial_instance_count,
1512 serializer=serializer,
1513 deserializer=deserializer,
1514 accelerator_type=accelerator_type,
1515 endpoint_name=endpoint_name,
1516 tags=tags or self.tags,
1517 wait=wait,
1518 kms_key=kms_key,
1519 data_capture_config=data_capture_config,
1520 serverless_inference_config=serverless_inference_config,
1521 async_inference_config=async_inference_config,
1522 explainer_config=explainer_config,
1523 volume_size=volume_size,
1524 model_data_download_timeout=model_data_download_timeout,
1525 container_startup_health_check_timeout=container_startup_health_check_timeout,
1526 inference_recommendation_id=inference_recommendation_id,
1527 )
File /opt/conda/lib/python3.9/site-packages/sagemaker/tensorflow/model.py:335, in TensorFlowModel.deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, update_endpoint, async_inference_config, serverless_inference_config, volume_size, model_data_download_timeout, container_startup_health_check_timeout, inference_recommendation_id, explainer_config)
332 msg = "The TensorFlow version %s doesn't support EIA." % self.framework_version
333 raise AttributeError(msg)
--> 335 return super(TensorFlowModel, self).deploy(
336 initial_instance_count=initial_instance_count,
337 instance_type=instance_type,
338 serializer=serializer,
339 deserializer=deserializer,
340 accelerator_type=accelerator_type,
341 endpoint_name=endpoint_name,
342 tags=tags,
343 kms_key=kms_key,
344 wait=wait,
345 data_capture_config=data_capture_config,
346 async_inference_config=async_inference_config,
347 serverless_inference_config=serverless_inference_config,
348 volume_size=volume_size,
349 model_data_download_timeout=model_data_download_timeout,
350 container_startup_health_check_timeout=container_startup_health_check_timeout,
351 update_endpoint=update_endpoint,
352 inference_recommendation_id=inference_recommendation_id,
353 explainer_config=explainer_config,
354 )
File /opt/conda/lib/python3.9/site-packages/sagemaker/model.py:1248, in Model.deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, async_inference_config, serverless_inference_config, volume_size, model_data_download_timeout, container_startup_health_check_timeout, inference_recommendation_id, explainer_config, **kwargs)
1245 if self._base_name is not None:
1246 self._base_name = "-".join((self._base_name, compiled_model_suffix))
-> 1248 self._create_sagemaker_model(
1249 instance_type, accelerator_type, tags, serverless_inference_config
1250 )
1252 serverless_inference_config_dict = (
1253 serverless_inference_config._to_request_dict() if is_serverless else None
1254 )
1255 production_variant = sagemaker.production_variant(
1256 self.name,
1257 instance_type,
(...)
1263 container_startup_health_check_timeout=container_startup_health_check_timeout,
1264 )
File /opt/conda/lib/python3.9/site-packages/sagemaker/model.py:681, in Model._create_sagemaker_model(self, instance_type, accelerator_type, tags, serverless_inference_config)
659 def _create_sagemaker_model(
660 self, instance_type=None, accelerator_type=None, tags=None, serverless_inference_config=None
661 ):
662 """Create a SageMaker Model Entity
663
664 Args:
(...)
679 not provided in serverless inference. So this is used to find image URIs.
680 """
--> 681 container_def = self.prepare_container_def(
682 instance_type,
683 accelerator_type=accelerator_type,
684 serverless_inference_config=serverless_inference_config,
685 )
687 if not isinstance(self.sagemaker_session, PipelineSession):
688 # _base_name, model_name are not needed under PipelineSession.
689 # the model_data may be Pipeline variable
690 # which may break the _base_name generation
691 self._ensure_base_name_if_needed(
692 image_uri=container_def["Image"],
693 script_uri=self.source_dir,
694 model_uri=self.model_data,
695 )
File /opt/conda/lib/python3.9/site-packages/sagemaker/tensorflow/model.py:391, in TensorFlowModel.prepare_container_def(self, instance_type, accelerator_type, serverless_inference_config)
389 env = self._get_container_env()
390 key_prefix = sagemaker.fw_utils.model_code_key_prefix(self.key_prefix, self.name, image_uri)
--> 391 bucket = self.bucket or self.sagemaker_session.default_bucket()
393 if self.entry_point and not is_pipeline_variable(self.model_data):
394 model_data = s3.s3_path_join("s3://", bucket, key_prefix, "model.tar.gz")
File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:500, in Session.default_bucket(self)
497 if not default_bucket:
498 default_bucket = generate_default_sagemaker_bucket_name(self.boto_session)
--> 500 self._create_s3_bucket_if_it_does_not_exist(bucket_name=default_bucket, region=region)
502 self._default_bucket = default_bucket
504 return self._default_bucket
File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:545, in Session._create_s3_bucket_if_it_does_not_exist(self, bucket_name, region)
543 s3.create_bucket(Bucket=bucket_name)
544 else:
--> 545 s3.create_bucket(
546 Bucket=bucket_name,
547 CreateBucketConfiguration={"LocationConstraint": region},
548 )
550 LOGGER.info("Created S3 bucket: %s", bucket_name)
551 except ClientError as e:
File /opt/conda/lib/python3.9/site-packages/boto3/resources/factory.py:581, in ResourceFactory._create_action.<locals>.do_action(self, *args, **kwargs)
580 def do_action(self, *args, **kwargs):
--> 581 response = action(self, *args, **kwargs)
583 if hasattr(self, 'load'):
584 # Clear cached data. It will be reloaded the next
585 # time that an attribute is accessed.
586 # TODO: Make this configurable in the future?
587 self.meta.data = None
File /opt/conda/lib/python3.9/site-packages/boto3/resources/action.py:88, in ServiceAction.__call__(self, parent, *args, **kwargs)
79 params.update(kwargs)
81 logger.debug(
82 'Calling %s:%s with %r',
83 parent.meta.service_name,
84 operation_name,
85 params,
86 )
---> 88 response = getattr(parent.meta.client, operation_name)(*args, **params)
90 logger.debug('Response: %r', response)
92 return self._response_handler(parent, params, response)
File /opt/conda/lib/python3.9/site-packages/botocore/client.py:553, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
549 raise TypeError(
550 f"{py_operation_name}() only accepts keyword arguments."
551 )
552 # The "self" in this scope is referring to the BaseClient.
--> 553 return self._make_api_call(operation_name, kwargs)
File /opt/conda/lib/python3.9/site-packages/botocore/client.py:1009, in BaseClient._make_api_call(self, operation_name, api_params)
1005 error_code = error_info.get("QueryErrorCode") or error_info.get(
1006 "Code"
1007 )
1008 error_class = self.exceptions.from_code(error_code)
-> 1009 raise error_class(parsed_response, operation_name)
1010 else:
1011 return parsed_response
ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Access Denied
Expected behavior
A clear and concise description of what you expected to happen.
no new s3 bucket needs to be created.
Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: latest
- Framework name (eg. PyTorch) or algorithm (eg. KMeans):
- Framework version:
- Python version:3.9
- CPU or GPU:cpu
- Custom Docker image (Y/N):N
Additional context
a proposed solution is #4537