Skip to content

Commit a2afd39

Browse files
surajkotarrkharse
andauthored
Implement per region instance type config for canary and e2e tests (#94)
* Implement per region instance type config for canary and e2e tests (#87) Description of changes: Currently, we are using one fixed instance type across all AWS regions in our endpoint and training job tests. However, certain regions do not support the currently specified instance type or require a limit increase to use that instance type. Specifically, canary tests in the eu-west-3 and eu-north-1 regions are failing due to this issue. This pull request updates the testing resource config file `replacement_values.py` to pass in the correct instance type depending on region. Regions that did not experience this issue will continue to use the previous instance type via the new config to avoid breaking canaries/e2e testing in those regions. The changes have been tested in our eu-west-3 and eu-north-1 canary stacks and have resulted in passing canaries. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. * enable adopted resource test in canaries * inc delete wait for endpoint Co-authored-by: Rahul Kharse <[email protected]>
1 parent 817416b commit a2afd39

9 files changed

+22
-9
lines changed

test/canary/Dockerfile.canary

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ RUN curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.18.6/b
3030
&& cp ./kubectl /bin
3131

3232
# Install eksctl
33-
RUN curl --silent --location "https://github.com/weaveworks/eksctl/releases/download/latest_release/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp && mv /tmp/eksctl /bin
33+
RUN curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp && mv /tmp/eksctl /bin
3434

3535
# Install Helm
3636
RUN curl -q -L "https://get.helm.sh/helm-v3.2.4-linux-amd64.tar.gz" | tar zxf - -C /usr/local/bin/ \

test/e2e/common/fixtures.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ def xgboost_churn_endpoint(sagemaker_client):
9191
yield endpoint_spec
9292

9393
for cr in (model_reference, endpoint_config_reference, endpoint_reference):
94-
_, deleted = k8s.delete_custom_resource(cr, 3, 10)
94+
_, deleted = k8s.delete_custom_resource(cr, cfg.DELETE_WAIT_PERIOD, cfg.DELETE_WAIT_LENGTH)
9595
assert deleted
9696

9797

test/e2e/replacement_values.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,16 @@
166166
"eu-south-1": "638885417683.dkr.ecr.eu-south-1.amazonaws.com",
167167
}
168168

169+
ENDPOINT_INSTANCE_TYPES = {
170+
"eu-west-3": "ml.m5.large",
171+
"eu-north-1": "ml.m5.large",
172+
}
173+
174+
TRAINING_JOB_INSTANCE_TYPES = {
175+
"eu-west-3": "ml.m5.xlarge",
176+
"eu-north-1": "ml.m5.xlarge",
177+
}
178+
169179
REPLACEMENT_VALUES = {
170180
"SAGEMAKER_DATA_BUCKET": get_bootstrap_resources().DataBucketName,
171181
"XGBOOST_IMAGE_URI": f"{XGBOOST_IMAGE_URIS[get_region()]}/sagemaker-xgboost:1.0-1-cpu-py3",
@@ -175,4 +185,6 @@
175185
"SAGEMAKER_EXECUTION_ROLE_ARN": get_bootstrap_resources().ExecutionRoleARN,
176186
"MODEL_MONITOR_ANALYZER_IMAGE_URI": f"{MODEL_MONITOR_IMAGE_URIS[get_region()]}/sagemaker-model-monitor-analyzer",
177187
"CLARIFY_IMAGE_URI": f"{CLARIFY_IMAGE_URIS[get_region()]}/sagemaker-clarify-processing:1.0",
188+
"ENDPOINT_INSTANCE_TYPE": ENDPOINT_INSTANCE_TYPES.get(get_region(), 'ml.c5.large'),
189+
"TRAINING_JOB_INSTANCE_TYPE": TRAINING_JOB_INSTANCE_TYPES.get(get_region(), 'ml.m4.xlarge')
178190
}

test/e2e/resources/endpoint_config_data_capture_single_variant.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ spec:
77
productionVariants:
88
- modelName: $MODEL_NAME
99
variantName: AllTraffic
10-
instanceType: ml.c5.large
10+
instanceType: $ENDPOINT_INSTANCE_TYPE
1111
initialVariantWeight: 1
1212
initialInstanceCount: 1
1313
dataCaptureConfig:

test/e2e/resources/endpoint_config_multi_variant.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,12 @@ spec:
99
modelName: $MODEL_NAME
1010
initialInstanceCount: 1
1111
# This is the smallest instance type which will support scaling
12-
instanceType: ml.c5.large
12+
instanceType: $ENDPOINT_INSTANCE_TYPE
1313
initialVariantWeight: 1
1414
- variantName: variant-2
1515
modelName: $MODEL_NAME
1616
initialInstanceCount: 1
17-
instanceType: ml.c5.large
17+
instanceType: $ENDPOINT_INSTANCE_TYPE
1818
initialVariantWeight: 1
1919
tags:
2020
- key: confidentiality

test/e2e/resources/endpoint_config_single_variant.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ spec:
1010
# instanceCount is 2 to test retainAllVariantProperties
1111
initialInstanceCount: 2
1212
# This is the smallest instance type which will support scaling
13-
instanceType: ml.c5.large
13+
instanceType: $ENDPOINT_INSTANCE_TYPE
1414
initialVariantWeight: 1
1515
tags:
1616
- key: confidentiality

test/e2e/resources/xgboost_trainingjob.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ spec:
2121
s3OutputPath: s3://$SAGEMAKER_DATA_BUCKET/sagemaker/training/output
2222
resourceConfig:
2323
instanceCount: 1
24-
instanceType: ml.m4.xlarge
24+
instanceType: $TRAINING_JOB_INSTANCE_TYPE
2525
volumeSizeInGB: 5
2626
stoppingCondition:
2727
maxRuntimeInSeconds: 86400

test/e2e/resources/xgboost_trainingjob_debugger.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ spec:
2121
s3OutputPath: s3://$SAGEMAKER_DATA_BUCKET/sagemaker/training/debugger/output
2222
resourceConfig:
2323
instanceCount: 1
24-
instanceType: ml.m4.xlarge
24+
instanceType: $TRAINING_JOB_INSTANCE_TYPE
2525
volumeSizeInGB: 5
2626
stoppingCondition:
2727
maxRuntimeInSeconds: 86400

test/e2e/tests/test_adopt_endpoint.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ def sdk_make_endpoint_config(model_name, endpoint_config_name):
6767
"VariantName": "variant-1",
6868
"ModelName": model_name,
6969
"InitialInstanceCount": 1,
70-
"InstanceType": "ml.c5.large",
70+
"InstanceType": REPLACEMENT_VALUES["ENDPOINT_INSTANCE_TYPE"],
7171
}
7272
],
7373
}
@@ -170,6 +170,7 @@ def adopted_endpoint(sdk_endpoint):
170170

171171

172172
@service_marker
173+
@pytest.mark.canary
173174
class TestAdoptedEndpoint:
174175
def test_smoke(self, sdk_endpoint, adopted_endpoint):
175176
(

0 commit comments

Comments
 (0)