Skip to content

Commit 00c8afb

Browse files
authored
Feature: Support RetryStrategy/InstanceGroups (#175)
Note: Some modifications are made to Hyper parameter tuning, this is normal as Hyper parameter tuning consists of training jobs. Description of changes: generator.yaml - commented out shapes. Added late initialize to ResourceConfig.InstanceCount and it's HPO equivalent. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
1 parent b028346 commit 00c8afb

23 files changed

+1110
-184
lines changed
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
ack_generate_info:
2-
build_date: "2022-10-04T18:55:06Z"
2+
build_date: "2022-11-23T21:07:39Z"
33
build_hash: 6e2ffbc3b16a30ac59be6719918c601c2c864064
44
go_version: go1.17.13
55
version: v0.20.1-3-g6e2ffbc
6-
api_directory_checksum: ca9187f53c674d6424c5a4120fe2609afce3d52a
6+
api_directory_checksum: 2633d649a5543e7613c66d1ae5de70dcb65c68d5
77
api_version: v1alpha1
88
aws_sdk_go_version: v1.44.93
99
generator_config_info:
10-
file_checksum: 858695b7159c1a59326d91623f545bf0be1c18d2
10+
file_checksum: 7c6cfe3508784bc2442ecf00f5155b9c778d00e4
1111
original_file_name: generator.yaml
1212
last_modification:
1313
reason: API generation

apis/v1alpha1/generator.yaml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,9 @@ resources:
188188
OutputDataConfig.KMSKeyID:
189189
late_initialize:
190190
min_backoff_seconds: 5
191+
ResourceConfig.InstanceCount:
192+
late_initialize:
193+
min_backoff_seconds: 5
191194
Tags:
192195
compare:
193196
is_ignored: true
@@ -335,6 +338,9 @@ resources:
335338
TrainingJobDefinition.EnableNetworkIsolation:
336339
late_initialize:
337340
min_backoff_seconds: 5
341+
TrainingJobDefinition.ResourceConfig.InstanceCount:
342+
late_initialize:
343+
min_backoff_seconds: 5
338344
Tags:
339345
compare:
340346
is_ignored: true
@@ -879,11 +885,8 @@ ignore:
879885
shape_names:
880886
# RSessionAppSettings is an empty struct that causes generation errors
881887
- RSessionAppSettings
882-
- RetryStrategy
883888
- DeploymentConfig
884889
- ProductionVariantServerlessConfig
885890
- ExecutionRoleIdentityConfig
886891
- HyperParameterTuningResourceConfig
887892
- InstanceMetadataServiceConfiguration
888-
- InstanceGroups
889-
- InstanceGroupNames

apis/v1alpha1/training_job.go

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

apis/v1alpha1/types.go

Lines changed: 29 additions & 8 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

apis/v1alpha1/zz_generated.deepcopy.go

Lines changed: 62 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

config/crd/bases/sagemaker.services.k8s.aws_hyperparametertuningjobs.yaml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -346,6 +346,10 @@ spec:
346346
items:
347347
type: string
348348
type: array
349+
instanceGroupNames:
350+
items:
351+
type: string
352+
type: array
349353
s3DataDistributionType:
350354
type: string
351355
s3DataType:
@@ -429,6 +433,22 @@ spec:
429433
instanceCount:
430434
format: int64
431435
type: integer
436+
instanceGroups:
437+
items:
438+
description: Defines an instance group for heterogeneous
439+
cluster training. When requesting a training job using
440+
the CreateTrainingJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html)
441+
API, you can configure multiple instance groups .
442+
properties:
443+
instanceCount:
444+
format: int64
445+
type: integer
446+
instanceGroupName:
447+
type: string
448+
instanceType:
449+
type: string
450+
type: object
451+
type: array
432452
instanceType:
433453
type: string
434454
volumeKMSKeyID:
@@ -437,6 +457,17 @@ spec:
437457
format: int64
438458
type: integer
439459
type: object
460+
retryStrategy:
461+
description: The retry strategy to use when a training job fails
462+
due to an InternalServerError. RetryStrategy is specified as
463+
part of the CreateTrainingJob and CreateHyperParameterTuningJob
464+
requests. You can add the StoppingCondition parameter to the
465+
request to limit the training time for the complete job.
466+
properties:
467+
maximumRetryAttempts:
468+
format: int64
469+
type: integer
470+
type: object
440471
roleARN:
441472
type: string
442473
staticHyperParameters:
@@ -673,6 +704,10 @@ spec:
673704
items:
674705
type: string
675706
type: array
707+
instanceGroupNames:
708+
items:
709+
type: string
710+
type: array
676711
s3DataDistributionType:
677712
type: string
678713
s3DataType:
@@ -758,6 +793,22 @@ spec:
758793
instanceCount:
759794
format: int64
760795
type: integer
796+
instanceGroups:
797+
items:
798+
description: Defines an instance group for heterogeneous
799+
cluster training. When requesting a training job using
800+
the CreateTrainingJob (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html)
801+
API, you can configure multiple instance groups .
802+
properties:
803+
instanceCount:
804+
format: int64
805+
type: integer
806+
instanceGroupName:
807+
type: string
808+
instanceType:
809+
type: string
810+
type: object
811+
type: array
761812
instanceType:
762813
type: string
763814
volumeKMSKeyID:
@@ -766,6 +817,17 @@ spec:
766817
format: int64
767818
type: integer
768819
type: object
820+
retryStrategy:
821+
description: The retry strategy to use when a training job fails
822+
due to an InternalServerError. RetryStrategy is specified
823+
as part of the CreateTrainingJob and CreateHyperParameterTuningJob
824+
requests. You can add the StoppingCondition parameter to the
825+
request to limit the training time for the complete job.
826+
properties:
827+
maximumRetryAttempts:
828+
format: int64
829+
type: integer
830+
type: object
769831
roleARN:
770832
type: string
771833
staticHyperParameters:

config/crd/bases/sagemaker.services.k8s.aws_trainingjobs.yaml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,10 @@ spec:
276276
items:
277277
type: string
278278
type: array
279+
instanceGroupNames:
280+
items:
281+
type: string
282+
type: array
279283
s3DataDistributionType:
280284
type: string
281285
s3DataType:
@@ -400,6 +404,22 @@ spec:
400404
instanceCount:
401405
format: int64
402406
type: integer
407+
instanceGroups:
408+
items:
409+
description: Defines an instance group for heterogeneous cluster
410+
training. When requesting a training job using the CreateTrainingJob
411+
(https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html)
412+
API, you can configure multiple instance groups .
413+
properties:
414+
instanceCount:
415+
format: int64
416+
type: integer
417+
instanceGroupName:
418+
type: string
419+
instanceType:
420+
type: string
421+
type: object
422+
type: array
403423
instanceType:
404424
type: string
405425
volumeKMSKeyID:
@@ -408,6 +428,14 @@ spec:
408428
format: int64
409429
type: integer
410430
type: object
431+
retryStrategy:
432+
description: The number of times to retry the job when the job fails
433+
due to an InternalServerError.
434+
properties:
435+
maximumRetryAttempts:
436+
format: int64
437+
type: integer
438+
type: object
411439
roleARN:
412440
description: "The Amazon Resource Name (ARN) of an IAM role that SageMaker
413441
can assume to perform tasks on your behalf. \n During model training,

0 commit comments

Comments
 (0)