You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Use server side defaults for Train definitions (#177)
Issue:
We [do not](https://github.com/aws-controllers-k8s/sagemaker-controller/blob/3996ba037349bb0b65088341afdfd03cf657e09c/generator.yaml#L326) late initialize some parameters in TrainingJobDefinitions like we do in TrainingJob defintion. As a result the controller will infinitely requeue if all parameters are not explicity specified(because the server sends back the default values it uses).
Description of changes:
`pkg/resource/hyper_parameter_tuning_job/custom_delta.go` - Sets some parameters to their server side default.
`pkg/resource/hyper_parameter_tuning_job/testdata/v1alpha1/readone/observed/completed_variation.yaml` - Modified unit test.
CRD I used to test:
```
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: HyperParameterTuningJob
metadata:
name: 2022-10-31-hpo-3
spec:
hyperParameterTuningJobName: 2022-10-31-hpo-3
hyperParameterTuningJobConfig:
strategy: Bayesian
resourceLimits:
maxNumberOfTrainingJobs: 2
maxParallelTrainingJobs: 1
trainingJobEarlyStoppingType: Auto
trainingJobDefinitions:
- staticHyperParameters:
base_score: '0.5'
definitionName: training-job-for-hpo
algorithmSpecification:
trainingImage: 433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:1
trainingInputMode: File
roleARN: <arn>
tuningObjective:
type_: Minimize
metricName: validation:error
hyperParameterRanges:
integerParameterRanges:
- name: num_round
minValue: '10'
maxValue: '20'
scalingType: Linear
continuousParameterRanges:
- name: gamma
minValue: '0'
maxValue: '5'
scalingType: Linear
inputDataConfig:
- channelName: train
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3URI: <train>
s3DataDistributionType: FullyReplicated
contentType: text/libsvm
compressionType: None
recordWrapperType: None
inputMode: File
- channelName: validation
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3URI: <validation>
s3DataDistributionType: FullyReplicated
contentType: text/libsvm
compressionType: None
recordWrapperType: None
inputMode: File
outputDataConfig:
s3OutputPath: <output>
resourceConfig:
instanceType: ml.m5.large
instanceCount: 1
volumeSizeInGB: 25
stoppingCondition:
maxRuntimeInSeconds: 3600
enableNetworkIsolation: true
enableInterContainerTrafficEncryption: false
tags:
- key: algorithm
value: xgboost
- key: environment
value: testing
- key: customer
value: test-user
```
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
0 commit comments