Skip to content

Commit 60eb0bc

Browse files
authored
doc: Add docs for debugger job support in operator (#1367)
1 parent 7bc032a commit 60eb0bc

File tree

1 file changed

+129
-0
lines changed

1 file changed

+129
-0
lines changed

doc/amazon_sagemaker_operators_for_kubernetes_jobs.rst

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -304,6 +304,135 @@ job stops or completes.
304304
continue to show on the Amazon SageMaker console. The delete command
305305
takes about 2 minutes to clean up the resources from Amazon SageMaker.
306306

307+
SageMaker Debugger Jobs
308+
^^^^^^^^^^^^^^^^^^^^^^^
309+
310+
When creating a SageMaker training job, you have an option to run
311+
asynchronous debugger jobs for your model. It gives you full visibility
312+
into a training job by using a hook to capture tensors that define
313+
the state of the training process at each instance in its lifecycle.
314+
It also provides the capability of defining 'rules' to
315+
analyze the captured tensors. See `SageMaker Debugger Introduction <https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html>`__ and `How Debugger Works <https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-how-it-works.html>`__ for details.
316+
317+
You can get more details on debug job by using the ``describe`` kubectl verb.
318+
The output of describing a training job will now have a new field ``Debug Rule Evaluation Statuses:``
319+
320+
::
321+
322+
kubectl describe trainingjobs xgboost-mnist-debugger
323+
324+
Name: xgboost-mnist-debugger
325+
Namespace: default
326+
Labels: <none>
327+
Annotations: kubectl.kubernetes.io/last-applied-configuration:
328+
{"apiVersion":"sagemaker.aws.amazon.com/v1","kind":"TrainingJob","metadata":{"annotations":{},"name":"xgboost-mnist-debugger","namespace":...
329+
API Version: sagemaker.aws.amazon.com/v1
330+
Kind: TrainingJob
331+
Metadata:
332+
Creation Timestamp: 2020-03-18T05:58:59Z
333+
Finalizers:
334+
sagemaker-operator-finalizer
335+
Generation: 2
336+
Resource Version: 2939388
337+
Self Link: /apis/sagemaker.aws.amazon.com/v1/namespaces/default/trainingjobs/xgboost-mnist-debugger
338+
UID: 8fe3799e-68dd-11ea-8423-1260529a8dc9
339+
Spec:
340+
Algorithm Specification:
341+
Training Image: 246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:0.90-2-cpu-py3
342+
Training Input Mode: File
343+
Debug Hook Config:
344+
Collection Configurations:
345+
Collection Name: feature_importance
346+
Collection Parameters:
347+
Name: save_interval
348+
Value: 5
349+
Collection Name: losses
350+
Collection Parameters:
351+
Name: save_interval"
352+
Value: 500
353+
Collection Name: average_shap
354+
Collection Parameters:
355+
Name: save_interval
356+
Value: 5
357+
Collection Name: metrics
358+
Collection Parameters:
359+
Name: save_interval
360+
Value: 5
361+
s3OutputPath: s3://my-bucket/sagemaker/xgboost-mnist/xgboost-debugger/
362+
Debug Rule Configurations:
363+
Rule Configuration Name: LossNotDecreasing
364+
Rule Evaluator Image: 895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest
365+
Rule Parameters:
366+
Name: collection_names
367+
Value: metrics
368+
Name: num_steps
369+
Value: 10
370+
Name: rule_to_invoke
371+
Value: LossNotDecreasing
372+
Hyper Parameters:
373+
Name: max_depth
374+
Value: 5
375+
Name: eta
376+
Value: 0.2
377+
Name: gamma
378+
Value: 4
379+
Name: min_child_weight
380+
Value: 6
381+
Name: silent
382+
Value: 0
383+
Name: objective
384+
Value: reg:squarederror
385+
Name: subsample
386+
Value: 0.7
387+
Name: num_round
388+
Value: 51
389+
Input Data Config:
390+
Channel Name: train
391+
Compression Type: None
392+
Content Type: libsvm
393+
Data Source:
394+
s3DataSource:
395+
s3DataDistributionType: FullyReplicated
396+
s3DataType: S3Prefix
397+
s3Uri: s3://my-bucket/sagemaker/xgboost-mnist/xgboost-debugger/train
398+
Channel Name: validation
399+
Compression Type: None
400+
Content Type: libsvm
401+
Data Source:
402+
s3DataSource:
403+
s3DataDistributionType: FullyReplicated
404+
s3DataType: S3Prefix
405+
s3Uri: s3://my-bucket/sagemaker/xgboost-mnist/xgboost-debugger/validation
406+
Output Data Config:
407+
s3OutputPath: s3://my-bucket/sagemaker/xgboost-mnist/xgboost-debugger/
408+
Region: us-west-2
409+
Resource Config:
410+
Instance Count: 1
411+
Instance Type: ml.m4.xlarge
412+
Volume Size In GB: 5
413+
Role Arn: arn:aws:iam::1234567890:role/service-role/AmazonSageMaker-ExecutionRole
414+
Stopping Condition:
415+
Max Runtime In Seconds: 86400
416+
Tags:
417+
Key: tagKey
418+
Value: tagValue
419+
Training Job Name: xgboost-mnist-debugger-8fe3799e68dd11ea84231260529a8dc9
420+
Status:
421+
Cloud Watch Log URL: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logStream:group=/aws/sagemaker/TrainingJobs;prefix=xgboost-mnist-debugger-8fe3799e68dd11ea84231260529a8dc9;streamFilter=typeLogStreamPrefix
422+
Debug Rule Evaluation Statuses:
423+
Last Modified Time: 2020-03-18T06:03:48Z
424+
Rule Configuration Name: LossNotDecreasing
425+
Rule Evaluation Job Arn: arn:aws:sagemaker:us-west-2:1234567890:processing-job/xgboost-mnist-debugger-8fe-lossnotdecreasing-a7d0eaf2
426+
Rule Evaluation Status: NoIssuesFound
427+
Model Path: s3://my-bucket/sagemaker/xgboost-mnist-debugger-8fe3799e68dd11ea84231260529a8dc9/output/model.tar.gz
428+
Sage Maker Training Job Name: xgboost-mnist-debugger-8fe3799e68dd11ea84231260529a8dc9
429+
Secondary Status: Completed
430+
Training Job Status: Completed
431+
Events: <none>
432+
433+
See `SageMaker Debugger Examples <https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master/samples>`__ for more examples of debugger jobs.
434+
435+
307436
HyperParameterTuningJobs operator
308437
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
309438

0 commit comments

Comments
 (0)