|
4 | 4 | "cell_type": "markdown",
|
5 | 5 | "metadata": {},
|
6 | 6 | "source": [
|
7 |
| - "# Detect stalled training and stop training job using debugger rule\n", |
| 7 | + "# Detect Stalled Training and Stop Training Job Using SageMaker Debugger Rule\n", |
8 | 8 | " \n",
|
| 9 | + "This notebook shows you how to use the `StalledTrainingRule` built-in rule. This rule can take an action to stop your training job, when the rule detects an inactivity in your training job for a certain time period. This functionality helps you monitor the training job status and reduces redundant resource usage.\n", |
9 | 10 | "\n",
|
10 |
| - "In this notebook, we'll show you how you can use StalledTrainingRule rule which can take action like stopping your training job when it finds that there has been no update in training job for certain threshold duration.\n", |
| 11 | + "## How the StalledTrainingRule Built-in Rule Works\n", |
11 | 12 | "\n",
|
12 |
| - "## How does StalledTrainingRule works?\n", |
| 13 | + "Amazon Sagemaker Debugger captures tensors that you want to watch from training jobs on [AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers) or your local machine. If you use one of the Debugger-integrated Deep Learning Containers, you don't need to make any changes to your training script to use the functionality of built-in rules. For information about Debugger-supported SageMaker frameworks and versions, see [Debugger-supported framework versions for zero script change](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md#zero-script-change). \n", |
13 | 14 | "\n",
|
14 |
| - "Amazon Sagemaker debugger automatically captures tensors from training job which use AWS DLC(tensorflow, pytorch, mxnet, xgboost)[refer doc for supported versions](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md#zero-script-change). StalledTrainingRule keeps watching on emission of tensors like loss. The execution happens outside of training containers. It is evident that if training job is running good and is not stalled it is expected to emit loss and metrics tensors at frequent intervals. If Rule doesn't find new tensors being emitted from training job for threshold period of time, it takes automatic action to issue StopTrainingJob.\n", |
| 15 | + "If you want to run a training script that uses partially supported framework by Debugger or your own custom container, you need to manually register the Debugger hook to your training script. The `smdebug` library provides tools to help the hook registration, and the sample script provided in the `src` folder includes the hook registration code as comment lines. For more information about how to manually register the Debugger hooks for this case, see the training script at `./src/simple_stalled_training.py`, and documentation at [smdebug TensorFlow hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/tensorflow.md), [smdebug PyTorch hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/pytorch.md), [smdebug MXNet hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/mxnet.md), and [smdebug XGBoost hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/xgboost.md).\n", |
15 | 16 | "\n",
|
16 |
| - "#### With no changes to your training script\n", |
17 |
| - "If you use one of the SageMaker provided [Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html). [Refer doc for supported framework versions](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md#zero-script-change), then you don't need to make any changes to your training script for activating this rule. Loss tensors will automatically be captured and monitored by the rule.\n", |
18 |
| - "\n", |
19 |
| - "You can also emit tensors periodically by using [save scalar api of hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) . \n", |
20 |
| - "\n", |
21 |
| - "Also look at example how to use save_scalar api [here](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow2/scripts/tf_keras_fit_non_eager.py#L42)" |
| 17 | + "The Debugger `StalledTrainingRule` watches tensor updates from your training job. If the rule doesn't find new tensors updated to the default S3 URI for a threshold period of time, it takes an action to trigger the `StopTrainingJob` API operation. The following code cells set up a SageMaker TensorFlow estimator with the Debugger `StalledTrainingRule` to watch the `losses` pre-built tensor collection." |
22 | 18 | ]
|
23 | 19 | },
|
24 | 20 | {
|
25 |
| - "cell_type": "code", |
26 |
| - "execution_count": null, |
| 21 | + "cell_type": "markdown", |
27 | 22 | "metadata": {},
|
28 |
| - "outputs": [], |
29 | 23 | "source": [
|
30 |
| - "! pip install -q sagemaker" |
| 24 | + "### Import SageMaker Python SDK" |
31 | 25 | ]
|
32 | 26 | },
|
33 | 27 | {
|
|
36 | 30 | "metadata": {},
|
37 | 31 | "outputs": [],
|
38 | 32 | "source": [
|
39 |
| - "import boto3\n", |
40 |
| - "import os\n", |
41 | 33 | "import sagemaker\n",
|
42 | 34 | "from sagemaker.tensorflow import TensorFlow\n",
|
43 | 35 | "print(sagemaker.__version__)"
|
44 | 36 | ]
|
45 | 37 | },
|
46 | 38 | {
|
47 |
| - "cell_type": "code", |
48 |
| - "execution_count": null, |
| 39 | + "cell_type": "markdown", |
49 | 40 | "metadata": {},
|
50 |
| - "outputs": [], |
51 | 41 | "source": [
|
52 |
| - "from sagemaker.debugger import Rule, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig\n", |
53 |
| - "import smdebug_rulesconfig as rule_configs" |
| 42 | + "### Import SageMaker Debugger classes for rule configuration" |
54 | 43 | ]
|
55 | 44 | },
|
56 | 45 | {
|
|
59 | 48 | "metadata": {},
|
60 | 49 | "outputs": [],
|
61 | 50 | "source": [
|
62 |
| - "# define the entrypoint script\n", |
63 |
| - "# Below script has 5 minutes sleep, we will create a stalledTrainingRule with 3 minutes of threshold.\n", |
64 |
| - "entrypoint_script='src/simple_stalled_training.py'\n", |
65 |
| - "\n", |
66 |
| - "# these hyperparameters ensure that vanishing gradient will trigger for our tensorflow mnist script\n", |
67 |
| - "hyperparameters = {\n", |
68 |
| - " \"num_epochs\": \"10\",\n", |
69 |
| - " \"lr\": \"10.00\"\n", |
70 |
| - "}" |
| 51 | + "from sagemaker.debugger import Rule, CollectionConfig, rule_configs" |
71 | 52 | ]
|
72 | 53 | },
|
73 | 54 | {
|
74 | 55 | "cell_type": "markdown",
|
75 | 56 | "metadata": {},
|
76 | 57 | "source": [
|
77 |
| - "### Create unique training job prefix\n", |
78 |
| - "We will create unique training job name prefix. this prefix would be passed to StalledTrainingRule to identify which training job, rule should take action on once the stalled training rule condition is met.\n", |
79 |
| - "Note that, this prefix needs to be unique. If rule doesn't find exactly one job with provided prefix, it will fallback to safe mode and not take action of stop training job. Rule will still emit a cloudwatch event if the rule condition is met. To see details about cloud watch event, check [here](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_action_on_rule/tf-mnist-stop-training-job.ipynb). " |
| 58 | + "### Create a unique training job prefix\n", |
| 59 | + "A unique prefix must be specified for `StalledTrainingRule` to identify the exact training job name that you want to monitor and stop when the rule triggers the stalled training job issue.\n", |
| 60 | + "If there are multiple training jobs sharing the same prefix, this rule may react to other training jobs. If the rule cannot find the exact training job name with a provided prefix, it falls back to safe mode and does not stop the training job. The rule evaluation process goes on in parallel while the training jobs are running. If you want to access the rule job logs, you will later find how to get the information at [Get a direct Amazon CloudWatch URL to find the current rule processing job log](#cw-url).\n", |
| 61 | + "\n", |
| 62 | + "The following code cell includes:\n", |
| 63 | + "* a code line to create a unique `base_job_name_prefix`\n", |
| 64 | + "* a stalled training job rule configuration object\n", |
| 65 | + "* a SageMaker TensorFlow estimator configuration with the Debugger `rules` parameter to run the built-in rule\n", |
| 66 | + "\n", |
| 67 | + "**Note**: Debugger collects `loss` tensors by default every 500 steps." |
80 | 68 | ]
|
81 | 69 | },
|
82 | 70 | {
|
|
85 | 73 | "metadata": {},
|
86 | 74 | "outputs": [],
|
87 | 75 | "source": [
|
| 76 | + "# Append current time to your training job name to generate a unique base_job_name_prefix\n", |
88 | 77 | "import time\n",
|
89 |
| - "print(int(time.time()))\n", |
90 |
| - "# Note that sagemaker appends date to your training job and truncates the provided name to 39 character. So, we will make \n", |
91 |
| - "# sure that we use less than 39 character in below prefix. Appending time is to provide a unique id\n", |
92 | 78 | "base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))\n",
|
93 |
| - "base_job_name_prefix = base_job_name_prefix[:34]\n", |
94 |
| - "print(base_job_name_prefix)" |
| 79 | + "\n", |
| 80 | + "# Configure a StalledTrainingRule rule parameter object\n", |
| 81 | + "stalled_training_job_rule = [\n", |
| 82 | + " Rule.sagemaker(\n", |
| 83 | + " base_config=rule_configs.stalled_training_rule(),\n", |
| 84 | + " rule_parameters={\n", |
| 85 | + " \"threshold\": \"120\", \n", |
| 86 | + " \"stop_training_on_fire\": \"True\",\n", |
| 87 | + " \"training_job_name_prefix\": base_job_name_prefix\n", |
| 88 | + " }\n", |
| 89 | + " )\n", |
| 90 | + "]\n", |
| 91 | + "\n", |
| 92 | + "# Configure a SageMaker TensorFlow estimator\n", |
| 93 | + "estimator = TensorFlow(\n", |
| 94 | + " role=sagemaker.get_execution_role(),\n", |
| 95 | + " base_job_name=base_job_name_prefix,\n", |
| 96 | + " train_instance_count=1,\n", |
| 97 | + " train_instance_type='ml.m5.4xlarge',\n", |
| 98 | + " entry_point='src/simple_stalled_training.py', # This sample script forces the training job to sleep for 10 minutes\n", |
| 99 | + " framework_version='1.15.0',\n", |
| 100 | + " py_version='py3',\n", |
| 101 | + " train_max_run=3600,\n", |
| 102 | + " ## Debugger-specific parameter\n", |
| 103 | + " rules = stalled_training_job_rule\n", |
| 104 | + ")" |
95 | 105 | ]
|
96 | 106 | },
|
97 | 107 | {
|
|
100 | 110 | "metadata": {},
|
101 | 111 | "outputs": [],
|
102 | 112 | "source": [
|
103 |
| - "stalled_training_job_rule = Rule.sagemaker(\n", |
104 |
| - " base_config={\n", |
105 |
| - " 'DebugRuleConfiguration': {\n", |
106 |
| - " 'RuleConfigurationName': 'StalledTrainingRule', \n", |
107 |
| - " 'RuleParameters': {'rule_to_invoke': 'StalledTrainingRule'}\n", |
108 |
| - " }\n", |
109 |
| - " },\n", |
110 |
| - " rule_parameters={\n", |
111 |
| - " 'threshold': '120',\n", |
112 |
| - " 'training_job_name_prefix': base_job_name_prefix,\n", |
113 |
| - " 'stop_training_on_fire' : 'True'\n", |
114 |
| - " }, \n", |
115 |
| - ")" |
| 113 | + "estimator.fit(wait=False)" |
116 | 114 | ]
|
117 | 115 | },
|
118 | 116 | {
|
119 |
| - "cell_type": "code", |
120 |
| - "execution_count": null, |
| 117 | + "cell_type": "markdown", |
121 | 118 | "metadata": {},
|
122 |
| - "outputs": [], |
123 | 119 | "source": [
|
124 |
| - "estimator = TensorFlow(\n", |
125 |
| - " role=sagemaker.get_execution_role(),\n", |
126 |
| - " base_job_name=base_job_name_prefix,\n", |
127 |
| - " train_instance_count=1,\n", |
128 |
| - " train_instance_type='ml.m5.4xlarge',\n", |
129 |
| - " entry_point=entrypoint_script,\n", |
130 |
| - " #source_dir = 'src',\n", |
131 |
| - " framework_version='1.15.0',\n", |
132 |
| - " py_version='py3',\n", |
133 |
| - " train_max_run=3600,\n", |
134 |
| - " script_mode=True,\n", |
135 |
| - " ## New parameter\n", |
136 |
| - " rules = [stalled_training_job_rule]\n", |
137 |
| - ")\n" |
| 120 | + "## Monitoring Training and Rule Evaluation Status\n", |
| 121 | + "\n", |
| 122 | + "Once you execute the `estimator.fit()` API, SageMaker initiates a training job in the background, and Debugger initiates a `StalledTrainingRule` rule evaluation job in parallel.\n", |
| 123 | + "Because the training scripts has a few lines of code at the end to force a sleep mode for 10 minutes, the `RuleEvaluationStatus` for `StalledTrainingRule` will change to `IssuesFound` in 2 minutes after the sleep mode is on and trigger the `StopTrainingJob` API." |
| 124 | + ] |
| 125 | + }, |
| 126 | + { |
| 127 | + "cell_type": "markdown", |
| 128 | + "metadata": {}, |
| 129 | + "source": [ |
| 130 | + "### Print the training job name\n", |
| 131 | + "\n", |
| 132 | + "The following cell outputs the training job name and its training status running in the background." |
138 | 133 | ]
|
139 | 134 | },
|
140 | 135 | {
|
|
143 | 138 | "metadata": {},
|
144 | 139 | "outputs": [],
|
145 | 140 | "source": [
|
146 |
| - "# After calling fit, SageMaker will spin off 1 training job and 1 rule job for you\n", |
147 |
| - "# The rule evaluation status(es) will be visible in the training logs\n", |
148 |
| - "# at regular intervals\n", |
149 |
| - "# wait=False makes this a fire and forget function. To stream the logs in the notebook leave this out\n", |
| 141 | + "job_name = estimator.latest_training_job.name\n", |
| 142 | + "print('Training job name: {}'.format(job_name))\n", |
150 | 143 | "\n",
|
151 |
| - "estimator.fit(wait=True)" |
| 144 | + "client = estimator.sagemaker_session.sagemaker_client\n", |
| 145 | + "\n", |
| 146 | + "description = client.describe_training_job(TrainingJobName=job_name)" |
152 | 147 | ]
|
153 | 148 | },
|
154 | 149 | {
|
155 | 150 | "cell_type": "markdown",
|
156 | 151 | "metadata": {},
|
157 | 152 | "source": [
|
158 |
| - "## Monitoring\n", |
| 153 | + "### Output the current job status and the rule evaluation status\n", |
159 | 154 | "\n",
|
160 |
| - "SageMaker kicked off rule evaluation job `StalledTrainingRule` as specified in the estimator. \n", |
161 |
| - "Given that we've stalled our training script for 10 minutes such that `StalledTrainingRule` is bound to fire and take action StopTrainingJob, we should expect to see the `TrainingJobStatus` as\n", |
162 |
| - "`Stopped` once the `RuleEvaluationStatus` for `StalledTrainingRule` changes to `IssuesFound`" |
| 155 | + "The following cell tracks the status of training job until the `SecondaryStatus` changes to `Stopped` or `Completed`. While training, Debugger collects output tensors from the training job and monitors the training job with the rules. " |
163 | 156 | ]
|
164 | 157 | },
|
165 | 158 | {
|
|
168 | 161 | "metadata": {},
|
169 | 162 | "outputs": [],
|
170 | 163 | "source": [
|
171 |
| - "# rule job summary gives you the summary of the rule evaluations. You might have to run it over \n", |
172 |
| - "# a few times before you start to see all values populated/changing\n", |
173 |
| - "estimator.latest_training_job.rule_job_summary()" |
| 164 | + "import time\n", |
| 165 | + "\n", |
| 166 | + "if description['TrainingJobStatus'] != 'Completed':\n", |
| 167 | + " while description['SecondaryStatus'] not in {'Stopped', 'Completed'}:\n", |
| 168 | + " description = client.describe_training_job(TrainingJobName=job_name)\n", |
| 169 | + " primary_status = description['TrainingJobStatus']\n", |
| 170 | + " secondary_status = description['SecondaryStatus']\n", |
| 171 | + " print('Current job status: [PrimaryStatus: {}, SecondaryStatus: {}] | {} Rule Evaluation Status: {}'\n", |
| 172 | + " .format(primary_status, secondary_status, \n", |
| 173 | + " estimator.latest_training_job.rule_job_summary()[0][\"RuleConfigurationName\"],\n", |
| 174 | + " estimator.latest_training_job.rule_job_summary()[0][\"RuleEvaluationStatus\"]\n", |
| 175 | + " )\n", |
| 176 | + " )\n", |
| 177 | + " time.sleep(15)" |
| 178 | + ] |
| 179 | + }, |
| 180 | + { |
| 181 | + "cell_type": "markdown", |
| 182 | + "metadata": {}, |
| 183 | + "source": [ |
| 184 | + "<a class=\"anchor\" id=\"cw-url\"></a>\n", |
| 185 | + "### Get a direct Amazon CloudWatch URL to find the current rule processing job log\n", |
| 186 | + "\n", |
| 187 | + "The following script returns a CloudWatch URL. Copy the URL and Paste it to a browser. This will directly lead you to the rule job log page." |
174 | 188 | ]
|
175 | 189 | },
|
176 | 190 | {
|
|
203 | 217 | " result[status[\"RuleConfigurationName\"]] = _get_cw_url_for_rule_job(rule_job_name, region)\n",
|
204 | 218 | " return result\n",
|
205 | 219 | "\n",
|
206 |
| - "get_rule_jobs_cw_urls(estimator)" |
| 220 | + "print(\n", |
| 221 | + " \"The direct CloudWatch URL to the current rule job:\", \n", |
| 222 | + " get_rule_jobs_cw_urls(estimator)[estimator.latest_training_job.rule_job_summary()[0][\"RuleConfigurationName\"]]\n", |
| 223 | + ")" |
207 | 224 | ]
|
208 | 225 | },
|
209 | 226 | {
|
210 | 227 | "cell_type": "markdown",
|
211 | 228 | "metadata": {},
|
212 | 229 | "source": [
|
213 |
| - "After running the last two cells over and until `VanishingGradient` reports `IssuesFound`, we'll attempt to describe the `TrainingJobStatus` for our training job." |
| 230 | + "## Conclusion\n", |
| 231 | + "\n", |
| 232 | + "This notebook showed how you can use the Debugger `StalledTrainingRule` built-in rule for your training job to take action on rule evaluation status changes. To find more information about Debugger, see [Amazon SageMaker Debugger Developer Guide](https://integ-docs-aws.amazon.com/sagemaker/latest/dg/train-debugger.html) and the [smdebug GitHub documentation](https://github.com/awslabs/sagemaker-debugger)." |
214 | 233 | ]
|
215 | 234 | },
|
216 | 235 | {
|
217 | 236 | "cell_type": "code",
|
218 | 237 | "execution_count": null,
|
219 | 238 | "metadata": {},
|
220 | 239 | "outputs": [],
|
221 |
| - "source": [ |
222 |
| - "estimator.latest_training_job.describe()[\"TrainingJobStatus\"]" |
223 |
| - ] |
224 |
| - }, |
225 |
| - { |
226 |
| - "cell_type": "markdown", |
227 |
| - "metadata": {}, |
228 |
| - "source": [ |
229 |
| - "## Result\n", |
230 |
| - "\n", |
231 |
| - "This notebook attempted to show a very simple setup of how you can use CloudWatch events for your training job to take action on rule evaluation status changes. Learn more about Amazon SageMaker Debugger in the [GitHub Documentation](https://github.com/awslabs/sagemaker-debugger)." |
232 |
| - ] |
| 240 | + "source": [] |
233 | 241 | }
|
234 | 242 | ],
|
235 | 243 | "metadata": {
|
236 | 244 | "kernelspec": {
|
237 |
| - "display_name": "Python 3", |
| 245 | + "display_name": "conda_tensorflow_p36", |
238 | 246 | "language": "python",
|
239 |
| - "name": "python3" |
| 247 | + "name": "conda_tensorflow_p36" |
240 | 248 | },
|
241 | 249 | "language_info": {
|
242 | 250 | "codemirror_mode": {
|
|
0 commit comments