Skip to content

Commit 3773d3b

Browse files
alexottgaborratky-dbmgyucht
authored
Late jobs support (aka health conditions) in databricks_job resource (#2496)
* Late jobs support (aka health conditions) in `databricks_job` resource Added support for `health` block that is used to detect late jobs. Also, this PR includes following changes: * Added `on_duration_warning_threshold_exceeded` attribute to email & webhook notifications (needed for late jobs support) * Added `notification_settings` on a task level & use jobs & task notification structs from Go SDK * Reorganized documentation for task block as it's getting more & more attributes * Update docs/resources/job.md Co-authored-by: Gabor Ratky <[email protected]> * Update docs/resources/job.md Co-authored-by: Gabor Ratky <[email protected]> * Update docs/resources/job.md Co-authored-by: Gabor Ratky <[email protected]> * Update docs/resources/job.md Co-authored-by: Gabor Ratky <[email protected]> * Update docs/resources/job.md Co-authored-by: Gabor Ratky <[email protected]> * Update docs/resources/job.md Co-authored-by: Gabor Ratky <[email protected]> * Update docs/resources/job.md Co-authored-by: Gabor Ratky <[email protected]> * Update docs/resources/job.md Co-authored-by: Gabor Ratky <[email protected]> * Update docs/resources/job.md Co-authored-by: Gabor Ratky <[email protected]> * Update docs/resources/job.md Co-authored-by: Gabor Ratky <[email protected]> * address review comments * add list of tasks * more review chanes --------- Co-authored-by: Gabor Ratky <[email protected]> Co-authored-by: Miles Yucht <[email protected]>
1 parent f2460dc commit 3773d3b

File tree

3 files changed

+132
-39
lines changed

3 files changed

+132
-39
lines changed

docs/resources/job.md

Lines changed: 55 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ The `databricks_job` resource allows you to manage [Databricks Jobs](https://doc
1010

1111
-> **Note** In Terraform configuration, it is recommended to define tasks in alphabetical order of their `task_key` arguments, so that you get consistent and readable diff. Whenever tasks are added or removed, or `task_key` is renamed, you'll observe a change in the majority of tasks. It's related to the fact that the current version of the provider treats `task` blocks as an ordered list. Alternatively, `task` block could have been an unordered set, though end-users would see the entire block replaced upon a change in single property of the task.
1212

13-
It is possible to create [a Databricks job](https://docs.databricks.com/data-engineering/jobs/jobs-user-guide.html) using `task` blocks. Single task is defined with the `task` block containing one of the `*_task` block, `task_key`, `libraries`, `email_notifications`, `timeout_seconds`, `max_retries`, `min_retry_interval_millis`, `retry_on_timeout` attributes and `depends_on` blocks to define cross-task dependencies.
13+
It is possible to create [a Databricks job](https://docs.databricks.com/data-engineering/jobs/jobs-user-guide.html) using `task` blocks. Single task is defined with the `task` block containing one of the `*_task` block, `task_key`, and additional arguments described below.
1414

1515
```hcl
1616
resource "databricks_job" "this" {
@@ -88,13 +88,44 @@ The resource supports the following arguments:
8888
```
8989
* `library` - (Optional) (Set) An optional list of libraries to be installed on the cluster that will execute the job. Please consult [libraries section](cluster.md#libraries) for [databricks_cluster](cluster.md) resource.
9090
* `retry_on_timeout` - (Optional) (Bool) An optional policy to specify whether to retry a job when it times out. The default behavior is to not retry on timeout.
91-
* `max_retries` - (Optional) (Integer) An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with a FAILED or INTERNAL_ERROR lifecycle state. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry. A run can have the following lifecycle state: PENDING, RUNNING, TERMINATING, TERMINATED, SKIPPED or INTERNAL_ERROR
91+
* `max_retries` - (Optional) (Integer) An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with a `FAILED` or `INTERNAL_ERROR` lifecycle state. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry.
9292
* `timeout_seconds` - (Optional) (Integer) An optional timeout applied to each run of this job. The default behavior is to have no timeout.
9393
* `min_retry_interval_millis` - (Optional) (Integer) An optional minimal interval in milliseconds between the start of the failed run and the subsequent retry run. The default behavior is that unsuccessful runs are immediately retried.
9494
* `max_concurrent_runs` - (Optional) (Integer) An optional maximum allowed number of concurrent runs of the job. Defaults to *1*.
95-
* `email_notifications` - (Optional) (List) An optional set of email addresses notified when runs of this job begins, completes and fails. The default behavior is to not send any emails. This field is a block and is documented below.
95+
* `email_notifications` - (Optional) (List) An optional set of email addresses notified when runs of this job begins, completes and fails. The default behavior is to not send any emails. This field is a block and is [documented below](#email_notifications-configuration-block).
9696
* `webhook_notifications` - (Optional) (List) An optional set of system destinations (for example, webhook destinations or Slack) to be notified when runs of this job begins, completes and fails. The default behavior is to not send any notifications. This field is a block and is documented below.
97+
* `notification_settings` - (Optional) An optional block controlling the notification settings on the job level (described below).
9798
* `schedule` - (Optional) (List) An optional periodic schedule for this job. The default behavior is that the job runs when triggered by clicking Run Now in the Jobs UI or sending an API request to runNow. This field is a block and is documented below.
99+
* `health` - (Optional) An optional block that specifies the health conditions for the job (described below).
100+
101+
### task Configuration Block
102+
103+
This block describes individual tasks:
104+
105+
* `task_key` - (Required) string specifying an unique key for a given task.
106+
* `*_task` - (Required) one of the specific task blocks described below:
107+
* `dbt_task`
108+
* `notebook_task`
109+
* `pipeline_task`
110+
* `python_wheel_task`
111+
* `spark_jar_task`
112+
* `spark_python_task`
113+
* `spark_submit_task`
114+
* `sql_task`
115+
* `library` - (Optional) (Set) An optional list of libraries to be installed on the cluster that will execute the job. Please consult [libraries section](cluster.md#libraries) for [databricks_cluster](cluster.md) resource.
116+
* `depends_on` - (Optional) block specifying dependency(-ies) for a given task.
117+
* `retry_on_timeout` - (Optional) (Bool) An optional policy to specify whether to retry a job when it times out. The default behavior is to not retry on timeout.
118+
* `max_retries` - (Optional) (Integer) An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with a `FAILED` or `INTERNAL_ERROR` lifecycle state. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry. A run can have the following lifecycle state: `PENDING`, `RUNNING`, `TERMINATING`, `TERMINATED`, `SKIPPED` or `INTERNAL_ERROR`.
119+
* `timeout_seconds` - (Optional) (Integer) An optional timeout applied to each run of this job. The default behavior is to have no timeout.
120+
* `min_retry_interval_millis` - (Optional) (Integer) An optional minimal interval in milliseconds between the start of the failed run and the subsequent retry run. The default behavior is that unsuccessful runs are immediately retried.
121+
* `email_notifications` - (Optional) (List) An optional set of email addresses notified when runs of this job begins, completes and fails. The default behavior is to not send any emails. This field is a block and is [documented below](#email_notifications-configuration-block).
122+
* `health` - (Optional) block described below that specifies health conditions for a given task.
123+
124+
### depends_on Configuration Block
125+
126+
This block describes dependencies of a given task:
127+
128+
* `task_key` - (Required) The name of the task this task depends on.
98129

99130
### tags Configuration Map
100131
`tags` - (Optional) (Map) An optional map of the tags associated with the job. Specified tags will be used as cluster tags for job clusters.
@@ -130,8 +161,6 @@ resource "databricks_job" "this" {
130161
}
131162
```
132163

133-
134-
135164
### job_cluster Configuration Block
136165

137166
[Shared job cluster](https://docs.databricks.com/jobs.html#use-shared-job-clusters) specification. Allows multiple tasks in the same job run to reuse the cluster.
@@ -172,6 +201,7 @@ This block is used to specify Git repository information & branch/tag/commit tha
172201
* `on_start` - (Optional) (List) list of emails to notify when the run starts.
173202
* `on_success` - (Optional) (List) list of emails to notify when the run completes successfully.
174203
* `on_failure` - (Optional) (List) list of emails to notify when the run fails.
204+
* `on_duration_warning_threshold_exceeded` - (Optional) (List) list of emails to notify when the duration of a run exceeds the threshold specified by the `RUN_DURATION_SECONDS` metric in the `health` block.
175205
* `no_alert_for_skipped_runs` - (Optional) (Bool) don't send alert for skipped runs. (It's recommended to use the corresponding setting in the `notification_settings` configuration block).
176206

177207
### webhook_notifications Configuration Block
@@ -181,6 +211,7 @@ Each entry in `webhook_notification` block takes a list `webhook` blocks. The fi
181211
* `on_start` - (Optional) (List) list of notification IDs to call when the run starts. A maximum of 3 destinations can be specified.
182212
* `on_success` - (Optional) (List) list of notification IDs to call when the run completes successfully. A maximum of 3 destinations can be specified.
183213
* `on_failure` - (Optional) (List) list of notification IDs to call when the run fails. A maximum of 3 destinations can be specified.
214+
* `on_duration_warning_threshold_exceeded` - (Optional) (List) list of notification IDs to call when the duration of a run exceeds the threshold specified by the `RUN_DURATION_SECONDS` metric in the `health` block.
184215

185216
Note that the `id` is not to be confused with the name of the alert destination. The `id` can be retrieved through the API or the URL of Databricks UI `https://<workspace host>/sql/destinations/<notification id>?o=<workspace id>`
186217

@@ -200,13 +231,30 @@ webhook_notifications {
200231

201232
-> **Note** The following configuration blocks can be standalone or nested inside a `task` block
202233

203-
### notification_settings Configuration Block
234+
### notification_settings Configuration Block (Job Level)
204235

205-
This block controls notification settings for both email & webhook notifications:
236+
This block controls notification settings for both email & webhook notifications on a job level:
206237

207238
* `no_alert_for_skipped_runs` - (Optional) (Bool) don't send alert for skipped runs.
208239
* `no_alert_for_canceled_runs` - (Optional) (Bool) don't send alert for cancelled runs.
209240

241+
### notification_settings Configuration Block (Task Level)
242+
243+
This block controls notification settings for both email & webhook notifications on a task level:
244+
245+
* `no_alert_for_skipped_runs` - (Optional) (Bool) don't send alert for skipped runs.
246+
* `no_alert_for_canceled_runs` - (Optional) (Bool) don't send alert for cancelled runs.
247+
* `alert_on_last_attempt` - (Optional) (Bool) do not send notifications to recipients specified in `on_start` for the retried runs and do not send notifications to recipients specified in `on_failure` until the last retry of the run.
248+
249+
### health Configuration Block
250+
251+
This block describes health conditions for a given job or an individual task. It consists of the following attributes:
252+
253+
* `rules` - (List) list of rules that are represented as objects with the following attributes:
254+
* `metric` - (Optional) string specifying the metric to check. The only supported metric is `RUN_DURATION_SECONDS` (check [Jobs REST API documentation](https://docs.databricks.com/api/workspace/jobs/create) for the latest information).
255+
* `op` - (Optional) string specifying the operation used to evaluate the given metric. The only supported operation is `GREATER_THAN`.
256+
* `value` - (Optional) integer value used to compare to the given metric.
257+
210258
### spark_jar_task Configuration Block
211259

212260
* `parameters` - (Optional) (List) Parameters passed to the main method.

jobs/resource_job.go

Lines changed: 38 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -114,24 +114,20 @@ type DbtTask struct {
114114

115115
// EmailNotifications contains the information for email notifications after job or task run start or completion
116116
type EmailNotifications struct {
117-
OnStart []string `json:"on_start,omitempty"`
118-
OnSuccess []string `json:"on_success,omitempty"`
119-
OnFailure []string `json:"on_failure,omitempty"`
120-
NoAlertForSkippedRuns bool `json:"no_alert_for_skipped_runs,omitempty"`
121-
AlertOnLastAttempt bool `json:"alert_on_last_attempt,omitempty"`
117+
OnStart []string `json:"on_start,omitempty"`
118+
OnSuccess []string `json:"on_success,omitempty"`
119+
OnFailure []string `json:"on_failure,omitempty"`
120+
OnDurationWarningThresholdExceeded []string `json:"on_duration_warning_threshold_exceeded,omitempty"`
121+
NoAlertForSkippedRuns bool `json:"no_alert_for_skipped_runs,omitempty"`
122+
AlertOnLastAttempt bool `json:"alert_on_last_attempt,omitempty"`
122123
}
123124

124125
// WebhookNotifications contains the information for webhook notifications sent after job start or completion.
125126
type WebhookNotifications struct {
126-
OnStart []Webhook `json:"on_start,omitempty"`
127-
OnSuccess []Webhook `json:"on_success,omitempty"`
128-
OnFailure []Webhook `json:"on_failure,omitempty"`
129-
}
130-
131-
// NotificationSettings control the notification settings for a job
132-
type NotificationSettings struct {
133-
NoAlertForSkippedRuns bool `json:"no_alert_for_skipped_runs,omitempty"`
134-
NoAlertForCanceledRuns bool `json:"no_alert_for_canceled_runs,omitempty"`
127+
OnStart []Webhook `json:"on_start,omitempty"`
128+
OnSuccess []Webhook `json:"on_success,omitempty"`
129+
OnFailure []Webhook `json:"on_failure,omitempty"`
130+
OnDurationWarningThresholdExceeded []Webhook `json:"on_duration_warning_threshold_exceeded,omitempty"`
135131
}
136132

137133
func (wn *WebhookNotifications) Sort() {
@@ -171,6 +167,16 @@ type GitSource struct {
171167

172168
// End Jobs + Repo integration preview
173169

170+
type JobHealthRule struct {
171+
Metric string `json:"metric,omitempty"`
172+
Operation string `json:"op,omitempty"`
173+
Value int32 `json:"value,omitempty"`
174+
}
175+
176+
type JobHealth struct {
177+
Rules []JobHealthRule `json:"rules"`
178+
}
179+
174180
type JobTaskSettings struct {
175181
TaskKey string `json:"task_key,omitempty"`
176182
Description string `json:"description,omitempty"`
@@ -198,11 +204,13 @@ type JobTaskSettings struct {
198204
// ConditionTask is in private preview
199205
ConditionTask *jobs.ConditionTask `json:"condition_task,omitempty" tf:"group:task_type"`
200206

201-
EmailNotifications *EmailNotifications `json:"email_notifications,omitempty" tf:"suppress_diff"`
202-
TimeoutSeconds int32 `json:"timeout_seconds,omitempty"`
203-
MaxRetries int32 `json:"max_retries,omitempty"`
204-
MinRetryIntervalMillis int32 `json:"min_retry_interval_millis,omitempty"`
205-
RetryOnTimeout bool `json:"retry_on_timeout,omitempty" tf:"computed"`
207+
EmailNotifications *EmailNotifications `json:"email_notifications,omitempty" tf:"suppress_diff"`
208+
NotificationSettings *jobs.TaskNotificationSettings `json:"notification_settings,omitempty"`
209+
TimeoutSeconds int32 `json:"timeout_seconds,omitempty"`
210+
MaxRetries int32 `json:"max_retries,omitempty"`
211+
MinRetryIntervalMillis int32 `json:"min_retry_interval_millis,omitempty"`
212+
RetryOnTimeout bool `json:"retry_on_timeout,omitempty" tf:"computed"`
213+
Health *JobHealth `json:"health,omitempty"`
206214
}
207215

208216
type JobCluster struct {
@@ -270,16 +278,17 @@ type JobSettings struct {
270278
GitSource *GitSource `json:"git_source,omitempty"`
271279
// END Jobs + Repo integration preview
272280

273-
Schedule *CronSchedule `json:"schedule,omitempty"`
274-
Continuous *ContinuousConf `json:"continuous,omitempty"`
275-
Trigger *Trigger `json:"trigger,omitempty"`
276-
MaxConcurrentRuns int32 `json:"max_concurrent_runs,omitempty"`
277-
EmailNotifications *EmailNotifications `json:"email_notifications,omitempty" tf:"suppress_diff"`
278-
WebhookNotifications *WebhookNotifications `json:"webhook_notifications,omitempty" tf:"suppress_diff"`
279-
NotificationSettings *NotificationSettings `json:"notification_settings,omitempty"`
280-
Tags map[string]string `json:"tags,omitempty"`
281-
Queue *Queue `json:"queue,omitempty"`
282-
RunAs *JobRunAs `json:"run_as,omitempty"`
281+
Schedule *CronSchedule `json:"schedule,omitempty"`
282+
Continuous *ContinuousConf `json:"continuous,omitempty"`
283+
Trigger *Trigger `json:"trigger,omitempty"`
284+
MaxConcurrentRuns int32 `json:"max_concurrent_runs,omitempty"`
285+
EmailNotifications *EmailNotifications `json:"email_notifications,omitempty" tf:"suppress_diff"`
286+
WebhookNotifications *WebhookNotifications `json:"webhook_notifications,omitempty" tf:"suppress_diff"`
287+
NotificationSettings *jobs.JobNotificationSettings `json:"notification_settings,omitempty"`
288+
Tags map[string]string `json:"tags,omitempty"`
289+
Queue *Queue `json:"queue,omitempty"`
290+
RunAs *JobRunAs `json:"run_as,omitempty"`
291+
Health *JobHealth `json:"health,omitempty"`
283292
}
284293

285294
func (js *JobSettings) isMultiTask() bool {

jobs/resource_job_test.go

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ import (
1414
"github.com/databricks/terraform-provider-databricks/common"
1515
"github.com/databricks/terraform-provider-databricks/libraries"
1616
"github.com/databricks/terraform-provider-databricks/qa"
17+
1718
"github.com/stretchr/testify/assert"
1819
"github.com/stretchr/testify/require"
1920
)
@@ -141,6 +142,15 @@ func TestResourceJobCreate_MultiTask(t *testing.T) {
141142
SparkJarTask: &SparkJarTask{
142143
MainClassName: "com.labs.BarMain",
143144
},
145+
Health: &JobHealth{
146+
Rules: []JobHealthRule{
147+
{
148+
Metric: "RUN_DURATION_SECONDS",
149+
Operation: "GREATER_THAN",
150+
Value: 3600,
151+
},
152+
},
153+
},
144154
},
145155
{
146156
TaskKey: "b",
@@ -158,6 +168,15 @@ func TestResourceJobCreate_MultiTask(t *testing.T) {
158168
},
159169
},
160170
MaxConcurrentRuns: 1,
171+
Health: &JobHealth{
172+
Rules: []JobHealthRule{
173+
{
174+
Metric: "RUN_DURATION_SECONDS",
175+
Operation: "GREATER_THAN",
176+
Value: 3600,
177+
},
178+
},
179+
},
161180
},
162181
Response: Job{
163182
JobID: 789,
@@ -185,7 +204,15 @@ func TestResourceJobCreate_MultiTask(t *testing.T) {
185204
Resource: ResourceJob(),
186205
HCL: `
187206
name = "Featurizer"
188-
207+
208+
health {
209+
rules {
210+
metric = "RUN_DURATION_SECONDS"
211+
op = "GREATER_THAN"
212+
value = 3600
213+
}
214+
}
215+
189216
task {
190217
task_key = "a"
191218
@@ -198,6 +225,15 @@ func TestResourceJobCreate_MultiTask(t *testing.T) {
198225
library {
199226
jar = "dbfs://aa/bb/cc.jar"
200227
}
228+
229+
health {
230+
rules {
231+
metric = "RUN_DURATION_SECONDS"
232+
op = "GREATER_THAN"
233+
value = 3600
234+
}
235+
}
236+
201237
}
202238
203239
task {
@@ -983,7 +1019,7 @@ func TestResourceJobCreateWithWebhooks(t *testing.T) {
9831019
OnSuccess: []Webhook{{ID: "id2"}},
9841020
OnFailure: []Webhook{{ID: "id3"}},
9851021
},
986-
NotificationSettings: &NotificationSettings{
1022+
NotificationSettings: &jobs.JobNotificationSettings{
9871023
NoAlertForSkippedRuns: true,
9881024
NoAlertForCanceledRuns: true,
9891025
},
@@ -1014,7 +1050,7 @@ func TestResourceJobCreateWithWebhooks(t *testing.T) {
10141050
OnSuccess: []Webhook{{ID: "id2"}},
10151051
OnFailure: []Webhook{{ID: "id3"}},
10161052
},
1017-
NotificationSettings: &NotificationSettings{
1053+
NotificationSettings: &jobs.JobNotificationSettings{
10181054
NoAlertForSkippedRuns: true,
10191055
NoAlertForCanceledRuns: true,
10201056
},

0 commit comments

Comments
 (0)