Late jobs support (aka health conditions) in databricks_job resource (#2496)

alexott · gaborratky-db · mgyucht · web-flow · commit 3773d3b21212 · 2023-08-01T17:43:04.000Z
* Late jobs support (aka health conditions) in `databricks_job` resource

Added support for `health` block that is used to detect late jobs.  Also, this PR includes
following changes:

* Added `on_duration_warning_threshold_exceeded` attribute to email &amp; webhook notifications (needed for late jobs support)
* Added `notification_settings` on a task level &amp; use jobs &amp; task notification structs from Go SDK
* Reorganized documentation for task block as it's getting more &amp; more attributes

* Update docs/resources/job.md

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;

* Update docs/resources/job.md

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;

* Update docs/resources/job.md

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;

* Update docs/resources/job.md

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;

* Update docs/resources/job.md

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;

* Update docs/resources/job.md

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;

* Update docs/resources/job.md

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;

* Update docs/resources/job.md

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;

* Update docs/resources/job.md

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;

* Update docs/resources/job.md

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;

* address review comments

* add list of tasks

* more review chanes

---------

Co-authored-by: Gabor Ratky &lt;gabor.ratky@databricks.com&gt;
Co-authored-by: Miles Yucht &lt;miles@databricks.com&gt;
diff --git a/docs/resources/job.md b/docs/resources/job.md
@@ -10,7 +10,7 @@ The `databricks_job` resource allows you to manage [Databricks Jobs](https://doc
 
 -> **Note** In Terraform configuration, it is recommended to define tasks in alphabetical order of their `task_key` arguments, so that you get consistent and readable diff. Whenever tasks are added or removed, or `task_key` is renamed, you'll observe a change in the majority of tasks. It's related to the fact that the current version of the provider treats `task` blocks as an ordered list. Alternatively, `task` block could have been an unordered set, though end-users would see the entire block replaced upon a change in single property of the task.
 
-It is possible to create [a Databricks job](https://docs.databricks.com/data-engineering/jobs/jobs-user-guide.html) using `task` blocks. Single task is defined with the `task` block containing one of the `*_task` block, `task_key`, `libraries`, `email_notifications`, `timeout_seconds`, `max_retries`, `min_retry_interval_millis`, `retry_on_timeout` attributes and `depends_on` blocks to define cross-task dependencies.
+It is possible to create [a Databricks job](https://docs.databricks.com/data-engineering/jobs/jobs-user-guide.html) using `task` blocks. Single task is defined with the `task` block containing one of the `*_task` block, `task_key`, and additional arguments described below.
 
 ```hcl
 resource "databricks_job" "this" {
@@ -88,13 +88,44 @@ The resource supports the following arguments:
   ```
 * `library` - (Optional) (Set) An optional list of libraries to be installed on the cluster that will execute the job. Please consult [libraries section](cluster.md#libraries) for [databricks_cluster](cluster.md) resource.
 * `retry_on_timeout` - (Optional) (Bool) An optional policy to specify whether to retry a job when it times out. The default behavior is to not retry on timeout.
-* `max_retries` - (Optional) (Integer) An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with a FAILED or INTERNAL_ERROR lifecycle state. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry. A run can have the following lifecycle state: PENDING, RUNNING, TERMINATING, TERMINATED, SKIPPED or INTERNAL_ERROR
+* `max_retries` - (Optional) (Integer) An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with a `FAILED` or `INTERNAL_ERROR` lifecycle state. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry.
 * `timeout_seconds` - (Optional) (Integer) An optional timeout applied to each run of this job. The default behavior is to have no timeout.
 * `min_retry_interval_millis` - (Optional) (Integer) An optional minimal interval in milliseconds between the start of the failed run and the subsequent retry run. The default behavior is that unsuccessful runs are immediately retried.
 * `max_concurrent_runs` - (Optional) (Integer) An optional maximum allowed number of concurrent runs of the job. Defaults to *1*.
-* `email_notifications` - (Optional) (List) An optional set of email addresses notified when runs of this job begins, completes and fails. The default behavior is to not send any emails. This field is a block and is documented below.
+* `email_notifications` - (Optional) (List) An optional set of email addresses notified when runs of this job begins, completes and fails. The default behavior is to not send any emails. This field is a block and is [documented below](#email_notifications-configuration-block).
 * `webhook_notifications` - (Optional) (List) An optional set of system destinations (for example, webhook destinations or Slack) to be notified when runs of this job begins, completes and fails. The default behavior is to not send any notifications. This field is a block and is documented below.
+* `notification_settings` - (Optional) An optional block controlling the notification settings on the job level (described below).
 * `schedule` - (Optional) (List) An optional periodic schedule for this job. The default behavior is that the job runs when triggered by clicking Run Now in the Jobs UI or sending an API request to runNow. This field is a block and is documented below.
+* `health` - (Optional) An optional block that specifies the health conditions for the job (described below).
+
+### task Configuration Block
+
+This block describes individual tasks:
+
+* `task_key` - (Required) string specifying an unique key for a given task.
+* `*_task` - (Required) one of the specific task blocks described below: 
+  * `dbt_task`
+  * `notebook_task`
+  * `pipeline_task`
+  * `python_wheel_task`
+  * `spark_jar_task`
+  * `spark_python_task`
+  * `spark_submit_task`
+  * `sql_task`
+* `library` - (Optional) (Set) An optional list of libraries to be installed on the cluster that will execute the job. Please consult [libraries section](cluster.md#libraries) for [databricks_cluster](cluster.md) resource.
+* `depends_on` - (Optional) block specifying dependency(-ies) for a given task.
+* `retry_on_timeout` - (Optional) (Bool) An optional policy to specify whether to retry a job when it times out. The default behavior is to not retry on timeout.
+* `max_retries` - (Optional) (Integer) An optional maximum number of times to retry an unsuccessful run. A run is considered to be unsuccessful if it completes with a `FAILED` or `INTERNAL_ERROR` lifecycle state. The value -1 means to retry indefinitely and the value 0 means to never retry. The default behavior is to never retry. A run can have the following lifecycle state: `PENDING`, `RUNNING`, `TERMINATING`, `TERMINATED`, `SKIPPED` or `INTERNAL_ERROR`.
+* `timeout_seconds` - (Optional) (Integer) An optional timeout applied to each run of this job. The default behavior is to have no timeout.
+* `min_retry_interval_millis` - (Optional) (Integer) An optional minimal interval in milliseconds between the start of the failed run and the subsequent retry run. The default behavior is that unsuccessful runs are immediately retried.
+* `email_notifications` - (Optional) (List) An optional set of email addresses notified when runs of this job begins, completes and fails. The default behavior is to not send any emails. This field is a block and is [documented below](#email_notifications-configuration-block).
+* `health` - (Optional) block described below that specifies health conditions for a given task.
+
+### depends_on Configuration Block
+
+This block describes dependencies of a given task:
+
+* `task_key` - (Required) The name of the task this task depends on.
 
 ### tags Configuration Map
 `tags` - (Optional) (Map) An optional map of the tags associated with the job. Specified tags will be used as cluster tags for job clusters.
@@ -130,8 +161,6 @@ resource "databricks_job" "this" {
 }
 ```
 
-
-
 ### job_cluster Configuration Block
 
 [Shared job cluster](https://docs.databricks.com/jobs.html#use-shared-job-clusters) specification. Allows multiple tasks in the same job run to reuse the cluster.
@@ -172,6 +201,7 @@ This block is used to specify Git repository information & branch/tag/commit tha
 * `on_start` - (Optional) (List) list of emails to notify when the run starts.
 * `on_success` - (Optional) (List) list of emails to notify when the run completes successfully.
 * `on_failure` - (Optional) (List) list of emails to notify when the run fails.
+* `on_duration_warning_threshold_exceeded` - (Optional) (List) list of emails to notify when the duration of a run exceeds the threshold specified by the `RUN_DURATION_SECONDS` metric in the `health` block.
 * `no_alert_for_skipped_runs` - (Optional) (Bool) don't send alert for skipped runs. (It's recommended to use the corresponding setting in the `notification_settings` configuration block).
 
 ### webhook_notifications Configuration Block
@@ -181,6 +211,7 @@ Each entry in `webhook_notification` block takes a list `webhook` blocks. The fi
 * `on_start` - (Optional) (List) list of notification IDs to call when the run starts. A maximum of 3 destinations can be specified.
 * `on_success` - (Optional) (List) list of notification IDs to call when the run completes successfully. A maximum of 3 destinations can be specified.
 * `on_failure` - (Optional) (List) list of notification IDs to call when the run fails. A maximum of 3 destinations can be specified.
+* `on_duration_warning_threshold_exceeded` - (Optional) (List) list of notification IDs to call when the duration of a run exceeds the threshold specified by the `RUN_DURATION_SECONDS` metric in the `health` block.
 
 Note that the `id` is not to be confused with the name of the alert destination. The `id` can be retrieved through the API or the URL of Databricks UI `https://<workspace host>/sql/destinations/<notification id>?o=<workspace id>`
 
@@ -200,13 +231,30 @@ webhook_notifications {
 
 -> **Note** The following configuration blocks can be standalone or nested inside a `task` block
 
-###  notification_settings Configuration Block
+###  notification_settings Configuration Block (Job Level)
 
-This block controls notification settings for both email & webhook notifications:
+This block controls notification settings for both email & webhook notifications on a job level:
 
 * `no_alert_for_skipped_runs` - (Optional) (Bool) don't send alert for skipped runs.
 * `no_alert_for_canceled_runs` - (Optional) (Bool) don't send alert for cancelled runs.
 
+###  notification_settings Configuration Block (Task Level)
+
+This block controls notification settings for both email & webhook notifications on a task level:
+
+* `no_alert_for_skipped_runs` - (Optional) (Bool) don't send alert for skipped runs.
+* `no_alert_for_canceled_runs` - (Optional) (Bool) don't send alert for cancelled runs.
+* `alert_on_last_attempt` - (Optional) (Bool) do not send notifications to recipients specified in `on_start` for the retried runs and do not send notifications to recipients specified in `on_failure` until the last retry of the run.
+
+### health Configuration Block
+
+This block describes health conditions for a given job or an individual task. It consists of the following attributes:
+
+* `rules` - (List) list of rules that are represented as objects with the following attributes:
+  * `metric` - (Optional) string specifying the metric to check.  The only supported metric is `RUN_DURATION_SECONDS` (check [Jobs REST API documentation](https://docs.databricks.com/api/workspace/jobs/create) for the latest information).
+  * `op` - (Optional) string specifying the operation used to evaluate the given metric. The only supported operation is `GREATER_THAN`.
+  * `value` - (Optional) integer value used to compare to the given metric.
+
 ### spark_jar_task Configuration Block
 
 * `parameters` - (Optional) (List) Parameters passed to the main method.
diff --git a/jobs/resource_job.go b/jobs/resource_job.go
@@ -114,24 +114,20 @@ type DbtTask struct {
 
 // EmailNotifications contains the information for email notifications after job or task run start or completion
 type EmailNotifications struct {
-	OnStart               []string `json:"on_start,omitempty"`
-	OnSuccess             []string `json:"on_success,omitempty"`
-	OnFailure             []string `json:"on_failure,omitempty"`
-	NoAlertForSkippedRuns bool     `json:"no_alert_for_skipped_runs,omitempty"`
-	AlertOnLastAttempt    bool     `json:"alert_on_last_attempt,omitempty"`
+	OnStart                            []string `json:"on_start,omitempty"`
+	OnSuccess                          []string `json:"on_success,omitempty"`
+	OnFailure                          []string `json:"on_failure,omitempty"`
+	OnDurationWarningThresholdExceeded []string `json:"on_duration_warning_threshold_exceeded,omitempty"`
+	NoAlertForSkippedRuns              bool     `json:"no_alert_for_skipped_runs,omitempty"`
+	AlertOnLastAttempt                 bool     `json:"alert_on_last_attempt,omitempty"`
 }
 
 // WebhookNotifications contains the information for webhook notifications sent after job start or completion.
 type WebhookNotifications struct {
-	OnStart   []Webhook `json:"on_start,omitempty"`
-	OnSuccess []Webhook `json:"on_success,omitempty"`
-	OnFailure []Webhook `json:"on_failure,omitempty"`
-}
-
-// NotificationSettings control the notification settings for a job
-type NotificationSettings struct {
-	NoAlertForSkippedRuns  bool `json:"no_alert_for_skipped_runs,omitempty"`
-	NoAlertForCanceledRuns bool `json:"no_alert_for_canceled_runs,omitempty"`
+	OnStart                            []Webhook `json:"on_start,omitempty"`
+	OnSuccess                          []Webhook `json:"on_success,omitempty"`
+	OnFailure                          []Webhook `json:"on_failure,omitempty"`
+	OnDurationWarningThresholdExceeded []Webhook `json:"on_duration_warning_threshold_exceeded,omitempty"`
 }
 
 func (wn *WebhookNotifications) Sort() {
@@ -171,6 +167,16 @@ type GitSource struct {
 
 // End Jobs + Repo integration preview
 
+type JobHealthRule struct {
+	Metric    string `json:"metric,omitempty"`
+	Operation string `json:"op,omitempty"`
+	Value     int32  `json:"value,omitempty"`
+}
+
+type JobHealth struct {
+	Rules []JobHealthRule `json:"rules"`
+}
+
 type JobTaskSettings struct {
 	TaskKey     string                `json:"task_key,omitempty"`
 	Description string                `json:"description,omitempty"`
@@ -198,11 +204,13 @@ type JobTaskSettings struct {
 	// ConditionTask is in private preview
 	ConditionTask *jobs.ConditionTask `json:"condition_task,omitempty" tf:"group:task_type"`
 
-	EmailNotifications     *EmailNotifications `json:"email_notifications,omitempty" tf:"suppress_diff"`
-	TimeoutSeconds         int32               `json:"timeout_seconds,omitempty"`
-	MaxRetries             int32               `json:"max_retries,omitempty"`
-	MinRetryIntervalMillis int32               `json:"min_retry_interval_millis,omitempty"`
-	RetryOnTimeout         bool                `json:"retry_on_timeout,omitempty" tf:"computed"`
+	EmailNotifications     *EmailNotifications            `json:"email_notifications,omitempty" tf:"suppress_diff"`
+	NotificationSettings   *jobs.TaskNotificationSettings `json:"notification_settings,omitempty"`
+	TimeoutSeconds         int32                          `json:"timeout_seconds,omitempty"`
+	MaxRetries             int32                          `json:"max_retries,omitempty"`
+	MinRetryIntervalMillis int32                          `json:"min_retry_interval_millis,omitempty"`
+	RetryOnTimeout         bool                           `json:"retry_on_timeout,omitempty" tf:"computed"`
+	Health                 *JobHealth                     `json:"health,omitempty"`
 }
 
 type JobCluster struct {
@@ -270,16 +278,17 @@ type JobSettings struct {
 	GitSource *GitSource `json:"git_source,omitempty"`
 	// END Jobs + Repo integration preview
 
-	Schedule             *CronSchedule         `json:"schedule,omitempty"`
-	Continuous           *ContinuousConf       `json:"continuous,omitempty"`
-	Trigger              *Trigger              `json:"trigger,omitempty"`
-	MaxConcurrentRuns    int32                 `json:"max_concurrent_runs,omitempty"`
-	EmailNotifications   *EmailNotifications   `json:"email_notifications,omitempty" tf:"suppress_diff"`
-	WebhookNotifications *WebhookNotifications `json:"webhook_notifications,omitempty" tf:"suppress_diff"`
-	NotificationSettings *NotificationSettings `json:"notification_settings,omitempty"`
-	Tags                 map[string]string     `json:"tags,omitempty"`
-	Queue                *Queue                `json:"queue,omitempty"`
-	RunAs                *JobRunAs             `json:"run_as,omitempty"`
+	Schedule             *CronSchedule                 `json:"schedule,omitempty"`
+	Continuous           *ContinuousConf               `json:"continuous,omitempty"`
+	Trigger              *Trigger                      `json:"trigger,omitempty"`
+	MaxConcurrentRuns    int32                         `json:"max_concurrent_runs,omitempty"`
+	EmailNotifications   *EmailNotifications           `json:"email_notifications,omitempty" tf:"suppress_diff"`
+	WebhookNotifications *WebhookNotifications         `json:"webhook_notifications,omitempty" tf:"suppress_diff"`
+	NotificationSettings *jobs.JobNotificationSettings `json:"notification_settings,omitempty"`
+	Tags                 map[string]string             `json:"tags,omitempty"`
+	Queue                *Queue                        `json:"queue,omitempty"`
+	RunAs                *JobRunAs                     `json:"run_as,omitempty"`
+	Health               *JobHealth                    `json:"health,omitempty"`
 }
 
 func (js *JobSettings) isMultiTask() bool {
diff --git a/jobs/resource_job_test.go b/jobs/resource_job_test.go
@@ -14,6 +14,7 @@ import (
 	"github.com/databricks/terraform-provider-databricks/common"
 	"github.com/databricks/terraform-provider-databricks/libraries"
 	"github.com/databricks/terraform-provider-databricks/qa"
+
 	"github.com/stretchr/testify/assert"
 	"github.com/stretchr/testify/require"
 )
@@ -141,6 +142,15 @@ func TestResourceJobCreate_MultiTask(t *testing.T) {
 							SparkJarTask: &SparkJarTask{
 								MainClassName: "com.labs.BarMain",
 							},
+							Health: &JobHealth{
+								Rules: []JobHealthRule{
+									{
+										Metric:    "RUN_DURATION_SECONDS",
+										Operation: "GREATER_THAN",
+										Value:     3600,
+									},
+								},
+							},
 						},
 						{
 							TaskKey: "b",
@@ -158,6 +168,15 @@ func TestResourceJobCreate_MultiTask(t *testing.T) {
 						},
 					},
 					MaxConcurrentRuns: 1,
+					Health: &JobHealth{
+						Rules: []JobHealthRule{
+							{
+								Metric:    "RUN_DURATION_SECONDS",
+								Operation: "GREATER_THAN",
+								Value:     3600,
+							},
+						},
+					},
 				},
 				Response: Job{
 					JobID: 789,
@@ -185,7 +204,15 @@ func TestResourceJobCreate_MultiTask(t *testing.T) {
 		Resource: ResourceJob(),
 		HCL: `
 		name = "Featurizer"
-		
+
+		health {
+			rules {
+				metric = "RUN_DURATION_SECONDS"
+				op     = "GREATER_THAN"
+				value  = 3600						  
+			}
+		}
+
 		task {
 			task_key = "a"
 
@@ -198,6 +225,15 @@ func TestResourceJobCreate_MultiTask(t *testing.T) {
 			library {
 				jar = "dbfs://aa/bb/cc.jar"
 			}
+
+			health {
+				rules {
+					metric = "RUN_DURATION_SECONDS"
+					op     = "GREATER_THAN"
+					value  = 3600						  
+				}
+			}
+	
 		}
 
 		task {
@@ -983,7 +1019,7 @@ func TestResourceJobCreateWithWebhooks(t *testing.T) {
 						OnSuccess: []Webhook{{ID: "id2"}},
 						OnFailure: []Webhook{{ID: "id3"}},
 					},
-					NotificationSettings: &NotificationSettings{
+					NotificationSettings: &jobs.JobNotificationSettings{
 						NoAlertForSkippedRuns:  true,
 						NoAlertForCanceledRuns: true,
 					},
@@ -1014,7 +1050,7 @@ func TestResourceJobCreateWithWebhooks(t *testing.T) {
 							OnSuccess: []Webhook{{ID: "id2"}},
 							OnFailure: []Webhook{{ID: "id3"}},
 						},
-						NotificationSettings: &NotificationSettings{
+						NotificationSettings: &jobs.JobNotificationSettings{
 							NoAlertForSkippedRuns:  true,
 							NoAlertForCanceledRuns: true,
 						},