Skip to content

Commit 0da019a

Browse files
committed
Adding job activity article
1 parent e50582f commit 0da019a

File tree

2 files changed

+152
-0
lines changed

2 files changed

+152
-0
lines changed

articles/data-factory/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -699,6 +699,8 @@ items:
699699
href: transform-data-using-custom-activity.md
700700
- name: Databricks Jar activity
701701
href: transform-data-databricks-jar.md
702+
- name: Databricks Job activity
703+
href: transform-data-databricks-job.md
702704
displayName: data bricks
703705
- name: Databricks Notebook activity
704706
href: transform-data-databricks-notebook.md
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
---
2+
title: Transform data with Databricks Job
3+
titleSuffix: Azure Data Factory & Azure Synapse
4+
description: Learn how to process or transform data by running a Databricks job in Azure Data Factory and Synapse Analytics pipelines.
5+
ms.custom: synapse
6+
author: n0elleli
7+
ms.author: noelleli
8+
ms.reviewer: whhender
9+
ms.topic: how-to
10+
ms.date: 04/24/2025
11+
ms.subservice: orchestration
12+
---
13+
14+
# Transform data by running a Databricks job
15+
16+
[!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)]
17+
18+
The Azure Databricks Job Activity in a [pipeline](concepts-pipelines-activities.md) runs a Databricks job in your Azure Databricks workspace. This article builds on the [data transformation activities](transform-data.md) article, which presents a general overview of data transformation and the supported transformation activities. Azure Databricks is a managed platform for running Apache Spark.
19+
20+
You can create a Databricks job with an ARM template using JSON, or directly through the Azure Data Factory Studio user interface.
21+
22+
## Add a Job activity for Azure Databricks to a pipeline with UI
23+
24+
To use a Job activity for Azure Databricks in a pipeline, complete the following steps:
25+
26+
1. Search for _Job_ in the pipeline Activities pane, and drag a Job activity to the pipeline canvas.
27+
1. Select the new Job activity on the canvas if it isn't already selected.
28+
1. Select the **Azure Databricks** tab to select or create a new Azure Databricks linked service that execute the Job activity.
29+
1. Select the **Settings** tab and specify the job path to be executed on Azure Databricks, optional base parameters to be passed to the job, and any other libraries to be installed on the cluster to execute the job.
30+
31+
## Databricks Job activity definition
32+
33+
Here's the sample JSON definition of a Databricks Job Activity:
34+
35+
```json
36+
{
37+
"activity": {
38+
"name": "MyActivity",
39+
"description": "MyActivity description",
40+
"type": "DatabricksJob",
41+
"linkedServiceName": {
42+
"referenceName": "MyDatabricksLinkedservice",
43+
"type": "LinkedServiceReference"
44+
},
45+
"typeProperties": {
46+
"jobPath": "/Users/[email protected]/ScalaExampleJob",
47+
"baseParameters": {
48+
"inputpath": "input/folder1/",
49+
"outputpath": "output/"
50+
},
51+
"libraries": [
52+
{
53+
"jar": "dbfs:/docs/library.jar"
54+
}
55+
]
56+
}
57+
}
58+
}
59+
```
60+
61+
## Databricks Job activity properties
62+
63+
The following table describes the JSON properties used in the JSON
64+
definition:
65+
66+
|Property|Description|Required|
67+
|---|---|---|
68+
|name|Name of the activity in the pipeline.|Yes|
69+
|description|Text describing what the activity does.|No|
70+
|type|For Databricks Job Activity, the activity type is DatabricksJob.|Yes|
71+
|linkedServiceName|Name of the Databricks Linked Service on which the Databricks job runs. To learn about this linked service, see [Compute linked services](compute-linked-services.md) article.|Yes|
72+
|jobPath|The absolute path of the job to be run in the Databricks Workspace. This path must begin with a slash.|Yes|
73+
|baseParameters|An array of Key-Value pairs. Base parameters can be used for each activity run. If the job takes a parameter that isn't specified, the default value from the job will be used. Find more on parameters in [Databricks Jobs](https://docs.databricks.com/api/latest/jobs.html#jobsparampair).|No|
74+
|libraries|A list of libraries to be installed on the cluster that will execute the job. It can be an array of \<string, object>.|No|
75+
76+
## Supported libraries for Databricks activities
77+
78+
In the above Databricks activity definition, you specify these library types: *jar*, *egg*, *whl*, *maven*, *pypi*, *cran*.
79+
80+
```json
81+
{
82+
"libraries": [
83+
{
84+
"jar": "dbfs:/mnt/libraries/library.jar"
85+
},
86+
{
87+
"egg": "dbfs:/mnt/libraries/library.egg"
88+
},
89+
{
90+
"whl": "dbfs:/mnt/libraries/mlflow-0.0.1.dev0-py2-none-any.whl"
91+
},
92+
{
93+
"whl": "dbfs:/mnt/libraries/wheel-libraries.wheelhouse.zip"
94+
},
95+
{
96+
"maven": {
97+
"coordinates": "org.jsoup:jsoup:1.7.2",
98+
"exclusions": [ "slf4j:slf4j" ]
99+
}
100+
},
101+
{
102+
"pypi": {
103+
"package": "simplejson",
104+
"repo": "http://my-pypi-mirror.com"
105+
}
106+
},
107+
{
108+
"cran": {
109+
"package": "ada",
110+
"repo": "https://cran.us.r-project.org"
111+
}
112+
}
113+
]
114+
}
115+
116+
```
117+
118+
For more information, see the [Databricks documentation](/azure/databricks/dev-tools/api/latest/libraries#managedlibrarieslibrary) for library types.
119+
120+
## Passing parameters between jobs and pipelines
121+
122+
You can pass parameters to jobs using *baseParameters* property in databricks activity.
123+
124+
In certain cases, you might require to pass back certain values from job back to the service, which can be used for control flow (conditional checks) in the service or be consumed by downstream activities (size limit is 2 MB).
125+
126+
1. In your job, you can call [dbutils.job.exit("returnValue")](/azure/databricks/jobs/job-workflows#python-1) and corresponding "returnValue" will be returned to the service.
127+
128+
1. You can consume the output in the service by using expression such as `@{activity('databricks job activity name').output.runOutput}`.
129+
130+
> [!IMPORTANT]
131+
> If you're passing JSON object, you can retrieve values by appending property names. Example: `@{activity('databricks job activity name').output.runOutput.PropertyName}`
132+
133+
## How to upload a library in Databricks
134+
135+
### You can use the Workspace UI:
136+
137+
1. [Use the Databricks workspace UI](/azure/databricks/libraries/cluster-libraries#install-a-library-on-a-cluster)
138+
139+
2. To obtain the dbfs path of the library added using UI, you can use [Databricks CLI](/azure/databricks/dev-tools/cli/fs-commands#list-the-contents-of-a-directory).
140+
141+
Typically the Jar libraries are stored under dbfs:/FileStore/jars while using the UI. You can list all through the CLI: *databricks fs ls dbfs:/FileStore/job-jars*
142+
143+
### Or you can use the Databricks CLI:
144+
145+
1. Follow [Copy the library using Databricks CLI](/azure/databricks/dev-tools/cli/fs-commands#copy-a-directory-or-a-file)
146+
147+
2. Use Databricks CLI [(installation steps)](/azure/databricks/dev-tools/cli/commands#compute-commands)
148+
149+
As an example, to copy a JAR to dbfs:
150+
`dbfs cp SparkPi-assembly-0.1.jar dbfs:/docs/sparkpi.jar`

0 commit comments

Comments
 (0)