|
| 1 | +--- |
| 2 | +title: Transform data with Synapse Spark job definition |
| 3 | +titleSuffix: Azure Data Factory & Azure Synapse |
| 4 | +description: Learn how to process or transform data by running a Synapse Spark job definition in Azure Data Factory and Synapse Analytics pipelines. |
| 5 | +ms.service: data-factory |
| 6 | +ms.subservice: tutorials |
| 7 | +ms.custom: synapse |
| 8 | +author: nabhishek |
| 9 | +ms.author: jejiang |
| 10 | +ms.topic: conceptual |
| 11 | +ms.date: 07/12/2022 |
| 12 | +--- |
| 13 | + |
| 14 | +# Transform data by running a Synapse Spark job definition |
| 15 | +[!INCLUDE[appliesto-adf-asa-md](includes/appliesto-adf-asa-md.md)] |
| 16 | + |
| 17 | +The Azure Synapse Spark job definition Activity in a [pipeline](concepts-pipelines-activities.md) runs a Synapse Spark job definition in your Azure Synapse Analytics workspace. This article builds on the [data transformation activities](transform-data.md) article, which presents a general overview of data transformation and the supported transformation activities. |
| 18 | + |
| 19 | +## Set Apache Spark job definition canvas |
| 20 | + |
| 21 | +To use a Spark job definition activity for Synapse in a pipeline, complete the following steps: |
| 22 | + |
| 23 | +## General settings |
| 24 | + |
| 25 | +1. Search for _Spark job definition_ in the pipeline Activities pane, and drag a Spark job definition activity under the Synapse to the pipeline canvas. |
| 26 | + |
| 27 | +2. Select the new Spark job definition activity on the canvas if it isn't already selected. |
| 28 | + |
| 29 | +3. In the **General** tab, enter sample for Name. |
| 30 | + |
| 31 | +4. (Option) You can also enter a description. |
| 32 | + |
| 33 | +5. Timeout: Maximum amount of time an activity can run. Default is seven days, which is also the maximum amount of time allowed. Format is in D.HH:MM:SS. |
| 34 | + |
| 35 | +6. Retry: Maximum number of retry attempts. |
| 36 | + |
| 37 | +7. Retry interval: The number of seconds between each retry attempt. |
| 38 | + |
| 39 | +8. Secure output: When checked, output from the activity won't be captured in logging. |
| 40 | + |
| 41 | +9. Secure input: When checked, input from the activity won't be captured in logging. |
| 42 | + |
| 43 | +## Azure Synapse Analytics (Artifacts) settings |
| 44 | + |
| 45 | +1. Select the new Spark job definition activity on the canvas if it isn't already selected. |
| 46 | + |
| 47 | +2. Select the **Azure Synapse Analytics (Artifacts)** tab to select or create a new Azure Synapse Analytics linked service that will execute the Spark job definition activity. |
| 48 | + |
| 49 | + |
| 50 | + :::image type="content" source="./media/transform-data-synapse-spark-job-definition/spark-job-definition-activity.png" alt-text="Screenshot that shows the UI for the linked service tab for a spark job definition activity."::: |
| 51 | + |
| 52 | +## Settings tab |
| 53 | + |
| 54 | +1. Select the new Spark job definition activity on the canvas if it isn't already selected. |
| 55 | + |
| 56 | +2. Select the **Settings** tab. |
| 57 | + |
| 58 | +3. Expand the Spark job definition list, you can select an existing Apache Spark job definition in the linked Azure Synapse Analytics workspace. |
| 59 | + |
| 60 | +4. (Optional) You can fill in information for Apache Spark job definition. If the following settings are empty, the settings of the spark job definition itself will be used to run; if the following settings aren't empty, these settings will replace the settings of the spark job definition itself. |
| 61 | + |
| 62 | + | Property | Description | |
| 63 | + | ----- | ----- | |
| 64 | + |Main definition file| The main file used for the job. Select a PY/JAR/ZIP file from your storage. You can select **Upload file** to upload the file to a storage account. <br> Sample: `abfss://…/path/to/wordcount.jar`| |
| 65 | + | References from subfolders | Scanning subfolders from the root folder of the main definition file, these files will be added as reference files. The folders named "jars", "pyFiles", "files" or "archives" will be scanned, and the folders name are case sensitive. | |
| 66 | + |Main class name| The fully qualified identifier or the main class that is in the main definition file. <br> Sample: `WordCount`| |
| 67 | + |Command-line arguments| You can add command-line arguments by clicking the **New** button. It should be noted that adding command-line arguments will override the command-line arguments defined by the Spark job definition. <br> *Sample: `abfss://…/path/to/shakespeare.txt` `abfss://…/path/to/result`* <br> | |
| 68 | + |Apache Spark pool| You can select Apache Spark pool from the list.| |
| 69 | + |Python code reference| Additional python code files used for reference in the main definition file. <br> It supports passing files (.py, .py3, .zip) to the "pyFiles" property. It will override the "pyFiles" property defined in Spark job definition. <br>| |
| 70 | + |Reference files | Additional files used for reference in the main definition file. | |
| 71 | + |Apache Spark pool| You can select Apache Spark pool from the list.| |
| 72 | + |Dynamically allocate executors| This setting maps to the dynamic allocation property in Spark configuration for Spark Application executors allocation.| |
| 73 | + |Min executors| Min number of executors to be allocated in the specified Spark pool for the job.| |
| 74 | + |Max executors| Max number of executors to be allocated in the specified Spark pool for the job.| |
| 75 | + |Driver size| Number of cores and memory to be used for driver given in the specified Apache Spark pool for the job.| |
| 76 | + |Spark configuration| Specify values for Spark configuration properties listed in the topic: Spark Configuration - Application properties. Users can use default configuration and customized configuration. | |
| 77 | + |
| 78 | + :::image type="content" source="./media/transform-data-synapse-spark-job-definition/spark-job-definition-activity-settings.png" alt-text="Screenshot that shows the UI for the spark job definition activity."::: |
| 79 | + |
| 80 | +5. You can add dynamic content by clicking the **Add Dynamic Content** button or by pressing the shortcut key <kbd>Alt</kbd>+<kbd>Shift</kbd>+<kbd>D</kbd>. In the **Add Dynamic Content** page, you can use any combination of expressions, functions, and system variables to add to dynamic content. |
| 81 | + |
| 82 | + :::image type="content" source="./media/transform-data-synapse-spark-job-definition/spark-job-definition-activity-add-dynamic-content.png" alt-text="Screenshot that displays the UI for adding dynamic content to Spark job definition activities."::: |
| 83 | + |
| 84 | +## User properties tab |
| 85 | + |
| 86 | +You can add properties for Apache Spark job definition activity in this panel. |
| 87 | + |
| 88 | +:::image type="content" source="./media/transform-data-synapse-spark-job-definition/spark-job-definition-activity-user-properties.png" alt-text="Screenshot that shows the UI for the properties for a spark job definition activity."::: |
| 89 | + |
| 90 | +## Azure Synapse spark job definition activity definition |
| 91 | + |
| 92 | +Here's the sample JSON definition of an Azure Synapse Analytics Notebook Activity: |
| 93 | + |
| 94 | +```json |
| 95 | + { |
| 96 | + "activities": [ |
| 97 | + { |
| 98 | + "name": "Spark job definition1", |
| 99 | + "type": "SparkJob", |
| 100 | + "dependsOn": [], |
| 101 | + "policy": { |
| 102 | + "timeout": "7.00:00:00", |
| 103 | + "retry": 0, |
| 104 | + "retryIntervalInSeconds": 30, |
| 105 | + "secureOutput": false, |
| 106 | + "secureInput": false |
| 107 | + }, |
| 108 | + "typeProperties": { |
| 109 | + "sparkJob": { |
| 110 | + "referenceName": { |
| 111 | + "value": "Spark job definition 1", |
| 112 | + "type": "Expression" |
| 113 | + }, |
| 114 | + "type": "SparkJobDefinitionReference" |
| 115 | + } |
| 116 | + }, |
| 117 | + "linkedServiceName": { |
| 118 | + "referenceName": "AzureSynapseArtifacts1", |
| 119 | + "type": "LinkedServiceReference" |
| 120 | + } |
| 121 | + } |
| 122 | + ], |
| 123 | + } |
| 124 | +``` |
| 125 | + |
| 126 | +## Azure Synapse Spark job definition properties |
| 127 | + |
| 128 | +The following table describes the JSON properties used in the JSON |
| 129 | +definition: |
| 130 | + |
| 131 | +|Property|Description|Required| |
| 132 | +|---|---|---| |
| 133 | +|name|Name of the activity in the pipeline.|Yes| |
| 134 | +|description|Text describing what the activity does.|No| |
| 135 | +|type|For Azure Synapse spark job definition Activity, the activity type is SparkJob.|Yes| |
| 136 | + |
| 137 | +## See Azure Synapse Spark job definition activity run history |
| 138 | + |
| 139 | +Go to Pipeline runs under the **Monitor** tab, you'll see the pipeline you've triggered. Open the pipeline that contains Azure Synapse Spark job definition activity to see the run history. |
| 140 | + |
| 141 | +:::image type="content" source="./media/transform-data-synapse-spark-job-definition/input-output-sjd.png" alt-text="Screenshot that shows the UI for the input and output for a spark job definition activity runs."::: |
| 142 | + |
| 143 | +You can see the notebook activity **input** or **output** by selecting the input or Output button. If your pipeline failed with a user error, select the **output** to check the **result** field to see the detailed user error traceback. |
| 144 | + |
| 145 | + |
| 146 | +:::image type="content" source="./media/transform-data-synapse-spark-job-definition/sjd-output-user-error.png" alt-text="Screenshot that shows the UI for the output user error for a spark job definition activity runs."::: |
0 commit comments