|
| 1 | +--- |
| 2 | +title: Deploy .NET for Apache Spark worker and user-defined function binaries |
| 3 | +description: Learn how to deploy .NET for Apache Spark worker and user-defined function binaries. |
| 4 | +ms.date: 01/21/2019 |
| 5 | +ms.topic: conceptual |
| 6 | +ms.custom: mvc,how-to |
| 7 | +--- |
| 8 | + |
| 9 | +# Deploy .NET for Apache Spark worker and user-defined function binaries |
| 10 | + |
| 11 | +This how-to provides general instructions on how to deploy .NET for Apache Spark worker and user-defined function binaries. You learn which Environment Variables to set up, as well as some commonly used parameters for launching applications with `spark-submit`. |
| 12 | + |
| 13 | +## Configurations |
| 14 | +Configurations show the general environment variables and parameters settings in order to deploy .NET for Apache Spark worker and user-defined function binaries. |
| 15 | + |
| 16 | +### Environment variables |
| 17 | +When deploying workers and writing UDFs, there are a few commonly used environment variables that you may need to set: |
| 18 | + |
| 19 | +| Environment Variable | Description |
| 20 | +| :--------------------------- | :---------- |
| 21 | +| DOTNET_WORKER_DIR | Path where the <code>Microsoft.Spark.Worker</code> binary has been generated.</br>It's used by the Spark driver and will be passed to Spark executors. If this variable is not set up, the Spark executors will search the path specified in the <code>PATH</code> environment variable.</br>_e.g. "C:\bin\Microsoft.Spark.Worker"_ |
| 22 | +| DOTNET_ASSEMBLY_SEARCH_PATHS | Comma-separated paths where <code>Microsoft.Spark.Worker</code> will load assemblies.</br>Note that if a path starts with ".", the working directory will be prepended. If in **yarn mode**, "." would represent the container's working directory.</br>_e.g. "C:\Users\\<user name>\\<mysparkapp>\bin\Debug\\<dotnet version>"_ |
| 23 | +| DOTNET_WORKER_DEBUG | If you want to <a href="https://github.com/dotnet/spark/blob/master/docs/developer-guide.md#debugging-user-defined-function-udf">debug a UDF</a>, then set this environment variable to <code>1</code> before running <code>spark-submit</code>. |
| 24 | + |
| 25 | +### Parameter options |
| 26 | +Once the Spark application is [bundled](https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies), you can launch it using `spark-submit`. The following table shows some of the commonly used options: |
| 27 | + |
| 28 | +| Parameter Name | Description |
| 29 | +| :---------------------| :---------- |
| 30 | +| --class | The entry point for your application.</br>_e.g. org.apache.spark.deploy.dotnet.DotnetRunner_ |
| 31 | +| --master | The <a href="https://spark.apache.org/docs/latest/submitting-applications.html#master-urls">master URL</a> for the cluster.</br>_e.g. yarn_ |
| 32 | +| --deploy-mode | Whether to deploy your driver on the worker nodes (<code>cluster</code>) or locally as an external client (<code>client</code>).</br>Default: <code>client</code> |
| 33 | +| --conf | Arbitrary Spark configuration property in <code>key=value</code> format.</br>_e.g. spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=.\worker\Microsoft.Spark.Worker_ |
| 34 | +| --files | Comma-separated list of files to be placed in the working directory of each executor.<br/><ul><li>Please note that this option is only applicable for yarn mode.</li><li>It supports specifying file names with # similar to Hadoop.</br></ul>_e.g. <code>myLocalSparkApp.dll#appSeen.dll</code>. Your application should use the name as <code>appSeen.dll</code> to reference <code>myLocalSparkApp.dll</code> when running on YARN._</li> |
| 35 | +| --archives | Comma-separated list of archives to be extracted into the working directory of each executor.</br><ul><li>Please note that this option is only applicable for yarn mode.</li><li>It supports specifying file names with # similar to Hadoop.</br></ul>_e.g. <code>hdfs://<path to your worker file>/Microsoft.Spark.Worker.zip#worker</code>. This will copy and extract the zip file to <code>worker</code> folder._</li> |
| 36 | +| application-jar | Path to a bundled jar including your application and all dependencies.</br>_e.g. hdfs://<path to your jar>/microsoft-spark-<version>.jar_ |
| 37 | +| application-arguments | Arguments passed to the main method of your main class, if any.</br>_e.g. hdfs://<path to your app>/<your app>.zip <your app name> <app args>_ |
| 38 | + |
| 39 | +> [!NOTE] |
| 40 | +> Specify all the `--options` before `application-jar` when launching applications with `spark-submit`, otherwise they will be ignored. For more information, see [`spark-submit` options](https://spark.apache.org/docs/latest/submitting-applications.html) and [running spark on YARN details](https://spark.apache.org/docs/latest/running-on-yarn.html). |
| 41 | +
|
| 42 | +## Frequently asked questions |
| 43 | +### When I run a spark app with UDFs, I get a `FileNotFoundException' error. What should I do? |
| 44 | +> **Error:** [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.IO.FileNotFoundException: Assembly 'mySparkApp, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' file not found: 'mySparkApp.dll' |
| 45 | +
|
| 46 | +**Answer:** Check that the `DOTNET_ASSEMBLY_SEARCH_PATHS` environment variable is set correctly. It should be the path that contains your `mySparkApp.dll`. |
| 47 | + |
| 48 | +### After I upgraded my .NET for Apache Spark version and reset the `DOTNET_WORKER_DIR` environment variable, why do I still get the following `IOException` error? |
| 49 | +> **Error:** Lost task 0.0 in stage 11.0 (TID 24, localhost, executor driver): java.io.IOException: Cannot run program "Microsoft.Spark.Worker.exe": CreateProcess error=2, The system cannot find the file specified. |
| 50 | +
|
| 51 | +**Answer:** Try restarting your PowerShell window (or other command windows) first so that it can take the latest environment variable values. Then start your program. |
| 52 | + |
| 53 | +### After submitting my Spark application, I get the error `System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context'`. |
| 54 | +> **Error:** [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=...'. |
| 55 | +
|
| 56 | +**Answer:** Check the `Microsoft.Spark.Worker` version you are using. There are two versions: **.NET Framework 4.6.1** and **.NET Core 2.1.x**. In this case, `Microsoft.Spark.Worker.net461.win-x64-<version>` (which you can [download](https://github.com/dotnet/spark/releases)) should be used since `System.Runtime.Remoting.Contexts.Context` is only for .NET Framework. |
| 57 | + |
| 58 | +### How do I run my spark application with UDFs on YARN? Which environment variables and parameters should I use? |
| 59 | + |
| 60 | +**Answer:** To launch the spark application on YARN, the environment variables should be specified as `spark.yarn.appMasterEnv.[EnvironmentVariableName]`. Please see below as an example using `spark-submit`: |
| 61 | + |
| 62 | +```powershell |
| 63 | +spark-submit \ |
| 64 | +--class org.apache.spark.deploy.dotnet.DotnetRunner \ |
| 65 | +--master yarn \ |
| 66 | +--deploy-mode cluster \ |
| 67 | +--conf spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=./worker/Microsoft.Spark.Worker-<version> \ |
| 68 | +--conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs \ |
| 69 | +--archives hdfs://<path to your files>/Microsoft.Spark.Worker.net461.win-x64-<version>.zip#worker,hdfs://<path to your files>/mySparkApp.zip#udfs \ |
| 70 | +hdfs://<path to jar file>/microsoft-spark-2.4.x-<version>.jar \ |
| 71 | +hdfs://<path to your files>/mySparkApp.zip mySparkApp |
| 72 | +``` |
| 73 | + |
| 74 | +## Next steps |
| 75 | + |
| 76 | +* [Get started with .NET for Apache Spark](../tutorials/get-started.md) |
| 77 | +* [Debug a .NET for Apache Spark application on Windows](../how-to-guides/debug.md) |
0 commit comments