Skip to content

Commit 9b5953d

Browse files
Create deploy-worker-udf-binaries.md (#16463)
* Create deploy-worker-udf-binaries.md * edit * add blank lines * change indentation * edit inline * resolve comments * update table from html to markdown * update syntax * Update TOC * resolve comments Co-authored-by: Mary McCready <[email protected]>
1 parent 6cbc9c3 commit 9b5953d

File tree

2 files changed

+79
-0
lines changed

2 files changed

+79
-0
lines changed
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
---
2+
title: Deploy .NET for Apache Spark worker and user-defined function binaries
3+
description: Learn how to deploy .NET for Apache Spark worker and user-defined function binaries.
4+
ms.date: 01/21/2019
5+
ms.topic: conceptual
6+
ms.custom: mvc,how-to
7+
---
8+
9+
# Deploy .NET for Apache Spark worker and user-defined function binaries
10+
11+
This how-to provides general instructions on how to deploy .NET for Apache Spark worker and user-defined function binaries. You learn which Environment Variables to set up, as well as some commonly used parameters for launching applications with `spark-submit`.
12+
13+
## Configurations
14+
Configurations show the general environment variables and parameters settings in order to deploy .NET for Apache Spark worker and user-defined function binaries.
15+
16+
### Environment variables
17+
When deploying workers and writing UDFs, there are a few commonly used environment variables that you may need to set:
18+
19+
| Environment Variable | Description
20+
| :--------------------------- | :----------
21+
| DOTNET_WORKER_DIR | Path where the <code>Microsoft.Spark.Worker</code> binary has been generated.</br>It's used by the Spark driver and will be passed to Spark executors. If this variable is not set up, the Spark executors will search the path specified in the <code>PATH</code> environment variable.</br>_e.g. "C:\bin\Microsoft.Spark.Worker"_
22+
| DOTNET_ASSEMBLY_SEARCH_PATHS | Comma-separated paths where <code>Microsoft.Spark.Worker</code> will load assemblies.</br>Note that if a path starts with ".", the working directory will be prepended. If in **yarn mode**, "." would represent the container's working directory.</br>_e.g. "C:\Users\\&lt;user name&gt;\\&lt;mysparkapp&gt;\bin\Debug\\&lt;dotnet version&gt;"_
23+
| DOTNET_WORKER_DEBUG | If you want to <a href="https://github.com/dotnet/spark/blob/master/docs/developer-guide.md#debugging-user-defined-function-udf">debug a UDF</a>, then set this environment variable to <code>1</code> before running <code>spark-submit</code>.
24+
25+
### Parameter options
26+
Once the Spark application is [bundled](https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies), you can launch it using `spark-submit`. The following table shows some of the commonly used options:
27+
28+
| Parameter Name | Description
29+
| :---------------------| :----------
30+
| --class | The entry point for your application.</br>_e.g. org.apache.spark.deploy.dotnet.DotnetRunner_
31+
| --master | The <a href="https://spark.apache.org/docs/latest/submitting-applications.html#master-urls">master URL</a> for the cluster.</br>_e.g. yarn_
32+
| --deploy-mode | Whether to deploy your driver on the worker nodes (<code>cluster</code>) or locally as an external client (<code>client</code>).</br>Default: <code>client</code>
33+
| --conf | Arbitrary Spark configuration property in <code>key=value</code> format.</br>_e.g. spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=.\worker\Microsoft.Spark.Worker_
34+
| --files | Comma-separated list of files to be placed in the working directory of each executor.<br/><ul><li>Please note that this option is only applicable for yarn mode.</li><li>It supports specifying file names with # similar to Hadoop.</br></ul>_e.g. <code>myLocalSparkApp.dll#appSeen.dll</code>. Your application should use the name as <code>appSeen.dll</code> to reference <code>myLocalSparkApp.dll</code> when running on YARN._</li>
35+
| --archives | Comma-separated list of archives to be extracted into the working directory of each executor.</br><ul><li>Please note that this option is only applicable for yarn mode.</li><li>It supports specifying file names with # similar to Hadoop.</br></ul>_e.g. <code>hdfs://&lt;path to your worker file&gt;/Microsoft.Spark.Worker.zip#worker</code>. This will copy and extract the zip file to <code>worker</code> folder._</li>
36+
| application-jar | Path to a bundled jar including your application and all dependencies.</br>_e.g. hdfs://&lt;path to your jar&gt;/microsoft-spark-&lt;version&gt;.jar_
37+
| application-arguments | Arguments passed to the main method of your main class, if any.</br>_e.g. hdfs://&lt;path to your app&gt;/&lt;your app&gt;.zip &lt;your app name&gt; &lt;app args&gt;_
38+
39+
> [!NOTE]
40+
> Specify all the `--options` before `application-jar` when launching applications with `spark-submit`, otherwise they will be ignored. For more information, see [`spark-submit` options](https://spark.apache.org/docs/latest/submitting-applications.html) and [running spark on YARN details](https://spark.apache.org/docs/latest/running-on-yarn.html).
41+
42+
## Frequently asked questions
43+
### When I run a spark app with UDFs, I get a `FileNotFoundException' error. What should I do?
44+
> **Error:** [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.IO.FileNotFoundException: Assembly 'mySparkApp, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' file not found: 'mySparkApp.dll'
45+
46+
**Answer:** Check that the `DOTNET_ASSEMBLY_SEARCH_PATHS` environment variable is set correctly. It should be the path that contains your `mySparkApp.dll`.
47+
48+
### After I upgraded my .NET for Apache Spark version and reset the `DOTNET_WORKER_DIR` environment variable, why do I still get the following `IOException` error?
49+
> **Error:** Lost task 0.0 in stage 11.0 (TID 24, localhost, executor driver): java.io.IOException: Cannot run program "Microsoft.Spark.Worker.exe": CreateProcess error=2, The system cannot find the file specified.
50+
51+
**Answer:** Try restarting your PowerShell window (or other command windows) first so that it can take the latest environment variable values. Then start your program.
52+
53+
### After submitting my Spark application, I get the error `System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context'`.
54+
> **Error:** [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.TypeLoadException: Could not load type 'System.Runtime.Remoting.Contexts.Context' from assembly 'mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=...'.
55+
56+
**Answer:** Check the `Microsoft.Spark.Worker` version you are using. There are two versions: **.NET Framework 4.6.1** and **.NET Core 2.1.x**. In this case, `Microsoft.Spark.Worker.net461.win-x64-<version>` (which you can [download](https://github.com/dotnet/spark/releases)) should be used since `System.Runtime.Remoting.Contexts.Context` is only for .NET Framework.
57+
58+
### How do I run my spark application with UDFs on YARN? Which environment variables and parameters should I use?
59+
60+
**Answer:** To launch the spark application on YARN, the environment variables should be specified as `spark.yarn.appMasterEnv.[EnvironmentVariableName]`. Please see below as an example using `spark-submit`:
61+
62+
```powershell
63+
spark-submit \
64+
--class org.apache.spark.deploy.dotnet.DotnetRunner \
65+
--master yarn \
66+
--deploy-mode cluster \
67+
--conf spark.yarn.appMasterEnv.DOTNET_WORKER_DIR=./worker/Microsoft.Spark.Worker-<version> \
68+
--conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs \
69+
--archives hdfs://<path to your files>/Microsoft.Spark.Worker.net461.win-x64-<version>.zip#worker,hdfs://<path to your files>/mySparkApp.zip#udfs \
70+
hdfs://<path to jar file>/microsoft-spark-2.4.x-<version>.jar \
71+
hdfs://<path to your files>/mySparkApp.zip mySparkApp
72+
```
73+
74+
## Next steps
75+
76+
* [Get started with .NET for Apache Spark](../tutorials/get-started.md)
77+
* [Debug a .NET for Apache Spark application on Windows](../how-to-guides/debug.md)

docs/spark/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@
2929
href: how-to-guides/hdinsight-deploy-methods.md
3030
- name: Submit jobs to Databricks
3131
href: how-to-guides/databricks-deploy-methods.md
32+
- name: Deploy worker and UDF binaries
33+
href: how-to-guides/deploy-worker-udf-binaries.md
3234
- name: Reference
3335
items:
3436
- name: API Reference

0 commit comments

Comments
 (0)