Skip to content

Commit 99dd822

Browse files
Merge pull request #249852 from gewarren/net-spark
Update .NET for Apache Spark links to previous-versions
2 parents d278553 + ad814a6 commit 99dd822

File tree

4 files changed

+36
-31
lines changed

4 files changed

+36
-31
lines changed

articles/data-lake-analytics/understand-spark-code-concepts.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,10 @@ This section provides high-level guidance on transforming U-SQL Scripts to Apach
1515

1616
- It starts with a [comparison of the two language's processing paradigms](#understand-the-u-sql-and-spark-language-and-processing-paradigms)
1717
- Provides tips on how to:
18-
- [Transform scripts](#transform-u-sql-scripts) including U-SQL's [rowset expressions](#transform-u-sql-rowset-expressions-and-sql-based-scalar-expressions)
19-
- [.NET code](#transform-net-code)
20-
- [Data types](#transform-typed-values)
21-
- [Catalog objects](#transform-u-sql-catalog-objects).
18+
- [Transform scripts](#transform-u-sql-scripts) including U-SQL's [rowset expressions](#transform-u-sql-rowset-expressions-and-sql-based-scalar-expressions)
19+
- [.NET code](#transform-net-code)
20+
- [Data types](#transform-typed-values)
21+
- [Catalog objects](#transform-u-sql-catalog-objects).
2222

2323
## Understand the U-SQL and Spark language and processing paradigms
2424

@@ -48,13 +48,13 @@ Spark programs are similar in that you would use Spark connectors to read the da
4848

4949
U-SQL's expression language is C# and it offers various ways to scale out custom .NET code with user-defined functions, user-defined operators and user-defined aggregators.
5050

51-
Azure Synapse and Azure HDInsight Spark both now natively support executing .NET code with .NET for Apache Spark. This means that you can potentially reuse some or all of your [.NET user-defined functions with Spark](#transform-user-defined-scalar-net-functions-and-user-defined-aggregators). Note though that U-SQL uses the .NET Framework while .NET for Apache Spark is based on .NET Core 3.1 or later.
51+
Azure Synapse and Azure HDInsight Spark both now natively support executing .NET code with .NET for Apache Spark. This means that you can potentially reuse some or all of your [.NET user-defined functions with Spark](#transform-user-defined-scalar-net-functions-and-user-defined-aggregators). Note though that U-SQL uses the .NET Framework while .NET for Apache Spark is based on .NET Core 3.1 or later.
5252

5353
[U-SQL user-defined operators (UDOs)](#transform-user-defined-operators-udos) are using the U-SQL UDO model to provide scaled-out execution of the operator's code. Thus, UDOs will have to be rewritten into user-defined functions to fit into the Spark execution model.
5454

5555
.NET for Apache Spark currently doesn't support user-defined aggregators. Thus, [U-SQL user-defined aggregators](#transform-user-defined-scalar-net-functions-and-user-defined-aggregators) will have to be translated into Spark user-defined aggregators written in Scala.
5656

57-
If you don't want to take advantage of the .NET for Apache Spark capabilities, you'll have to rewrite your expressions into an equivalent Spark, Scala, Java, or Python expression, function, aggregator or connector.
57+
If you don't want to take advantage of the .NET for Apache Spark capabilities, you'll have to rewrite your expressions into an equivalent Spark, Scala, Java, or Python expression, function, aggregator or connector.
5858

5959
In any case, if you have a large amount of .NET logic in your U-SQL scripts, please contact us through your Microsoft Account representative for further guidance.
6060

@@ -137,9 +137,9 @@ For more information, see:
137137

138138
In Spark, types per default allow NULL values while in U-SQL, you explicitly mark scalar, non-object as nullable. While Spark allows you to define a column as not nullable, it will not enforce the constraint and [may lead to wrong result](https://medium.com/@weshoffman/apache-spark-parquet-and-troublesome-nulls-28712b06f836).
139139

140-
In Spark, NULL indicates that the value is unknown. A Spark NULL value is different from any value, including itself. Comparisons between two Spark NULL values, or between a NULL value and any other value, return unknown because the value of each NULL is unknown.
140+
In Spark, NULL indicates that the value is unknown. A Spark NULL value is different from any value, including itself. Comparisons between two Spark NULL values, or between a NULL value and any other value, return unknown because the value of each NULL is unknown.
141141

142-
This behavior is different from U-SQL, which follows C# semantics where `null` is different from any value but equal to itself.
142+
This behavior is different from U-SQL, which follows C# semantics where `null` is different from any value but equal to itself.
143143

144144
Thus a SparkSQL `SELECT` statement that uses `WHERE column_name = NULL` returns zero rows even if there are NULL values in `column_name`, while in U-SQL, it would return the rows where `column_name` is set to `null`. Similarly, A Spark `SELECT` statement that uses `WHERE column_name != NULL` returns zero rows even if there are non-null values in `column_name`, while in U-SQL, it would return the rows that have non-null. Thus, if you want the U-SQL null-check semantics, you should use [isnull](https://spark.apache.org/docs/2.3.0/api/sql/index.html#isnull) and [isnotnull](https://spark.apache.org/docs/2.3.0/api/sql/index.html#isnotnull) respectively (or their DSL equivalent).
145145

@@ -203,7 +203,7 @@ Most of the settable system variables have no direct equivalent in Spark. Some o
203203

204204
### U-SQL hints
205205

206-
U-SQL offers several syntactic ways to provide hints to the query optimizer and execution engine:
206+
U-SQL offers several syntactic ways to provide hints to the query optimizer and execution engine:
207207

208208
- Setting a U-SQL system variable
209209
- an `OPTION` clause associated with the rowset expression to provide a data or plan hint
@@ -214,7 +214,7 @@ Spark's cost-based query optimizer has its own capabilities to provide hints and
214214
## Next steps
215215

216216
- [Understand Spark data formats for U-SQL developers](understand-spark-data-formats.md)
217-
- [.NET for Apache Spark](/dotnet/spark/what-is-apache-spark-dotnet)
217+
- [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
218218
- [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
219219
- [Transform data using Spark activity in Azure Data Factory](../data-factory/transform-data-using-spark.md)
220220
- [Transform data using Hadoop Hive activity in Azure Data Factory](../data-factory/transform-data-using-hadoop-hive.md)

articles/data-lake-analytics/understand-spark-data-formats.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ After this transformation, you copy the data as outlined in the chapter [Move da
4747

4848
- [Understand Spark code concepts for U-SQL developers](understand-spark-code-concepts.md)
4949
- [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
50-
- [.NET for Apache Spark](/dotnet/spark/what-is-apache-spark-dotnet)
50+
- [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
5151
- [Transform data using Spark activity in Azure Data Factory](../data-factory/transform-data-using-spark.md)
5252
- [Transform data using Hadoop Hive activity in Azure Data Factory](../data-factory/transform-data-using-hadoop-hive.md)
53-
- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)
53+
- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)

articles/data-lake-analytics/understand-spark-for-usql-developers.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ It includes the steps you can take, and several alternatives.
4242
- [Understand Spark data formats for U-SQL developers](understand-spark-data-formats.md)
4343
- [Understand Spark code concepts for U-SQL developers](understand-spark-code-concepts.md)
4444
- [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
45-
- [.NET for Apache Spark](/dotnet/spark/what-is-apache-spark-dotnet)
45+
- [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
4646
- [Transform data using Hadoop Hive activity in Azure Data Factory](../data-factory/transform-data-using-hadoop-hive.md)
4747
- [Transform data using Spark activity in Azure Data Factory](../data-factory/transform-data-using-spark.md)
48-
- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)
48+
- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)

articles/synapse-analytics/spark/spark-dotnet.md

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,18 @@ title: Use .NET for Apache Spark
33
description: Learn about using .NET and Apache Spark to do batch processing, real-time streaming, machine learning, and write ad-hoc queries in Azure Synapse Analytics notebooks.
44
author: juluczni
55
ms.author: juluczni
6-
services: synapse-analytics
7-
ms.service: synapse-analytics
6+
services: synapse-analytics
7+
ms.service: synapse-analytics
88
ms.topic: conceptual
99
ms.subservice: spark
1010
ms.custom: devx-track-dotnet
11-
ms.date: 05/01/2020
11+
ms.date: 05/01/2020
1212
ms.reviewer: sngun
1313
---
1414

1515
# Use .NET for Apache Spark with Azure Synapse Analytics
1616

17-
[.NET for Apache Spark](https://dot.net/spark) provides free, [open-source](https://github.com/dotnet/spark), and cross-platform .NET support for Spark.
17+
[.NET for Apache Spark](https://dot.net/spark) provides free, [open-source](https://github.com/dotnet/spark), and cross-platform .NET support for Spark.
1818

1919
It provides .NET bindings for Spark, which allows you to access Spark APIs through C# and F#. With .NET for Apache Spark, you can also write and execute user-defined functions for Spark written in .NET. The .NET APIs for Spark enable you to access all aspects of Spark DataFrames that help you analyze your data, including Spark SQL, Delta Lake, and Structured Streaming.
2020

@@ -23,8 +23,8 @@ You can analyze data with .NET for Apache Spark through Spark batch job definiti
2323
>[!IMPORTANT]
2424
> The [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet) is an open-source project under the .NET Foundation that currently requires the .NET 3.1 library, which has reached the out-of-support status. We would like to inform users of Azure Synapse Spark of the removal of the .NET for Apache Spark library in the Azure Synapse Runtime for Apache Spark version 3.3. Users may refer to the [.NET Support Policy](https://dotnet.microsoft.com/platform/support/policy/dotnet-core) for more details on this matter.
2525
>
26-
> As a result, it will no longer be possible for users to utilize Apache Spark APIs via C# and F#, or execute C# code in notebooks within Synapse or through Apache Spark Job definitions in Synapse. It is important to note that this change affects only Azure Synapse Runtime for Apache Spark 3.3 and above.
27-
>
26+
> As a result, it will no longer be possible for users to utilize Apache Spark APIs via C# and F#, or execute C# code in notebooks within Synapse or through Apache Spark Job definitions in Synapse. It is important to note that this change affects only Azure Synapse Runtime for Apache Spark 3.3 and above.
27+
>
2828
> We will continue to support .NET for Apache Spark in all previous versions of the Azure Synapse Runtime according to [their lifecycle stages](runtime-for-apache-spark-lifecycle-and-supportability.md). However, we do not have plans to support .NET for Apache Spark in Azure Synapse Runtime for Apache Spark 3.3 and future versions. We recommend that users with existing workloads written in C# or F# migrate to Python or Scala. Users are advised to take note of this information and plan accordingly.
2929
3030
## Submit batch jobs using the Spark job definition
@@ -37,34 +37,35 @@ The required .NET Spark version will be noted in the Synapse Studio interface un
3737
:::image type="content" source="./media/apache-spark-job-definitions/net-spark-workspace-compatibility.png" alt-text="Screenshot that shows properties, including the .NET Spark version.":::
3838

3939
Create your project as a .NET console application that outputs an Ubuntu x86 executable.
40-
40+
4141
```
4242
<Project Sdk="Microsoft.NET.Sdk">
43-
43+
4444
<PropertyGroup>
4545
<OutputType>Exe</OutputType>
4646
<TargetFramework>netcoreapp3.1</TargetFramework>
4747
</PropertyGroup>
48-
48+
4949
<ItemGroup>
5050
<PackageReference Include="Microsoft.Spark" Version="2.1.0" />
5151
</ItemGroup>
52-
52+
5353
</Project>
5454
```
5555

5656
2. Run the following commands to publish your app. Be sure to replace *mySparkApp* with the path to your app.
57-
57+
5858
```dotnetcli
5959
cd mySparkApp
6060
dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.18.04-x64
6161
```
6262

63-
3. Zip the contents of the publish folder, `publish.zip` for example, that was created as a result of Step 1. All the assemblies should be in the root of the ZIP file and there should be no intermediate folder layer. This means when you unzip `publish.zip`, all assemblies are extracted into your current working directory.
63+
3. Zip the contents of the publish folder, `publish.zip` for example, that was created as a result of Step 1. All the assemblies should be in the root of the ZIP file and there should be no intermediate folder layer. This means when you unzip `publish.zip`, all assemblies are extracted into your current working directory.
6464

6565
**On Windows:**
6666

6767
Using Windows PowerShell or PowerShell 7, create a .zip from the contents of your publish directory.
68+
6869
```PowerShell
6970
Compress-Archive publish/* publish.zip -Update
7071
```
@@ -77,9 +78,9 @@ The required .NET Spark version will be noted in the Synapse Studio interface un
7778
zip -r publish.zip
7879
```
7980
80-
## .NET for Apache Spark in Azure Synapse Analytics notebooks
81+
## .NET for Apache Spark in Azure Synapse Analytics notebooks
8182
82-
Notebooks are a great option for prototyping your .NET for Apache Spark pipelines and scenarios. You can start working with, understanding, filtering, displaying, and visualizing your data quickly and efficiently.
83+
Notebooks are a great option for prototyping your .NET for Apache Spark pipelines and scenarios. You can start working with, understanding, filtering, displaying, and visualizing your data quickly and efficiently.
8384
8485
Data engineers, data scientists, business analysts, and machine learning engineers are all able to collaborate over a shared, interactive document. You see immediate results from data exploration, and can visualize your data in the same notebook.
8586
@@ -109,19 +110,23 @@ The following features are available when you use .NET for Apache Spark in the A
109110
* Access to the standard C# library (such as System, LINQ, Enumerables, and so on).
110111
* Support for C# 8.0 language features.
111112
* `spark` as a pre-defined variable to give you access to your Apache Spark session.
112-
* Support for defining [.NET user-defined functions that can run within Apache Spark](/dotnet/spark/how-to-guides/udf-guide). We recommend [Write and call UDFs in .NET for Apache Spark Interactive environments](/dotnet/spark/how-to-guides/dotnet-interactive-udf-issue) for learning how to use UDFs in .NET for Apache Spark Interactive experiences.
113+
* Support for defining [.NET user-defined functions that can run within Apache Spark](/previous-versions/dotnet/spark/how-to-guides/udf-guide). We recommend [Write and call UDFs in .NET for Apache Spark Interactive environments](/previous-versions/dotnet/spark/how-to-guides/dotnet-interactive-udf-issue) for learning how to use UDFs in .NET for Apache Spark Interactive experiences.
113114
* Support for visualizing output from your Spark jobs using different charts (such as line, bar, or histogram) and layouts (such as single, overlaid, and so on) using the `XPlot.Plotly` library.
114115
* Ability to include NuGet packages into your C# notebook.
116+
115117
## Troubleshooting
116118
117119
### `DotNetRunner: null` / `Futures timeout` in Synapse Spark Job Definition Run
120+
118121
Synapse Spark Job Definitions on Spark Pools using Spark 2.4 require `Microsoft.Spark` 1.0.0. Clear your `bin` and `obj` directories, and publish the project using 1.0.0.
119-
### OutOfMemoryError: java heap space at org.apache.spark...
122+
123+
### OutOfMemoryError: java heap space at org.apache.spark
124+
120125
Dotnet Spark 1.0.0 uses a different debug architecture than 1.1.1+. You will have to use 1.0.0 for your published version and 1.1.1+ for local debugging.
121126
122127
## Next steps
123128
124129
* [.NET for Apache Spark documentation](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
125-
* [.NET for Apache Spark Interactive guides](/dotnet/spark/how-to-guides/dotnet-interactive-udf-issue)
130+
* [.NET for Apache Spark Interactive guides](/previous-versions/dotnet/spark/how-to-guides/dotnet-interactive-udf-issue)
126131
* [Azure Synapse Analytics](https://azure.microsoft.com/services/synapse-analytics/)
127132
* [.NET Interactive](https://devblogs.microsoft.com/dotnet/creating-interactive-net-documentation/)

0 commit comments

Comments
 (0)