Merge pull request #249852 from gewarren/net-spark

prmerger-automator[bot] · web-flow · commit 99dd82248a8d · 2023-08-30T19:23:09.000Z
Update .NET for Apache Spark links to previous-versions
diff --git a/articles/data-lake-analytics/understand-spark-code-concepts.md b/articles/data-lake-analytics/understand-spark-code-concepts.md
@@ -15,10 +15,10 @@ This section provides high-level guidance on transforming U-SQL Scripts to Apach
 
 - It starts with a [comparison of the two language's processing paradigms](#understand-the-u-sql-and-spark-language-and-processing-paradigms)
 - Provides tips on how to:
-   - [Transform scripts](#transform-u-sql-scripts) including U-SQL's [rowset expressions](#transform-u-sql-rowset-expressions-and-sql-based-scalar-expressions)
-   - [.NET code](#transform-net-code)
-   - [Data types](#transform-typed-values)
-   - [Catalog objects](#transform-u-sql-catalog-objects).
+  - [Transform scripts](#transform-u-sql-scripts) including U-SQL's [rowset expressions](#transform-u-sql-rowset-expressions-and-sql-based-scalar-expressions)
+  - [.NET code](#transform-net-code)
+  - [Data types](#transform-typed-values)
+  - [Catalog objects](#transform-u-sql-catalog-objects).
 
 ## Understand the U-SQL and Spark language and processing paradigms
 
@@ -48,13 +48,13 @@ Spark programs are similar in that you would use Spark connectors to read the da
 
 U-SQL's expression language is C# and it offers various ways to scale out custom .NET code with user-defined functions, user-defined operators and user-defined aggregators.
 
-Azure Synapse and Azure HDInsight Spark both now natively support executing .NET code with .NET for Apache Spark. This means that you can potentially reuse some or all of your [.NET user-defined functions with Spark](#transform-user-defined-scalar-net-functions-and-user-defined-aggregators). Note though that U-SQL uses the .NET Framework while .NET for Apache Spark is based on .NET Core 3.1 or later. 
+Azure Synapse and Azure HDInsight Spark both now natively support executing .NET code with .NET for Apache Spark. This means that you can potentially reuse some or all of your [.NET user-defined functions with Spark](#transform-user-defined-scalar-net-functions-and-user-defined-aggregators). Note though that U-SQL uses the .NET Framework while .NET for Apache Spark is based on .NET Core 3.1 or later.
 
 [U-SQL user-defined operators (UDOs)](#transform-user-defined-operators-udos) are using the U-SQL UDO model to provide scaled-out execution of the operator's code. Thus, UDOs will have to be rewritten into user-defined functions to fit into the Spark execution model.
 
 .NET for Apache Spark currently doesn't support user-defined aggregators. Thus, [U-SQL user-defined aggregators](#transform-user-defined-scalar-net-functions-and-user-defined-aggregators) will have to be translated into Spark user-defined aggregators written in Scala.
 
-If you don't want to take advantage of the .NET for Apache Spark capabilities, you'll have to rewrite your expressions into an equivalent Spark, Scala, Java, or Python expression, function, aggregator or connector. 
+If you don't want to take advantage of the .NET for Apache Spark capabilities, you'll have to rewrite your expressions into an equivalent Spark, Scala, Java, or Python expression, function, aggregator or connector.
 
 In any case, if you have a large amount of .NET logic in your U-SQL scripts, please contact us through your Microsoft Account representative for further guidance.
 
@@ -137,9 +137,9 @@ For more information, see:
 
 In Spark, types per default allow NULL values while in U-SQL, you explicitly mark scalar, non-object as nullable. While Spark allows you to define a column as not nullable, it will not enforce the constraint and [may lead to wrong result](https://medium.com/@weshoffman/apache-spark-parquet-and-troublesome-nulls-28712b06f836).
 
-In Spark, NULL indicates that the value is unknown. A Spark NULL value is different from any value, including itself. Comparisons between two Spark NULL values, or between a NULL value and any other value, return unknown because the value of each NULL is unknown.  
+In Spark, NULL indicates that the value is unknown. A Spark NULL value is different from any value, including itself. Comparisons between two Spark NULL values, or between a NULL value and any other value, return unknown because the value of each NULL is unknown.
 
-This behavior is different from U-SQL, which follows C# semantics where `null` is different from any value but equal to itself.  
+This behavior is different from U-SQL, which follows C# semantics where `null` is different from any value but equal to itself.
 
 Thus a SparkSQL `SELECT` statement that uses `WHERE column_name = NULL` returns zero rows even if there are NULL values in `column_name`, while in U-SQL, it would return the rows where `column_name` is set to `null`. Similarly, A Spark `SELECT` statement that uses `WHERE column_name != NULL` returns zero rows even if there are non-null values in `column_name`, while in U-SQL, it would return the rows that have non-null. Thus, if you want the U-SQL null-check semantics, you should use [isnull](https://spark.apache.org/docs/2.3.0/api/sql/index.html#isnull) and [isnotnull](https://spark.apache.org/docs/2.3.0/api/sql/index.html#isnotnull) respectively (or their DSL equivalent).
 
@@ -203,7 +203,7 @@ Most of the settable system variables have no direct equivalent in Spark. Some o
 
 ### U-SQL hints
 
-U-SQL offers several syntactic ways to provide hints to the query optimizer and execution engine:  
+U-SQL offers several syntactic ways to provide hints to the query optimizer and execution engine:
 
 - Setting a U-SQL system variable
 - an `OPTION` clause associated with the rowset expression to provide a data or plan hint
@@ -214,7 +214,7 @@ Spark's cost-based query optimizer has its own capabilities to provide hints and
 ## Next steps
 
 - [Understand Spark data formats for U-SQL developers](understand-spark-data-formats.md)
-- [.NET for Apache Spark](/dotnet/spark/what-is-apache-spark-dotnet)
+- [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
 - [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
 - [Transform data using Spark activity in Azure Data Factory](../data-factory/transform-data-using-spark.md)
 - [Transform data using Hadoop Hive activity in Azure Data Factory](../data-factory/transform-data-using-hadoop-hive.md)
diff --git a/articles/data-lake-analytics/understand-spark-data-formats.md b/articles/data-lake-analytics/understand-spark-data-formats.md
@@ -47,7 +47,7 @@ After this transformation, you copy the data as outlined in the chapter [Move da
 
 - [Understand Spark code concepts for U-SQL developers](understand-spark-code-concepts.md)
 - [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
-- [.NET for Apache Spark](/dotnet/spark/what-is-apache-spark-dotnet)
+- [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
 - [Transform data using Spark activity in Azure Data Factory](../data-factory/transform-data-using-spark.md)
 - [Transform data using Hadoop Hive activity in Azure Data Factory](../data-factory/transform-data-using-hadoop-hive.md)
-- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)
+- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)
diff --git a/articles/data-lake-analytics/understand-spark-for-usql-developers.md b/articles/data-lake-analytics/understand-spark-for-usql-developers.md
@@ -42,7 +42,7 @@ It includes the steps you can take, and several alternatives.
 - [Understand Spark data formats for U-SQL developers](understand-spark-data-formats.md)
 - [Understand Spark code concepts for U-SQL developers](understand-spark-code-concepts.md)
 - [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
-- [.NET for Apache Spark](/dotnet/spark/what-is-apache-spark-dotnet)
+- [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
 - [Transform data using Hadoop Hive activity in Azure Data Factory](../data-factory/transform-data-using-hadoop-hive.md)
 - [Transform data using Spark activity in Azure Data Factory](../data-factory/transform-data-using-spark.md)
-- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)
+- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)
diff --git a/articles/synapse-analytics/spark/spark-dotnet.md b/articles/synapse-analytics/spark/spark-dotnet.md
@@ -3,18 +3,18 @@ title: Use .NET for Apache Spark
 description: Learn about using .NET and Apache Spark to do batch processing, real-time streaming, machine learning, and write ad-hoc queries in Azure Synapse Analytics notebooks.
 author: juluczni
 ms.author: juluczni
-services: synapse-analytics 
-ms.service: synapse-analytics 
+services: synapse-analytics
+ms.service: synapse-analytics
 ms.topic: conceptual
 ms.subservice: spark
 ms.custom: devx-track-dotnet
-ms.date: 05/01/2020 
+ms.date: 05/01/2020
 ms.reviewer: sngun
 ---
 
 # Use .NET for Apache Spark with Azure Synapse Analytics
 
-[.NET for Apache Spark](https://dot.net/spark) provides free, [open-source](https://github.com/dotnet/spark), and cross-platform .NET support for Spark. 
+[.NET for Apache Spark](https://dot.net/spark) provides free, [open-source](https://github.com/dotnet/spark), and cross-platform .NET support for Spark.
 
 It provides .NET bindings for Spark, which allows you to access Spark APIs through C# and F#. With .NET for Apache Spark, you can also write and execute user-defined functions for Spark written in .NET. The .NET APIs for Spark enable you to access all aspects of Spark DataFrames that help you analyze your data, including Spark SQL, Delta Lake, and Structured Streaming.
 
@@ -23,8 +23,8 @@ You can analyze data with .NET for Apache Spark through Spark batch job definiti
 >[!IMPORTANT]
 > The [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet) is an open-source project under the .NET Foundation that currently requires the .NET 3.1 library, which has reached the out-of-support status. We would like to inform users of Azure Synapse Spark of the removal of the .NET for Apache Spark library in the Azure Synapse Runtime for Apache Spark version 3.3. Users may refer to the [.NET Support Policy](https://dotnet.microsoft.com/platform/support/policy/dotnet-core) for more details on this matter.
 >
-> As a result, it will no longer be possible for users to utilize Apache Spark APIs via C# and F#, or execute C# code in notebooks within Synapse or through Apache Spark Job definitions in Synapse. It is important to note that this change affects only Azure Synapse Runtime for Apache Spark 3.3 and above. 
-> 
+> As a result, it will no longer be possible for users to utilize Apache Spark APIs via C# and F#, or execute C# code in notebooks within Synapse or through Apache Spark Job definitions in Synapse. It is important to note that this change affects only Azure Synapse Runtime for Apache Spark 3.3 and above.
+>
 > We will continue to support .NET for Apache Spark in all previous versions of the Azure Synapse Runtime according to [their lifecycle stages](runtime-for-apache-spark-lifecycle-and-supportability.md). However, we do not have plans to support .NET for Apache Spark in Azure Synapse Runtime for Apache Spark 3.3 and future versions. We recommend that users with existing workloads written in C# or F# migrate to Python or Scala. Users are advised to take note of this information and plan accordingly.
 
 ## Submit batch jobs using the Spark job definition
@@ -37,34 +37,35 @@ The required .NET Spark version will be noted in the Synapse Studio interface un
    :::image type="content" source="./media/apache-spark-job-definitions/net-spark-workspace-compatibility.png" alt-text="Screenshot that shows properties, including the .NET Spark version.":::
 
    Create your project as a .NET console application that outputs an Ubuntu x86 executable.
-  
+
    ```
    <Project Sdk="Microsoft.NET.Sdk">
- 
+
      <PropertyGroup>
        <OutputType>Exe</OutputType>
        <TargetFramework>netcoreapp3.1</TargetFramework>
      </PropertyGroup>
- 
+
      <ItemGroup>
        <PackageReference Include="Microsoft.Spark" Version="2.1.0" />
      </ItemGroup>
- 
+
    </Project>
    ```
 
 2. Run the following commands to publish your app. Be sure to replace *mySparkApp* with the path to your app.
-   
+
    ```dotnetcli
    cd mySparkApp
    dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.18.04-x64
    ```
 
-3. Zip the contents of the publish folder, `publish.zip` for example, that was created as a result of Step 1. All the assemblies should be in the root of the ZIP file and there should be no intermediate folder layer. This means when you unzip `publish.zip`, all assemblies are extracted into your current working directory. 
+3. Zip the contents of the publish folder, `publish.zip` for example, that was created as a result of Step 1. All the assemblies should be in the root of the ZIP file and there should be no intermediate folder layer. This means when you unzip `publish.zip`, all assemblies are extracted into your current working directory.
 
     **On Windows:**
 
     Using Windows PowerShell or PowerShell 7, create a .zip from the contents of your publish directory.
+
     ```PowerShell
     Compress-Archive publish/* publish.zip -Update
     ```
@@ -77,9 +78,9 @@ The required .NET Spark version will be noted in the Synapse Studio interface un
     zip -r publish.zip
     ```
 
-## .NET for Apache Spark in Azure Synapse Analytics notebooks 
+## .NET for Apache Spark in Azure Synapse Analytics notebooks
 
-Notebooks are a great option for prototyping your .NET for Apache Spark pipelines and scenarios. You can start working with, understanding, filtering, displaying, and visualizing your data quickly and efficiently. 
+Notebooks are a great option for prototyping your .NET for Apache Spark pipelines and scenarios. You can start working with, understanding, filtering, displaying, and visualizing your data quickly and efficiently.
 
 Data engineers, data scientists, business analysts, and machine learning engineers are all able to collaborate over a shared, interactive document. You see immediate results from data exploration, and can visualize your data in the same notebook.
 
@@ -109,19 +110,23 @@ The following features are available when you use .NET for Apache Spark in the A
 * Access to the standard C# library (such as System, LINQ, Enumerables, and so on).
 * Support for C# 8.0 language features.
 * `spark` as a pre-defined variable to give you access to your Apache Spark session.
-* Support for defining [.NET user-defined functions that can run within Apache Spark](/dotnet/spark/how-to-guides/udf-guide). We recommend [Write and call UDFs in .NET for Apache Spark Interactive environments](/dotnet/spark/how-to-guides/dotnet-interactive-udf-issue) for learning how to use UDFs in .NET for Apache Spark Interactive experiences.
+* Support for defining [.NET user-defined functions that can run within Apache Spark](/previous-versions/dotnet/spark/how-to-guides/udf-guide). We recommend [Write and call UDFs in .NET for Apache Spark Interactive environments](/previous-versions/dotnet/spark/how-to-guides/dotnet-interactive-udf-issue) for learning how to use UDFs in .NET for Apache Spark Interactive experiences.
 * Support for visualizing output from your Spark jobs using different charts (such as line, bar, or histogram) and layouts (such as single, overlaid, and so on) using the `XPlot.Plotly` library.
 * Ability to include NuGet packages into your C# notebook.
+
 ## Troubleshooting
 
 ### `DotNetRunner: null` / `Futures timeout` in Synapse Spark Job Definition Run
+
 Synapse Spark Job Definitions on Spark Pools using Spark 2.4 require `Microsoft.Spark` 1.0.0. Clear your `bin` and `obj` directories, and publish the project using 1.0.0.
-### OutOfMemoryError: java heap space at org.apache.spark...
+
+### OutOfMemoryError: java heap space at org.apache.spark
+
 Dotnet Spark 1.0.0 uses a different debug architecture than 1.1.1+. You will have to use 1.0.0 for your published version and 1.1.1+ for local debugging.
 
 ## Next steps
 
 * [.NET for Apache Spark documentation](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
-* [.NET for Apache Spark Interactive guides](/dotnet/spark/how-to-guides/dotnet-interactive-udf-issue)
+* [.NET for Apache Spark Interactive guides](/previous-versions/dotnet/spark/how-to-guides/dotnet-interactive-udf-issue)
 * [Azure Synapse Analytics](https://azure.microsoft.com/services/synapse-analytics/)
 * [.NET Interactive](https://devblogs.microsoft.com/dotnet/creating-interactive-net-documentation/)