Skip to content

Commit 7f078ed

Browse files
committed
update .net for apache spark links
1 parent 773ed52 commit 7f078ed

File tree

5 files changed

+187
-31
lines changed

5 files changed

+187
-31
lines changed
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
---
2+
title: "Cognitive Services for big data"
3+
description: Learn how to leverage Azure Cognitive Services on large datasets using Python, Java, and Scala. With Cognitive Services for big data you can embed continuously improving, intelligent models directly into Apache Spark™ and SQL computations.
4+
services: cognitive-services
5+
author: mhamilton723
6+
manager: nitinme
7+
ms.service: cognitive-services
8+
ms.custom: ignite-2022, devx-track-extended-java, devx-track-python
9+
ms.topic: conceptual
10+
ms.date: 10/28/2021
11+
ms.author: marhamil
12+
---
13+
14+
# Azure Cognitive Services for big data
15+
16+
![Azure Cognitive Services for big data](media/cognitive-services-big-data-overview.svg)
17+
18+
Azure Cognitive Services for big data lets users channel terabytes of data through Cognitive Services using [Apache Spark™](/previous-versions/dotnet/spark/what-is-spark) and open source libraries for distributed machine learning workloads. With Cognitive Services for big data, it's easy to create large-scale intelligent applications with any datastore.
19+
20+
Using the resources and libraries described in this article, you can embed continuously improving, intelligent models directly into Apache Spark™ and SQL computations. These tools liberate developers from low-level networking details, so that they can focus on creating smart, distributed applications.
21+
22+
## Features and benefits
23+
24+
Cognitive Services for big data can use resources from any [supported region](https://azure.microsoft.com/global-infrastructure/services/?products=cognitive-services), as well as [containerized Cognitive Services](../cognitive-services-container-support.md). Containers support low or no connectivity deployments with ultra-low latency responses. Containerized Cognitive Services can be run locally, directly on the worker nodes of your Spark cluster, or on an external orchestrator like Kubernetes.
25+
26+
## Supported services
27+
28+
[Cognitive Services](../index.yml), accessed through APIs and SDKs, help developers build intelligent applications without having AI or data science skills. With Cognitive Services you can make your applications see, hear, speak, and understand. To use Cognitive Services, your application must send data to the service over the network. Once received, the service sends an intelligent response in return. The following Cognitive Services resources are available for big data workloads:
29+
30+
### Vision
31+
32+
|Service Name|Service Description|
33+
|:-----------|:------------------|
34+
|[Computer Vision](../computer-vision/index.yml "Computer Vision")| The Computer Vision service provides you with access to advanced algorithms for processing images and returning information. |
35+
|[Face](../computer-vision/index-identity.yml "Face")| The Face service provides access to advanced face algorithms, enabling face attribute detection and recognition. |
36+
37+
### Speech
38+
39+
|Service Name|Service Description|
40+
|:-----------|:------------------|
41+
|[Speech service](../speech-service/index.yml "Speech service")|The Speech service provides access to features like speech recognition, speech synthesis, speech translation, and speaker verification and identification.|
42+
43+
### Decision
44+
45+
|Service Name|Service Description|
46+
|:-----------|:------------------|
47+
|[Anomaly Detector](../anomaly-detector/index.yml "Anomaly Detector") | The Anomaly Detector service allows you to monitor and detect abnormalities in your time series data.|
48+
49+
### Language
50+
51+
|Service Name|Service Description|
52+
|:-----------|:------------------|
53+
|[Language service](../language-service/index.yml "Language service")| The Language service provides natural language processing over raw text for sentiment analysis, key-phrase extraction, and language detection.|
54+
55+
### Search
56+
57+
|Service Name|Service Description|
58+
|:-----------|:------------------|
59+
|[Bing Image Search](/azure/cognitive-services/bing-image-search "Bing Image Search")|The Bing Image Search service returns a display of images determined to be relevant to the user's query.|
60+
61+
## Supported programming languages for Cognitive Services for big data
62+
63+
Cognitive Services for big data are built on Apache Spark. Apache Spark is a distributed computing library that supports Java, Scala, Python, R, and many other languages. See [SynapseML](https://microsoft.github.io/SynapseML) for documentation, samples, and blog posts.
64+
65+
The following languages are currently supported.
66+
67+
### Python
68+
69+
We provide a PySpark API for current and legacy libraries:
70+
71+
* [`synapseml.cognitive`](https://mmlspark.blob.core.windows.net/docs/0.10.0/pyspark/synapse.ml.cognitive.html)
72+
73+
* [`mmlspark.cognitive`](https://mmlspark.blob.core.windows.net/docs/0.18.1/pyspark/modules.html)
74+
75+
For more information, see the [Python Developer API](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc1/pyspark/mmlspark.cognitive.html). For usage examples, see the [Python Samples](samples-python.md).
76+
77+
### Scala and Java
78+
79+
We provide a Scala and Java-based Spark API for current and legacy libraries:
80+
81+
* [`com.microsoft.synapseml.cognitive`](https://mmlspark.blob.core.windows.net/docs/0.10.0/scala/com/microsoft/azure/synapse/ml/cognitive/index.html)
82+
83+
* [`com.microsoft.ml.spark.cognitive`](https://mmlspark.blob.core.windows.net/docs/0.18.1/scala/index.html#com.microsoft.ml.spark.cognitive.package)
84+
85+
For more information, see the [Scala Developer API](https://mmlspark.blob.core.windows.net/docs/1.0.0-rc1/scala/index.html#package). For usage examples, see the [Scala Samples](samples-scala.md).
86+
87+
## Supported platforms and connectors
88+
89+
Big data scenarios require Apache Spark. There are several Apache Spark platforms that support Cognitive Services for big data.
90+
91+
### Azure Databricks
92+
93+
[Azure Databricks](/azure/databricks/scenarios/what-is-azure-databricks) is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. It provides one-click setup, streamlined work-flows, and an interactive workspace that supports collaboration between data scientists, data engineers, and business analysts.
94+
95+
### Azure Synapse Analytics
96+
97+
[Azure Synapse Analytics](/azure/databricks/data/data-sources/azure/synapse-analytics) is as enterprise data warehouse that uses massive parallel processing. With Synapse Analytics, you can quickly run complex queries across petabytes of data. Azure Synapse Analytics provides managed Spark Pools to run Spark Jobs with an intuitive Jupyter Notebook Interface.
98+
99+
### Azure Kubernetes Service
100+
101+
[Azure Kubernetes Service (AKS)](../../aks/index.yml) orchestrates Docker Containers and distributed applications at massive scales. AKS is a managed Kubernetes offering that simplifies using Kubernetes in Azure. Kubernetes can enable fine-grained control of Cognitive Service scale, latency, and networking. However, we recommend using Azure Databricks or Azure Synapse Analytics if you're unfamiliar with Apache Spark.
102+
103+
### Data Connectors
104+
105+
Once you have a Spark Cluster, the next step is connecting to your data. Apache Spark has a broad collection of database connectors. These connectors allow applications to work with large datasets no matter where they're stored. For more information about supported databases and connectors, see the [list of supported datasources for Azure Databricks](/azure/databricks/data/data-sources/).
106+
107+
## Concepts
108+
109+
### Spark
110+
111+
[Apache Spark™](http://spark.apache.org/) is a unified analytics engine for large-scale data processing. Its parallel processing framework boosts performance of big data and analytic applications. Spark can operate as both a batch and stream processing system, without changing core application code.
112+
113+
The basis of Spark is the DataFrame: a tabular collection of data distributed across the Apache Spark worker nodes. A Spark DataFrame is like a table in a relational database or a data frame in R/Python, but with limitless scale. DataFrames can be constructed from many sources such as: structured data files, tables in Hive, or external databases. Once your data is in a Spark DataFrame, you can:
114+
115+
* Do SQL-style computations such as join and filter tables.
116+
* Apply functions to large datasets using MapReduce style parallelism.
117+
* Apply Distributed Machine Learning using Microsoft Machine Learning for Apache Spark.
118+
* Use Cognitive Services for big data to enrich your data with ready-to-use intelligent services.
119+
120+
### Microsoft Machine Learning for Apache Spark (MMLSpark)
121+
122+
[Microsoft Machine Learning for Apache Spark](https://mmlspark.blob.core.windows.net/website/index.html#install) (MMLSpark) is an open-source, distributed machine learning library (ML) built on Apache Spark. Cognitive Services for big data is included in this package. Additionally, MMLSpark contains several other ML tools for Apache Spark, such as LightGBM, Vowpal Wabbit, OpenCV, LIME, and more. With MMLSpark, you can build powerful predictive and analytical models from any Spark datasource.
123+
124+
### HTTP on Spark
125+
126+
Cognitive Services for big data is an example of how we can integrate intelligent web services with big data. Web services power many applications across the globe and most services communicate through the Hypertext Transfer Protocol (HTTP). To work with *arbitrary* web services at large scales, we provide HTTP on Spark. With HTTP on Spark, you can pass terabytes of data through any web service. Under the hood, we use this technology to power Cognitive Services for big data.
127+
128+
## Developer samples
129+
130+
* [Recipe: Predictive Maintenance](recipes/anomaly-detection.md)
131+
* [Recipe: Intelligent Art Exploration](recipes/art-explorer.md)
132+
133+
## Blog posts
134+
135+
* [Learn more about how Cognitive Services work on Apache Spark™](https://azure.microsoft.com/blog/dear-spark-developers-welcome-to-azure-cognitive-services/)
136+
* [Saving Snow Leopards with Deep Learning and Computer Vision on Spark](/archive/blogs/machinelearning/saving-snow-leopards-with-deep-learning-and-computer-vision-on-spark)
137+
* [Microsoft Research Podcast: MMLSpark, empowering AI for Good with Mark Hamilton](https://blubrry.com/microsoftresearch/49485070/092-mmlspark-empowering-ai-for-good-with-mark-hamilton/)
138+
* [Academic Whitepaper: Large Scale Intelligent Microservices](https://arxiv.org/abs/2009.08044)
139+
140+
## Webinars and videos
141+
142+
* [The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Services](https://databricks.com/session/the-azure-cognitive-services-on-spark-clusters-with-embedded-intelligent-services)
143+
* [Spark Summit Keynote: Scalable AI for Good](https://databricks.com/session_eu19/scalable-ai-for-good)
144+
* [Cognitive Services for big data in Azure Cosmos DB](https://medius.studios.ms/Embed/Video-nc/B19-BRK3004?latestplayer=true&l=2571.208093)
145+
* [Lightning Talk on Large Scale Intelligent Microservices](https://www.youtube.com/watch?v=BtuhmdIy9Fk&t=6s)
146+
147+
## Next steps
148+
149+
* [Getting Started with Cognitive Services for big data](getting-started.md)
150+
* [Simple Python Examples](samples-python.md)
151+
* [Simple Scala Examples](samples-scala.md)

articles/data-lake-analytics/understand-spark-code-concepts.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,10 @@ This section provides high-level guidance on transforming U-SQL Scripts to Apach
1515

1616
- It starts with a [comparison of the two language's processing paradigms](#understand-the-u-sql-and-spark-language-and-processing-paradigms)
1717
- Provides tips on how to:
18-
- [Transform scripts](#transform-u-sql-scripts) including U-SQL's [rowset expressions](#transform-u-sql-rowset-expressions-and-sql-based-scalar-expressions)
19-
- [.NET code](#transform-net-code)
20-
- [Data types](#transform-typed-values)
21-
- [Catalog objects](#transform-u-sql-catalog-objects).
18+
- [Transform scripts](#transform-u-sql-scripts) including U-SQL's [rowset expressions](#transform-u-sql-rowset-expressions-and-sql-based-scalar-expressions)
19+
- [.NET code](#transform-net-code)
20+
- [Data types](#transform-typed-values)
21+
- [Catalog objects](#transform-u-sql-catalog-objects).
2222

2323
## Understand the U-SQL and Spark language and processing paradigms
2424

@@ -48,13 +48,13 @@ Spark programs are similar in that you would use Spark connectors to read the da
4848

4949
U-SQL's expression language is C# and it offers various ways to scale out custom .NET code with user-defined functions, user-defined operators and user-defined aggregators.
5050

51-
Azure Synapse and Azure HDInsight Spark both now natively support executing .NET code with .NET for Apache Spark. This means that you can potentially reuse some or all of your [.NET user-defined functions with Spark](#transform-user-defined-scalar-net-functions-and-user-defined-aggregators). Note though that U-SQL uses the .NET Framework while .NET for Apache Spark is based on .NET Core 3.1 or later.
51+
Azure Synapse and Azure HDInsight Spark both now natively support executing .NET code with .NET for Apache Spark. This means that you can potentially reuse some or all of your [.NET user-defined functions with Spark](#transform-user-defined-scalar-net-functions-and-user-defined-aggregators). Note though that U-SQL uses the .NET Framework while .NET for Apache Spark is based on .NET Core 3.1 or later.
5252

5353
[U-SQL user-defined operators (UDOs)](#transform-user-defined-operators-udos) are using the U-SQL UDO model to provide scaled-out execution of the operator's code. Thus, UDOs will have to be rewritten into user-defined functions to fit into the Spark execution model.
5454

5555
.NET for Apache Spark currently doesn't support user-defined aggregators. Thus, [U-SQL user-defined aggregators](#transform-user-defined-scalar-net-functions-and-user-defined-aggregators) will have to be translated into Spark user-defined aggregators written in Scala.
5656

57-
If you don't want to take advantage of the .NET for Apache Spark capabilities, you'll have to rewrite your expressions into an equivalent Spark, Scala, Java, or Python expression, function, aggregator or connector.
57+
If you don't want to take advantage of the .NET for Apache Spark capabilities, you'll have to rewrite your expressions into an equivalent Spark, Scala, Java, or Python expression, function, aggregator or connector.
5858

5959
In any case, if you have a large amount of .NET logic in your U-SQL scripts, please contact us through your Microsoft Account representative for further guidance.
6060

@@ -137,9 +137,9 @@ For more information, see:
137137

138138
In Spark, types per default allow NULL values while in U-SQL, you explicitly mark scalar, non-object as nullable. While Spark allows you to define a column as not nullable, it will not enforce the constraint and [may lead to wrong result](https://medium.com/@weshoffman/apache-spark-parquet-and-troublesome-nulls-28712b06f836).
139139

140-
In Spark, NULL indicates that the value is unknown. A Spark NULL value is different from any value, including itself. Comparisons between two Spark NULL values, or between a NULL value and any other value, return unknown because the value of each NULL is unknown.
140+
In Spark, NULL indicates that the value is unknown. A Spark NULL value is different from any value, including itself. Comparisons between two Spark NULL values, or between a NULL value and any other value, return unknown because the value of each NULL is unknown.
141141

142-
This behavior is different from U-SQL, which follows C# semantics where `null` is different from any value but equal to itself.
142+
This behavior is different from U-SQL, which follows C# semantics where `null` is different from any value but equal to itself.
143143

144144
Thus a SparkSQL `SELECT` statement that uses `WHERE column_name = NULL` returns zero rows even if there are NULL values in `column_name`, while in U-SQL, it would return the rows where `column_name` is set to `null`. Similarly, A Spark `SELECT` statement that uses `WHERE column_name != NULL` returns zero rows even if there are non-null values in `column_name`, while in U-SQL, it would return the rows that have non-null. Thus, if you want the U-SQL null-check semantics, you should use [isnull](https://spark.apache.org/docs/2.3.0/api/sql/index.html#isnull) and [isnotnull](https://spark.apache.org/docs/2.3.0/api/sql/index.html#isnotnull) respectively (or their DSL equivalent).
145145

@@ -203,7 +203,7 @@ Most of the settable system variables have no direct equivalent in Spark. Some o
203203

204204
### U-SQL hints
205205

206-
U-SQL offers several syntactic ways to provide hints to the query optimizer and execution engine:
206+
U-SQL offers several syntactic ways to provide hints to the query optimizer and execution engine:
207207

208208
- Setting a U-SQL system variable
209209
- an `OPTION` clause associated with the rowset expression to provide a data or plan hint
@@ -214,7 +214,7 @@ Spark's cost-based query optimizer has its own capabilities to provide hints and
214214
## Next steps
215215

216216
- [Understand Spark data formats for U-SQL developers](understand-spark-data-formats.md)
217-
- [.NET for Apache Spark](/dotnet/spark/what-is-apache-spark-dotnet)
217+
- [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
218218
- [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
219219
- [Transform data using Spark activity in Azure Data Factory](../data-factory/transform-data-using-spark.md)
220220
- [Transform data using Hadoop Hive activity in Azure Data Factory](../data-factory/transform-data-using-hadoop-hive.md)

articles/data-lake-analytics/understand-spark-data-formats.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ After this transformation, you copy the data as outlined in the chapter [Move da
4747

4848
- [Understand Spark code concepts for U-SQL developers](understand-spark-code-concepts.md)
4949
- [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
50-
- [.NET for Apache Spark](/dotnet/spark/what-is-apache-spark-dotnet)
50+
- [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
5151
- [Transform data using Spark activity in Azure Data Factory](../data-factory/transform-data-using-spark.md)
5252
- [Transform data using Hadoop Hive activity in Azure Data Factory](../data-factory/transform-data-using-hadoop-hive.md)
53-
- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)
53+
- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)

articles/data-lake-analytics/understand-spark-for-usql-developers.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ It includes the steps you can take, and several alternatives.
4242
- [Understand Spark data formats for U-SQL developers](understand-spark-data-formats.md)
4343
- [Understand Spark code concepts for U-SQL developers](understand-spark-code-concepts.md)
4444
- [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
45-
- [.NET for Apache Spark](/dotnet/spark/what-is-apache-spark-dotnet)
45+
- [.NET for Apache Spark](/previous-versions/dotnet/spark/what-is-apache-spark-dotnet)
4646
- [Transform data using Hadoop Hive activity in Azure Data Factory](../data-factory/transform-data-using-hadoop-hive.md)
4747
- [Transform data using Spark activity in Azure Data Factory](../data-factory/transform-data-using-spark.md)
48-
- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)
48+
- [What is Apache Spark in Azure HDInsight](../hdinsight/spark/apache-spark-overview.md)

0 commit comments

Comments
 (0)