Skip to content

Commit 1ddd11b

Browse files
Merge pull request #208607 from SnehaGunda/SynapseML
Adding SynapseML concept doc
2 parents b635213 + c7a8eb0 commit 1ddd11b

File tree

4 files changed

+68
-3
lines changed

4 files changed

+68
-3
lines changed
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
---
2+
title: SynapseML and its use in Azure Synapse analytics.
3+
description: Learn about the SynapseML library and how it simplifies the creation of massively scalable machine learning (ML) pipelines in Azure Synapse analytics.
4+
author: SnehaGunda
5+
ms.service: synapse-analytics
6+
ms.topic: conceptual
7+
ms.subservice: machine-learning
8+
ms.date: 08/31/2022
9+
ms.author: sngun
10+
---
11+
12+
# What is SynapseML?
13+
14+
SynapseML (previously known as MMLSpark), is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. SynapseML provides simple, composable, and distributed APIs for a wide variety of different machine learning tasks such as text analytics, vision, anomaly detection, and many others. SynapseML is built on the [Apache Spark distributed computing framework](https://spark.apache.org/) and shares the same API as the [SparkML/MLLib library](https://spark.apache.org/mllib/), allowing you to seamlessly embed SynapseML models into existing Apache Spark workflows.
15+
16+
With SynapseML, you can build scalable and intelligent systems to solve challenges in domains such as anomaly detection, computer vision, deep learning, text analytics, and others. SynapseML can train and evaluate models on single-node, multi-node, and elastically resizable clusters of computers. This lets you scale your work without wasting resources. SynapseML is usable across Python, R, Scala, Java, and .NET. Furthermore, its API abstracts over a wide variety of databases, file systems, and cloud data stores to simplify experiments no matter where data is located.
17+
18+
SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+.
19+
20+
## Key features of SynapseML
21+
22+
### A unified API for creating, training, and scoring models
23+
24+
SynapseML offers a unified API that simplifies developing fault-tolerant distributed programs. In particular, SynapseML exposes many different machine learning frameworks under a single API that is scalable, data and language agnostic, and works for batch, streaming, and serving applications.
25+
26+
A unified API standardizes many tools, frameworks, algorithms and streamlines the distributed machine learning experience. It enables developers to quickly compose disparate machine learning frameworks, keeps code clean, and enables workflows that require more than one framework. For example, workflows such as web-supervised learning or search-engine creation require multiple services and frameworks. SynapseML shields users from this extra complexity.
27+
28+
29+
### Use pre-built intelligent models
30+
31+
Many tools in SynapseML don't require a large labeled training dataset. Instead, SynapseML provides simple APIs for pre-built intelligent services, such as Azure Cognitive Services, to quickly solve large-scale AI challenges related to both business and research. SynapseML enables developers to embed over 50 different state-of-the-art ML services directly into their systems and databases. These ready-to-use algorithms can parse a wide variety of documents, transcribe multi-speaker conversations in real time, and translate text to over 100 different languages. For more examples of how to use pre-built AI to solve tasks quickly, see [the SynapseML cognitive service examples](https://microsoft.github.io/SynapseML/docs/features/cognitive_services/CognitiveServices%20-%20Overview/).
32+
33+
To make SynapseML's integration with Azure Cognitive Services fast and efficient SynapseML introduces many optimizations for service-oriented workflows. In particular, SynapseML automatically parses common throttling responses to ensure that jobs don’t overwhelm backend services. Additionally, it uses exponential back-offs to handle unreliable network connections and failed responses. Finally, Spark’s worker machines stay busy with new asynchronous parallelism primitives for Spark. Asynchronous parallelism allows worker machines to send requests while waiting on a response from the server and can yield a tenfold increase in throughput.
34+
35+
### Broad ecosystem compatibility with ONNX
36+
37+
SynapseML enables developers to use models from many different ML ecosystems through the Open Neural Network Exchange (ONNX) framework. With this integration, you can execute a wide variety of classical and deep learning models at scale with only a few lines of code. SynapseML automatically handles distributing ONNX models to worker nodes, batching and buffering input data for high throughput, and scheduling work on hardware accelerators.
38+
39+
Bringing ONNX to Spark not only helps developers scale deep learning models, it also enables distributed inference across a wide variety of ML ecosystems. In particular, ONNXMLTools converts models from TensorFlow, scikit-learn, Core ML, LightGBM, XGBoost, H2O, and PyTorch to ONNX for accelerated and distributed inference using SynapseML.
40+
41+
### Build responsible AI systems
42+
43+
After building a model, it’s imperative that researchers and engineers understand its limitations and behavior before deployment. SynapseML helps developers and researchers build responsible AI systems by introducing new tools that reveal why models make certain predictions and how to improve the training dataset to eliminate biases. SynapseML dramatically speeds the process of understanding a user’s trained model by enabling developers to distribute computation across hundreds of machines. More specifically, SynapseML includes distributed implementations of Shapley Additive Explanations (SHAP) and Locally Interpretable Model-Agnostic Explanations (LIME) to explain the predictions of vision, text, and tabular models. It also includes tools such as Individual Conditional Expectation (ICE) and partial dependence analysis to recognized biased datasets.
44+
45+
## Enterprise support on Azure Synapse Analytics
46+
47+
SynapseML is generally available on Azure Synapse Analytics with enterprise support. You can build large-scale machine learning pipelines using Azure Cognitive Services, LightGBM, ONNX, and other [selected SynapseML features](https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/streamline-collaboration-and-insights-with-simplified-machine/ba-p/2924707). It even includes templates to quickly prototype distributed machine learning systems, such as visual search engines, predictive maintenance pipelines, document translation, and more.
48+
49+
## Next steps
50+
51+
* To learn more about SynapseML, see the [blog post.](https://www.microsoft.com/en-us/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/)
52+
53+
* [Install SynapseML and get started with examples.](https://microsoft.github.io/SynapseML/docs/getting_started/installation/)
54+
55+
* [SynapseML GitHub repository.](https://github.com/microsoft/SynapseML)

articles/synapse-analytics/machine-learning/what-is-machine-learning.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ ms.subservice: machine-learning
88
ms.topic: overview
99
ms.reviewer: sngun, garye
1010

11-
ms.date: 10/01/2021
11+
ms.date: 08/31/2022
1212
author: nelgson
1313
ms.author: negust
1414
---
@@ -69,6 +69,10 @@ Models that have been trained either in Azure Synapse or outside Azure Synapse c
6969

7070
* Another option for batch scoring machine learning models in Azure Synapse is to leverage the Apache Spark Pools for Azure Synapse. Depending on the libraries used to train the models, you can use a code experience to run your batch scoring.
7171

72+
## SynapseML
73+
74+
SynapseML (previously known as MMLSpark), is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. It is an ecosystem of tools used to expand the Apache Spark framework in several new directions. SynapseML unifies several existing machine learning frameworks and new Microsoft algorithms into a single, scalable API that’s usable across Python, R, Scala, .NET, and Java. To learn more, see the [key features of SynapseML](synapse-machine-learning-library.md).
75+
7276
## Next steps
7377

7478
* [Get started with Azure Synapse Analytics](../get-started.md)

articles/synapse-analytics/overview-terminology.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ author: saveenr
55
ms.service: synapse-analytics
66
ms.topic: overview
77
ms.subservice: overview
8-
ms.date: 01/13/2022
8+
ms.date: 08/19/2022
99
ms.author: saveenr
1010
ms.reviewer: sngun
1111
ms.custom: ignite-fall-2021
@@ -15,7 +15,7 @@ ms.custom: ignite-fall-2021
1515

1616
This document guides you through the basic concepts of Azure Synapse Analytics.
1717

18-
## Basics
18+
## Synapse workspace
1919

2020
A **Synapse workspace** is a securable collaboration boundary for doing cloud-based enterprise analytics in Azure. A workspace is deployed in a specific region and has an associated ADLS Gen2 account and file system (for storing temporary data). A workspace is under a resource group.
2121

@@ -43,6 +43,10 @@ There are two ways within Synapse to use Spark:
4343
* **Spark Notebooks** for doing data Data Science and Engineering use Scala, PySpark, C#, and SparkSQL
4444
* **Spark job definitions** for running batch Spark jobs using jar files.
4545

46+
## SynapseML
47+
48+
SynapseML (previously known as MMLSpark), is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. It is an ecosystem of tools used to expand the Apache Spark framework in several new directions. SynapseML unifies several existing machine learning frameworks and new Microsoft algorithms into a single, scalable API that’s usable across Python, R, Scala, .NET, and Java. To learn more, see the [key features of SynapseML](machine-learning/synapse-machine-learning-library.md).
49+
4650
## Pipelines
4751

4852
Pipelines are how Azure Synapse provides Data Integration - allowing you to move data between services and orchestrate activities.

articles/synapse-analytics/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1455,6 +1455,8 @@ items:
14551455
href: ./spark/apache-spark-machine-learning-training.md
14561456
- name: Deep learning
14571457
href: ./machine-learning/concept-deep-learning.md
1458+
- name: SynapseML library
1459+
href: ./machine-learning/synapse-machine-learning-library.md
14581460
- name: Tutorials
14591461
items:
14601462
- name: Data access and preparation

0 commit comments

Comments
 (0)