Skip to content

Commit 749fe58

Browse files
authored
Merge pull request #212410 from midesa/main
core r docs
2 parents e21ac6c + 46fdd8e commit 749fe58

20 files changed

+1752
-63
lines changed

articles/synapse-analytics/spark/apache-spark-3-runtime.md

Lines changed: 470 additions & 0 deletions
Large diffs are not rendered by default.

articles/synapse-analytics/spark/apache-spark-32-runtime.md

Lines changed: 472 additions & 0 deletions
Large diffs are not rendered by default.

articles/synapse-analytics/spark/apache-spark-azure-portal-add-libraries.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ When a Spark instance starts, these libraries are included automatically. More p
3838

3939
## Workspace packages
4040

41-
When your team develops custom applications or models, you might develop various code artifacts like *.whl* or *.jar* files to package your code.
41+
When your team develops custom applications or models, you might develop various code artifacts like *.whl*, *.jar*, or *tar.gz* files to package your code.
4242

4343
In Synapse, workspace packages can be custom or private *.whl* or *.jar* files. You can upload these packages to your workspace and later assign them to a specific serverless Apache Spark pool. Once assigned, these workspace packages are installed automatically on all Spark pool sessions.
4444

@@ -70,8 +70,11 @@ Session-scoped packages allow users to define package dependencies at the start
7070

7171
To learn more about how to manage session-scoped packages, see the following articles:
7272

73-
- [Python session packages: ](./apache-spark-manage-session-packages.md#session-scoped-python-packages) At the start of a session, provide a Conda *environment.yml* to install more Python packages from popular repositories.
74-
- [Scala/Java session packages: ](./apache-spark-manage-session-packages.md#session-scoped-java-or-scala-packages) At the start of your session, provide a list of *.jar* files to install using `%%configure`.
73+
- [Python session packages:](./apache-spark-manage-session-packages.md#session-scoped-python-packages) At the start of a session, provide a Conda *environment.yml* to install more Python packages from popular repositories.
74+
75+
- [Scala/Java session packages:](./apache-spark-manage-session-packages.md#session-scoped-java-or-scala-packages) At the start of your session, provide a list of *.jar* files to install using `%%configure`.
76+
77+
- [R session packages:](./apache-spark-manage-session-packages.md#session-scoped-r-packages-preview) Within your session, you can install packages across all nodes within your Spark pool using `install.packages` or `devtools`.
7578

7679
## Manage your packages outside Synapse Analytics UI
7780

@@ -83,5 +86,6 @@ To learn more about Azure PowerShell cmdlets and package management REST APIs, s
8386
- Package management REST APIs: [Manage your Spark pool libraries through REST APIs](apache-spark-manage-packages-outside-ui.md#manage-packages-through-rest-apis)
8487

8588
## Next steps
89+
8690
- View the default libraries: [Apache Spark version support](apache-spark-version-support.md)
8791
- Troubleshoot library installation errors: [Troubleshoot library errors](apache-spark-troubleshoot-library-errors.md)

articles/synapse-analytics/spark/apache-spark-data-visualization.md

Lines changed: 125 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Visualizations
3-
description: Use Azure Synapse notebooks to visualize your data
2+
title: Python Visualizations
3+
description: Use Python and Azure Synapse notebooks to visualize your data
44
author: midesa
55
ms.author: midesa
66
ms.reviewer: euang
@@ -11,43 +11,44 @@ ms.subservice: spark
1111
ms.date: 09/13/2020
1212
---
1313
# Visualize data
14-
Azure Synapse is an integrated analytics service that accelerates time to insight, across data warehouses and big data analytics systems. Data visualization is a key component in being able to gain insight into your data. It helps make big and small data easier for humans to understand. It also makes it easier to detect patterns, trends, and outliers in groups of data.
14+
15+
Azure Synapse is an integrated analytics service that accelerates time to insight, across data warehouses and big data analytics systems. Data visualization is a key component in being able to gain insight into your data. It helps make big and small data easier for humans to understand. It also makes it easier to detect patterns, trends, and outliers in groups of data.
1516

1617
When using Apache Spark in Azure Synapse Analytics, there are various built-in options to help you visualize your data, including Synapse notebook chart options, access to popular open-source libraries, and integration with Synapse SQL and Power BI.
1718

1819
## Notebook chart options
19-
When using an Azure Synapse notebook, you can turn your tabular results view into a customized chart using chart options. Here, you can visualize your data without having to write any code.
20+
21+
When using an Azure Synapse notebook, you can turn your tabular results view into a customized chart using chart options. Here, you can visualize your data without having to write any code.
2022

2123
### display(df) function
22-
The ```display``` function allows you to turn SQL queries and Apache Spark dataframes and RDDs into rich data visualizations.The ```display``` function can be used on dataframes or RDDs created in PySpark, Scala, Java, and .NET.
24+
25+
The ```display``` function allows you to turn SQL queries and Apache Spark dataframes and RDDs into rich data visualizations. The ```display``` function can be used on dataframes or RDDs created in PySpark, Scala, Java, R, and .NET.
2326

2427
To access the chart options:
25-
1. The output of ```%%sql``` magic commands appear in the rendered table view by default. You can also call ```display(df)``` on Spark DataFrames or Resilient Distributed Datasets (RDD) function to produce the rendered table view.
26-
28+
29+
1. The output of ```%%sql``` magic commands appear in the rendered table view by default. You can also call ```display(df)``` on Spark DataFrames or Resilient Distributed Datasets (RDD) function to produce the rendered table view.
2730
2. Once you have a rendered table view, switch to the Chart View.
2831
![built-in-charts](./media/apache-spark-development-using-notebooks/synapse-built-in-charts.png#lightbox)
29-
3032
3. You can now customize your visualization by specifying the following values:
3133

3234
| Configuration | Description |
33-
|--|--|
35+
|--|--|
3436
| Chart Type | The ```display``` function supports a wide range of chart types, including bar charts, scatter plots, line graphs, and more |
3537
| Key | Specify the range of values for the x-axis|
3638
| Value | Specify the range of values for the y-axis values |
37-
| Series Group | Used to determine the groups for the aggregation |
38-
| Aggregation | Method to aggregate data in your visualization|
39-
40-
39+
| Series Group | Used to determine the groups for the aggregation |
40+
| Aggregation | Method to aggregate data in your visualization|
4141
> [!NOTE]
4242
> By default the ```display(df)``` function will only take the first 1000 rows of the data to render the charts. Check the **Aggregation over all results** and click the **Apply** button, you will apply the chart generation from the whole dataset. A Spark job will be triggered when the chart setting changes. Please note that it may take several minutes to complete the calculation and render the chart.
43-
4443
4. Once done, you can view and interact with your final visualization!
4544

4645
### display(df) statistic details
46+
4747
You can use <code>display(df, summary = true)</code> to check the statistics summary of a given Apache Spark DataFrame that include the column name, column type, unique values, and missing values for each column. You can also select on specific column to see its minimum value, maximum value, mean value and standard deviation.
4848
![built-in-charts-summary](./media/apache-spark-development-using-notebooks/synapse-built-in-charts-summary.png#lightbox)
49-
49+
5050
### displayHTML() option
51+
5152
Azure Synapse Analytics notebooks support HTML graphics using the ```displayHTML``` function.
5253

5354
The following image is an example of creating visualizations using [D3.js](https://d3js.org/).
@@ -140,10 +141,12 @@ svg
140141

141142
```
142143

143-
## Popular Libraries
144-
When it comes to data visualization, Python offers multiple graphing libraries that come packed with many different features. By default, every Apache Spark Pool in Azure Synapse Analytics contains a set of curated and popular open-source libraries. You can also add or manage additional libraries & versions by using the Azure Synapse Analytics library management capabilities.
144+
## Python Libraries
145+
146+
When it comes to data visualization, Python offers multiple graphing libraries that come packed with many different features. By default, every Apache Spark Pool in Azure Synapse Analytics contains a set of curated and popular open-source libraries. You can also add or manage additional libraries & versions by using the Azure Synapse Analytics library management capabilities.
145147

146148
### Matplotlib
149+
147150
You can render standard plotting libraries, like Matplotlib, using the built-in rendering functions for each library.
148151

149152
The following image is an example of creating a bar chart using **Matplotlib**.
@@ -173,9 +176,9 @@ plt.legend()
173176
plt.show()
174177
```
175178

176-
177179
### Bokeh
178-
You can render HTML or interactive libraries, like **bokeh**, using the ```displayHTML(df)```.
180+
181+
You can render HTML or interactive libraries, like **bokeh**, using the ```displayHTML(df)```.
179182

180183
The following image is an example of plotting glyphs over a map using **bokeh**.
181184

@@ -212,15 +215,14 @@ html = file_html(p, CDN, "my plot1")
212215
displayHTML(html)
213216
```
214217

215-
216218
### Plotly
219+
217220
You can render HTML or interactive libraries like **Plotly**, using the **displayHTML()**.
218221

219222
Run the following sample code to draw the image below.
220223

221224
![plotly-example](./media/apache-spark-development-using-notebooks/synapse-plotly-image.png#lightbox)
222225

223-
224226
```python
225227
from urllib.request import urlopen
226228
import json
@@ -248,9 +250,10 @@ h = plotly.offline.plot(fig, output_type='div')
248250
# display this html
249251
displayHTML(h)
250252
```
253+
251254
### Pandas
252255

253-
You can view html output of pandas dataframe as the default output, notebook will automatically show the styled html content.
256+
You can view html output of pandas dataframe as the default output, notebook will automatically show the styled html content.
254257

255258
![Panda graph example.](./media/apache-spark-data-viz/support-panda.png#lightbox)
256259

@@ -267,20 +270,116 @@ df = pd.DataFrame([[38.0, 2.0, 18.0, 22.0, 21, np.nan],[19, 439, 6, 452, 226,232
267270
df
268271
```
269272

273+
### Additional libraries
270274

271-
### Additional libraries
272275
Beyond these libraries, the Azure Synapse Analytics Runtime also includes the following set of libraries that are often used for data visualization:
273276

274-
- [Seaborn](https://seaborn.pydata.org/)
277+
- [Seaborn](https://seaborn.pydata.org/)
275278

276279
You can visit the Azure Synapse Analytics Runtime [documentation](./spark/../apache-spark-version-support.md) for the most up to date information about the available libraries and versions.
277280

281+
## R Libraries (Preview)
282+
283+
The R ecosystem offers multiple graphing libraries that come packed with many different features. By default, every Apache Spark Pool in Azure Synapse Analytics contains a set of curated and popular open-source libraries. You can also add or manage additional libraries & versions by using the Azure Synapse Analytics library management capabilities.
284+
285+
### ggplot2
286+
287+
The [ggplot2](https://ggplot2.tidyverse.org/) library is popular for data visualization and exploratory data analysis.
288+
289+
![Screenshot of a ggplot2 graph example.](./media/apache-spark-data-viz/ggplot2.png#lightbox)
290+
291+
```r
292+
library(ggplot2)
293+
data(mpg, package="ggplot2")
294+
theme_set(theme_bw())
295+
296+
g <- ggplot(mpg, aes(cty, hwy))
297+
298+
# Scatterplot
299+
g + geom_point() +
300+
geom_smooth(method="lm", se=F) +
301+
labs(subtitle="mpg: city vs highway mileage",
302+
y="hwy",
303+
x="cty",
304+
title="Scatterplot with overlapping points",
305+
caption="Source: midwest")
306+
```
307+
308+
### rBokeh
309+
310+
[rBokeh](https://hafen.github.io/rbokeh/) is a native R plotting library for creating interactive graphics which are backed by the Bokeh visualization library.
311+
312+
To install rBokeh, you can use the following command:
313+
314+
```r
315+
install.packages("rbokeh")
316+
```
317+
318+
Once installed, you can leverage rBokeh to create interactive visualizations.
319+
320+
![Screenshot of a rBokeh graph example.](./media/apache-spark-data-viz/bokeh-plot.png#lightbox)
321+
322+
```r
323+
library(rbokeh)
324+
p <- figure() %>%
325+
ly_points(Sepal.Length, Sepal.Width, data = iris,
326+
color = Species, glyph = Species,
327+
hover = list(Sepal.Length, Sepal.Width))
328+
```
329+
330+
### R Plotly
331+
332+
[Plotly's](https://plotly.com/r/) R graphing library makes interactive, publication-quality graphs.
333+
334+
To install Plotly, you can use the following command:
335+
336+
```r
337+
install.packages("plotly")
338+
```
339+
340+
Once installed, you can leverage Plotly to create interactive visualizations.
341+
342+
![Screenshot of a Plotly graph example.](./media/apache-spark-data-viz/plotly-r.png#lightbox)
343+
344+
```r
345+
library(plotly)
346+
347+
fig <- plot_ly() %>%
348+
add_lines(x = c("a","b","c"), y = c(1,3,2))%>%
349+
layout(title="sample figure", xaxis = list(title = 'x'), yaxis = list(title = 'y'), plot_bgcolor = "#c7daec")
350+
351+
fig
352+
```
353+
354+
### Highcharter
355+
356+
[Highcharter](https://jkunst.com/highcharter/) is a R wrapper for Highcharts Javascript library and its modules.
357+
358+
To install Highcharter, you can use the following command:
359+
360+
```r
361+
install.packages("highcharter")
362+
```
363+
364+
Once installed, you can leverage Highcharter to create interactive visualizations.
365+
366+
![Screenshot of a Highcharter graph example.](./media/apache-spark-data-viz/highcharter.png#lightbox)
367+
368+
```r
369+
library(magrittr)
370+
library(highcharter)
371+
hchart(mtcars, "scatter", hcaes(wt, mpg, z = drat, color = hp)) %>%
372+
hc_title(text = "Scatter chart with size and color")
373+
374+
```
375+
278376
## Connect to Power BI using Apache Spark & SQL On-Demand
377+
279378
Azure Synapse Analytics integrates deeply with Power BI allowing data engineers to build analytics solutions.
280379

281380
Azure Synapse Analytics allows the different workspace computational engines to share databases and tables between its Spark pools and serverless SQL pool. Using the [shared metadata model](../metadata/overview.md),you can query your Apache Spark tables using SQL on-demand. Once done, you can connect your SQL on-demand endpoint to Power BI to easily query your synced Spark tables.
282381

283-
284382
## Next steps
383+
285384
- For more information on how to set up the Spark SQL DW Connector: [Synapse SQL connector](./spark/../synapse-spark-sql-pool-import-export.md)
286-
- View the default libraries: [Azure Synapse Analytics runtime](../spark/apache-spark-version-support.md)
385+
- View the default libraries: [Azure Synapse Analytics runtime](../spark/apache-spark-version-support.md)

articles/synapse-analytics/spark/apache-spark-delta-lake-overview.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.topic: overview
99
ms.subservice: spark
1010
ms.date: 02/15/2022
1111
ms.custom: devx-track-csharp
12-
zone_pivot_groups: programming-languages-spark-all-minus-sql
12+
zone_pivot_groups: programming-languages-spark-all-minus-sql-r
1313
---
1414

1515
# Linux Foundation Delta Lake overview
@@ -160,7 +160,7 @@ The order of the results is different from above as there was no order explicitl
160160

161161
## Update table data
162162

163-
Delta Lake supports several operations to modify tables using standard DataFrame APIs, this is one of the big enhancements that delta format adds. The following example runs a batch job to overwrite the data in the table.
163+
Delta Lake supports several operations to modify tables using standard DataFrame APIs. These operations are one of the enhancements that delta format adds. The following example runs a batch job to overwrite the data in the table.
164164

165165
:::zone pivot = "programming-language-python"
166166

@@ -344,7 +344,7 @@ Results in:
344344

345345
## Conditional update without overwrite
346346

347-
Delta Lake provides programmatic APIs to conditional update, delete, and merge (this is commonly referred to as an upsert) data into tables.
347+
Delta Lake provides programmatic APIs to conditional update, delete, and merge (this command is commonly referred to as an upsert) data into tables.
348348

349349
:::zone pivot = "programming-language-python"
350350

@@ -530,7 +530,7 @@ Here you have a combination of the existing data. The existing data has been ass
530530

531531
### History
532532

533-
Delta Lake's has the ability to allow looking into history of a table. That is, the changes that were made to the underlying Delta Table. The cell below shows how simple it is to inspect the history.
533+
Delta Lake's has the ability to allow looking into history of a table. That is, the changes that were made to the underlying Delta Table. The cell below shows how simple it's to inspect the history.
534534

535535
:::zone pivot = "programming-language-python"
536536

@@ -572,7 +572,7 @@ Here you can see all of the modifications made over the above code snippets.
572572

573573
It's possible to query previous snapshots of your Delta Lake table by using a feature called Time Travel. If you want to access the data that you overwrote, you can query a snapshot of the table before you overwrote the first set of data using the versionAsOf option.
574574

575-
Once you run the cell below, you should see the first set of data from before you overwrote it. Time Travel is an extremely powerful feature that takes advantage of the power of the Delta Lake transaction log to access data that is no longer in the table. Removing the version 0 option (or specifying version 1) would let you see the newer data again. For more information, see [Query an older snapshot of a table](https://docs.delta.io/latest/delta-batch.html#deltatimetravel).
575+
Once you run the cell below, you should see the first set of data from before you overwrote it. Time Travel is a powerful feature that takes advantage of the power of the Delta Lake transaction log to access data that is no longer in the table. Removing the version 0 option (or specifying version 1) would let you see the newer data again. For more information, see [Query an older snapshot of a table](https://docs.delta.io/latest/delta-batch.html#deltatimetravel).
576576

577577
:::zone pivot = "programming-language-python"
578578

@@ -611,7 +611,7 @@ Results in:
611611
| 3|
612612
| 2|
613613

614-
Here you can see you have gone back to the earliest version of the data.
614+
Here you can see you've gone back to the earliest version of the data.
615615

616616
## Write a stream of data to a table
617617

@@ -626,7 +626,7 @@ In the cells below, here's what we are doing:
626626
* Cell 32 Stop the structured streaming job
627627
* Cell 33 Inspect history <--You'll notice appends have stopped
628628

629-
First you are going to set up a simple Spark Streaming job to generate a sequence and make the job write to your Delta Table.
629+
First you're going to set up a simple Spark Streaming job to generate a sequence and make the job write to your Delta Table.
630630

631631
:::zone pivot = "programming-language-python"
632632

@@ -748,7 +748,7 @@ Results in:
748748
| 1|2020-04-25 00:35:05| WRITE| [mode -> Overwrite, partitionBy -> []]| 0|
749749
| 0|2020-04-25 00:34:34| WRITE| [mode -> ErrorIfExists, partitionBy -> []]| null|
750750

751-
Here you are dropping some of the less interesting columns to simplify the viewing experience of the history view.
751+
Here you're dropping some of the less interesting columns to simplify the viewing experience of the history view.
752752

753753
:::zone pivot = "programming-language-python"
754754

@@ -792,7 +792,7 @@ Results in:
792792

793793
You can do an in-place conversion from the Parquet format to Delta.
794794

795-
Here you are going to test if the existing table is in delta format or not.
795+
Here you're going to test if the existing table is in delta format or not.
796796
:::zone pivot = "programming-language-python"
797797

798798
```python
@@ -830,7 +830,7 @@ Results in:
830830

831831
False
832832

833-
Now you are going to convert the data to delta format and verify it worked.
833+
Now you're going to convert the data to delta format and verify it worked.
834834

835835
:::zone pivot = "programming-language-python"
836836

@@ -936,7 +936,7 @@ Results in:
936936
|--------------------|
937937
|abfss://data@arca...|
938938

939-
Now you are going to verify that a table is not a delta format table, then convert it to delta format using Spark SQL and confirm it was converted correctly.
939+
Now, you're going to verify that a table is not a delta format table. Then, you will convert the table to delta format using Spark SQL and confirm that it was converted correctly.
940940

941941
:::zone pivot = "programming-language-python"
942942

0 commit comments

Comments
 (0)