Skip to content

Commit 34c5dbf

Browse files
authored
Merge pull request #295542 from whhender/public-prs-feb-2025-2
Resolving script issues from public PRs
2 parents 40dec8c + bd8c35c commit 34c5dbf

File tree

2 files changed

+34
-34
lines changed

2 files changed

+34
-34
lines changed

articles/synapse-analytics/spark/apache-spark-data-visualization-tutorial.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -35,16 +35,16 @@ Create an Apache Spark Pool by following the [Create an Apache Spark pool tutori
3535
3. Because the raw data is in a Parquet format, you can use the Spark context to pull the file into memory as a DataFrame directly. Create a Spark DataFrame by retrieving the data via the Open Datasets API. Here, we use the Spark DataFrame *schema on read* properties to infer the datatypes and schema.
3636

3737
```python
38-
from azureml.opendatasets import NycTlcYellow
39-
40-
from datetime import datetime
41-
from dateutil import parser
42-
43-
end_date = parser.parse('2018-05-08 00:00:00')
44-
start_date = parser.parse('2018-05-01 00:00:00')
45-
46-
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
47-
filtered_df = spark.createDataFrame(nyc_tlc.to_pandas_dataframe())
38+
from azureml.opendatasets import NycTlcYellow
39+
40+
from datetime import datetime
41+
from dateutil import parser
42+
43+
end_date = parser.parse('2018-05-08 00:00:00')
44+
start_date = parser.parse('2018-05-01 00:00:00')
45+
46+
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
47+
df = spark.createDataFrame(nyc_tlc.to_pandas_dataframe())
4848

4949
```
5050

@@ -174,4 +174,4 @@ After you finish running the application, shut down the notebook to release the
174174
## Next steps
175175

176176
- [Azure Synapse Analytics](../index.yml)
177-
- [Apache Spark official documentation](https://spark.apache.org/docs/latest/)
177+
- [Apache Spark official documentation](https://spark.apache.org/docs/latest/)

articles/synapse-analytics/spark/apache-spark-to-power-bi.md

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.date: 11/16/2020
1212

1313
# Tutorial: Create a Power BI report using Apache Spark and Azure Synapse Analytics
1414

15-
Organizations often need to process large volumes of data before serving to key business stakeholders. In this tutorial, you will learn how to leverage the integrated experiences in Azure Synapse Analytics to process data using Apache Spark and later serve the data to end-users through Power BI and Serverless SQL.
15+
Organizations often need to process large volumes of data before serving to key business stakeholders. In this tutorial, you'll learn how to leverage the integrated experiences in Azure Synapse Analytics to process data using Apache Spark and later serve the data to end-users through Power BI and Serverless SQL.
1616

1717
## Before you begin
1818
- [Azure Synapse Analytics workspace](../quickstart-create-workspace.md) with an ADLS Gen2 storage account configured as the default storage.
@@ -21,29 +21,29 @@ Organizations often need to process large volumes of data before serving to key
2121
- Serverless Apache Spark pool in your Synapse Analytics workspace. For details, see [create a serverless Apache Spark pool](../quickstart-create-apache-spark-pool-studio.md)
2222

2323
## Download and prepare the data
24-
In this example, you will use Apache Spark to perform some analysis on taxi trip tip data from New York. The data is available through [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/). This subset of the dataset contains information about yellow taxi trips, including information about each trip, the start and end time and locations, the cost, and other interesting attributes.
24+
In this example, you'll use Apache Spark to perform some analysis on taxi trip tip data from New York. The data is available through [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/). This subset of the dataset contains information about yellow taxi trips, including information about each trip, the start, and end time and locations, the cost, and other interesting attributes.
2525

2626
1. Run the following lines to create a Spark dataframe by pasting the code into a new cell. This retrieves the data via the Open Datasets API. Pulling all of this data generates about 1.5 billion rows. The following code example uses start_date and end_date to apply a filter that returns a single month of data.
2727

2828
```python
29-
from azureml.opendatasets import NycTlcYellow
30-
from dateutil import parser
31-
from datetime import datetime
32-
33-
end_date = parser.parse('2018-06-06')
34-
start_date = parser.parse('2018-05-01')
35-
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
36-
filtered_df = nyc_tlc.to_spark_dataframe()
29+
from azureml.opendatasets import NycTlcYellow
30+
from dateutil import parser
31+
from datetime import datetime
32+
33+
end_date = parser.parse('2018-06-06')
34+
start_date = parser.parse('2018-05-01')
35+
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
36+
filtered_df = spark.createDataFrame(nyc_tlc.to_pandas_dataframe())
3737
```
38-
2. Using Apache Spark SQL, we will create a database called NycTlcTutorial. We will use this database to store the results of our data processing.
38+
2. Using Apache Spark SQL, we'll create a database called NycTlcTutorial. We'll use this database to store the results of our data processing.
3939
```python
4040
%%pyspark
41-
spark.sql("CREATE DATABASE IF NOT EXISTS NycTlcTutorial")
41+
spark.sql("CREATE DATABASE IF NOT EXISTS NycTlcTutorial")
4242
```
43-
3. Next, we will use Spark dataframe operations to process the data. In the following code, we perform the following transformations:
44-
1. The removal of columns which are not needed.
43+
3. Next, we'll use Spark dataframe operations to process the data. In the following code, we perform the following transformations:
44+
1. The removal of columns which aren't needed.
4545
2. The removal of outliers/incorrect values through filtering.
46-
3. The creation of new features like ```tripTimeSecs``` and ```tipped``` for additional analysis.
46+
3. The creation of new features like ```tripTimeSecs``` and ```tipped``` for extra analysis.
4747
```python
4848
from pyspark.sql.functions import unix_timestamp, date_format, col, when
4949

@@ -62,11 +62,11 @@ In this example, you will use Apache Spark to perform some analysis on taxi trip
6262
& (filtered_df.rateCodeId <= 5)
6363
& (filtered_df.paymentType.isin({"1", "2"})))
6464
```
65-
4. Finally, we will save our dataframe using the Apache Spark ```saveAsTable``` method. This will allow you to later query and connect to the same table using serverless SQL pools.
66-
```python
67-
taxi_df.write.mode("overwrite").saveAsTable("NycTlcTutorial.nyctaxi")
68-
```
69-
65+
4. Finally, we'll save our dataframe using the Apache Spark ```saveAsTable``` method. This will allow you to later query and connect to the same table using serverless SQL pools.
66+
```python
67+
taxi_df.write.mode("overwrite").saveAsTable("NycTlcTutorial.nyctaxi")
68+
```
69+
7070
## Query data using serverless SQL pools
7171
Azure Synapse Analytics allows the different workspace computational engines to share databases and tables between its serverless Apache Spark pools and serverless SQL pool. This is powered through the Synapse [shared metadata management](../metadata/overview.md) capability. As a result, the Spark created databases and their parquet-backed tables become visible in the workspace serverless SQL pool.
7272

@@ -80,7 +80,7 @@ To query your Apache Spark table using your serverless SQL pool:
8080
3. You can continue to refine your query or even visualize your results using the SQL charting options.
8181

8282
## Connect to Power BI
83-
Next, we will connect our serverless SQL pool to our Power BI workspace. Once you have connected your workspace, you will be able to create Power BI reports both directly from Azure Synapse Analytics as well as from Power BI desktop.
83+
Next, we'll connect our serverless SQL pool to our Power BI workspace. Once you have connected your workspace, you'll be able to create Power BI reports both directly from Azure Synapse Analytics and from Power BI desktop.
8484

8585
>[!Note]
8686
> Before you begin, you will need to set up a linked service to your [Power BI workspace](../quickstart-power-bi.md) and download the [Power BI desktop](/power-bi/service-create-the-new-workspaces).
@@ -104,7 +104,7 @@ To connect our serverless SQL pool to our Power BI workspace:
104104

105105
2. On the Power BI desktop Home tab, select **Publish** and **Save** changes. Enter a file name and save this report to the *NycTaxiTutorial Workspace*.
106106

107-
3. In addition, you can also create Power BI visualizations from within your Azure Synapse Analytics workspace. To do this, navigate to the **Develop** tab in your Azure Synapse workspace and open the Power BI tab. From here, you can select your report and continue building additional visualizations.
107+
3. In addition, you can also create Power BI visualizations from within your Azure Synapse Analytics workspace. To do this, navigate to the **Develop** tab in your Azure Synapse workspace and open the Power BI tab. From here, you can select your report and continue building more visualizations.
108108

109109
:::image type="content" source="../spark/media/apache-spark-power-bi/power-bi-synapse.png" alt-text="Azure Synapse Analytics Workspace." border="true":::
110110

@@ -113,4 +113,4 @@ For more details on how to create a dataset through serverless SQL and connect t
113113
## Next steps
114114
You can continue to learn more about data visualization capabilities in Azure Synapse Analytics by visiting the following documents and tutorials:
115115
- [Visualize data with serverless Apache Spark pools](../spark/apache-spark-data-visualization-tutorial.md)
116-
- [Overview of data visualization with Apache Spark pools](../spark/apache-spark-data-visualization.md)
116+
- [Overview of data visualization with Apache Spark pools](../spark/apache-spark-data-visualization.md)

0 commit comments

Comments
 (0)