Azure-Samples
diff --git a/‎Notebooks/Scala/01 Read and write data from Azure Data Lake Storage Gen2.ipynb‎
Lines changed: 384 additions & 0 deletions b/‎Notebooks/Scala/01 Read and write data from Azure Data Lake Storage Gen2.ipynb‎
Lines changed: 384 additions & 0 deletions
@@ -0,0 +1,384 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Access data on Azure Data Lake Storage Gen2 (ADLS Gen2) with Synapse Spark\n",
+        "\n",
+        "Azure Data Lake Storage Gen2 (ADLS Gen2) is used as the storage account associated with a Synapse workspace. A synapse workspace can have a default ADLS Gen2 storage account and additional linked storage accounts. \n",
+        "\n",
+        "You can access data on ADLS Gen2 with Synapse Spark via following URL:\n",
+        "    \n",
+        "    abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>\n",
+        "\n",
+        "This notebook provides examples of how to read data from ADLS Gen2 account into a Spark context and how to write the output of Spark jobs directly into an ADLS Gen2 location.\n",
+        "\n",
+        "## Pre-requisites\n",
+        "Synapse leverage AAD pass-through to access any ADLS Gen2 account (or folder) to which you have a **Blob Storage Contributor** permission. No credentials or access token is required. "
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load a sample data\n",
+        "\n",
+        "Let's first load the [public holidays](https://azure.microsoft.com/en-us/services/open-datasets/catalog/public-holidays/) from Azure Open datasets as a sample."
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "// set blob storage account connection for open dataset\n",
+        "\n",
+        "val hol_blob_account_name = \"azureopendatastorage\"\n",
+        "val hol_blob_container_name = \"holidaydatacontainer\"\n",
+        "val hol_blob_relative_path = \"Processed\"\n",
+        "val hol_blob_sas_token = \"\"\n",
+        "\n",
+        "val hol_wasbs_path = f\"wasbs://$hol_blob_container_name@$hol_blob_account_name.blob.core.windows.net/$hol_blob_relative_path\"\n",
+        "spark.conf.set(f\"fs.azure.sas.$hol_blob_container_name.$hol_blob_account_name.blob.core.windows.net\",hol_blob_sas_token)"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 5,
+          "data": {
+            "text/plain": [
+              "hol_blob_account_name: String = azureopendatastorage\n",
+              "hol_blob_container_name: String = holidaydatacontainer\n",
+              "hol_blob_relative_path: String = Processed\n",
+              "hol_blob_sas_token: String = \"\"\n",
+              "hol_wasbs_path: String = wasbs://holidaydatacontainer@azureopendatastorage.blob.core.windows.net/Processed"
+            ]
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 5,
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "// load the sample data as a Spark DataFrame\n",
+        "val hol_df = spark.read.parquet(hol_wasbs_path) \n",
+        "hol_df.show(5, truncate = false)"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 6,
+          "data": {
+            "text/plain": [
+              "hol_df: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]\n",
+              "+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+\n",
+              "|countryOrRegion|holidayName               |normalizeHolidayName      |isPaidTimeOff|countryRegionCode|date               |\n",
+              "+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+\n",
+              "|Argentina      |Año Nuevo [New Year's Day]|Año Nuevo [New Year's Day]|null         |AR               |1970-01-01 00:00:00|\n",
+              "|Australia      |New Year's Day            |New Year's Day            |null         |AU               |1970-01-01 00:00:00|\n",
+              "|Austria        |Neujahr                   |Neujahr                   |null         |AT               |1970-01-01 00:00:00|\n",
+              "|Belgium        |Nieuwjaarsdag             |Nieuwjaarsdag             |null         |BE               |1970-01-01 00:00:00|\n",
+              "|Brazil         |Ano novo                  |Ano novo                  |null         |BR               |1970-01-01 00:00:00|\n",
+              "+---------------+--------------------------+--------------------------+-------------+-----------------+-------------------+\n",
+              "only showing top 5 rows"
+            ]
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 6,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Write data to the default ADLS Gen2 storage\n",
+        "\n",
+        "We are going to write the spark dateframe to your default ADLS Gen2 storage account.\n"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "// set your storage account connection\n",
+        "\n",
+        "val account_name = \"\" // replace with your blob name\n",
+        "val container_name = \"\" //replace with your container name\n",
+        "val relative_path = \"\" //replace with your relative folder path\n",
+        "\n",
+        "val adls_path = f\"abfss://$container_name@$account_name.dfs.core.windows.net/$relative_path\""
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 7,
+          "data": {
+            "text/plain": [
+              "account_name: String = ltianwestus2gen2\n",
+              "container_name: String = mydefault\n",
+              "relative_path: String = samplenb/\n",
+              "adls_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/"
+            ]
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 7,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Save a dataframe as Parquet, JSON or CSV\n",
+        "If you have a dataframe, you can save it to Parquet or JSON with the .write.parquet(), .write.json() and .write.csv() methods respectively.\n",
+        "\n",
+        "Dataframes can be saved in any format, regardless of the input format.\n"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "// set the path for the output file\n",
+        "\n",
+        "val parquet_path = adls_path + \"holiday.parquet\"\n",
+        "val json_path = adls_path + \"holiday.json\"\n",
+        "val csv_path = adls_path + \"holiday.csv\""
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 9,
+          "data": {
+            "text/plain": [
+              "parquet_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/holiday.parquet\n",
+              "json_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/holiday.json\n",
+              "csv_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/holiday.csv"
+            ]
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 9,
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import org.apache.spark.sql.SaveMode\n",
+        "\n",
+        "hol_df.write.mode(SaveMode.Overwrite).parquet(parquet_path)\n",
+        "hol_df.write.mode(SaveMode.Overwrite).json(json_path)\n",
+        "hol_df.write.mode(SaveMode.Overwrite).option(\"header\", \"true\").csv(csv_path)"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 10,
+          "data": {
+            "text/plain": [
+              "import org.apache.spark.sql.SaveMode"
+            ]
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 10,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Save a dataframe as text files\n",
+        "If you have a dataframe that you want ot save as text file, you must first covert it to an RDD and then save that RDD as a text file.\n"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "// Define the text file path and covert spark dataframe into RDD\n",
+        "val text_path = adls_path + \"holiday.txt\"\n",
+        "val hol_RDD = hol_df.rdd"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 12,
+          "data": {
+            "text/plain": [
+              "text_path: String = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/holiday.txt\n",
+              "hol_RDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[24] at rdd at <console>:30"
+            ]
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 12,
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "// Save RDD as text file\n",
+        "hol_RDD.saveAsTextFile(text_path)"
+      ],
+      "outputs": [],
+      "execution_count": 14,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Read data from the default ADLS Gen2 storage\n"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Create a dataframe from parquet files\n"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "val df_parquet = spark.read.parquet(parquet_path)"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 15,
+          "data": {
+            "text/plain": [
+              "df_parquet: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]"
+            ]
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 15,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Create a dataframe from JSON files\n"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "val df_json = spark.read.json(json_path)"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 16,
+          "data": {
+            "text/plain": [
+              "df_json: org.apache.spark.sql.DataFrame = [countryOrRegion: string, countryRegionCode: string ... 4 more fields]"
+            ]
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 16,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Create a dataframe from CSV files\n"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "val df_csv = spark.read.csv(csv_path, header = 'true')"
+      ],
+      "outputs": [
+        {
+          "output_type": "error",
+          "ename": "Error",
+          "evalue": "<console>:1: error: unclosed character literal",
+          "traceback": [
+            "Error : <console>:1: error: unclosed character literal",
+            "val df_csv = spark.read.csv(csv_path, header = 'true')\n",
+            "                                                    ^\n"
+          ]
+        }
+      ],
+      "execution_count": 17,
+      "metadata": {}
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Create an RDD from text file\n"
+      ],
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "val text = sc.textFile(text_path)"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 18,
+          "data": {
+            "text/plain": [
+              "text: org.apache.spark.rdd.RDD[String] = abfss://mydefault@ltianwestus2gen2.dfs.core.windows.net/samplenb/holiday.txt MapPartitionsRDD[33] at textFile at <console>:32"
+            ]
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 18,
+      "metadata": {}
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "text.take(5).foreach(println)"
+      ],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "execution_count": 19,
+          "data": {
+            "text/plain": [
+              "[Argentina,Año Nuevo [New Year's Day],Año Nuevo [New Year's Day],null,AR,1970-01-01 00:00:00.0]\n",
+              "[Australia,New Year's Day,New Year's Day,null,AU,1970-01-01 00:00:00.0]\n",
+              "[Austria,Neujahr,Neujahr,null,AT,1970-01-01 00:00:00.0]\n",
+              "[Belgium,Nieuwjaarsdag,Nieuwjaarsdag,null,BE,1970-01-01 00:00:00.0]\n",
+              "[Brazil,Ano novo,Ano novo,null,BR,1970-01-01 00:00:00.0]"
+            ]
+          },
+          "metadata": {}
+        }
+      ],
+      "execution_count": 19,
+      "metadata": {}
+    }
+  ],
+  "metadata": {
+    "saveOutput": true,
+    "language_info": {
+      "name": "scala"
+    },
+    "nteract": {
+      "version": "0.22.4"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}