Merge pull request #214820 from Dawn2111/prdawn/query-pushdown

19BMG00 · web-flow · commit df7ff6f8eba2 · 2023-01-23T10:07:18.000-08:00
query push down
diff --git a/articles/synapse-analytics/spark/synapse-spark-sql-pool-import-export.md b/articles/synapse-analytics/spark/synapse-spark-sql-pool-import-export.md
@@ -18,9 +18,10 @@ The Azure Synapse Dedicated SQL Pool Connector for Apache Spark in Azure Synapse
 At a high-level, the connector provides the following capabilities:
 
 * Read from Azure Synapse Dedicated SQL Pool:
-  * Read large data sets from Synapse Dedicated SQL Pool Tables (Internal and External) and Views.
+  * Read large data sets from Synapse Dedicated SQL Pool Tables (Internal and External) and views.
   * Comprehensive predicate push down support, where filters on DataFrame get mapped to corresponding SQL predicate push down.
   * Support for column pruning.
+  * Support for query push down.
 * Write to Azure Synapse Dedicated SQL Pool:
   * Ingest large volume data to Internal and External table types.
   * Supports following DataFrame save mode preferences:
@@ -198,17 +199,17 @@ This section presents reference code templates to describe how to use and invoke
 ##### [Scala](#tab/scala)
 
 ```Scala
-synapsesql(tableName:String) => org.apache.spark.sql.DataFrame
+synapsesql(tableName:String="") => org.apache.spark.sql.DataFrame
 ```
 
 ##### [Python](#tab/python)
 
 ```python
-synapsesql(table_name: str) -> org.apache.spark.sql.DataFrame
+synapsesql(table_name: str="") -> org.apache.spark.sql.DataFrame
 ```
 ---
 
-#### Read using Azure AD based authentication
+#### Read from a table using Azure AD based authentication
 
 ##### [Scala](#tab/scala1)
 
@@ -268,7 +269,13 @@ dfToReadFromTable.show()
 ```
 ---
 
-#### Read using basic authentication
+#### Read from a query using Azure AD based authentication
+> [!Note]
+> Restrictions while reading from query:
+> * Table name and query cannot be specified at the same time.
+> * Only select queries are allowed. DDL and DML SQLs are not allowed.
+> * The select and filter options on dataframe are not pushed down to the SQL dedicated pool when a query is specified.
+> * Read from a query is only available in Spark 3.1 and 3.2. It is not available in Spark 2.4.
 
 ##### [Scala](#tab/scala2)
 
@@ -279,6 +286,98 @@ import org.apache.spark.sql.DataFrame
 import com.microsoft.spark.sqlanalytics.utils.Constants
 import org.apache.spark.sql.SqlAnalyticsConnector._
 
+
+// Read from a query
+// Query can be provided either as an argument to synapsesql or as a Constant - Constants.QUERY
+val dfToReadFromQueryAsOption:DataFrame = spark.read.
+    // Name of the SQL Dedicated Pool or database where to run the query
+    // Database can be specified as a Spark Config - spark.sqlanalyticsconnector.dw.database or as a Constant - Constants.DATABASE
+     option(Constants.DATABASE, "<database_name>").
+    //If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument 
+    //to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
+    option(Constants.SERVER, "<sql-server-name>.sql.azuresynapse.net").
+    //Defaults to storage path defined in the runtime configurations
+    option(Constants.TEMP_FOLDER, "abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<some_base_path_for_temporary_staging_folders>")
+    //query from which data will be read
+    .option(Constants.QUERY, "select <column_name>, count(*) as cnt from <schema_name>.<table_name> group by <column_name>")
+    synapsesql()
+
+val dfToReadFromQueryAsArgument:DataFrame = spark.read.
+     // Name of the SQL Dedicated Pool or database where to run the query
+     // Database can be specified as a Spark Config - spark.sqlanalyticsconnector.dw.database or as a Constant - Constants.DATABASE
+     option(Constants.DATABASE, "<database_name>")
+    //If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument 
+    //to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
+    option(Constants.SERVER, "<sql-server-name>.sql.azuresynapse.net").
+    //Defaults to storage path defined in the runtime configurations
+    option(Constants.TEMP_FOLDER, "abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<some_base_path_for_temporary_staging_folders>")
+    //query from which data will be read
+    .synapsesql("select <column_name>, count(*) as counts from <schema_name>.<table_name> group by <column_name>")
+
+
+//Show contents of the dataframe
+dfToReadFromQueryAsOption.show()
+dfToReadFromQueryAsArgument.show()
+```
+
+##### [Python](#tab/python2)
+
+```python
+# Add required imports
+import com.microsoft.spark.sqlanalytics
+from com.microsoft.spark.sqlanalytics.Constants import Constants
+from pyspark.sql.functions import col
+
+# Name of the SQL Dedicated Pool or database where to run the query
+# Database can be specified as a Spark Config or as a Constant - Constants.DATABASE
+spark.conf.set("spark.sqlanalyticsconnector.dw.database", "<database_name>")
+
+# Read from a query
+# Query can be provided either as an argument to synapsesql or as a Constant - Constants.QUERY
+dfToReadFromQueryAsOption = (spark.read
+                     # Name of the SQL Dedicated Pool or database where to run the query
+                     # Database can be specified as a Spark Config - spark.sqlanalyticsconnector.dw.database or as a Constant - Constants.DATABASE
+                     .option(Constants.DATABASE, "<database_name>")
+                     # If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument
+                     # to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
+                     .option(Constants.SERVER, "<sql-server-name>.sql.azuresynapse.net")
+                     # Defaults to storage path defined in the runtime configurations
+                     .option(Constants.TEMP_FOLDER, "abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<some_base_path_for_temporary_staging_folders>")
+                     # query from which data will be read
+                     .option(Constants.QUERY, "select <column_name>, count(*) as cnt from <schema_name>.<table_name> group by <column_name>")
+                     .synapsesql()
+)
+
+dfToReadFromQueryAsArgument = (spark.read
+                     # Name of the SQL Dedicated Pool or database where to run the query
+                     # Database can be specified as a Spark Config - spark.sqlanalyticsconnector.dw.database or as a Constant - Constants.DATABASE
+                     .option(Constants.DATABASE, "<database_name>")
+                     # If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument
+                     # to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
+                     .option(Constants.SERVER, "<sql-server-name>.sql.azuresynapse.net")
+                     # Defaults to storage path defined in the runtime configurations
+                     .option(Constants.TEMP_FOLDER, "abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<some_base_path_for_temporary_staging_folders>")
+                     # query from which data will be read
+                     .synapsesql("select <column_name>, count(*) as counts from <schema_name>.<table_name> group by <column_name>")
+)
+
+# Show contents of the dataframe
+dfToReadFromQueryAsOption.show()
+dfToReadFromQueryAsArgument.show()
+```
+---
+
+#### Read from a table using basic authentication
+
+##### [Scala](#tab/scala3)
+
+```Scala
+//Use case is to read data from an internal table in Synapse Dedicated SQL Pool DB
+//Azure Active Directory based authentication approach is preferred here.
+import org.apache.spark.sql.DataFrame
+import com.microsoft.spark.sqlanalytics.utils.Constants
+import org.apache.spark.sql.SqlAnalyticsConnector._
+
 //Read from existing internal table
 val dfToReadFromTable:DataFrame = spark.read.
     //If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument 
@@ -289,22 +388,23 @@ val dfToReadFromTable:DataFrame = spark.read.
     //Set user's password to the database
     option(Constants.PASSWORD, "<user_password>").
     //Set name of the data source definition that is defined with database scoped credentials.
-    //Data extracted from the SQL query will be staged to the storage path defined on the data source's location setting.
+    //Data extracted from the table will be staged to the storage path defined on the data source's location setting.
     option(Constants.DATA_SOURCE, "<data_source_name>").
     //Three-part table name from where data will be read.
     synapsesql("<database_name>.<schema_name>.<table_name>").
     //Column-pruning i.e., query select column values.
-    select("<some_column_1>", "<some_column_5>", "<some_column_n>").
+    select("<some_column_1>", "<some_column_5>", "<some_column_n>"). 
     //Push-down filter criteria that gets translated to SQL Push-down Predicates.    
     filter(col("Title").startsWith("E")).
     //Fetch a sample of 10 records 
     limit(10)
+    
 
 //Show contents of the dataframe
 dfToReadFromTable.show()
 ```
 
-##### [Python](#tab/python2)
+##### [Python](#tab/python3)
 
 ```python
 # Add required imports
@@ -323,7 +423,7 @@ dfToReadFromTable = (spark.read
                      .option(Constants.PASSWORD, "<user_password>")
                      # Set name of the data source definition that is defined with database scoped credentials.
                      # https://learn.microsoft.com/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-ver15&tabs=dedicated#h-create-external-data-source-to-access-data-in-azure-storage-using-the-abfs-interface
-                     # Data extracted from the SQL query will be staged to the storage path defined on the data source's location setting.
+                     # Data extracted from the table will be staged to the storage path defined on the data source's location setting.
                      .option(Constants.DATA_SOURCE, "<data_source_name>")
                      # Three-part table name from where data will be read.
                      .synapsesql("<database_name>.<schema_name>.<table_name>")
@@ -332,14 +432,133 @@ dfToReadFromTable = (spark.read
                      # Push-down filter criteria that gets translated to SQL Push-down Predicates.
                      .filter(col("Title").contains("E"))
                      # Fetch a sample of 10 records
-                     .limit(10))
+                     .limit(10)
+                    )
 
 # Show contents of the dataframe
 dfToReadFromTable.show()
 
 ```
 ---
 
+#### Read from a query using basic authentication
+
+##### [Scala](#tab/scala4)
+
+```Scala
+//Use case is to read data from an internal table in Synapse Dedicated SQL Pool DB
+//Azure Active Directory based authentication approach is preferred here.
+import org.apache.spark.sql.DataFrame
+import com.microsoft.spark.sqlanalytics.utils.Constants
+import org.apache.spark.sql.SqlAnalyticsConnector._
+
+// Name of the SQL Dedicated Pool or database where to run the query
+// Database can be specified as a Spark Config or as a Constant - Constants.DATABASE
+spark.conf.set("spark.sqlanalyticsconnector.dw.database", "<database_name>")
+
+// Read from a query
+// Query can be provided either as an argument to synapsesql or as a Constant - Constants.QUERY
+val dfToReadFromQueryAsOption:DataFrame = spark.read.
+     //Name of the SQL Dedicated Pool or database where to run the query
+     //Database can be specified as a Spark Config - spark.sqlanalyticsconnector.dw.database or as a Constant - Constants.DATABASE
+      option(Constants.DATABASE, "<database_name>").
+    //If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument 
+    //to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
+    option(Constants.SERVER, "<sql-server-name>.sql.azuresynapse.net").
+    //Set database user name
+    option(Constants.USER, "<user_name>").
+    //Set user's password to the database
+    option(Constants.PASSWORD, "<user_password>").
+    //Set name of the data source definition that is defined with database scoped credentials.
+    //Data extracted from the SQL query will be staged to the storage path defined on the data source's location setting.
+    option(Constants.DATA_SOURCE, "<data_source_name>").
+    //Query where data will be read.  
+    option(Constants.QUERY, "select <column_name>, count(*) as counts from <schema_name>.<table_name> group by <column_name>" ).
+    synapsesql()
+
+val dfToReadFromQueryAsArgument:DataFrame = spark.read.
+     //Name of the SQL Dedicated Pool or database where to run the query
+     //Database can be specified as a Spark Config - spark.sqlanalyticsconnector.dw.database or as a Constant - Constants.DATABASE
+      option(Constants.DATABASE, "<database_name>").
+    //If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument 
+    //to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
+    option(Constants.SERVER, "<sql-server-name>.sql.azuresynapse.net").
+    //Set database user name
+    option(Constants.USER, "<user_name>").
+    //Set user's password to the database
+    option(Constants.PASSWORD, "<user_password>").
+    //Set name of the data source definition that is defined with database scoped credentials.
+    //Data extracted from the SQL query will be staged to the storage path defined on the data source's location setting.
+    option(Constants.DATA_SOURCE, "<data_source_name>").
+    //Query where data will be read.  
+    synapsesql("select <column_name>, count(*) as counts from <schema_name>.<table_name> group by <column_name>")
+    
+
+//Show contents of the dataframe
+dfToReadFromQueryAsOption.show()
+dfToReadFromQueryAsArgument.show()
+```
+
+##### [Python](#tab/python4)
+
+```python
+# Add required imports
+import com.microsoft.spark.sqlanalytics
+from com.microsoft.spark.sqlanalytics.Constants import Constants
+from pyspark.sql.functions import col
+
+# Name of the SQL Dedicated Pool or database where to run the query
+# Database can be specified as a Spark Config or as a Constant - Constants.DATABASE
+spark.conf.set("spark.sqlanalyticsconnector.dw.database", "<database_name>")
+
+# Read from a query
+# Query can be provided either as an argument to synapsesql or as a Constant - Constants.QUERY
+dfToReadFromQueryAsOption = (spark.read
+                     # Name of the SQL Dedicated Pool or database where to run the query
+                     # Database can be specified as a Spark Config - spark.sqlanalyticsconnector.dw.database or as a Constant - Constants.DATABASE
+                     .option(Constants.DATABASE, "<database_name>")
+                     # If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument
+                     # to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
+                     .option(Constants.SERVER, "<sql-server-name>.sql.azuresynapse.net")
+                     # Set database user name
+                     .option(Constants.USER, "<user_name>")
+                     # Set user's password to the database
+                     .option(Constants.PASSWORD, "<user_password>")
+                     # Set name of the data source definition that is defined with database scoped credentials.
+                     # https://docs.microsoft.com/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-ver15&tabs=dedicated#h-create-external-data-source-to-access-data-in-azure-storage-using-the-abfs-interface
+                     # Data extracted from the SQL query will be staged to the storage path defined on the data source's location setting.
+                     .option(Constants.DATA_SOURCE, "<data_source_name>")
+                     # Query from where data will be read.
+                     .option(Constants.QUERY, "select <column_name>, count(*) as counts from <schema_name>.<table_name> group by <column_name>")
+                     .synapsesql()
+                    )
+
+dfToReadFromQueryAsArgument = (spark.read
+                     # Name of the SQL Dedicated Pool or database where to run the query
+                     # Database can be specified as a Spark Config - spark.sqlanalyticsconnector.dw.database or as a Constant - Constants.DATABASE
+                     .option(Constants.DATABASE, "<database_name>")
+                     # If `Constants.SERVER` is not provided, the `<database_name>` from the three-part table name argument
+                     # to `synapsesql` method is used to infer the Synapse Dedicated SQL End Point.
+                     .option(Constants.SERVER, "<sql-server-name>.sql.azuresynapse.net")
+                     # Set database user name
+                     .option(Constants.USER, "<user_name>")
+                     # Set user's password to the database
+                     .option(Constants.PASSWORD, "<user_password>")
+                     # Set name of the data source definition that is defined with database scoped credentials.
+                     # https://docs.microsoft.com/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-ver15&tabs=dedicated#h-create-external-data-source-to-access-data-in-azure-storage-using-the-abfs-interface
+                     # Data extracted from the SQL query will be staged to the storage path defined on the data source's location setting.
+                     .option(Constants.DATA_SOURCE, "<data_source_name>")
+                     .synapsesql("select <column_name>, count(*) as counts from <schema_name>.<table_name> group by <column_name>")
+                    )
+
+# Show contents of the dataframe
+dfToReadFromQueryAsOption.show()
+dfToReadFromQueryAsArgument.show()
+
+```
+---
+
+
 ### Write to Azure Synapse Dedicated SQL Pool
 
 #### Write Request - `synapsesql` method signature