Skip to content

Commit b4a9913

Browse files
authored
Merge pull request #216663 from SnehaGunda/freshnessReview
Review and freshness updates- SQL architecture
2 parents e8613d7 + fea1974 commit b4a9913

File tree

1 file changed

+27
-24
lines changed

1 file changed

+27
-24
lines changed

articles/synapse-analytics/sql/overview-architecture.md

Lines changed: 27 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -6,28 +6,29 @@ manager: rothja
66
ms.service: synapse-analytics
77
ms.topic: conceptual
88
ms.subservice: sql
9-
ms.date: 04/15/2020
9+
ms.date: 11/01/2022
1010
ms.author: martinle
1111
ms.reviewer: wiassaf
12+
ms.custom: engagement-fy23
1213
---
1314

14-
# Azure Synapse SQL architecture
15+
# Azure Synapse SQL architecture
1516

16-
This article describes the architecture components of Synapse SQL.
17+
This article describes the architecture components of Synapse SQL. It also explains how Azure Synapse SQL combines distributed query processing capabilities with Azure Storage to achieve high performance and scalability.
1718

1819
## Synapse SQL architecture components
1920

20-
Synapse SQL leverages a scale out architecture to distribute computational processing of data across multiple nodes. Compute is separate from storage, which enables you to scale compute independently of the data in your system.
21+
Synapse SQL uses a scale-out architecture to distribute computational processing of data across multiple nodes. Compute is separate from storage, which enables you to scale compute independently of the data in your system.
2122

22-
For dedicated SQL pool, the unit of scale is an abstraction of compute power that is known as a [data warehouse unit](resource-consumption-models.md).
23+
For dedicated SQL pool, the unit of scale is an abstraction of compute power that is known as a [data warehouse unit](resource-consumption-models.md).
2324

24-
For serverless SQL pool, being serverless, scaling is done automatically to accommodate query resource requirements. As topology changes over time by adding, removing nodes or failovers, it adapts to changes and makes sure your query has enough resources and finishes successfully. For example, the image below shows serverless SQL pool utilizing 4 compute nodes to execute a query.
25+
For serverless SQL pool, being serverless, scaling is done automatically to accommodate query resource requirements. As topology changes over time by adding, removing nodes or failovers, it adapts to changes and makes sure your query has enough resources and finishes successfully. For example, the following image shows serverless SQL pool using four compute nodes to execute a query.
2526

26-
![Synapse SQL architecture](./media//overview-architecture/sql-architecture.png)
27+
:::image type="content" source="./media/overview-architecture/sql-architecture.png" alt-text="Screenshot of Synapse SQL architecture." lightbox="./media/overview-architecture/sql-architecture.png" :::
2728

28-
Synapse SQL uses a node-based architecture. Applications connect and issue T-SQL commands to a Control node, which is the single point of entry for Synapse SQL.
29+
Synapse SQL uses a node-based architecture. Applications connect and issue T-SQL commands to a Control node, which is the single point of entry for Synapse SQL.
2930

30-
The Azure Synapse SQL Control node utilizes a distributed query engine to optimize queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.
31+
The Azure Synapse SQL Control node utilizes a distributed query engine to optimize queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.
3132

3233
The serverless SQL pool Control node utilizes Distributed Query Processing (DQP) engine to optimize and orchestrate distributed execution of user query by splitting it into smaller queries that will be executed on Compute nodes. Each small query is called task and represents distributed execution unit. It reads file(s) from storage, joins results from other tasks, groups, or orders data retrieved from other tasks.
3334

@@ -41,7 +42,7 @@ With decoupled storage and compute, when using Synapse SQL one can benefit from
4142

4243
## Azure Storage
4344

44-
Synapse SQL leverages Azure Storage to keep your user data safe. Since your data is stored and managed by Azure Storage, there is a separate charge for your storage consumption.
45+
Synapse SQL uses Azure Storage to keep your user data safe. Since your data is stored and managed by Azure Storage, there's a separate charge for your storage consumption.
4546

4647
Serverless SQL pool allows you to query your data lake files, while dedicated SQL pool allows you to query and ingest data from your data lake files. When data is ingested into dedicated SQL pool, the data is sharded into **distributions** to optimize the performance of the system. You can choose which sharding pattern to use to distribute the data when you define the table. These sharding patterns are supported:
4748

@@ -51,15 +52,15 @@ Serverless SQL pool allows you to query your data lake files, while dedicated SQ
5152

5253
## Control node
5354

54-
The Control node is the brain of the architecture. It is the front end that interacts with all applications and connections.
55+
The Control node is the brain of the architecture. It's the front end that interacts with all applications and connections.
5556

5657
In Synapse SQL, the distributed query engine runs on the Control node to optimize and coordinate parallel queries. When you submit a T-SQL query to dedicated SQL pool, the Control node transforms it into queries that run against each distribution in parallel.
5758

5859
In serverless SQL pool, the DQP engine runs on Control node to optimize and coordinate distributed execution of user query by splitting it into smaller queries that will be executed on Compute nodes. It also assigns sets of files to be processed by each node.
5960

6061
## Compute nodes
6162

62-
The Compute nodes provide the computational power.
63+
The Compute nodes provide the computational power.
6364

6465
In dedicated SQL pool, distributions map to Compute nodes for processing. As you pay for more compute resources, pool remaps the distributions to the available Compute nodes. The number of compute nodes ranges from 1 to 60, and is determined by the service level for the dedicated SQL pool. Each Compute node has a node ID that is visible in system views. You can see the Compute node ID by looking for the node_id column in system views whose names begin with sys.pdw_nodes. For a list of these system views, see [Synapse SQL system views](/sql/relational-databases/system-catalog-views/sql-data-warehouse-and-parallel-data-warehouse-catalog-views?view=azure-sqldw-latest&preserve-view=true).
6566

@@ -73,21 +74,22 @@ Data Movement Service (DMS) is the data transport technology in dedicated SQL po
7374
7475
## Distributions
7576

76-
A distribution is the basic unit of storage and processing for parallel queries that run on distributed data in dedicated SQL pool. When dedicated SQL pool runs a query, the work is divided into 60 smaller queries that run in parallel.
77+
A distribution is the basic unit of storage and processing for parallel queries that run on distributed data in dedicated SQL pool. When dedicated SQL pool runs a query, the work is divided into 60 smaller queries that run in parallel.
7778

78-
Each of the 60 smaller queries runs on one of the data distributions. Each Compute node manages one or more of the 60 distributions. A dedicated SQL pool with maximum compute resources has one distribution per Compute node. A dedicated SQL pool with minimum compute resources has all the distributions on one compute node.
79+
Each of the 60 smaller queries runs on one of the data distributions. Each Compute node manages one or more of the 60 distributions. A dedicated SQL pool with maximum compute resources has one distribution per Compute node. A dedicated SQL pool with minimum compute resources has all the distributions on one compute node.
7980

8081
## Hash-distributed tables
81-
A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.
82+
83+
A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.
8284

8385
To shard data into a hash-distributed table, dedicated SQL pool uses a hash function to deterministically assign each row to one distribution. In the table definition, one of the columns is designated as the distribution column. The hash function uses the values in the distribution column to assign each row to a distribution.
8486

85-
The following diagram illustrates how a full (non-distributed table) gets stored as a hash-distributed table.
87+
The following diagram illustrates how a full (non-distributed table) gets stored as a hash-distributed table.
8688

87-
![Distributed table](media//overview-architecture/hash-distributed-table.png "Distributed table")
89+
:::image type="content" source="./media/overview-architecture/hash-distributed-table.png" alt-text="Screenshot of a table stored as a hash-distribution." lightbox="./media/overview-architecture/hash-distributed-table.png" :::
8890

89-
* Each row belongs to one distribution.
90-
* A deterministic hash algorithm assigns each row to one distribution.
91+
* Each row belongs to one distribution.
92+
* A deterministic hash algorithm assigns each row to one distribution.
9193
* The number of table rows per distribution varies as shown by the different sizes of tables.
9294

9395
There are performance considerations for the selection of a distribution column, such as distinctness, data skew, and the types of queries that run on the system.
@@ -96,17 +98,18 @@ There are performance considerations for the selection of a distribution column,
9698

9799
A round-robin table is the simplest table to create and delivers fast performance when used as a staging table for loads.
98100

99-
A round-robin distributed table distributes data evenly across the table but without any further optimization. A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables. Joins on round-robin tables require reshuffling data, which takes additional time.
101+
A round-robin distributed table distributes data evenly across the table but without any further optimization. A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. It's quick to load data into a round-robin table, but query performance can often be better with hash distributed tables. Joins on round-robin tables require reshuffling data, which takes extra time.
100102

101103
## Replicated tables
104+
102105
A replicated table provides the fastest query performance for small tables.
103106

104-
A table that is replicated caches a full copy of the table on each compute node. So, replicating a table removes the need to transfer data among compute nodes before a join or aggregation. Replicated tables are best utilized with small tables. Extra storage is required and there is additional overhead that is incurred when writing data, which make large tables impractical.
107+
A table that is replicated caches a full copy of the table on each compute node. So, replicating a table removes the need to transfer data among compute nodes before a join or aggregation. Replicated tables are best utilized with small tables. Extra storage is required and there's extra overhead that is incurred when writing data, which make large tables impractical.
105108

106-
The diagram below shows a replicated table that is cached on the first distribution on each compute node.
109+
The diagram below shows a replicated table that is cached on the first distribution on each compute node.
107110

108-
![Replicated table](media/overview-architecture/replicated-table.png "Replicated table")
111+
:::image type="content" source="./media/overview-architecture/replicated-table.png" alt-text="Screenshot of the replicated table cached on the first distribution on each compute node." lightbox="./media/overview-architecture/replicated-table.png" :::
109112

110113
## Next steps
111114

112-
Now that you know a bit about Synapse SQL, learn how to quickly [create a dedicated SQL pool](../quickstart-create-sql-pool-portal.md) and [load sample data](../sql-data-warehouse/sql-data-warehouse-load-from-azure-blob-storage-with-polybase.md). Or start [using serverless SQL pool](../quickstart-sql-on-demand.md). If you are new to Azure, you may find the [Azure glossary](../../azure-glossary-cloud-terminology.md) helpful as you encounter new terminology.
115+
Now that you know a bit about Synapse SQL, learn how to quickly [create a dedicated SQL pool](../quickstart-create-sql-pool-portal.md) and [load sample data](../sql-data-warehouse/sql-data-warehouse-load-from-azure-blob-storage-with-polybase.md). Or start [using serverless SQL pool](../quickstart-sql-on-demand.md). If you're new to Azure, you may find the [Azure glossary](../../azure-glossary-cloud-terminology.md) helpful as you encounter new terminology.

0 commit comments

Comments
 (0)