|
| 1 | +--- |
| 2 | +title: Fundamental concepts for scaling - Hyperscale (Citus) - Azure Database for PostgreSQL |
| 3 | +description: Ideas you need to know to build relational apps that scale |
| 4 | +ms.author: jonels |
| 5 | +author: jonels-msft |
| 6 | +ms.service: postgresql |
| 7 | +ms.subservice: hyperscale-citus |
| 8 | +ms.topic: how-to |
| 9 | +ms.date: 04/28/2022 |
| 10 | +--- |
| 11 | + |
| 12 | +# Fundamental concepts for scaling |
| 13 | + |
| 14 | +Before we investigate the steps of building a new app, it's helpful to see a |
| 15 | +quick overview of the terms and concepts involved. |
| 16 | + |
| 17 | +## Architectural overview |
| 18 | + |
| 19 | +Hyperscale (Citus) gives you the power to distribute tables across multiple |
| 20 | +machines in a server group and transparently query them the same you query |
| 21 | +plain PostgreSQL: |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +In the Hyperscale (Citus) architecture, there are multiple kinds of nodes: |
| 26 | + |
| 27 | +* The **coordinator** node stores distributed table metadata and is responsible |
| 28 | + for distributed planning. |
| 29 | +* By contrast, the **worker** nodes store the actual data and do the computation. |
| 30 | +* Both the coordinator and workers are plain PostgreSQL databases, with the |
| 31 | + `citus` extension loaded. |
| 32 | + |
| 33 | +To distribute a normal PostgreSQL table, like `campaigns` in the diagram above, |
| 34 | +run a command called `create_distributed_table()`. Once you run this |
| 35 | +command, Hyperscale (Citus) transparently creates shards for the table across |
| 36 | +worker nodes. In the diagram, shards are represented as blue boxes. |
| 37 | + |
| 38 | +> [!NOTE] |
| 39 | +> |
| 40 | +> On the basic tier, shards of distributed tables are on the coordinator node, |
| 41 | +> not worker nodes. |
| 42 | +
|
| 43 | +Shards are plain (but specially named) PostgreSQL tables that hold slices of |
| 44 | +your data. In our example, because we distributed `campaigns` by `company_id`, |
| 45 | +the shards hold campaigns, where the campaigns of different companies are |
| 46 | +assigned to different shards. |
| 47 | + |
| 48 | +## Distribution column (also known as shard key) |
| 49 | + |
| 50 | +`create_distributed_table()` is the magic function that Hyperscale (Citus) |
| 51 | +provides to distribute tables and use resources across multiple machines. |
| 52 | + |
| 53 | +```postgresql |
| 54 | +SELECT create_distributed_table( |
| 55 | + 'table_name', |
| 56 | + 'distribution_column'); |
| 57 | +``` |
| 58 | + |
| 59 | +The second argument above picks a column from the table as a **distribution |
| 60 | +column**. It can be any column with a native PostgreSQL type (with integer and |
| 61 | +text being most common). The value of the distribution column determines which |
| 62 | +rows go into which shards, which is why the distribution column is also called |
| 63 | +the **shard key**. |
| 64 | + |
| 65 | +Hyperscale (Citus) decides how to run queries based on their use of the shard |
| 66 | +key: |
| 67 | + |
| 68 | +| Query involves | Where it runs | |
| 69 | +|----------------|---------------| |
| 70 | +| just one shard key | on the worker node that holds its shard | |
| 71 | +| multiple shard keys | parallelized across multiple nodes | |
| 72 | + |
| 73 | +The choice of shard key dictates the performance and scalability of your |
| 74 | +applications. |
| 75 | + |
| 76 | +* Uneven data distribution per shard keys (also known as *data skew*) isn't optimal |
| 77 | + for performance. For example, don’t choose a column for which a single value |
| 78 | + represents 50% of data. |
| 79 | +* Shard keys with low cardinality can affect scalability. You can use only as |
| 80 | + many shards as there are distinct key values. Choose a key with cardinality |
| 81 | + in the hundreds to thousands. |
| 82 | +* Joining two large tables with different shard keys can be slow. Choose a |
| 83 | + common shard key across large tables. Learn more in |
| 84 | + [colocation](#colocation). |
| 85 | + |
| 86 | +## Colocation |
| 87 | + |
| 88 | +Another concept closely related to shard key is *colocation*. Tables sharded by |
| 89 | +the same distribution column values are colocated - The shards of colocated |
| 90 | +tables are stored together on the same workers. |
| 91 | + |
| 92 | +Below are two tables sharded by the same key, `site_id`. They're colocated. |
| 93 | + |
| 94 | + |
| 95 | + |
| 96 | +Hyperscale (Citus) ensures that rows with a matching `site_id` value in both |
| 97 | +tables are stored on the same worker node. You can see that, for both tables, |
| 98 | +rows with `site_id=1` are stored on worker 1. Similarly for other site IDs. |
| 99 | + |
| 100 | +Colocation helps optimize JOINs across these tables. If you join the two tables |
| 101 | +on `site_id`, Hyperscale (Citus) can perform the join locally on worker nodes |
| 102 | +without shuffling data between nodes. |
| 103 | + |
| 104 | +## Next steps |
| 105 | + |
| 106 | +> [!div class="nextstepaction"] |
| 107 | +> [Classify application workload >](howto-build-scalable-apps-classify.md) |
0 commit comments