By default, tables are Round Robin distributed. This design makes it easy for users to get started creating tables without having to decide how their tables should be distributed. Round Robin tables may perform sufficiently for some workloads, but in most cases selecting a distribution column will perform much better. The most common example of when a table distributed by a column will far outperform a Round Robin table is when two large fact tables are joined. For example, if you have an orders table, which is distributed by order_id, and a transactions table, which is also distributed by order_id, when you join your orders table to your transactions table on order_id, this query becomes a pass-through query, which means we eliminate data movement operations. Fewer steps mean a faster query. Less data movement also makes for faster queries. This explanation only scratch the surface. When loading a distributed table, be sure that your incoming data is not sorted on the distribution key as this will slow down your loads. See the below links for many more details on how selecting a distribution column can improve performance as well as how to define a distributed table in the WITH clause of your CREATE TABLES statement.
0 commit comments