Skip to content

Commit 74801c2

Browse files
committed
vale fixes, script update, csv for progress
1 parent e6c2282 commit 74801c2

File tree

3 files changed

+150
-109
lines changed

3 files changed

+150
-109
lines changed

docs/cloud/onboard/02_migrate/01_migration_guides/04_snowflake/01_overview.md

Lines changed: 101 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@ import Image from '@theme/IdealImage';
1717
> This document provides an introduction to migrating data from Snowflake to ClickHouse.
1818
1919
Snowflake is a cloud data warehouse primarily focused on migrating legacy on-premise
20-
data warehousing workloads to the cloud. It is well-optimized for executing
20+
data warehousing workloads to the cloud. It's well-optimized for executing
2121
long-running reports at scale. As datasets migrate to the cloud, data owners start
22-
thinking about how else they can extract value from this data, including using
22+
thinking about how else they can extract value from this data, including using
2323
these datasets to power real-time applications for internal and external use cases.
24-
When this happens, they often realize they need a database optimized for
24+
When this happens, they often realize they need a database optimized for
2525
powering real-time analytics, like ClickHouse.
2626

2727
## Comparison {#comparison}
@@ -30,22 +30,22 @@ In this section, we'll compare the key features of ClickHouse and Snowflake.
3030

3131
### Similarities {#similarities}
3232

33-
Snowflake is a cloud-based data warehousing platform that provides a scalable
34-
and efficient solution for storing, processing, and analyzing large amounts of
35-
data.
36-
Like ClickHouse, Snowflake is not built on existing technologies but relies
33+
Snowflake is a cloud-based data warehousing platform that provides a scalable
34+
and efficient solution for storing, processing, and analyzing large amounts of
35+
data.
36+
Like ClickHouse, Snowflake isn't built on existing technologies but relies
3737
on its own SQL query engine and custom architecture.
3838

3939
Snowflake’s architecture is described as a hybrid between a shared-storage (shared-disk)
4040
architecture and a shared-nothing architecture. A shared-storage architecture is
41-
one where data is both accessible from all compute nodes using object
42-
stores such as S3. A shared-nothing architecture is one where each compute node
43-
stores a portion of the entire data set locally to respond to queries. This, in
44-
theory, delivers the best of both models: the simplicity of a shared-disk
41+
one where data is both accessible from all compute nodes using object
42+
stores such as S3. A shared-nothing architecture is one where each compute node
43+
stores a portion of the entire data set locally to respond to queries. This, in
44+
theory, delivers the best of both models: the simplicity of a shared-disk
4545
architecture and the scalability of a shared-nothing architecture.
4646

4747
This design fundamentally relies on object storage as the primary storage medium,
48-
which scales almost infinitely under concurrent access while providing high
48+
which scales almost infinitely under concurrent access while providing high
4949
resilience and scalable throughput guarantees.
5050

5151
The image below from [docs.snowflake.com](https://docs.snowflake.com/en/user-guide/intro-key-concepts)
@@ -54,20 +54,20 @@ shows this architecture:
5454
<Image img={snowflake_architecture} size="md" alt="Snowflake architecture" />
5555

5656
Conversely, as an open-source and cloud-hosted product, ClickHouse can be deployed
57-
in both shared-disk and shared-nothing architectures. The latter is typical for
58-
self-managed deployments. While allowing for CPU and memory to be easily scaled,
59-
shared-nothing configurations introduce classic data management challenges and
57+
in both shared-disk and shared-nothing architectures. The latter is typical for
58+
self-managed deployments. While allowing for CPU and memory to be easily scaled,
59+
shared-nothing configurations introduce classic data management challenges and
6060
overhead of data replication, especially during membership changes.
6161

62-
For this reason, ClickHouse Cloud utilizes a shared-storage architecture that is
63-
conceptually similar to Snowflake. Data is stored once in an object store
64-
(single copy), such as S3 or GCS, providing virtually infinite storage with
65-
strong redundancy guarantees. Each node has access to this single copy of the
66-
data as well as its own local SSDs for cache purposes. Nodes can, in turn, be
62+
For this reason, ClickHouse Cloud utilizes a shared-storage architecture that's
63+
conceptually similar to Snowflake. Data is stored once in an object store
64+
(single copy), such as S3 or GCS, providing virtually infinite storage with
65+
strong redundancy guarantees. Each node has access to this single copy of the
66+
data and its own local SSDs for cache purposes. Nodes can, in turn, be
6767
scaled to provide additional CPU and memory resources as required. Like Snowflake,
68-
S3’s scalability properties address the classic limitation of shared-disk
69-
architectures (disk I/O and network bottlenecks) by ensuring the I/O throughput
70-
available to current nodes in a cluster is not impacted as additional nodes are
68+
S3’s scalability properties address the classic limitation of shared-disk
69+
architectures (disk I/O and network bottlenecks) by ensuring the I/O throughput
70+
available to current nodes in a cluster isn't impacted as additional nodes are
7171
added.
7272

7373
<Image img={cloud_architecture} size="md" alt="ClickHouse Cloud architecture" />
@@ -79,107 +79,107 @@ differ in a few subtle ways:
7979

8080
* Compute resources in Snowflake are provided through a concept of [warehouses](https://docs.snowflake.com/en/user-guide/warehouses).
8181
These consist of a number of nodes, each of a set size. While Snowflake
82-
doesn't publish the specific architecture of their warehouses, it is
83-
[generally understood](https://select.dev/posts/snowflake-warehouse-sizing)
84-
that each node consists of 8 vCPUs, 16GiB, and 200GB of local storage (for cache).
85-
The number of nodes depends on a t-shirt size, e.g. an x-small has one node,
82+
doesn't publish the specific architecture of their warehouses, it's
83+
[generally understood](https://select.dev/posts/snowflake-warehouse-sizing)
84+
that each node consists of 8 vCPUs, 16 GiB, and 200 GB of local storage (for cache).
85+
The number of nodes depends on a t-shirt size, e.g. an x-small has one node,
8686
a small 2, medium 4, large 8, etc. These warehouses are independent of the data
87-
and can be used to query any database residing on object storage. When idle
88-
and not subjected to query load, warehouses are paused - resuming when a query
89-
is received. While storage costs are always reflected in billing, warehouses
87+
and can be used to query any database residing on object storage. When idle
88+
and not subjected to query load, warehouses are paused - resuming when a query
89+
is received. While storage costs are always reflected in billing, warehouses
9090
are only charged when active.
9191

92-
* ClickHouse Cloud utilizes a similar principle of nodes with local cache
93-
storage. Rather than t-shirt sizes, users deploy a service with a total
94-
amount of compute and available RAM. This, in turn, transparently
95-
auto-scales (within defined limits) based on the query load - either
96-
vertically by increasing (or decreasing) the resources for each node or
97-
horizontally by raising/lowering the total number of nodes. ClickHouse
98-
Cloud nodes currently have a 1 CPU-to-memory ratio, unlike Snowflake's 1.
99-
While a looser coupling is possible, services are currently coupled to the
100-
data, unlike Snowflake warehouses. Nodes will also pause if idle and
101-
resume if subjected to queries. Users can also manually resize services if
92+
* ClickHouse Cloud utilizes a similar principle of nodes with local cache
93+
storage. Rather than t-shirt sizes, users deploy a service with a total
94+
amount of compute and available RAM. This, in turn, transparently
95+
auto-scales (within defined limits) based on the query load - either
96+
vertically by increasing (or decreasing) the resources for each node or
97+
horizontally by raising/lowering the total number of nodes. ClickHouse
98+
Cloud nodes have a 1 CPU-to-memory ratio, unlike Snowflake's 1.
99+
While a looser coupling is possible, services are coupled to the
100+
data, unlike Snowflake warehouses. Nodes will also pause if idle and
101+
resume if subjected to queries. Users can also manually resize services if
102102
needed.
103103

104-
* ClickHouse Cloud's query cache is currently node specific, unlike
105-
Snowflake's, which is delivered at a service layer independent of the
106-
warehouse. Based on benchmarks, ClickHouse Cloud's node cache outperforms
104+
* ClickHouse Cloud's query cache is node specific, unlike
105+
Snowflake's, which is delivered at a service layer independent of the
106+
warehouse. Based on benchmarks, ClickHouse Cloud's node cache outperforms
107107
Snowflake's.
108108

109-
* Snowflake and ClickHouse Cloud take different approaches to scaling to
110-
increase query concurrency. Snowflake addresses this through a feature
109+
* Snowflake and ClickHouse Cloud take different approaches to scaling to
110+
increase query concurrency. Snowflake addresses this through a feature
111111
known as [multi-cluster warehouses](https://docs.snowflake.com/en/user-guide/warehouses-multicluster#benefits-of-multi-cluster-warehouses).
112-
This feature allows users to add clusters to a warehouse. While this offers no
113-
improvement to query latency, it does provide additional parallelization and
114-
allows higher query concurrency. ClickHouse achieves this by adding more memory
115-
and CPU to a service through vertical or horizontal scaling. We do not explore the
116-
capabilities of these services to scale to higher concurrency in this blog,
117-
focusing instead on latency, but acknowledge that this work should be done
118-
for a complete comparison. However, we would expect ClickHouse to perform
119-
well in any concurrency test, with Snowflake explicitly limiting the number
112+
This feature allows users to add clusters to a warehouse. While this offers no
113+
improvement to query latency, it does provide additional parallelization and
114+
allows higher query concurrency. ClickHouse achieves this by adding more memory
115+
and CPU to a service through vertical or horizontal scaling. We don't explore the
116+
capabilities of these services to scale to higher concurrency in this blog,
117+
focusing instead on latency, but acknowledge that this work should be done
118+
for a complete comparison. However, we would expect ClickHouse to perform
119+
well in any concurrency test, with Snowflake explicitly limiting the number
120120
of concurrent queries allowed for a [warehouse to 8 by default](https://docs.snowflake.com/en/sql-reference/parameters#max-concurrency-level).
121-
In comparison, ClickHouse Cloud allows up to 1000 queries to be executed per
121+
In comparison, ClickHouse Cloud allows up to 1000 queries to be executed per
122122
node.
123123

124-
* Snowflake's ability to switch compute size on a dataset, coupled with fast
125-
resume times for warehouses, makes it an excellent experience for ad hoc
126-
querying. For data warehouse and data lake use cases, this provides an
124+
* Snowflake's ability to switch compute size on a dataset, coupled with fast
125+
resume times for warehouses, makes it an excellent experience for ad hoc
126+
querying. For data warehouse and data lake use cases, this provides an
127127
advantage over other systems.
128128

129129
### Real-time analytics {#real-time-analytics}
130130

131131
Based on public [benchmark](https://benchmark.clickhouse.com/#system=+%E2%98%81w|%EF%B8%8Fr|C%20c|nfe&type=-&machine=-ca2|gl|6ax|6ale|3al&cluster_size=-&opensource=-&tuned=+n&metric=hot&queries=-) data,
132132
ClickHouse outperforms Snowflake for real-time analytics applications in the following areas:
133133

134-
* **Query latency**: Snowflake queries have a higher query latency even
135-
when clustering is applied to tables to optimize performance. In our
136-
testing, Snowflake requires over twice the compute to achieve equivalent
137-
ClickHouse performance on queries where a filter is applied that is part
138-
of the Snowflake clustering key or ClickHouse primary key. While
139-
Snowflake's [persistent query cache](https://docs.snowflake.com/en/user-guide/querying-persisted-results)
140-
offsets some of these latency challenges, this is ineffective in cases
141-
where the filter criteria are more diverse. This query cache effectiveness
142-
can be further impacted by changes to the underlying data, with cache
143-
entries invalidated when the table changes. While this is not the case in
144-
the benchmark for our application, a real deployment would require the new,
145-
more recent data to be inserted. Note that ClickHouse's query cache is
146-
node specific and not [transactionally consistent](https://clickhouse.com/blog/introduction-to-the-clickhouse-query-cache-and-design),
134+
* **Query latency**: Snowflake queries have a higher query latency even
135+
when clustering is applied to tables to optimize performance. In our
136+
testing, Snowflake requires over twice the compute to achieve equivalent
137+
ClickHouse performance on queries where a filter is applied that's part
138+
of the Snowflake clustering key or ClickHouse primary key. While
139+
Snowflake's [persistent query cache](https://docs.snowflake.com/en/user-guide/querying-persisted-results)
140+
offsets some of these latency challenges, this is ineffective in cases
141+
where the filter criteria are more diverse. This query cache effectiveness
142+
can be further impacted by changes to the underlying data, with cache
143+
entries invalidated when the table changes. While this isn't the case in
144+
the benchmark for our application, a real deployment would require the new,
145+
more recent data to be inserted. Note that ClickHouse's query cache is
146+
node specific and not [transactionally consistent](https://clickhouse.com/blog/introduction-to-the-clickhouse-query-cache-and-design),
147147
making it [better suited ](https://clickhouse.com/blog/introduction-to-the-clickhouse-query-cache-and-design)
148-
to real-time analytics. Users also have granular control over its use
149-
with the ability to control its use on a [per-query basis](/operations/settings/settings#use_query_cache),
150-
its [precise size](/operations/settings/settings#query_cache_max_size_in_bytes),
151-
whether a [query is cached](/operations/settings/settings#enable_writes_to_query_cache)
152-
(limits on duration or required number of executions), and whether it is
148+
to real-time analytics. Users also have granular control over its use
149+
with the ability to control its use on a [per-query basis](/operations/settings/settings#use_query_cache),
150+
its [precise size](/operations/settings/settings#query_cache_max_size_in_bytes),
151+
whether a [query is cached](/operations/settings/settings#enable_writes_to_query_cache)
152+
(limits on duration or required number of executions), and whether it's
153153
only [passively used](https://clickhouse.com/blog/introduction-to-the-clickhouse-query-cache-and-design#using-logs-and-settings).
154154

155-
* **Lower cost**: Snowflake warehouses can be configured to suspend after
156-
a period of query inactivity. Once suspended, charges are not incurred.
157-
Practically, this inactivity check can [only be lowered to 60s](https://docs.snowflake.com/en/sql-reference/sql/alter-warehouse).
158-
Warehouses will automatically resume, within several seconds, once a query
159-
is received. With Snowflake only charging for resources when a warehouse
160-
is under use, this behavior caters to workloads that often sit idle, like
155+
* **Lower cost**: Snowflake warehouses can be configured to suspend after
156+
a period of query inactivity. Once suspended, charges aren't incurred.
157+
Practically, this inactivity check can [only be lowered to 60 s](https://docs.snowflake.com/en/sql-reference/sql/alter-warehouse).
158+
Warehouses will automatically resume, within several seconds, once a query
159+
is received. With Snowflake only charging for resources when a warehouse
160+
is under use, this behavior caters to workloads that often sit idle, like
161161
ad-hoc querying.
162162

163-
However, many real-time analytics workloads require ongoing real-time data
164-
ingestion and frequent querying that doesn't benefit from idling (like
165-
customer-facing dashboards). This means warehouses must often be fully
166-
active and incurring charges. This negates the cost-benefit of idling as
167-
well as any performance advantage that may be associated with Snowflake's
168-
ability to resume a responsive state faster than alternatives. This active
169-
state requirement, when combined with ClickHouse Cloud's lower per-second
170-
cost for an active state, results in ClickHouse Cloud offering a
163+
However, many real-time analytics workloads require ongoing real-time data
164+
ingestion and frequent querying that doesn't benefit from idling (like
165+
customer-facing dashboards). This means warehouses must often be fully
166+
active and incurring charges. This negates the cost-benefit of idling as
167+
well as any performance advantage that may be associated with Snowflake's
168+
ability to resume a responsive state faster than alternatives. This active
169+
state requirement, when combined with ClickHouse Cloud's lower per-second
170+
cost for an active state, results in ClickHouse Cloud offering a
171171
significantly lower total cost for these kinds of workloads.
172172

173-
* **Predictable pricing of features:** Features such as materialized views
174-
and clustering (equivalent to ClickHouse's ORDER BY) are required to reach
175-
the highest levels of performance in real-time analytics use cases. These
176-
features incur additional charges in Snowflake, requiring not only a
177-
higher tier, which increases costs per credit by 1.5x, but also
178-
unpredictable background costs. For instance, materialized views incur a
179-
background maintenance cost, as does clustering, which is hard to predict
180-
prior to use. In contrast, these features incur no additional cost in
181-
ClickHouse Cloud, except additional CPU and memory usage at insert time,
182-
typically negligible outside of high insert workload use cases. We have
183-
observed in our benchmark that these differences, along with lower query
184-
latencies and higher compression, result in significantly lower costs with
173+
* **Predictable pricing of features:** Features such as materialized views
174+
and clustering (equivalent to ClickHouse's `ORDER BY`) are required to reach
175+
the highest levels of performance in real-time analytics use cases. These
176+
features incur additional charges in Snowflake, requiring not only a
177+
higher tier, which increases costs per credit by 1.5x, but also
178+
unpredictable background costs. For instance, materialized views incur a
179+
background maintenance cost, as does clustering, which is hard to predict
180+
prior to use. In contrast, these features incur no additional cost in
181+
ClickHouse Cloud, except additional CPU and memory usage at insert time,
182+
typically negligible outside of high insert workload use cases. We've
183+
observed in our benchmark that these differences, along with lower query
184+
latencies and higher compression, result in significantly lower costs with
185185
ClickHouse.

0 commit comments

Comments
 (0)