Skip to content

Commit 81dd505

Browse files
authored
Merge pull request #3780 from Blargian/add_example_dataset
Add Foursquare dataset + general improvements
2 parents 0febf9c + d71f0f7 commit 81dd505

36 files changed

+538
-1366
lines changed

.gitignore

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,10 @@ docs/cloud/manage/api/prometheus-api-reference.md
5050
docs/cloud/manage/api/usageCost-api-reference.md
5151
docs/whats-new/changelog/index.md
5252
docs/about-us/beta-and-experimental-features.md
53+
static/knowledgebase_toc.json
54+
.floating-pages-validation-failed
55+
.frontmatter-validation-failed
56+
logs/
5357

5458
.vscode
5559
.aspell.en.prepl
@@ -59,8 +63,8 @@ docs/about-us/beta-and-experimental-features.md
5963
**.translate
6064
/ClickHouse/
6165

62-
6366
# Ignore table of contents files
6467
docs/cloud/reference/release-notes-index.md
6568
docs/whats-new/changelog/index.md
66-
docs/cloud/manage/api/api-reference-index.md
69+
docs/cloud/manage/api/api-reference-index.md
70+
docs/getting-started/index.md

docs/about-us/beta-and-experimental-features.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,10 @@ Note: please be sure to be using a current version of the ClickHouse [compatibil
3838
- Cannot be enabled in the cloud
3939

4040
Please note: no additional experimental features are allowed to be enabled in ClickHouse Cloud other than those listed above as Beta.
41+
42+
<!-- The inner content of the tags below are replaced at build time with a table generated from source
43+
Please do not modify or remove the tags
44+
-->
45+
46+
<!--AUTOGENERATED_START-->
47+
<!--AUTOGENERATED_END-->

docs/getting-started/example-datasets/cell-towers.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,17 +17,17 @@ import ActionsMenu from '@site/docs/_snippets/_service_actions_menu.md';
1717
import SQLConsoleDetail from '@site/docs/_snippets/_launch_sql_console.md';
1818
import SupersetDocker from '@site/docs/_snippets/_add_superset_detail.md';
1919
import cloud_load_data_sample from '@site/static/images/_snippets/cloud-load-data-sample.png';
20-
import cell_towers_1 from '@site/docs/getting-started/example-datasets/images/superset-cell-tower-dashboard.png'
21-
import add_a_database from '@site/docs/getting-started/example-datasets/images/superset-add.png'
22-
import choose_clickhouse_connect from '@site/docs/getting-started/example-datasets/images/superset-choose-a-database.png'
23-
import add_clickhouse_as_superset_datasource from '@site/docs/getting-started/example-datasets/images/superset-connect-a-database.png'
24-
import add_cell_towers_table_as_dataset from '@site/docs/getting-started/example-datasets/images/superset-add-dataset.png'
25-
import create_a_map_in_superset from '@site/docs/getting-started/example-datasets/images/superset-create-map.png'
26-
import specify_long_and_lat from '@site/docs/getting-started/example-datasets/images/superset-lon-lat.png'
27-
import superset_mcc_2024 from '@site/docs/getting-started/example-datasets/images/superset-mcc-204.png'
28-
import superset_radio_umts from '@site/docs/getting-started/example-datasets/images/superset-radio-umts.png'
29-
import superset_umts_netherlands from '@site/docs/getting-started/example-datasets/images/superset-umts-netherlands.png'
30-
import superset_cell_tower_dashboard from '@site/docs/getting-started/example-datasets/images/superset-cell-tower-dashboard.png'
20+
import cell_towers_1 from '@site/static/images/getting-started/example-datasets/superset-cell-tower-dashboard.png'
21+
import add_a_database from '@site/static/images/getting-started/example-datasets/superset-add.png'
22+
import choose_clickhouse_connect from '@site/static/images/getting-started/example-datasets/superset-choose-a-database.png'
23+
import add_clickhouse_as_superset_datasource from '@site/static/images/getting-started/example-datasets/superset-connect-a-database.png'
24+
import add_cell_towers_table_as_dataset from '@site/static/images/getting-started/example-datasets/superset-add-dataset.png'
25+
import create_a_map_in_superset from '@site/static/images/getting-started/example-datasets/superset-create-map.png'
26+
import specify_long_and_lat from '@site/static/images/getting-started/example-datasets/superset-lon-lat.png'
27+
import superset_mcc_2024 from '@site/static/images/getting-started/example-datasets/superset-mcc-204.png'
28+
import superset_radio_umts from '@site/static/images/getting-started/example-datasets/superset-radio-umts.png'
29+
import superset_umts_netherlands from '@site/static/images/getting-started/example-datasets/superset-umts-netherlands.png'
30+
import superset_cell_tower_dashboard from '@site/static/images/getting-started/example-datasets/superset-cell-tower-dashboard.png'
3131

3232
## Goal {#goal}
3333

docs/getting-started/example-datasets/environmental-sensors.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ title: 'Environmental Sensors Data'
77
---
88

99
import Image from '@theme/IdealImage';
10-
import no_events_per_day from './images/sensors_01.png';
11-
import sensors_02 from './images/sensors_02.png';
10+
import no_events_per_day from '@site/static/images/getting-started/example-datasets/sensors_01.png';
11+
import sensors_02 from '@site/static/images/getting-started/example-datasets/sensors_02.png';
1212

1313
[Sensor.Community](https://sensor.community/en/) is a contributors-driven global sensor network that creates Open Environmental Data. The data is collected from sensors all over the globe. Anyone can purchase a sensor and place it wherever they like. The APIs to download the data is in [GitHub](https://github.com/opendata-stuttgart/meta/wiki/APIs) and the data is freely available under the [Database Contents License (DbCL)](https://opendatacommons.org/licenses/dbcl/1-0/).
1414

Lines changed: 278 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,278 @@
1+
---
2+
description: 'Dataset with over 100 million records containing information about places on a map, such as shops,
3+
restaurants, parks, playgrounds, and monuments.'
4+
sidebar_label: 'Foursquare places'
5+
slug: /getting-started/example-datasets/foursquare-places
6+
title: 'Foursquare places'
7+
keywords: ['visualizing']
8+
---
9+
10+
import Image from '@theme/IdealImage';
11+
import visualization_1 from '@site/static/images/getting-started/example-datasets/visualization_1.png';
12+
import visualization_2 from '@site/static/images/getting-started/example-datasets/visualization_2.png';
13+
import visualization_3 from '@site/static/images/getting-started/example-datasets/visualization_3.png';
14+
import visualization_4 from '@site/static/images/getting-started/example-datasets/visualization_4.png';
15+
16+
## Dataset {#dataset}
17+
18+
This dataset by Foursquare is available to [download](https://docs.foursquare.com/data-products/docs/access-fsq-os-places)
19+
and to use for free under the Apache 2.0 license.
20+
21+
It contains over 100 million records of commercial points-of-interest (POI),
22+
such as shops, restaurants, parks, playgrounds, and monuments. It also includes
23+
additional metadata about those places, such as categories and social media
24+
information.
25+
26+
## Data exploration {#data-exploration}
27+
28+
For exploring the data we'll use [`clickhouse-local`](https://clickhouse.com/blog/extracting-converting-querying-local-files-with-sql-clickhouse-local), a small command-line tool
29+
that provides the full ClickHouse engine, although you could also use
30+
ClickHouse Cloud, `clickhouse-client` or even `chDB`.
31+
32+
Run the following query to select the data from the s3 bucket where the data is stored:
33+
34+
```sql title="Query"
35+
SELECT * FROM s3('s3://fsq-os-places-us-east-1/release/dt=2025-04-08/places/parquet/*') LIMIT 1
36+
```
37+
38+
```response title="Response"
39+
Row 1:
40+
──────
41+
fsq_place_id: 4e1ef76cae60cd553dec233f
42+
name: @VirginAmerica In-flight Via @Gogo
43+
latitude: 37.62120111687914
44+
longitude: -122.39003793803701
45+
address: ᴺᵁᴸᴸ
46+
locality: ᴺᵁᴸᴸ
47+
region: ᴺᵁᴸᴸ
48+
postcode: ᴺᵁᴸᴸ
49+
admin_region: ᴺᵁᴸᴸ
50+
post_town: ᴺᵁᴸᴸ
51+
po_box: ᴺᵁᴸᴸ
52+
country: US
53+
date_created: 2011-07-14
54+
date_refreshed: 2018-07-05
55+
date_closed: 2018-07-05
56+
tel: ᴺᵁᴸᴸ
57+
website: ᴺᵁᴸᴸ
58+
email: ᴺᵁᴸᴸ
59+
facebook_id: ᴺᵁᴸᴸ
60+
instagram: ᴺᵁᴸᴸ
61+
twitter: ᴺᵁᴸᴸ
62+
fsq_category_ids: ['4bf58dd8d48988d1f7931735']
63+
fsq_category_labels: ['Travel and Transportation > Transport Hub > Airport > Plane']
64+
placemaker_url: https://foursquare.com/placemakers/review-place/4e1ef76cae60cd553dec233f
65+
geom: �^��a�^@B�
66+
bbox: (-122.39003793803701,37.62120111687914,-122.39003793803701,37.62120111687914)
67+
```
68+
69+
We see that quite a few fields have `ᴺᵁᴸᴸ`, so we can add some additional conditions
70+
to our query to get back more usable data:
71+
72+
```sql title="Query"
73+
SELECT * FROM s3('s3://fsq-os-places-us-east-1/release/dt=2025-04-08/places/parquet/*')
74+
WHERE address IS NOT NULL AND postcode IS NOT NULL AND instagram IS NOT NULL LIMIT 1
75+
```
76+
77+
```response
78+
Row 1:
79+
──────
80+
fsq_place_id: 59b2c754b54618784f259654
81+
name: Villa 722
82+
latitude: ᴺᵁᴸᴸ
83+
longitude: ᴺᵁᴸᴸ
84+
address: Gijzenveldstraat 75
85+
locality: Zutendaal
86+
region: Limburg
87+
postcode: 3690
88+
admin_region: ᴺᵁᴸᴸ
89+
post_town: ᴺᵁᴸᴸ
90+
po_box: ᴺᵁᴸᴸ
91+
country: ᴺᵁᴸᴸ
92+
date_created: 2017-09-08
93+
date_refreshed: 2020-01-25
94+
date_closed: ᴺᵁᴸᴸ
95+
tel: ᴺᵁᴸᴸ
96+
website: https://www.landal.be
97+
email: ᴺᵁᴸᴸ
98+
facebook_id: 522698844570949 -- 522.70 trillion
99+
instagram: landalmooizutendaal
100+
twitter: landalzdl
101+
fsq_category_ids: ['56aa371be4b08b9a8d5734e1']
102+
fsq_category_labels: ['Travel and Transportation > Lodging > Vacation Rental']
103+
placemaker_url: https://foursquare.com/placemakers/review-place/59b2c754b54618784f259654
104+
geom: ᴺᵁᴸᴸ
105+
bbox: (NULL,NULL,NULL,NULL)
106+
```
107+
108+
Run the following query to view the automatically inferred schema of the data using
109+
the `DESCRIBE`:
110+
111+
```sql title="Query"
112+
DESCRIBE s3('s3://fsq-os-places-us-east-1/release/dt=2025-04-08/places/parquet/*')
113+
```
114+
115+
```response title="Response"
116+
┌─name────────────────┬─type────────────────────────┬
117+
1. │ fsq_place_id │ Nullable(String) │
118+
2. │ name │ Nullable(String) │
119+
3. │ latitude │ Nullable(Float64) │
120+
4. │ longitude │ Nullable(Float64) │
121+
5. │ address │ Nullable(String) │
122+
6. │ locality │ Nullable(String) │
123+
7. │ region │ Nullable(String) │
124+
8. │ postcode │ Nullable(String) │
125+
9. │ admin_region │ Nullable(String) │
126+
10. │ post_town │ Nullable(String) │
127+
11. │ po_box │ Nullable(String) │
128+
12. │ country │ Nullable(String) │
129+
13. │ date_created │ Nullable(String) │
130+
14. │ date_refreshed │ Nullable(String) │
131+
15. │ date_closed │ Nullable(String) │
132+
16. │ tel │ Nullable(String) │
133+
17. │ website │ Nullable(String) │
134+
18. │ email │ Nullable(String) │
135+
19. │ facebook_id │ Nullable(Int64) │
136+
20. │ instagram │ Nullable(String) │
137+
21. │ twitter │ Nullable(String) │
138+
22. │ fsq_category_ids │ Array(Nullable(String)) │
139+
23. │ fsq_category_labels │ Array(Nullable(String)) │
140+
24. │ placemaker_url │ Nullable(String) │
141+
25. │ geom │ Nullable(String) │
142+
26. │ bbox │ Tuple( ↴│
143+
│ │↳ xmin Nullable(Float64),↴│
144+
│ │↳ ymin Nullable(Float64),↴│
145+
│ │↳ xmax Nullable(Float64),↴│
146+
│ │↳ ymax Nullable(Float64)) │
147+
└─────────────────────┴─────────────────────────────┘
148+
```
149+
150+
## Loading the data into ClickHouse {#loading-the-data}
151+
152+
If you'd like to persist the data on disk, you can use `clickhouse-server`
153+
or ClickHouse Cloud.
154+
155+
To create the table, run the following command:
156+
157+
```sql title="Query"
158+
CREATE TABLE foursquare_mercator
159+
(
160+
fsq_place_id String,
161+
name String,
162+
latitude Float64,
163+
longitude Float64,
164+
address String,
165+
locality String,
166+
region LowCardinality(String),
167+
postcode LowCardinality(String),
168+
admin_region LowCardinality(String),
169+
post_town LowCardinality(String),
170+
po_box LowCardinality(String),
171+
country LowCardinality(String),
172+
date_created Nullable(Date),
173+
date_refreshed Nullable(Date),
174+
date_closed Nullable(Date),
175+
tel String,
176+
website String,
177+
email String,
178+
facebook_id String,
179+
instagram String,
180+
twitter String,
181+
fsq_category_ids Array(String),
182+
fsq_category_labels Array(String),
183+
placemaker_url String,
184+
geom String,
185+
bbox Tuple(
186+
xmin Nullable(Float64),
187+
ymin Nullable(Float64),
188+
xmax Nullable(Float64),
189+
ymax Nullable(Float64)
190+
),
191+
category LowCardinality(String) ALIAS fsq_category_labels[1],
192+
mercator_x UInt32 MATERIALIZED 0xFFFFFFFF * ((longitude + 180) / 360),
193+
mercator_y UInt32 MATERIALIZED 0xFFFFFFFF * ((1 / 2) - ((log(tan(((latitude + 90) / 360) * pi())) / 2) / pi())),
194+
INDEX idx_x mercator_x TYPE minmax,
195+
INDEX idx_y mercator_y TYPE minmax
196+
)
197+
ORDER BY mortonEncode(mercator_x, mercator_y)
198+
```
199+
200+
Take note of the use of the [`LowCardinality`](/sql-reference/data-types/lowcardinality)
201+
data type for several columns which changes the internal representation of the data
202+
types to be dictionary-encoded. Operating with dictionary encoded data significantly
203+
increases the performance of `SELECT` queries for many applications.
204+
205+
Additionally, two `UInt32` `MATERIALIZED` columns, `mercator_x` and `mercator_y` are created
206+
that map the lat/lon coordinates to the [Web Mercator projection](https://en.wikipedia.org/wiki/Web_Mercator_projection)
207+
for easier segmentation of the map into tiles:
208+
209+
```sql
210+
mercator_x UInt32 MATERIALIZED 0xFFFFFFFF * ((longitude + 180) / 360),
211+
mercator_y UInt32 MATERIALIZED 0xFFFFFFFF * ((1 / 2) - ((log(tan(((latitude + 90) / 360) * pi())) / 2) / pi())),
212+
```
213+
214+
Let's break down what is happening above for each column.
215+
216+
**mercator_x**
217+
218+
This column converts a longitude value into an X coordinate in the Mercator projection:
219+
220+
- `longitude + 180` shifts the longitude range from [-180, 180] to [0, 360]
221+
- Dividing by 360 normalizes this to a value between 0 and 1
222+
- Multiplying by `0xFFFFFFFF` (hex for maximum 32-bit unsigned integer) scales this normalized value to the full range of a 32-bit integer
223+
224+
**mercator_y**
225+
226+
This column converts a latitude value into a Y coordinate in the Mercator projection:
227+
228+
- `latitude + 90` shifts latitude from [-90, 90] to [0, 180]
229+
- Dividing by 360 and multiplying by pi() converts to radians for the trigonometric functions
230+
- The `log(tan(...))` part is the core of the Mercator projection formula
231+
- multiplying by `0xFFFFFFFF` scales to the full 32-bit integer range
232+
233+
Specifying `MATERIALIZED` makes sure that ClickHouse calculates the values for these
234+
columns when we `INSERT` the data, without having to specify these columns (which are not
235+
part of the original data schema) in the `INSERT statement.
236+
237+
The table is ordered by `mortonEncode(mercator_x, mercator_y)` which produces a
238+
Z-order space-filling curve of `mercator_x`, `mercator_y` in order to significantly
239+
improve geospatial query performance. This Z-order curve ordering ensures data is
240+
physically organized by spatial proximity:
241+
242+
```sql
243+
ORDER BY mortonEncode(mercator_x, mercator_y)
244+
```
245+
246+
Two `minmax` indices are also created for faster search:
247+
248+
```sql
249+
INDEX idx_x mercator_x TYPE minmax,
250+
INDEX idx_y mercator_y TYPE minmax
251+
```
252+
253+
As you can see, ClickHouse has absolutely everything you need for real-time
254+
mapping applications!
255+
256+
Run the following query to load the data:
257+
258+
```sql
259+
INSERT INTO foursquare_mercator
260+
SELECT * FROM s3('s3://fsq-os-places-us-east-1/release/dt=2025-04-08/places/parquet/*')
261+
```
262+
263+
## Visualizing the data {#data-visualization}
264+
265+
To see what's possible with this dataset, check out [adsb.exposed](https://adsb.exposed/?dataset=Places&zoom=5&lat=52.3488&lng=4.9219).
266+
adsb.exposed was originally built by co-founder and CTO Alexey Milovidov to visualize ADS-B (Automatic Dependent Surveillance-Broadcast)
267+
flight data, which is 1000x times larger. During a company hackathon Alexey added the Foursquare data to the tool.
268+
269+
Some of our favourite visualizations are produced here below for you to enjoy.
270+
271+
<Image img={visualization_1} size="md" alt="Density map of points of interest in Europe"/>
272+
273+
<Image img={visualization_2} size="md" alt="Sake bars in Japan"/>
274+
275+
<Image img={visualization_3} size="md" alt="ATMs"/>
276+
277+
<Image img={visualization_4} size="md" alt="Map of Europe with points of interest categorised by country"/>
278+

docs/getting-started/example-datasets/github.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ title: 'Writing Queries in ClickHouse using GitHub Data'
88
---
99

1010
import Image from '@theme/IdealImage';
11-
import superset_github_lines_added_deleted from './images/superset-github-lines-added-deleted.png'
12-
import superset_commits_authors from './images/superset-commits-authors.png'
13-
import superset_authors_matrix from './images/superset-authors-matrix.png'
14-
import superset_authors_matrix_v2 from './images/superset-authors-matrix_v2.png'
11+
import superset_github_lines_added_deleted from '@site/static/images/getting-started/example-datasets/superset-github-lines-added-deleted.png'
12+
import superset_commits_authors from '@site/static/images/getting-started/example-datasets/superset-commits-authors.png'
13+
import superset_authors_matrix from '@site/static/images/getting-started/example-datasets/superset-authors-matrix.png'
14+
import superset_authors_matrix_v2 from '@site/static/images/getting-started/example-datasets/superset-authors-matrix_v2.png'
1515

1616
This dataset contains all of the commits and changes for the ClickHouse repository. It can be generated using the native `git-import` tool distributed with ClickHouse.
1717

docs/getting-started/example-datasets/stackoverflow.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ title: 'Analyzing Stack Overflow data with ClickHouse'
77
---
88

99
import Image from '@theme/IdealImage';
10-
import stackoverflow from './images/stackoverflow.png'
10+
import stackoverflow from '@site/static/images/getting-started/example-datasets/stackoverflow.png'
1111

1212
This dataset contains every `Posts`, `Users`, `Votes`, `Comments`, `Badges`, `PostHistory`, and `PostLinks` that has occurred on Stack Overflow.
1313

0 commit comments

Comments
 (0)