Skip to content
This repository was archived by the owner on Jul 15, 2024. It is now read-only.

Commit 87fb358

Browse files
marlenezwjcrist
authored andcommitted
added clickhouse hackernews example
1 parent 4ece0e9 commit 87fb358

File tree

1 file changed

+266
-0
lines changed

1 file changed

+266
-0
lines changed

examples/clickhouse-hackernews.ipynb

Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Using Ibis with ClickHouse"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"[Ibis](https://ibis-project.com) supports reading and querying data using [ClickHouse](https://clickhouse.com/) as a backend.\n",
15+
"\n",
16+
"In this example we'll demonstrate connecting Ibis to a ClickHouse server, and using it to execute a few queries."
17+
]
18+
},
19+
{
20+
"cell_type": "code",
21+
"execution_count": null,
22+
"metadata": {},
23+
"outputs": [],
24+
"source": [
25+
"import ibis\n",
26+
"from ibis import _\n",
27+
"\n",
28+
"ibis.options.interactive = True "
29+
]
30+
},
31+
{
32+
"cell_type": "markdown",
33+
"metadata": {},
34+
"source": [
35+
"## Creating a Connection\n",
36+
"\n",
37+
"First we need to connect Ibis to a running ClickHouse server.\n",
38+
"\n",
39+
"In this example we'll run queries against the publically available [ClickHouse playground](https://clickhouse.com/docs/en/getting-started/playground) server. To run against your own ClickHouse server you'd only need to change the connection details."
40+
]
41+
},
42+
{
43+
"cell_type": "code",
44+
"execution_count": null,
45+
"metadata": {},
46+
"outputs": [],
47+
"source": [
48+
"con = ibis.clickhouse.connect(\n",
49+
" host=\"play.clickhouse.com\", \n",
50+
" port=9440, \n",
51+
" user=\"play\", \n",
52+
" secure=True\n",
53+
")"
54+
]
55+
},
56+
{
57+
"cell_type": "markdown",
58+
"metadata": {},
59+
"source": [
60+
"## Listing available tables\n",
61+
"\n",
62+
"The ClickHouse playground server has a number of interesting datasets available. To see them, we can examine the tables via the `.tables` attribue. This shows a list of all tables available:"
63+
]
64+
},
65+
{
66+
"cell_type": "code",
67+
"execution_count": null,
68+
"metadata": {},
69+
"outputs": [],
70+
"source": [
71+
"con.tables"
72+
]
73+
},
74+
{
75+
"cell_type": "markdown",
76+
"metadata": {},
77+
"source": [
78+
"## Inspecting a Table\n",
79+
"\n",
80+
"Lets take a look at the `hackernews` table. This table contains all posts and comments on [Hacker News](https://news.ycombinator.com/).\n",
81+
"\n",
82+
"We can access the table by attribute as `con.tables.hackernews`."
83+
]
84+
},
85+
{
86+
"cell_type": "code",
87+
"execution_count": null,
88+
"metadata": {},
89+
"outputs": [],
90+
"source": [
91+
"t = con.tables.hackernews"
92+
]
93+
},
94+
{
95+
"cell_type": "markdown",
96+
"metadata": {},
97+
"source": [
98+
"We can then take a peak at the first few rows using the `.head()` method."
99+
]
100+
},
101+
{
102+
"cell_type": "code",
103+
"execution_count": null,
104+
"metadata": {},
105+
"outputs": [],
106+
"source": [
107+
"t.head()"
108+
]
109+
},
110+
{
111+
"cell_type": "markdown",
112+
"metadata": {},
113+
"source": [
114+
"## Finding the highest scoring posts\n",
115+
"\n",
116+
"Here we find the top 5 posts by score.\n",
117+
"\n",
118+
"Posts have a title, so we:\n",
119+
"\n",
120+
"- `filter` out rows that lack a title\n",
121+
"- `select` only the columns we're interested in\n",
122+
"- `order` them by score, descending\n",
123+
"- `limit` to the top 5 rows"
124+
]
125+
},
126+
{
127+
"cell_type": "code",
128+
"execution_count": null,
129+
"metadata": {},
130+
"outputs": [],
131+
"source": [
132+
"top_posts_by_score = (\n",
133+
" t.filter(_.title != \"\")\n",
134+
" .select(\"title\", \"score\")\n",
135+
" .order_by(ibis.desc(\"score\"))\n",
136+
" .limit(5)\n",
137+
")\n",
138+
"\n",
139+
"top_posts_by_score"
140+
]
141+
},
142+
{
143+
"cell_type": "markdown",
144+
"metadata": {},
145+
"source": [
146+
"## Finding the most prolific commenters\n",
147+
"\n",
148+
"Here we find the top 5 commenters by number of comments made.\n",
149+
"\n",
150+
"To do this we:\n",
151+
"\n",
152+
"- `filter` out rows with no author\n",
153+
"- `group_by` author\n",
154+
"- `count` all the rows in each group\n",
155+
"- `order_by` the counts, descending\n",
156+
"- `limit` to the top 5 rows"
157+
]
158+
},
159+
{
160+
"cell_type": "code",
161+
"execution_count": null,
162+
"metadata": {},
163+
"outputs": [],
164+
"source": [
165+
"top_commenters = (\n",
166+
" t.filter(_.by != \"\")\n",
167+
" .group_by(\"by\")\n",
168+
" .count()\n",
169+
" .order_by(ibis.desc(\"count\"))\n",
170+
" .limit(5)\n",
171+
")\n",
172+
"\n",
173+
"top_commenters"
174+
]
175+
},
176+
{
177+
"cell_type": "markdown",
178+
"metadata": {},
179+
"source": [
180+
"This query could also be expressed using the `.topk` method, which is a shorthand for the above:"
181+
]
182+
},
183+
{
184+
"cell_type": "code",
185+
"execution_count": null,
186+
"metadata": {},
187+
"outputs": [],
188+
"source": [
189+
"# This is a shorthand for the above\n",
190+
"top_commenters = t.filter(_.by != \"\").by.topk(5)\n",
191+
"\n",
192+
"top_commenters"
193+
]
194+
},
195+
{
196+
"cell_type": "markdown",
197+
"metadata": {},
198+
"source": [
199+
"## Finding top commenters by score"
200+
]
201+
},
202+
{
203+
"cell_type": "markdown",
204+
"metadata": {},
205+
"source": [
206+
"Here we find the top 5 commenters with the highest cumulative scores. In this case the `.topk` shorthand won't work and we'll need to write out the full `group_by` -> `agg` -> `order_by` -> `limit` pipeline."
207+
]
208+
},
209+
{
210+
"cell_type": "code",
211+
"execution_count": null,
212+
"metadata": {},
213+
"outputs": [],
214+
"source": [
215+
"top_commenters_by_score = (\n",
216+
" t.filter(_.by != \"\")\n",
217+
" .group_by(\"by\")\n",
218+
" .agg(total_score=_.score.sum())\n",
219+
" .order_by(ibis.desc(\"total_score\"))\n",
220+
" .limit(5)\n",
221+
")\n",
222+
"\n",
223+
"top_commenters_by_score"
224+
]
225+
},
226+
{
227+
"cell_type": "markdown",
228+
"metadata": {},
229+
"source": [
230+
"## Next Steps\n",
231+
"\n",
232+
"There are lots of other interesting queries one might ask of this dataset. A few examples:\n",
233+
"\n",
234+
"- What posts had the most comments?\n",
235+
"- How do post scores fluctuate over time?\n",
236+
"- What day of the week has the highest average post score? What day has the lowest?\n",
237+
"\n",
238+
"To learn more about how to use Ibis with Clickhouse, see [the documentation](https://ibis-project.org/backends/ClickHouse/)."
239+
]
240+
}
241+
],
242+
"metadata": {
243+
"interpreter": {
244+
"hash": "db67a4c5f346815e3207df1348e9e718605305208b0cc89f618da4cb81ede2ba"
245+
},
246+
"kernelspec": {
247+
"display_name": "Python 3 (ipykernel)",
248+
"language": "python",
249+
"name": "python3"
250+
},
251+
"language_info": {
252+
"codemirror_mode": {
253+
"name": "ipython",
254+
"version": 3
255+
},
256+
"file_extension": ".py",
257+
"mimetype": "text/x-python",
258+
"name": "python",
259+
"nbconvert_exporter": "python",
260+
"pygments_lexer": "ipython3",
261+
"version": "3.10.10"
262+
}
263+
},
264+
"nbformat": 4,
265+
"nbformat_minor": 2
266+
}

0 commit comments

Comments
 (0)