Skip to content

Commit 0bfc9ee

Browse files
authored
Merge pull request #3140 from ClickHouse/measuring_search
New search
2 parents 023c8de + d9196a5 commit 0bfc9ee

File tree

20 files changed

+1625
-69
lines changed

20 files changed

+1625
-69
lines changed

.github/workflows/build-search.yml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
name: Update Algolia Search
2+
3+
on:
4+
pull_request:
5+
types:
6+
- closed
7+
8+
workflow_dispatch:
9+
10+
schedule:
11+
- cron: '0 4 * * *'
12+
13+
env:
14+
PYTHONUNBUFFERED: 1 # Force the stdout and stderr streams to be unbuffered
15+
16+
jobs:
17+
update-search:
18+
if: github.event.pull_request.merged == true && contains(github.event.pull_request.labels.*.name, 'update search') && github.event.pull_request.base.ref == 'main'
19+
#if: contains(github.event.pull_request.labels.*.name, 'update search') # Updated to trigger directly on PRs with the label
20+
runs-on: ubuntu-latest
21+
22+
steps:
23+
- name: Checkout Repository
24+
uses: actions/checkout@v3
25+
26+
- name: Set up Node.js
27+
uses: actions/setup-node@v3
28+
with:
29+
node-version: '20'
30+
31+
- name: Run Prep from Master
32+
run: yarn copy-clickhouse-repo-docs
33+
34+
- name: Run Auto Generate Settings
35+
run: yarn auto-generate-settings
36+
37+
- name: Run Indexer
38+
run: yarn run-indexer
39+
env:
40+
ALGOLIA_API_KEY: ${{ secrets.ALGOLIA_API_KEY }}
41+
ALGOLIA_APP_ID: 5H9UG7CX5W
42+
43+
- name: Verify Completion
44+
run: echo "All steps completed successfully!"

docs/en/chdb/getting-started.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ pip install pandas pyarrow
4949
## Querying a JSON file in S3
5050

5151
Let's now have a look at how to query a JSON file that's stored in an S3 bucket.
52-
The [YouTube dislikes dataset](https://clickhouse.com/docs/en/getting-started/example-datasets/youtube-dislikes) contains more than 4 billion rows of dislikes on YouTube videos up to 2021.
52+
The [YouTube dislikes dataset](/docs/en/getting-started/example-datasets/youtube-dislikes) contains more than 4 billion rows of dislikes on YouTube videos up to 2021.
5353
We're going to work with one of the JSON files from that dataset.
5454

5555
Import chdb:

docs/en/integrations/data-ingestion/kafka/kafka-clickhouse-connect-sink.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ slug: /en/integrations/kafka/clickhouse-kafka-connect-sink
55
description: The official Kafka connector from ClickHouse.
66
---
77

8-
import ConnectionDetails from '@site/docs/en/\_snippets/\_gather_your_details_http.mdx';
8+
import ConnectionDetails from '@site/docs/en/_snippets/_gather_your_details_http.mdx';
99

1010
# ClickHouse Kafka Connect Sink
1111

docs/en/integrations/data-visualization/mitzu-and-clickhouse.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ keywords: [clickhouse, Mitzu, connect, integrate, ui]
55
description: Mitzu is a no-code warehouse-native product analytics application.
66
---
77

8-
import ConnectionDetails from '@site/docs/en/\_snippets/\_gather_your_details_http.mdx';
8+
import ConnectionDetails from '@site/docs/en/_snippets/_gather_your_details_http.mdx';
99

1010
# Connecting Mitzu to ClickHouse
1111

docs/en/integrations/data-visualization/omni-and-clickhouse.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ keywords: [clickhouse, Omni, connect, integrate, ui]
55
description: Omni is an enterprise platform for BI, data applications, and embedded analytics that helps you explore and share insights in real time.
66
---
77

8-
import ConnectionDetails from '@site/docs/en/\_snippets/\_gather_your_details_http.mdx';
8+
import ConnectionDetails from '@site/docs/en/_snippets/_gather_your_details_http.mdx';
99

1010
# Omni
1111

docs/en/managing-data/core-concepts/partitions.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
slug: /en/partitions
33
title: Table partitions
44
description: What are table partitions in ClickHouse
5-
keywords: [partitions]
5+
keywords: [partitions, partition by]
66
---
77

88
## What are table partitions in ClickHouse?
@@ -12,6 +12,7 @@ keywords: [partitions]
1212

1313
Partitions group the [data parts](/docs/en/parts) of a table in the [MergeTree engine family](/docs/en/engines/table-engines/mergetree-family) into organized, logical units, which is a way of organizing data that is conceptually meaningful and aligned with specific criteria, such as time ranges, categories, or other key attributes. These logical units make data easier to manage, query, and optimize.
1414

15+
### Partition By
1516

1617
Partitioning can be enabled when a table is initially defined via the [PARTITION BY clause](/docs/en/engines/table-engines/mergetree-family/custom-partitioning-key). This clause can contain a SQL expression on any columns, the results of which will define which partition a row belongs to.
1718

@@ -33,6 +34,8 @@ PARTITION BY toStartOfMonth(date);
3334

3435
You can [query this table](https://sql.clickhouse.com/?query=U0VMRUNUICogRlJPTSB1ay51a19wcmljZV9wYWlkX3NpbXBsZV9wYXJ0aXRpb25lZA&run_query=true&tab=results) in our ClickHouse SQL Playground.
3536

37+
### Structure on disk
38+
3639
Whenever a set of rows is inserted into the table, instead of creating (at [least](/docs/en/operations/settings/settings#max_insert_block_size)) one single data part containing all the inserted rows (as described [here](/docs/en/parts)), ClickHouse creates one new data part for each unique partition key value among the inserted rows:
3740

3841
<img src={require('./images/partitions.png').default} alt='INSERT PROCESSING' class='image' style={{width: '100%'}} />

docs/en/managing-data/deleting-data/overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
slug: /en/deletes/overview
3-
title: Overview
3+
title: Delete Overview
44
description: How to delete data in ClickHouse
55
keywords: [delete, truncate, drop, lightweight delete]
66
---

docusaurus.config.js

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -174,8 +174,8 @@ const config = {
174174
/** @type {import('@docusaurus/preset-classic').ThemeConfig} */
175175
({
176176
algolia: {
177-
appId: '62VCH2MD74',
178-
apiKey: '2363bec2ff1cf20b0fcac675040107c3',
177+
appId: '5H9UG7CX5W',
178+
apiKey: '4a7bf25cf3edbef29d78d5e1eecfdca5',
179179
indexName: 'clickhouse',
180180
contextualSearch: false,
181181
searchPagePath: 'search',

package.json

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,19 +13,23 @@
1313
"docusaurus": "docusaurus",
1414
"prep-from-local": "bash -c 'array_root=($npm_package_config_prep_array_root);array_en=($npm_package_config_prep_array_en);for folder in ${array_en[@]}; do cp -r $0/$folder docs/en;echo \"Copied $folder from [$0]\";done;for folder in ${array_root[@]}; do cp -r $0/$folder docs/;echo \"Copied $folder from [$0]\";done;echo \"Prep completed\";'",
1515
"prep-from-master": "bash -c 'array_root=($npm_package_config_prep_array_root);array_en=($npm_package_config_prep_array_en);ch_temp=/tmp/ch_temp_$RANDOM && mkdir -p $ch_temp && git clone --depth 1 --branch master https://github.com/ClickHouse/ClickHouse $ch_temp; for folder in ${array_en[@]}; do cp -r $ch_temp/$folder docs/en;echo \"Copied $folder from ClickHouse master branch\";done;for folder in ${array_root[@]}; do cp -r $ch_temp/$folder docs/;echo \"Copied $folder from ClickHouse master branch\";done;rm -rf $ch_temp && echo \"Prep completed\";'",
16+
"copy-clickhouse-repo-docs": "bash ./copyClickhouseRepoDocs.sh",
1617
"serve": "docusaurus serve",
1718
"build-api-doc": "node clickhouseapi.js",
1819
"build-swagger": "npx @redocly/cli build-docs https://api.clickhouse.cloud/v1 --output build/en/cloud/manage/api/swagger.html",
19-
"new-build": "bash ./copyClickhouseRepoDocs.sh && bash ./scripts/settings/autogenerate-settings.sh && yarn build-api-doc && yarn build && yarn build-swagger",
20+
"auto-generate-settings": "bash ./scripts/settings/autogenerate-settings.sh",
21+
"new-build": "yarn copy-clickhouse-repo-docs && yarn auto-generate-settings && yarn build-api-doc && yarn build && yarn build-swagger",
2022
"start": "docusaurus start",
2123
"swizzle": "docusaurus swizzle",
22-
"write-heading-ids": "docusaurus write-heading-ids"
24+
"write-heading-ids": "docusaurus write-heading-ids",
25+
"run-indexer": "bash ./scripts/search/run_indexer.sh"
2326
},
2427
"dependencies": {
2528
"@docusaurus/core": "3.7.0",
2629
"@docusaurus/plugin-client-redirects": "3.7.0",
2730
"@docusaurus/preset-classic": "3.7.0",
2831
"@docusaurus/theme-mermaid": "3.7.0",
32+
"@docusaurus/theme-search-algolia": "^3.7.0",
2933
"@mdx-js/react": "^3.1.0",
3034
"@radix-ui/react-navigation-menu": "^1.2.3",
3135
"axios": "^1.7.9",

scripts/search/README.md

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,33 @@ options:
3131

3232
## Results
3333

34-
35-
| Date | Average nDCG | Results |
36-
|------------|--------------|------------------------------------------------------------------------------------------------|
37-
| 20/01/2024 | 0.5010 | [here](https://pastila.nl/?008231f5/bc107912f8a5074d70201e27b1a66c6c#cB/yJOsZPOWi9h8xAkuTUQ==) |
38-
| | | |
39-
34+
| **Date** | **Average nDCG** | **Results** | **Changes** |
35+
|------------|------------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------|
36+
| 20/01/2024 | 0.4700 | [View Results](https://pastila.nl/?008231f5/bc107912f8a5074d70201e27b1a66c6c#cB/yJOsZPOWi9h8xAkuTUQ==) | Baseline |
37+
| 21/01/2024 | 0.5021 | [View Results](https://pastila.nl/?00bb2c2f/936a9a3af62a9bdda186af5f37f55782#m7Hg0i9F1YCesMW6ot25yA==) | Index `_` character and move language to English |
38+
| 24/01/2024 | 0.7072 | [View Results](https://pastila.nl/?065e3e67/e4ad889d0c166226118e6160b4ee53ff#x1NPd2R7hU90CZvvrE4nhg==) | Process markdown, and tune settings. |
39+
| 24/01/2024 | 0.7412 | [View Results](https://pastila.nl/?0020013d/e69b33aaae82e49bc71c5ee2cea9ad46#pqq3VtRd4eP4JM5/izcBcA==) | Include manual promotions for ambigious terms. |
40+
41+
Note: exact scores may vary due to constant content changes.
42+
43+
## Issues
44+
45+
1. Some pages are not optimized for retrieval e.g.
46+
a. https://clickhouse.com/docs/en/sql-reference/aggregate-functions/combinators#-if will never return for `countIf`, `sumif`, `multiif`
47+
1. Some pages are hidden e.g. https://clickhouse.com/docs/en/install#from-docker-image - this needs to be separate page.
48+
1. Some pages e.g. https://clickhouse.com/docs/en/sql-reference/statements/alter need headings e.g. `Alter table`
49+
1. https://clickhouse.com/docs/en/optimize/sparse-primary-indexes needs to be optimized for primary key
50+
1. case `when` - https://clickhouse.com/docs/en/sql-reference/functions/conditional-functions needs to be improved. Maybe keywords or a header
51+
1. `has` - https://clickhouse.com/docs/en/sql-reference/functions/array-functions#hasarr-elem tricky
52+
1. `codec` - we need better content
53+
1. `shard` - need a better page
54+
1. `populate` - we need to have a subheading on the mv page
55+
1. `contains` - https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions needs words
56+
1. `replica` - need more terms on https://clickhouse.com/docs/en/architecture/horizontal-scaling but we need a better page
57+
58+
59+
Algolia configs to try:
60+
61+
- minProximity - 1
62+
- minWordSizefor2Typos - 7
63+
- minWordSizefor1Typo- 3

0 commit comments

Comments
 (0)