Skip to content

Commit edce6d7

Browse files
committed
fix issues
1 parent 2ce7d80 commit edce6d7

File tree

6 files changed

+399
-30
lines changed

6 files changed

+399
-30
lines changed

scripts/search/README.md

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -36,22 +36,28 @@ options:
3636
| 20/01/2024 | 0.4700 | [View Results](https://pastila.nl/?008231f5/bc107912f8a5074d70201e27b1a66c6c#cB/yJOsZPOWi9h8xAkuTUQ==) | Baseline |
3737
| 21/01/2024 | 0.5021 | [View Results](https://pastila.nl/?00bb2c2f/936a9a3af62a9bdda186af5f37f55782#m7Hg0i9F1YCesMW6ot25yA==) | Index `_` character and move language to English |
3838
| 24/01/2024 | 0.7072 | [View Results](https://pastila.nl/?065e3e67/e4ad889d0c166226118e6160b4ee53ff#x1NPd2R7hU90CZvvrE4nhg==) | Process markdown, and tune settings. |
39+
| 24/01/2024 | 0.7412 | [View Results](https://pastila.nl/?0020013d/e69b33aaae82e49bc71c5ee2cea9ad46#pqq3VtRd4eP4JM5/izcBcA==) | Include manual promotions for ambigious terms. |
40+
41+
3942

4043
## Issues
4144

4245
1. Some pages are not optimized for retrieval e.g.
4346
a. https://clickhouse.com/docs/en/sql-reference/aggregate-functions/combinators#-if will never return for `countIf`, `sumif`, `multiif`
44-
2. Some pages are hidden e.g. https://clickhouse.com/docs/en/install#from-docker-image - this needs to be separate page.
45-
3. Some pages e.g. https://clickhouse.com/docs/en/sql-reference/statements/alter need headings e.g. `Alter table`
46-
4. https://clickhouse.com/docs/en/optimize/sparse-primary-indexes needs to be optimized for primary key
47-
5. `between` we need to likely manually promote.
48-
6. `case when` - https://clickhouse.com/docs/en/sql-reference/functions/conditional-functions needs to be improved. Maybe keywords or a header
49-
7. `has` - https://clickhouse.com/docs/en/sql-reference/functions/array-functions#hasarr-elem tricky
50-
8. `clickhouse` - manual promotion
51-
9. `codec` - we need better content
52-
10. `shard` - need a better page
53-
11. `populate` - we need to have a subheading on the mv page
54-
12. `contains` - https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions needs words
55-
13. `client` - maybe promote manually
56-
14. `config.xml` - manually promote
57-
15. `replica` - need more terms on https://clickhouse.com/docs/en/architecture/horizontal-scaling but we need a better page
47+
1. Some pages are hidden e.g. https://clickhouse.com/docs/en/install#from-docker-image - this needs to be separate page.
48+
1. Some pages e.g. https://clickhouse.com/docs/en/sql-reference/statements/alter need headings e.g. `Alter table`
49+
1. https://clickhouse.com/docs/en/optimize/sparse-primary-indexes needs to be optimized for primary key
50+
1. case `when` - https://clickhouse.com/docs/en/sql-reference/functions/conditional-functions needs to be improved. Maybe keywords or a header
51+
1. `has` - https://clickhouse.com/docs/en/sql-reference/functions/array-functions#hasarr-elem tricky
52+
1. `codec` - we need better content
53+
1. `shard` - need a better page
54+
1. `populate` - we need to have a subheading on the mv page
55+
1. `contains` - https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions needs words
56+
1. `replica` - need more terms on https://clickhouse.com/docs/en/architecture/horizontal-scaling but we need a better page
57+
58+
59+
Algolia configs to try:
60+
61+
- minProximity - 1
62+
- minWordSizefor2Typos - 7
63+
- minWordSizefor1Typo- 3

scripts/search/index_pages.py

Lines changed: 45 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,24 @@
77
from slugify import slugify
88
from algoliasearch.search.client import SearchClientSync
99
import networkx as nx
10+
from urllib.parse import urlparse, urlunparse
1011

1112
DOCS_SITE = 'https://clickhouse.com/docs'
13+
with open(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'settings.json'), 'r') as f:
14+
settings = json.load(f)
1215
HEADER_PATTERN = re.compile(r"^(.*?)(?:\s*\{#(.*?)\})$")
1316
object_ids = set()
1417
files_processed = set()
1518
link_data = []
1619

1720

21+
def split_url_and_anchor(url):
22+
parsed_url = urlparse(url)
23+
url_without_anchor = urlunparse(parsed_url._replace(fragment=""))
24+
anchor = parsed_url.fragment
25+
return url_without_anchor, anchor
26+
27+
1828
def read_metadata(text):
1929
parts = text.split("\n")
2030
metadata = {}
@@ -124,6 +134,7 @@ def clean_content(content):
124134
content = re.sub(r'```.*?```', '', content, flags=re.DOTALL) # replace code blocks
125135
return content
126136

137+
127138
def inject_snippets(directory, content):
128139
snippet_pattern = re.compile(
129140
r"import\s+(\w+)\s+from\s+['\"]@site/((.*?))['\"];",
@@ -207,20 +218,20 @@ def parse_markdown_content(metadata, content):
207218
current_h1 = metadata.get('title', '')
208219
current_h2 = None
209220
current_h3 = None
210-
current_h4 = None
211221
current_subdoc = {
212222
'file_path': metadata.get('file_path', ''),
213223
'slug': heading_slug,
214224
'url': f'{DOCS_SITE}{heading_slug}',
215225
'h1': current_h1,
226+
'h1_camel': current_h1,
216227
'title': metadata.get('title', ''),
217228
'content': metadata.get('description', ''),
218229
'keywords': metadata.get('keywords', ''),
219230
'objectID': get_object_id(heading_slug),
220231
'type': 'lvl1',
221232
'hierarchy': {
222-
'lvl0': metadata.get('title', ''),
223-
'lvl1': metadata.get('title', '')
233+
'lvl0': current_h1,
234+
'lvl1': current_h1
224235
}
225236
}
226237
for line in lines:
@@ -234,6 +245,7 @@ def parse_markdown_content(metadata, content):
234245
current_subdoc['slug'] = heading_slug
235246
current_subdoc['url'] = f'{DOCS_SITE}{heading_slug}'
236247
current_subdoc['h1'] = current_h1
248+
current_subdoc['h1_camel'] = current_h1
237249
current_subdoc['title'] = current_h1
238250
current_subdoc['type'] = 'lvl1'
239251
current_subdoc['object_id'] = custom_slugify(heading_slug)
@@ -254,6 +266,7 @@ def parse_markdown_content(metadata, content):
254266
'url': f'{DOCS_SITE}{heading_slug}',
255267
'title': current_h2,
256268
'h2': current_h2,
269+
'h2_camel': current_h2,
257270
'content': '',
258271
'keywords': metadata.get('keywords', ''),
259272
'objectID': get_object_id(f'{heading_slug}-{current_h2}'),
@@ -281,6 +294,7 @@ def parse_markdown_content(metadata, content):
281294
'url': f'{DOCS_SITE}{heading_slug}',
282295
'title': current_h3,
283296
'h3': current_h3,
297+
'h3_camel': current_h3,
284298
'content': '',
285299
'keywords': metadata.get('keywords', ''),
286300
'objectID': get_object_id(f'{heading_slug}-{current_h3}'),
@@ -305,6 +319,7 @@ def parse_markdown_content(metadata, content):
305319
'url': f'{DOCS_SITE}{heading_slug}',
306320
'title': current_h4,
307321
'h4': current_h4,
322+
'h4_camel': current_h4,
308323
'content': '',
309324
'keywords': metadata.get('keywords', ''),
310325
'objectID': get_object_id(f'{heading_slug}-{current_h4}'),
@@ -336,6 +351,9 @@ def process_markdown_directory(directory, base_directory):
336351
files_processed.add(md_file_path)
337352
metadata, content = parse_metadata_and_content(directory, base_directory, md_file_path)
338353
for sub_doc in parse_markdown_content(metadata, content):
354+
url_without_anchor, anchor = split_url_and_anchor(sub_doc['url'])
355+
sub_doc['url_without_anchor'] = url_without_anchor
356+
sub_doc['anchor'] = anchor
339357
update_page_links(directory, base_directory, metadata.get('file_path', ''), sub_doc['url'],
340358
sub_doc['content'])
341359
yield sub_doc
@@ -371,9 +389,22 @@ def compute_page_rank(link_data, damping_factor=0.85, max_iter=100, tol=1e-6):
371389
return page_rank
372390

373391

392+
def create_new_index(client, index_name):
393+
try:
394+
client.delete_index(index_name)
395+
print(f'Temporary index \'{index_name}\' deleted successfully.')
396+
except:
397+
print(f'Temporary index \'{index_name}\' does not exist or could not be deleted')
398+
client.set_settings(index_name, settings['settings'])
399+
client.save_rules(index_name, settings['rules'])
400+
print(f"Settings applied to temporary index '{index_name}'.")
401+
402+
374403
def main(base_directory, sub_directories, algolia_app_id, algolia_api_key, algolia_index_name,
375404
batch_size=1000, dry_run=False):
405+
temp_index_name = f"{algolia_index_name}_temp"
376406
client = SearchClientSync(algolia_app_id, algolia_api_key)
407+
create_new_index(client, temp_index_name)
377408
docs = []
378409
for sub_directory in sub_directories:
379410
directory = os.path.join(base_directory, sub_directory)
@@ -388,13 +419,22 @@ def main(base_directory, sub_directories, algolia_app_id, algolia_api_key, algol
388419
for i in range(0, len(docs), batch_size):
389420
batch = docs[i:i + batch_size] # Get the current batch
390421
if not dry_run:
391-
send_to_algolia(client, algolia_index_name, batch)
422+
send_to_algolia(client, temp_index_name, batch)
392423
else:
393424
for d in batch:
394425
print(f"{d['url']} - {d['page_rank']}")
395426
print(f'{'processed' if dry_run else 'indexed'} {len(batch)} records')
396427
t += len(batch)
397-
print(f'total for {directory}: {'processed' if dry_run else 'indexed'} {t} records')
428+
print(f'total {'processed' if dry_run else 'indexed'} {t} records')
429+
print('switching temporary index...', end='')
430+
client.operation_index(
431+
index_name=temp_index_name,
432+
operation_index_params={
433+
"operation": "move",
434+
"destination": algolia_index_name
435+
},
436+
)
437+
print('done')
398438

399439

400440
if __name__ == '__main__':

scripts/search/results.csv

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ mater,https://clickhouse.com/docs/en/materialized-view,,
6060
bitmap,https://clickhouse.com/docs/en/sql-reference/functions/bitmap-functions,,
6161
docker,https://clickhouse.com/docs/en/install#from-docker-image,,
6262
match,https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions#match,https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions,
63-
alter table,https://clickhouse.com/docs/en/sql-reference/statements/alter,https://clickhouse.com/docs/en/sql-reference/statements/alter/delete",
63+
alter table,https://clickhouse.com/docs/en/sql-reference/statements/alter,https://clickhouse.com/docs/en/sql-reference/statements/alter/delete,
6464
partition by,https://clickhouse.com/docs/en/partitions,https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/custom-partitioning-key,https://clickhouse.com/docs/en/engines/table-engines/special/url#partition-by
6565
group,https://clickhouse.com/docs/en/sql-reference/statements/select/group-by,,
6666
null,https://clickhouse.com/docs/en/sql-reference/data-types/nullable,,
@@ -133,7 +133,7 @@ timezone,https://clickhouse.com/docs/en/sql-reference/functions/date-time-functi
133133
union,https://clickhouse.com/docs/en/sql-reference/statements/select/union,,
134134
dict,https://clickhouse.com/docs/en/dictionary,https://clickhouse.com/docs/en/sql-reference/dictionaries,
135135
array join,https://clickhouse.com/docs/en/sql-reference/statements/select/array-join,https://clickhouse.com/docs/en/sql-reference/functions/array-join,
136-
clickhouse,https://clickhouse.com/,https://clickhouse.com/docs/en/intro,
136+
clickhouse,https://clickhouse.com/docs/en/intro,
137137
nested,https://clickhouse.com/docs/en/sql-reference/data-types/nested-data-structures/nested,https://clickhouse.com/docs/en/sql-reference/data-types/nested-data-structures/nested#nestedname1-type1-name2-type2-,
138138
sample,https://clickhouse.com/docs/en/sql-reference/statements/select/sample,,
139139
distinct,https://clickhouse.com/docs/en/sql-reference/statements/select/distinct,,
@@ -151,7 +151,7 @@ materi,https://clickhouse.com/docs/en/materialized-view,,
151151
max_threads,https://clickhouse.com/docs/en/operations/settings/settings#max_threads,,
152152
limit,https://clickhouse.com/docs/en/sql-reference/statements/select/limit,,
153153
toint,https://clickhouse.com/docs/en/sql-reference/functions/type-conversion-functions,,
154-
shard,https://clickhouse.com/docs/concepts/concepts/glossary#shard,,
154+
shard,https://clickhouse.com/docs/concepts/concepts/glossary#shard,https://clickhouse.com/docs/en/concepts/glossary,
155155
timeout,https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings,https://clickhouse.com/docs/en/operations/settings/settings#timeout_overflow_mode,
156156
date_diff,https://clickhouse.com/docs/en/sql-reference/functions/date-time-functions#date_diff,,
157157
default,https://clickhouse.com/docs/en/operations/settings/settings-users,https://clickhouse.com/docs/knowledgebase/remove-default-user,

0 commit comments

Comments
 (0)