Adding custom tooltips for concepts and getting started docs, including script for re-use. Starting with concepts and getting started to reduce scope.

dhtclk · dhtclk · commit defc5f8dd4ea · 2025-07-24T13:17:34.000-05:00
diff --git a/docs/concepts/glossary.md b/docs/concepts/glossary.md
@@ -5,6 +5,8 @@ title: 'Glossary'
 slug: /concepts/glossary
 ---
 
+<!-- no-glossary -->
+
 # Glossary
 
 ## Atomicity {#atomicity}
diff --git a/docs/concepts/index.md b/docs/concepts/index.md
@@ -11,7 +11,7 @@ In this section of the docs we'll dive into the concepts around what makes Click
 
 | Page                                                             | Description                                                                           |
 |------------------------------------------------------------------|---------------------------------------------------------------------------------------|
-| [Why is ClickHouse so Fast?](./why-clickhouse-is-so-fast.md)     | Learn what makes ClickHouse so fast.                                                  
+| [Why is ClickHouse so Fast?](./why-clickhouse-is-so-fast.mdx)     | Learn what makes ClickHouse so fast.                                                  
 | [What is OLAP?](./olap.md)                                       | Learn what Online Analytical Processing is.                                           
 | [Why is ClickHouse unique?](../about-us/distinctive-features.md) | Learn what makes ClickHouse unique.                                                   
 | [Glossary](./glossary.md)                                        | This page contains a glossary of terms you'll commonly encounter throughout the docs. 
diff --git a/docs/concepts/why-clickhouse-is-so-fast.mdx b/docs/concepts/why-clickhouse-is-so-fast.mdx
@@ -19,7 +19,7 @@ From an architectural perspective, databases consist (at least) of a storage lay
 
 <iframe width="1024" height="576" src="https://www.youtube.com/embed/vsykFYns0Ws?si=hE2qnOf6cDKn-otP" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
 
-In ClickHouse, each table consists of multiple "table parts". A [part](/parts) is created whenever a user inserts data into the table (INSERT statement). A query is always executed against all table parts that exist at the time the query starts.
+In ClickHouse, each table consists of multiple "table <GlossaryTooltip term="Parts" />". A [part](/parts) is created whenever a user inserts data into the table (INSERT statement). A query is always executed against all table parts that exist at the time the query starts.
 
 To avoid that too many parts accumulate, ClickHouse runs a [merge](/merges) operation in the background which continuously combines multiple smaller parts into a single bigger part.
 
@@ -97,7 +97,7 @@ Finally, ClickHouse uses a vectorized query processing layer that parallelizes q
 
 Modern systems have dozens of CPU cores. To utilize all cores, ClickHouse unfolds the query plan into multiple lanes, typically one per core. Each lane processes a disjoint range of the table data. That way, the performance of the database scales "vertically" with the number of available cores.
 
-If a single node becomes too small to hold the table data, further nodes can be added to form a cluster. Tables can be split ("sharded") and distributed across the nodes. ClickHouse will run queries on all nodes that store table data and thereby scale "horizontally" with the number of available nodes.
+If a single node becomes too small to hold the table data, further nodes can be added to form a <GlossaryTooltip term="Cluster" />. Tables can be split ("sharded") and distributed across the nodes. ClickHouse will run queries on all nodes that store table data and thereby scale "horizontally" with the number of available nodes.
 
 🤿 Deep dive into this in the [Query Processing Layer](/academic_overview#4-query-processing-layer) section of the web version of our VLDB 2024 paper.
 
@@ -143,4 +143,4 @@ You can read a [PDF of the paper](https://www.vldb.org/pvldb/vol17/p3731-schulze
 Alexey Milovidov, our CTO and the creator of ClickHouse, presented the paper (slides [here](https://raw.githubusercontent.com/ClickHouse/clickhouse-presentations/master/2024-vldb/VLDB_2024_presentation.pdf)), followed by a Q&A (that quickly ran out of time!).
 You can catch the recorded presentation here:
 
-<iframe width="1024" height="576" src="https://www.youtube.com/embed/7QXKBKDOkJE?si=5uFerjqPSXQWqDkF" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
+<iframe width="1024" height="576" src="https://www.youtube.com/embed/7QXKBKDOkJE?si=5uFerjqPSXQWqDkF" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
diff --git a/docs/faq/general/index.md b/docs/faq/general/index.md
@@ -10,7 +10,7 @@ description: 'Index page listing general questions about ClickHouse'
 # General questions about ClickHouse
 
 - [What is ClickHouse?](../../intro.md)
-- [Why is ClickHouse so fast?](../../concepts/why-clickhouse-is-so-fast.md)
+- [Why is ClickHouse so fast?](../../concepts/why-clickhouse-is-so-fast.mdx)
 - [Who is using ClickHouse?](../../faq/general/who-is-using-clickhouse.md)
 - [What does "ClickHouse" mean?](../../faq/general/dbms-naming.md)
 - [What does "Не тормозит" mean?](../../faq/general/ne-tormozit.md)
diff --git a/docs/faq/general/olap.md b/docs/faq/general/olap.md
@@ -34,7 +34,7 @@ All database management systems could be classified into two groups: OLAP (Onlin
 
 In practice OLAP and OLTP are not categories, it's more like a spectrum. Most real systems usually focus on one of them but provide some solutions or workarounds if the opposite kind of workload is also desired. This situation often forces businesses to operate multiple storage systems integrated, which might be not so big deal but having more systems make it more expensive to maintain. So the trend of recent years is HTAP (**Hybrid Transactional/Analytical Processing**) when both kinds of the workload are handled equally well by a single database management system.
 
-Even if a DBMS started as a pure OLAP or pure OLTP, they are forced to move towards that HTAP direction to keep up with their competition. And ClickHouse is no exception, initially, it has been designed as [fast-as-possible OLAP system](../../concepts/why-clickhouse-is-so-fast.md) and it still does not have full-fledged transaction support, but some features like consistent read/writes and mutations for updating/deleting data had to be added.
+Even if a DBMS started as a pure OLAP or pure OLTP, they are forced to move towards that HTAP direction to keep up with their competition. And ClickHouse is no exception, initially, it has been designed as [fast-as-possible OLAP system](../../concepts/why-clickhouse-is-so-fast.mdx) and it still does not have full-fledged transaction support, but some features like consistent read/writes and mutations for updating/deleting data had to be added.
 
 The fundamental trade-off between OLAP and OLTP systems remains:
 
diff --git a/docs/faq/use-cases/time-series.md b/docs/faq/use-cases/time-series.md
@@ -10,7 +10,7 @@ description: 'Page describing how to use ClickHouse as a time-series database'
 
 _Note: Please see the blog [Working with Time series data in ClickHouse](https://clickhouse.com/blog/working-with-time-series-data-and-functions-ClickHouse) for additional examples of using ClickHouse for time series analysis._
 
-ClickHouse is a generic data storage solution for [OLAP](../../faq/general/olap.md) workloads, while there are many specialized [time-series database management systems](https://clickhouse.com/engineering-resources/what-is-time-series-database). Nevertheless, ClickHouse's [focus on query execution speed](../../concepts/why-clickhouse-is-so-fast.md) allows it to outperform specialized systems in many cases. There are many independent benchmarks on this topic out there, so we're not going to conduct one here. Instead, let's focus on ClickHouse features that are important to use if that's your use case.
+ClickHouse is a generic data storage solution for [OLAP](../../faq/general/olap.md) workloads, while there are many specialized [time-series database management systems](https://clickhouse.com/engineering-resources/what-is-time-series-database). Nevertheless, ClickHouse's [focus on query execution speed](../../concepts/why-clickhouse-is-so-fast.mdx) allows it to outperform specialized systems in many cases. There are many independent benchmarks on this topic out there, so we're not going to conduct one here. Instead, let's focus on ClickHouse features that are important to use if that's your use case.
 
 First of all, there are **[specialized codecs](../../sql-reference/statements/create/table.md#specialized-codecs)** which make typical time-series. Either common algorithms like `DoubleDelta` and `Gorilla` or specific to ClickHouse like `T64`.
 
diff --git a/docs/getting-started/quick-start/cloud.mdx b/docs/getting-started/quick-start/cloud.mdx
@@ -60,7 +60,7 @@ Select your desired region for deploying the service, and give your new service
 <Image img={createservice1} size="md" alt='New ClickHouse Service' border/>
 <br/>
 
-By default, the scale tier will create 3 replicas each with 4 VCPUs and 16 GiB RAM. [Vertical autoscaling](/manage/scaling#vertical-auto-scaling) will be enabled by default in the Scale tier.
+By default, the scale tier will create 3 <GlossaryTooltip term="Replica" plural="s" /> each with 4 VCPUs and 16 GiB RAM. [Vertical autoscaling](/manage/scaling#vertical-auto-scaling) will be enabled by default in the Scale tier.
 
 Users can customize the service resources if required, specifying a minimum and maximum size for replicas to scale between. When ready, select `Create service`.
 
@@ -329,4 +329,4 @@ Suppose we have the following text in a CSV file named `data.csv`:
 - Check out our 25-minute video on [Getting Started with ClickHouse](https://clickhouse.com/company/events/getting-started-with-clickhouse/)
 - If your data is coming from an external source, view our [collection of integration guides](/integrations/index.mdx) for connecting to message queues, databases, pipelines and more
 - If you are using a UI/BI visualization tool, view the [user guides for connecting a UI to ClickHouse](/integrations/data-visualization)
-- The user guide on [primary keys](/guides/best-practices/sparse-primary-indexes.md) is everything you need to know about primary keys and how to define them
+- The user guide on [primary keys](/guides/best-practices/sparse-primary-indexes.md) is everything you need to know about primary keys and how to define them
diff --git a/docs/getting-started/quick-start/oss.mdx b/docs/getting-started/quick-start/oss.mdx
@@ -107,7 +107,7 @@ PRIMARY KEY (user_id, timestamp)
 
 You can use the familiar `INSERT INTO TABLE` command with ClickHouse, but it is
 important to understand that each insert into a `MergeTree` table causes what we
-call a **part** in ClickHouse to be created in storage. These parts later get
+call a **part** in ClickHouse to be created in storage. These <GlossaryTooltip term="Parts" /> later get
 merged in the background by ClickHouse.
 
 In ClickHouse, we try to bulk insert lots of rows at a time
@@ -373,4 +373,3 @@ technologies that integrate with ClickHouse.
 - The user guide on [primary keys](/guides/best-practices/sparse-primary-indexes.md) is everything you need to know about primary keys and how to define them.
 
 </VerticalStepper>
-
diff --git a/docs/intro.md b/docs/intro.md
@@ -88,7 +88,7 @@ ClickHouse chooses the join algorithm adaptively, it starts with fast hash joins
 ## Superior query performance {#superior-query-performance}
 
 ClickHouse is well known for having extremely fast query performance.
-To learn why ClickHouse is so fast, see the [Why is ClickHouse fast?](/concepts/why-clickhouse-is-so-fast.md) guide.
+To learn why ClickHouse is so fast, see the [Why is ClickHouse fast?](/concepts/why-clickhouse-is-so-fast.mdx) guide.
 
 <!--
 ## What is OLAP? {#what-is-olap}
diff --git a/scripts/inject-glossary-tooltips.py b/scripts/inject-glossary-tooltips.py
@@ -0,0 +1,131 @@
+import os
+import re
+import json
+import argparse
+import difflib
+# This script injects glossary tooltips into Markdown files based on a glossary JSON file.
+# Path to the glossary JSON file and target directory for Markdown files
+# Adjust these paths as necessary
+GLOSSARY_FILE = 'src/components/GlossaryTooltip/glossary.json'
+TARGET_DIR = 'docs/getting-started/quick-start'
+
+with open(GLOSSARY_FILE, 'r', encoding='utf-8') as f:
+    glossary = json.load(f)
+
+terms = sorted(glossary.keys(), key=len, reverse=True)
+
+def build_term_regex(term):
+    escaped = re.escape(term)
+    return re.compile(rf'\b({escaped})(s|es)?\b', re.IGNORECASE)
+
+term_regexes = {term: build_term_regex(term) for term in terms}
+
+def capitalize_word(word):
+    return word[0].upper() + word[1:] if word else word
+
+def replace_terms(line, replaced_terms):
+    for term in terms:
+        if term in replaced_terms:
+            continue
+
+        regex = term_regexes[term]
+
+        def _replacer(match):
+            if term in replaced_terms:
+                return match.group(0)
+
+            base, plural = match.group(1), match.group(2) or ''
+            capitalize = base[0].isupper()
+            capital_attr = ' capitalize' if capitalize else ''
+            plural_attr = f' plural="{plural}"' if plural else ''
+            replaced_terms.add(term)
+            return f'<GlossaryTooltip term="{term}"{capital_attr}{plural_attr} />'
+
+        line, count = regex.subn(_replacer, line, count=1)
+        if count > 0:
+            break  # one term per line max
+
+    return line
+
+def process_markdown(content):
+    if '<!-- no-glossary -->' in content:
+        return content, False
+
+    lines = content.splitlines()
+    inside_code_block = False
+    replaced_terms = set()
+    modified = False
+    output_lines = []
+
+    for line in lines:
+        stripped = line.strip()
+
+        # Fence detection for code blocks
+        if stripped.startswith('```'):
+            inside_code_block = not inside_code_block
+            output_lines.append(line)
+            continue
+
+        # Skip inside code or headings
+        if inside_code_block or stripped.startswith('#'):
+            output_lines.append(line)
+            continue
+
+        new_line = replace_terms(line, replaced_terms)
+        if new_line != line:
+            modified = True
+        output_lines.append(new_line)
+
+    return '\n'.join(output_lines), modified
+
+def rename_md_to_mdx(filepath):
+    if filepath.endswith('.md'):
+        new_path = filepath[:-3] + '.mdx'
+        os.rename(filepath, new_path)
+        print(f'Renamed: {filepath} → {new_path}')
+        return new_path
+    return filepath
+
+def walk_files(target_dir):
+    for root, _, files in os.walk(target_dir):
+        for filename in files:
+            if filename.endswith('.md') or filename.endswith('.mdx'):
+                yield os.path.join(root, filename)
+
+def print_diff(original, modified, path):
+    diff_lines = list(difflib.unified_diff(
+        original.splitlines(),
+        modified.splitlines(),
+        fromfile=path,
+        tofile=path,
+        lineterm=''
+    ))
+    if diff_lines:
+        print('\n'.join(diff_lines))
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--dry-run', action='store_true', help='Show diffs without writing')
+    args = parser.parse_args()
+
+    for filepath in walk_files(TARGET_DIR):
+        with open(filepath, 'r', encoding='utf-8') as f:
+            original = f.read()
+
+        modified, changed = process_markdown(original)
+
+        if changed:
+            if args.dry_run:
+                print(f'\n--- DRY RUN: {filepath} ---')
+                print_diff(original, modified, filepath)
+            else:
+                with open(filepath, 'w', encoding='utf-8') as f:
+                    f.write(modified)
+                print(f'✅ Updated: {filepath}')
+
+                # Rename to .mdx if needed
+                if filepath.endswith('.md'):
+                    filepath = rename_md_to_mdx(filepath)
+
+if __name__ == '__main__':
+    main()
diff --git a/src/components/GlossaryTooltip/GlossaryTooltip.tsx b/src/components/GlossaryTooltip/GlossaryTooltip.tsx
@@ -0,0 +1,49 @@
+import React, { useState } from 'react';
+import glossary from './glossary.json';
+import Link from '@docusaurus/Link';
+
+const GlossaryTooltip = ({ term, capitalize = false, plural = '' }) => {
+  const [visible, setVisible] = useState(false);
+  const definition = glossary[term];
+
+  if (!definition) {
+    console.warn(`Glossary term not found: ${term}`);
+    const displayFallback = capitalize
+      ? capitalizeWord(term) + plural
+      : term.toLowerCase() + plural;
+    return <span>{displayFallback}</span>;
+  }
+
+  const displayTerm = capitalize ? capitalizeWord(term) : term.toLowerCase();
+  const anchorId = term.toLowerCase().replace(/\s+/g, '-');
+  const glossarySlug = `/concepts/glossary#${anchorId}`;
+
+  return (
+    <span
+      className="tooltip"
+      onMouseEnter={() => setVisible(true)}
+      onMouseLeave={() => setVisible(false)}
+      onFocus={() => setVisible(true)}
+      onBlur={() => setVisible(false)}
+    >
+      <Link
+        to={glossarySlug}
+        className="tooltip-link"
+        tabIndex={0}
+        onClick={() => setVisible(false)}
+      >
+        {displayTerm}
+        {plural}
+      </Link>
+      <span className={`tooltipText ${visible ? 'visible' : ''}`}>
+        {definition}
+      </span>
+    </span>
+  );
+};
+
+function capitalizeWord(word) {
+  return word.charAt(0).toUpperCase() + word.slice(1);
+}
+
+export default GlossaryTooltip;
diff --git a/src/theme/MDXComponents.js b/src/theme/MDXComponents.js
@@ -5,10 +5,12 @@ import MDXComponents from '@theme-original/MDXComponents';
 // Import the custom Stepper component
 // Make sure the path matches your project structure
 import VStepper from '@site/src/components/Stepper/Stepper';
+import GlossaryTooltip from '../../src/components/GlossaryTooltip/GlossaryTooltip.tsx';
 
 // Define the enhanced components
 const enhancedComponents = {
     ...MDXComponents,
+    GlossaryTooltip,
     ul: (props) => <ul className="custom-ul" {...props} />,
     ol: (props) => <ol className="custom-ol" {...props} />,
     li: (props) => <li className="custom-li" {...props} />,