Skip to content

Commit 23d6f39

Browse files
Merge pull request #376 from max-svistunov/lcore-136-html-chunking-pipeline
[LCORE-136] Add HTML chunking pipeline from "Fetch docs" to "Produce embeddings"
2 parents efd8c3d + afb2a46 commit 23d6f39

12 files changed

+2702
-89
lines changed

scripts/generate_embeddings.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ def get_file_title(file_path: str) -> str:
3838
"""Extract title from the plaintext doc file."""
3939
title = ""
4040
try:
41-
with open(file_path, "r") as file:
41+
with open(file_path, "r", encoding="utf-8") as file:
4242
title = file.readline().rstrip("\n").lstrip("# ")
4343
except Exception: # noqa: S110
4444
pass
@@ -219,13 +219,13 @@ def got_whitespace(text: str) -> bool:
219219
metadata["overlap"] = args.overlap
220220
metadata["total-embedded-files"] = len(documents)
221221

222-
with open(os.path.join(PERSIST_FOLDER, "metadata.json"), "w") as file:
222+
with open(os.path.join(PERSIST_FOLDER, "metadata.json"), "w", encoding="utf-8") as file:
223223
file.write(json.dumps(metadata))
224224

225225
if UNREACHABLE_DOCS > 0:
226226
print(
227227
"WARNING:\n"
228-
f"There were documents with {UNREACHABLE_DOCS} unreachable URLs, "
228+
"There were documents with %s unreachable URLs, "
229229
"grep the log for UNREACHABLE.\n"
230-
"Please update the plain text."
230+
"Please update the plain text." % UNREACHABLE_DOCS
231231
)

scripts/generate_packages_to_prefetch.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,8 +73,8 @@ def remove_package(
7373
"""Remove package or packages with specified prefix from the requirements file."""
7474
package_block = False
7575

76-
with open(join(directory, source)) as fin:
77-
with open(join(directory, target), "w") as fout:
76+
with open(join(directory, source), encoding="utf-8") as fin:
77+
with open(join(directory, target), "w", encoding="utf-8") as fout:
7878
for line in fin:
7979
if line.startswith(package_prefix):
8080
print(line)

scripts/html_embeddings/README.md

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# HTML Embeddings Pipeline
2+
3+
The pipeline has the following stages:
4+
5+
1. **Download** - Fetch HTML documentation from Red Hat's portal
6+
2. **Strip** - Remove navigation, headers, and other ballast from HTML
7+
3. **Chunk** - Semantically chunk HTML content preserving document structure
8+
4. **Process Runbooks** - Handle runbooks using existing Markdown logic
9+
5. **Embed** - Generate embeddings and store in vector database
10+
11+
I've put them into separate files, which can be useful for e.g. running only a subset of stages:
12+
13+
```
14+
scripts/html_embeddings/
15+
├── generate_embeddings.py # Main orchestrator script
16+
├── download_docs.py # Portal fetcher wrapper
17+
├── strip_html.py # HTML stripper wrapper
18+
├── chunk_html.py # HTML chunking wrapper
19+
├── process_runbooks.py # Runbooks processing
20+
├── utils.py # Shared utilities
21+
└── README.md # This file
22+
```
23+
24+
## Usage
25+
26+
Standard:
27+
28+
```bash
29+
# Generate embeddings for OpenShift 4.18
30+
python scripts/html_embeddings/generate_embeddings.py \
31+
--version 4.18 \
32+
--output-dir ./vector_db \
33+
--model-dir ./embeddings_model
34+
```
35+
36+
Specify custom index name instead of the auto-generated one:
37+
38+
```bash
39+
# Generate embeddings for OpenShift 4.18
40+
python scripts/html_embeddings/generate_embeddings.py \
41+
--version 4.18 \
42+
--output-dir ./vector_db \
43+
--index ocp-4.18 \
44+
--model-dir ./embeddings_model
45+
```
46+
47+
Process only specific document and skip runbooks (good for quick testing):
48+
49+
```bash
50+
# Process only monitoring documentation
51+
python scripts/html_embeddings/generate_embeddings.py \
52+
--version 4.18 \
53+
--specific-doc observability_overview \
54+
--output-dir ./vector_db \
55+
--model-dir ./embeddings_model \
56+
--skip-runbooks
57+
```
58+
59+
Use cached downloads:
60+
61+
```bash
62+
# Use previously downloaded files
63+
python scripts/html_embeddings/generate_embeddings.py \
64+
--version 4.18 \
65+
--use-cached-downloads \
66+
--output-dir ./vector_db \
67+
--model-dir ./embeddings_model
68+
```
69+
70+
Set a custom token limit (default is the same 380 as in Markdown-based chunking):
71+
72+
```bash
73+
# Set the token limit
74+
python generate_embeddings.py \
75+
--version 4.18 \
76+
--chunk 380 \
77+
--output-dir ./vector_db \
78+
--model-dir ./embeddings_model
79+
```
80+
81+
## CLI options
82+
83+
### Main arguments
84+
85+
- `--version` - OpenShift version (required, e.g., "4.18")
86+
- `--index` - Index name (optional, e.g., "ocp-4.18")
87+
- `--output-dir` - Vector DB output directory (default: "./vector_db")
88+
- `--model-dir` - Embedding model directory (default: "./embeddings_model")
89+
90+
### Pipeline control
91+
92+
- `--specific-doc` - Process only specific document (e.g., "monitoring_apis")
93+
- `--use-cached-downloads` - Use existing downloads instead of re-fetching
94+
- `--skip-runbooks` - Skip runbooks processing
95+
- `--cache-dir` - Directory for intermediate files (default: "./cache")
96+
- `--continue-on-error` - Continue with cached data if a step fails
97+
98+
### HTML chunking parameters
99+
100+
- `--max-token-limit` - Maximum tokens per chunk (default: 380)
101+
- `--count-tag-tokens` / `--no-count-tag-tokens` - Include/exclude HTML tags in token count
102+
103+
### Other options
104+
105+
- `--runbooks-dir` - Directory containing runbooks (default: "./runbooks")
106+
- `--exclude-metadata` - Metadata to exclude during embedding
107+
- `--chunk` - Chunk size (maps to --max-token-limit)
108+
- `--verbose` - Enable verbose logging
109+
110+
## Pipeline stages
111+
112+
### 1. Download stage
113+
114+
Downloads HTML documentation from Red Hat's portal using the portal fetcher.
115+
116+
**Standalone usage:**
117+
```bash
118+
python scripts/html_embeddings/download_docs.py --version 4.18 --output-dir ./downloads
119+
```
120+
121+
### 2. Strip stage
122+
123+
Removes navigation, headers, footers, and other non-content elements from HTML.
124+
125+
**Standalone usage:**
126+
```bash
127+
python scripts/html_embeddings/strip_html.py --input-dir ./downloads --output-dir ./stripped
128+
```
129+
130+
### 3. Chunk stage
131+
132+
Semantically chunks HTML documents while preserving structure and context.
133+
134+
**Standalone usage:**
135+
```bash
136+
python scripts/html_embeddings/chunk_html.py --input-dir ./stripped --output-dir ./chunks --max-token-limit 380
137+
```
138+
139+
### 4. Runbooks stage
140+
141+
Processes runbooks using the existing Markdown chunking logic.
142+
143+
**Standalone usage:**
144+
```bash
145+
python scripts/html_embeddings/process_runbooks.py --runbooks-dir ./runbooks --output-dir ./chunks
146+
```
147+
148+
### 5. Embedding stage
149+
150+
Generates embeddings from all chunks and stores in vector database.
151+
152+
## Cache Structure
153+
154+
The pipeline creates a structured cache to avoid re-processing:
155+
156+
```
157+
cache/
158+
├── downloads/ # Raw HTML downloads
159+
├── stripped/ # Stripped HTML
160+
└── chunks/ # JSON chunk files
161+
```
162+
163+
## Output Format
164+
165+
Chunks are saved as JSON files with the following structure:
166+
167+
```json
168+
{
169+
"id": "monitoring_chunk_0001",
170+
"content": "<h2>Monitoring Overview</h2><p>...",
171+
"metadata": {
172+
"doc_name": "monitoring",
173+
"doc_id": "monitoring",
174+
"version": "4.18",
175+
"file_path": "monitoring/index.html",
176+
"doc_type": "openshift_documentation",
177+
"source": "https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html-single/monitoring/",
178+
"chunk_index": 1,
179+
"total_chunks": 45,
180+
"token_count": 375,
181+
"source_file": "monitoring/index.html",
182+
}
183+
}
184+
```

0 commit comments

Comments
 (0)