Skip to content

Commit 2e61a00

Browse files
committed
Add README.md for chunking
1 parent 5f1af66 commit 2e61a00

File tree

1 file changed

+55
-0
lines changed

1 file changed

+55
-0
lines changed

scripts/html_chunking/README.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# HTML chunking library
2+
3+
This library provides an HTML chunker that splits single-page HTML documentation into semantically-aware chunks suitable for RAG.
4+
5+
## Core usage
6+
7+
The primary function is `chunk_html`. It takes an HTML string and returns a list of chunk objects, each with its own content and metadata.
8+
9+
```python
10+
from html_chunking.chunker import chunk_html
11+
12+
# Assuming 'sample_html_content' is a string containing the HTML document
13+
# and 'source_url' is the public URL of the document.
14+
source_url = "https://docs.openshift.com/container-platform/4.18/html-single/monitoring/"
15+
with open("path/to/your/document.html", "r", encoding="utf-8") as f:
16+
sample_html_content = f.read()
17+
18+
chunks = chunk_html(
19+
html_content=sample_html_content,
20+
source_url=source_url,
21+
max_token_limit=380,
22+
count_tag_tokens=True
23+
)
24+
25+
# Process the resulting chunks
26+
for i, chunk in enumerate(chunks):
27+
print(f"--- Chunk {i+1} ---")
28+
print(f"Source: {chunk.metadata.get('source')}")
29+
print(f"Content: {chunk.text[:100]}...")
30+
```
31+
32+
### Parameters
33+
34+
| Name | Type | Description | Default |
35+
| ------------------- | -------- | ------------------------------------------------------------------------------------------------------- | ------- |
36+
| `html_content` | `str` | The raw HTML content to be chunked. | |
37+
| `source_url` | `str` | The public source URL of the document, used for generating `source` metadata. | |
38+
| `max_token_limit` | `int` | The target maximum token limit for each chunk. The chunker will _try_ to keep chunks below this size. | `380` |
39+
| `count_tag_tokens` | `bool` | If `True`, HTML tags are included in the token count. | `True` |
40+
41+
### Return value
42+
43+
The function returns a list of `Chunk` objects. Each `Chunk` object has two attributes:
44+
45+
* **`text` (`str`)**: The HTML content of the chunk.
46+
* **`metadata` (`dict`)**: A dictionary containing metadata about the chunk. It includes:
47+
* `source`: A URL pointing to the original document, appended with an HTML anchor (`#anchor-id`) that links directly to the section where the chunk originated.
48+
49+
## Standalone Example and Visual Report
50+
51+
Use `example.py` to run chunking on an example document and inspect the resulting chunks:
52+
53+
```bash
54+
python example.py --max-token-limit=600 --output=limit-600.html
55+
```

0 commit comments

Comments
 (0)