|
| 1 | +# HTML chunking library |
| 2 | + |
| 3 | +This library provides an HTML chunker that splits single-page HTML documentation into semantically-aware chunks suitable for RAG. |
| 4 | + |
| 5 | +## Core usage |
| 6 | + |
| 7 | +The primary function is `chunk_html`. It takes an HTML string and returns a list of chunk objects, each with its own content and metadata. |
| 8 | + |
| 9 | +```python |
| 10 | +from html_chunking.chunker import chunk_html |
| 11 | + |
| 12 | +# Assuming 'sample_html_content' is a string containing the HTML document |
| 13 | +# and 'source_url' is the public URL of the document. |
| 14 | +source_url = "https://docs.openshift.com/container-platform/4.18/html-single/monitoring/" |
| 15 | +with open("path/to/your/document.html", "r", encoding="utf-8") as f: |
| 16 | + sample_html_content = f.read() |
| 17 | + |
| 18 | +chunks = chunk_html( |
| 19 | + html_content=sample_html_content, |
| 20 | + source_url=source_url, |
| 21 | + max_token_limit=380, |
| 22 | + count_tag_tokens=True |
| 23 | +) |
| 24 | + |
| 25 | +# Process the resulting chunks |
| 26 | +for i, chunk in enumerate(chunks): |
| 27 | + print(f"--- Chunk {i+1} ---") |
| 28 | + print(f"Source: {chunk.metadata.get('source')}") |
| 29 | + print(f"Content: {chunk.text[:100]}...") |
| 30 | +``` |
| 31 | + |
| 32 | +### Parameters |
| 33 | + |
| 34 | +| Name | Type | Description | Default | |
| 35 | +| ------------------- | -------- | ------------------------------------------------------------------------------------------------------- | ------- | |
| 36 | +| `html_content` | `str` | The raw HTML content to be chunked. | | |
| 37 | +| `source_url` | `str` | The public source URL of the document, used for generating `source` metadata. | | |
| 38 | +| `max_token_limit` | `int` | The target maximum token limit for each chunk. The chunker will _try_ to keep chunks below this size. | `380` | |
| 39 | +| `count_tag_tokens` | `bool` | If `True`, HTML tags are included in the token count. | `True` | |
| 40 | + |
| 41 | +### Return value |
| 42 | + |
| 43 | +The function returns a list of `Chunk` objects. Each `Chunk` object has two attributes: |
| 44 | + |
| 45 | +* **`text` (`str`)**: The HTML content of the chunk. |
| 46 | +* **`metadata` (`dict`)**: A dictionary containing metadata about the chunk. It includes: |
| 47 | + * `source`: A URL pointing to the original document, appended with an HTML anchor (`#anchor-id`) that links directly to the section where the chunk originated. |
| 48 | + |
| 49 | +## Standalone Example and Visual Report |
| 50 | + |
| 51 | +Use `example.py` to run chunking on an example document and inspect the resulting chunks: |
| 52 | + |
| 53 | +```bash |
| 54 | +python example.py --max-token-limit=600 --output=limit-600.html |
| 55 | +``` |
0 commit comments