RAG Document Viewer is an open-source library that generates high-fidelity file previews for seamless integration into your applications. It provides desktop-level file viewing capabilities for a wide range of document formats, including:
- PDF documents
- Microsoft Office files (Word, PowerPoint, Excel)
- OpenOffice documents (ODS, ODT, ODP)
The library converts these files into interactive HTML-based previews that can be easily embedded into web applications, desktop applications, or any system that supports HTML rendering.
Developed by Preprocess Team
- Pass in a file and specify the destination path.
- An HTML bundle is created.
- You can now embed the viewer in your application with just an
<iframe>
.
Viewer capabilities:
- High-Fidelity Rendering: Preserve the exact look-and-feel of PDFs, DOCX, PPTX & XLSX documents.
- Embed in Seconds: Generate a self-contained HTML bundle and drop it into an
<iframe>
. - Precise Highlights: Pass bounding-box coordinates from your RAG chunks; the viewer auto-scrolls and spotlights them.
- Lightweight & Secure - Runs 100 % in-browser. Files are served directly from your backend under your auth logic, no external servers.
Viewer features:
- Chunk Navigator: Navigate between highlighted chunks with next/previous controls.
- Zoom Controls: Renders the document at the optimal zoom level, and users can zoom in/out as needed.
- Scrollbar Navigator: Visual indicators on the scrollbar show highlighted chunk positions; click to jump to a specific chunk.
- Chunks Highlighting - Visual emphasis of the important content part you select.
Demo:
We've created a demo on Hugging Face that lets you see the results you can achieve with your documents.
The demo doesn't have chunk highlighting functionality. For that feature, you'll need to use a supported provider like preprocess.co for document chunking.
1. Install Dependencies
wget "https://raw.githubusercontent.com/preprocess-co/rag-document-viewer/refs/heads/main/install.sh"
chmod +x install.sh && ./install.sh
2. Install the Library
pip install rag-document-viewer
3. Create the bundle
from rag_document_viewer import RAG_DV
# Generate an HTML viewer
RAG_DV("document.pdf", "/static/viewers/document")
4. Serve in your application
<iframe
src="/static/viewers/document/"
width="100%"
height="800"
style="border:0"
></iframe>
TL;DR – You only need system tools when building viewers on your server. Pre-built viewers are pure HTML/JS and have no dependencies.
Before you start, make sure the required system dependencies are installed. An install.sh
convenience script is included for Ubuntu; support for additional operating systems is coming soon.
For macOS, Windows, and other OSes, please refer to this guide.
Install the required libraries:
wget "https://raw.githubusercontent.com/preprocess-co/rag-document-viewer/refs/heads/main/install.sh"
chmod +x install.sh && ./install.sh
Install the package from PyPI:
pip install rag-document-viewer
# or with Poetry:
# poetry add rag-document-viewer
Confirm both system tools are properly installed:
libreoffice --version
# Expected output:
# LibreOffice 24.2.7.2 420(Build:2)
pdf2htmlEX --version
# Expected output:
# pdf2htmlEX version 0.18.8.rc1
# ...
from rag_document_viewer import RAG_DV
# Generate an HTML viewer
RAG_DV(file_path="document.pdf", store_path="/path/to/viewers/doc1")
Note: We suggest setting
store_path
to a non-public, internal path and serving the content through a dedicated view. This way, you remain in full control of the authentication logic. See Handling Authentication for more details.
You can get chunk coordinates from chunking providers like Preprocess.co (which supports paragraphs, layout items, multi-column layouts, slides, and more) or Unstructured.io (which offers PDF-only item-level support).
Note: Chunks' coordinates should be stored in a list. When storing and then accessing a chunk, you should use the list index to reference the correct chunk.
With the Preprocess SDK
from pypreprocess import Preprocess
from rag_document_viewer import RAG_DV
# Preprocess a file
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/to/file", boundary_boxes=True)
preprocess.chunk()
preprocess.wait()
result = preprocess.result()
# result is a PreprocessResponse object
# Generate an HTML viewer with highlighting capabilities
RAG_DV(
file_path="path/to/file",
store_path="/path/to/viewers/doc1",
chunks=result.data['boundary_boxes']["boxes"]
)
With other providers
from rag_document_viewer import RAG_DV
# Define boxes for highlighting specific content areas.
# Each chunk is a list of one or more boxes.
# Each box has coordinates relative to the page dimensions (0.0 to 1.0).
# page: is a 0 based index for identifying the document page.
# top: position of the chunk between 0 and 1 relative to the page height
# left: position of the chunk between 0 and 1 relative to the page width
# height: vertical length of the chunk between 0 and 1 relative to the page height
# width: horizontal length of the chunk between 0 and 1 relative to the page width
boxes = [
[ # First chunk
{"page": 1, "top": 0.02, "left": 0.1, "height": 0.1, "width": 0.5},
# A chunk can be composed of multiple boxes (e.g., for multi-column text)
],
[ # Second chunk
{"page": 2, "top": 0.5, "left": 0.2, "height": 0.2, "width": 0.6},
],
# ... more chunks
]
# Generate an HTML viewer with highlighting capabilities
RAG_DV(
file_path="path/to/file",
store_path="/path/to/viewers/doc1",
chunks=boxes
)
Important: If no chunk information is provided when generating the viewer, the following features will be disabled:
- Chunk highlighting and navigation
- Scrollbar chunk indicators
- The
goto_chunk
URL parameterEnsure you include chunk coordinates if you plan to use these interactive features.
Tip: Page Highlighting If you prefer to highlight entire pages instead of precise portions, create a chunk that covers the full page:
[{"page": 3, "top": 0, "left": 0, "height": 1, "width": 1}]
Customize the viewer's appearance and behavior with these parameters during generation:
Parameter | Type | Default | Description |
---|---|---|---|
chunks |
list |
[] |
List of box coordinates for content chunks to highlight. |
page_number |
bool |
True |
Display page numbers at the bottom. |
chunks_navigator |
bool |
True |
Show chunk navigation controls (requires chunks ). |
scrollbar_navigator |
bool |
True |
Display chunk indicators on the scrollbar (requires chunks ). |
show_chunks_if_single |
bool |
False |
Show chunks navigator even with only one chunk (requires chunks ). |
chunk_navigator_text |
str |
"Chunk %d of %d" |
Text template for chunk counter (use %d placeholders, requires chunks ). |
Example
from rag_document_viewer import RAG_DV
# `boxes` defined earlier in the code
RAG_DV(
file_path="path/to/file",
store_path="/path/to/viewer",
chunks=boxes,
chunk_navigator_text="Suggestion %d of %d",
scrollbar_navigator=False
)
Customize the viewer's colors to match your branding.
If
main_color
andbackground_color
are set, all other colors are automatically derived. You can still override any specific color individually.
Parameter | Type | Default | Description |
---|---|---|---|
main_color |
str |
#ff8000 |
Primary color for interactive elements |
background_color |
str |
#dddddd |
Viewer background color |
page_shadow |
str |
None |
CSS box-shadow for pages (auto-calculated if not set) |
text_selection_color |
str |
None |
Browser text selection color for the viewer (auto-calculated if not set) |
controls_text_color |
str |
None |
Text color of viewer controls, like zoom and page number (auto-calculated if not set) |
controls_bg_color |
str |
None |
Background color of viewer controls, like zoom and page number (auto-calculated if not set) |
scrollbar_color |
str |
None |
Scrollbar background color (auto-calculated if not set) |
scroller_color |
str |
None |
Scrollbar thumb color (auto-calculated if not set) |
bookmark_color |
str |
None |
Color for relevant chunk indicators in the scrollbar (defaults to main_color) |
highlight_chunk_color |
str |
None |
CSS background-image for chunk highlight (auto-calculated if not set) |
highlight_page_color |
str |
None |
CSS background-image for page highlight (auto-calculated if not set) |
highlight_page_outline |
str |
None |
Page border color for highlighted pages (auto-calculated if not set) |
Example
from rag_document_viewer import RAG_DV
RAG_DV(
file_path="path/to/file",
store_path="/path/to/viewer",
main_color="#0969da",
background_color="#f6f8fa"
)
Add an <iframe>
to your application to show the document.
<iframe
src="/path/to/viewers/my_document"
width="100%"
height="800"
style="border:0"
></iframe>
Note: Please see the Handling Authentication section for best practices on securely integrating the viewer.
Control the viewer's initial state by passing parameters in the <iframe>
URL:
Parameter | Type | Default | Description |
---|---|---|---|
chunks |
string |
[] |
An ordered JSON array of chunk indices to highlight and navigate. |
goto_chunk |
int |
None |
Automatically scroll to this chunk index on load. |
goto_page |
int |
None |
Automatically scroll to this page number on load. |
Note: The
chunks
andgoto_chunk
parameters only work if chunk data was provided when the viewer was generated. The order of indices in thechunks
URL parameter determines the "Next/Previous" navigation order. chunks and pages are 0-based inndexes
Behavior Priority: The viewer determines the initial scroll position based on the following priority:
- If
goto_chunk
is set, it scrolls to that chunk. - Else, if
chunks
is set, it scrolls to the first chunk in the list. - Else, if
goto_page
is set, it scrolls to that page. - Otherwise, it defaults to the beginning of the document.
Examples:
Highlight chunks 0
, 2
, and 3
, and jump directly to chunk 2
on load. Navigation will follow the [0, 2, 3]
order.
<iframe src="/viewer/doc1?chunks=[0,2,3]&goto_chunk=2"></iframe>
Highlight chunks 2
, 0
, and 3
. The "Next/Previous" buttons will navigate in this specific order (2
-> 0
-> 3
). The view will initially scroll to chunk 2
.
<iframe src="/viewer/doc1?chunks=[2,0,3]"></iframe>
Go to a specific page on load.
<iframe src="/viewer/doc1?goto_page=4"></iframe>
We strongly recommend storing viewer bundles in a non-public path. Here is a guide on how to manage authentication to prevent unwanted access to your documents.
When generating a viewer, you should store the resulting bundle in a directory that is not publicly accessible via HTTP. You can use your web server (Apache, Nginx, etc.) to block direct access to this folder. When a user requests to see a document, your application backend should first verify their permissions and then serve the viewer bundle from the disk.
Depending on your stack, this can be implemented in many ways. Using a route handler is a common approach.
Flask Example This example shows how to serve a viewer only after checking user permissions.
from flask import Flask, send_from_directory, abort
from pathlib import Path
# Path where viewer bundles are stored securely, outside the public web root
BASE_DIR = Path("/var/secure_viewers").resolve()
@app.route("/view/<doc_id>/")
@app.route("/view/<doc_id>/<path:asset>")
def serve_my_document(doc_id, asset="index.html"):
# 1. Add your authentication and authorization logic here
# Example: check_user_can_view(current_user, doc_id)
if not user_is_allowed:
abort(403) # Forbidden
# 2. Securely resolve the path to the viewer
viewer_dir = (BASE_DIR / doc_id).resolve()
# Security check: ensure the resolved path is still within the base directory
# This prevents path traversal attacks (e.g., doc_id = "../../../etc/passwd")
if viewer_dir.parent != BASE_DIR:
abort(404) # Not Found
# 3. Serve the requested asset (index.html, CSS, JS, etc.)
return send_from_directory(viewer_dir, asset)
Note: Remember to include a wildcard in your route (e.g.
<path:asset>
) to handle requests for all assets inside the bundle (CSS, JS, fonts, images), otherwise the viewer will not render correctly.
Contact the Preprocess team at [email protected]
or join our Discord channel.
This project is licensed under the MIT License.
RAG Document Viewer would not be possible without the following open-source projects:
Project | License |
---|---|
LibreOffice https://www.libreoffice.org/ | MPL 2.0 / LGPL v3 |
pdf2htmlEX https://github.com/pdf2htmlEX/pdf2htmlEX | GPL v3 |
These tools are not bundled with the rag-document-viewer
package; they must be installed on the host system where viewers are generated. Please consult the upstream repositories for full license texts and source code.