Skip to content

Commit b574729

Browse files
authored
Adding support for zendesk source (#22)
* Adding support for zendesk source Signed-off-by: Denis Jannot <[email protected]> * Updating version Signed-off-by: Denis Jannot <[email protected]> --------- Signed-off-by: Denis Jannot <[email protected]>
1 parent a0ed5cb commit b574729

File tree

5 files changed

+500
-8
lines changed

5 files changed

+500
-8
lines changed

README.md

Lines changed: 47 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
[![npm version](https://img.shields.io/npm/v/doc2vec.svg)](https://www.npmjs.com/package/doc2vec)
44

5-
This project provides a configurable tool (`doc2vec`) to crawl specified websites (typically documentation sites), GitHub repositories, and local directories, extract relevant content, convert it to Markdown, chunk it intelligently, generate vector embeddings using OpenAI, and store the chunks along with their embeddings in a vector database (SQLite with `sqlite-vec` or Qdrant).
5+
This project provides a configurable tool (`doc2vec`) to crawl specified websites (typically documentation sites), GitHub repositories, local directories, and Zendesk support systems, extract relevant content, convert it to Markdown, chunk it intelligently, generate vector embeddings using OpenAI, and store the chunks along with their embeddings in a vector database (SQLite with `sqlite-vec` or Qdrant).
66

77
The primary goal is to prepare documentation content for Retrieval-Augmented Generation (RAG) systems or semantic search applications.
88

@@ -12,6 +12,11 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
1212
* **Sitemap Support:** Extracts URLs from XML sitemaps to discover pages not linked in navigation.
1313
* **PDF Support:** Automatically downloads and processes PDF files linked from websites.
1414
* **GitHub Issues Integration:** Retrieves GitHub issues and comments, processing them into searchable chunks.
15+
* **Zendesk Integration:** Fetches support tickets and knowledge base articles from Zendesk, converting them to searchable chunks.
16+
* **Support Tickets:** Processes tickets with metadata, descriptions, and comments.
17+
* **Knowledge Base Articles:** Converts help center articles from HTML to clean Markdown.
18+
* **Incremental Updates:** Only processes tickets/articles updated since the last run.
19+
* **Flexible Filtering:** Filter tickets by status and priority.
1520
* **Local Directory Processing:** Scans local directories for files, converts content to searchable chunks.
1621
* **PDF Support:** Automatically extracts text from PDF files and converts them to Markdown format using Mozilla's PDF.js.
1722
* **Content Extraction:** Uses Puppeteer for rendering JavaScript-heavy pages and `@mozilla/readability` to extract the main article content.
@@ -22,9 +27,9 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
2227
* **SQLite:** Using `better-sqlite3` and the `sqlite-vec` extension for efficient vector search.
2328
* **Qdrant:** A dedicated vector database, using the `@qdrant/js-client-rest`.
2429
* **Change Detection:** Uses content hashing to detect changes and only re-embeds and updates chunks that have actually been modified.
25-
* **Incremental Updates:** For GitHub sources, tracks the last run date to only fetch new or updated issues.
30+
* **Incremental Updates:** For GitHub and Zendesk sources, tracks the last run date to only fetch new or updated issues/tickets.
2631
* **Cleanup:** Removes obsolete chunks from the database corresponding to pages or files that are no longer found during processing.
27-
* **Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, local directories, database types, metadata, and other parameters.
32+
* **Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, local directories, Zendesk instances, database types, metadata, and other parameters.
2833
* **Structured Logging:** Uses a custom logger (`logger.ts`) with levels, timestamps, colors, progress bars, and child loggers for clear execution monitoring.
2934

3035
## Prerequisites
@@ -34,6 +39,7 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
3439
* **TypeScript:** As the project is written in TypeScript (`ts-node` is used for execution via `npm start`).
3540
* **OpenAI API Key:** You need an API key from OpenAI to generate embeddings.
3641
* **GitHub Personal Access Token:** Required for accessing GitHub issues (set as `GITHUB_PERSONAL_ACCESS_TOKEN` in your environment).
42+
* **Zendesk API Token:** Required for accessing Zendesk tickets and articles (set as `ZENDESK_API_TOKEN` in your environment).
3743
* **(Optional) Qdrant Instance:** If using the `qdrant` database type, you need a running Qdrant instance accessible from where you run the script.
3844
* **(Optional) Build Tools:** Dependencies like `better-sqlite3` and `sqlite-vec` might require native compilation, which could necessitate build tools like `python`, `make`, and a C++ compiler (like `g++` or Clang) depending on your operating system.
3945

@@ -68,6 +74,9 @@ Configuration is managed through two files:
6874
# Required for GitHub sources
6975
GITHUB_PERSONAL_ACCESS_TOKEN="ghp_..."
7076
77+
# Required for Zendesk sources
78+
ZENDESK_API_TOKEN="your-zendesk-api-token"
79+
7180
# Optional: Required only if using Qdrant
7281
QDRANT_API_KEY="your-qdrant-api-key"
7382
```
@@ -78,7 +87,7 @@ Configuration is managed through two files:
7887
**Structure:**
7988

8089
* `sources`: An array of source configurations.
81-
* `type`: Either `'website'`, `'github'`, or `'local_directory'`
90+
* `type`: Either `'website'`, `'github'`, `'local_directory'`, or `'zendesk'`
8291

8392
For websites (`type: 'website'`):
8493
* `url`: The starting URL for crawling the documentation site.
@@ -96,6 +105,16 @@ Configuration is managed through two files:
96105
* `url_rewrite_prefix` (Optional) URL prefix to rewrite `file://` URLs (e.g., `https://mydomain.com`)
97106
* `encoding`: (Optional) File encoding to use (defaults to `'utf8'`). Note: PDF files are processed as binary and this setting doesn't apply to them.
98107

108+
For Zendesk (`type: 'zendesk'`):
109+
* `zendesk_subdomain`: Your Zendesk subdomain (e.g., `'mycompany'` for mycompany.zendesk.com).
110+
* `email`: Your Zendesk admin email address.
111+
* `api_token`: Your Zendesk API token (reference environment variable as `'${ZENDESK_API_TOKEN}'`).
112+
* `fetch_tickets`: (Optional) Whether to fetch support tickets (defaults to `true`).
113+
* `fetch_articles`: (Optional) Whether to fetch knowledge base articles (defaults to `true`).
114+
* `start_date`: (Optional) Only process tickets/articles updated since this date (e.g., `'2025-01-01'`).
115+
* `ticket_status`: (Optional) Filter tickets by status (defaults to `['new', 'open', 'pending', 'hold', 'solved']`).
116+
* `ticket_priority`: (Optional) Filter tickets by priority (defaults to all priorities).
117+
99118
Common configuration for all types:
100119
* `product_name`: A string identifying the product (used in metadata).
101120
* `version`: A string identifying the product version (used in metadata).
@@ -150,6 +169,24 @@ Configuration is managed through two files:
150169
params:
151170
db_path: './project-docs.db'
152171
172+
# Zendesk example
173+
- type: 'zendesk'
174+
product_name: 'MyCompany'
175+
version: 'latest'
176+
zendesk_subdomain: 'mycompany'
177+
178+
api_token: '${ZENDESK_API_TOKEN}'
179+
fetch_tickets: true
180+
fetch_articles: true
181+
start_date: '2025-01-01'
182+
ticket_status: ['open', 'pending']
183+
ticket_priority: ['high']
184+
max_size: 1048576
185+
database_config:
186+
type: 'sqlite'
187+
params:
188+
db_path: './zendesk-kb.db'
189+
153190
# Qdrant example
154191
- type: 'website'
155192
product_name: 'Istio'
@@ -188,6 +225,7 @@ The script will then:
188225
- For websites: Crawl the site, process any sitemaps, extract content from HTML pages and download/process PDF files, convert to Markdown
189226
- For GitHub repos: Fetch issues and comments, convert to Markdown
190227
- For local directories: Scan files, process content (converting HTML and PDF files to Markdown if needed)
228+
- For Zendesk: Fetch tickets and articles, convert to Markdown
191229
6. For all sources: Chunk content, check for changes, generate embeddings (if needed), and store/update in the database.
192230
7. Cleanup obsolete chunks.
193231
8. Output detailed logs.
@@ -296,6 +334,11 @@ If you don't specify a config path, it will look for config.yaml in the current
296334
* Read file content, converting HTML to Markdown if needed.
297335
* For PDF files, extract text using Mozilla's PDF.js and convert to Markdown format with proper page structure.
298336
* Process each file's content.
337+
- **For Zendesk:**
338+
* Fetch tickets and articles using the Zendesk API.
339+
* Convert tickets to formatted Markdown.
340+
* Convert articles to formatted Markdown.
341+
* Track last run date to support incremental updates.
299342
3. **Process Content:** For each processed page, issue, or file:
300343
* **Chunk:** Split Markdown into smaller `DocumentChunk` objects based on headings and size.
301344
* **Hash Check:** Generate a hash of the chunk content. Check if a chunk with the same ID exists in the DB and if its hash matches.

content-processor.ts

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,32 @@ export class ContentProcessor {
129129
logger.debug('Turndown rules setup complete');
130130
}
131131

132+
public convertHtmlToMarkdown(html: string): string {
133+
if (!html || !html.trim()) {
134+
return '';
135+
}
136+
137+
// Sanitize the HTML first
138+
const cleanHtml = sanitizeHtml(html, {
139+
allowedTags: [
140+
'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'a', 'ul', 'ol',
141+
'li', 'b', 'i', 'strong', 'em', 'code', 'pre',
142+
'div', 'span', 'table', 'thead', 'tbody', 'tr', 'th', 'td',
143+
'blockquote', 'br'
144+
],
145+
allowedAttributes: {
146+
'a': ['href'],
147+
'pre': ['class', 'data-language'],
148+
'code': ['class', 'data-language'],
149+
'div': ['class'],
150+
'span': ['class']
151+
}
152+
});
153+
154+
// Convert to markdown using TurndownService
155+
return this.turndownService.turndown(cleanHtml).trim();
156+
}
157+
132158
async parseSitemap(sitemapUrl: string, logger: Logger): Promise<string[]> {
133159
logger.info(`Parsing sitemap from ${sitemapUrl}`);
134160
try {

0 commit comments

Comments
 (0)