You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project provides a configurable tool (`doc2vec`) to crawl specified websites (typically documentation sites), GitHub repositories, and local directories, extract relevant content, convert it to Markdown, chunk it intelligently, generate vector embeddings using OpenAI, and store the chunks along with their embeddings in a vector database (SQLite with `sqlite-vec` or Qdrant).
5
+
This project provides a configurable tool (`doc2vec`) to crawl specified websites (typically documentation sites), GitHub repositories, local directories, and Zendesk support systems, extract relevant content, convert it to Markdown, chunk it intelligently, generate vector embeddings using OpenAI, and store the chunks along with their embeddings in a vector database (SQLite with `sqlite-vec` or Qdrant).
6
6
7
7
The primary goal is to prepare documentation content for Retrieval-Augmented Generation (RAG) systems or semantic search applications.
8
8
@@ -12,6 +12,11 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
12
12
***Sitemap Support:** Extracts URLs from XML sitemaps to discover pages not linked in navigation.
13
13
***PDF Support:** Automatically downloads and processes PDF files linked from websites.
14
14
***GitHub Issues Integration:** Retrieves GitHub issues and comments, processing them into searchable chunks.
15
+
***Zendesk Integration:** Fetches support tickets and knowledge base articles from Zendesk, converting them to searchable chunks.
16
+
***Support Tickets:** Processes tickets with metadata, descriptions, and comments.
17
+
***Knowledge Base Articles:** Converts help center articles from HTML to clean Markdown.
18
+
***Incremental Updates:** Only processes tickets/articles updated since the last run.
19
+
***Flexible Filtering:** Filter tickets by status and priority.
15
20
***Local Directory Processing:** Scans local directories for files, converts content to searchable chunks.
16
21
***PDF Support:** Automatically extracts text from PDF files and converts them to Markdown format using Mozilla's PDF.js.
17
22
***Content Extraction:** Uses Puppeteer for rendering JavaScript-heavy pages and `@mozilla/readability` to extract the main article content.
@@ -22,9 +27,9 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
22
27
***SQLite:** Using `better-sqlite3` and the `sqlite-vec` extension for efficient vector search.
23
28
***Qdrant:** A dedicated vector database, using the `@qdrant/js-client-rest`.
24
29
***Change Detection:** Uses content hashing to detect changes and only re-embeds and updates chunks that have actually been modified.
25
-
***Incremental Updates:** For GitHub sources, tracks the last run date to only fetch new or updated issues.
30
+
***Incremental Updates:** For GitHub and Zendesk sources, tracks the last run date to only fetch new or updated issues/tickets.
26
31
***Cleanup:** Removes obsolete chunks from the database corresponding to pages or files that are no longer found during processing.
27
-
***Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, local directories, database types, metadata, and other parameters.
32
+
***Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, local directories, Zendesk instances, database types, metadata, and other parameters.
28
33
***Structured Logging:** Uses a custom logger (`logger.ts`) with levels, timestamps, colors, progress bars, and child loggers for clear execution monitoring.
29
34
30
35
## Prerequisites
@@ -34,6 +39,7 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
34
39
***TypeScript:** As the project is written in TypeScript (`ts-node` is used for execution via `npm start`).
35
40
***OpenAI API Key:** You need an API key from OpenAI to generate embeddings.
36
41
***GitHub Personal Access Token:** Required for accessing GitHub issues (set as `GITHUB_PERSONAL_ACCESS_TOKEN` in your environment).
42
+
***Zendesk API Token:** Required for accessing Zendesk tickets and articles (set as `ZENDESK_API_TOKEN` in your environment).
37
43
***(Optional) Qdrant Instance:** If using the `qdrant` database type, you need a running Qdrant instance accessible from where you run the script.
38
44
***(Optional) Build Tools:** Dependencies like `better-sqlite3` and `sqlite-vec` might require native compilation, which could necessitate build tools like `python`, `make`, and a C++ compiler (like `g++` or Clang) depending on your operating system.
39
45
@@ -68,6 +74,9 @@ Configuration is managed through two files:
68
74
# Required for GitHub sources
69
75
GITHUB_PERSONAL_ACCESS_TOKEN="ghp_..."
70
76
77
+
# Required for Zendesk sources
78
+
ZENDESK_API_TOKEN="your-zendesk-api-token"
79
+
71
80
# Optional: Required only if using Qdrant
72
81
QDRANT_API_KEY="your-qdrant-api-key"
73
82
```
@@ -78,7 +87,7 @@ Configuration is managed through two files:
78
87
**Structure:**
79
88
80
89
*`sources`: An array of source configurations.
81
-
*`type`: Either `'website'`, `'github'`, or `'local_directory'`
90
+
*`type`: Either `'website'`, `'github'`, `'local_directory'`, or `'zendesk'`
82
91
83
92
For websites (`type: 'website'`):
84
93
*`url`: The starting URL for crawling the documentation site.
@@ -96,6 +105,16 @@ Configuration is managed through two files:
0 commit comments