An automated pipeline for extracting and generating FAQ entries from Discord chat messages using AI and vector embeddings.
This project processes Discord chat exports to automatically generate high-quality FAQ entries. It uses OpenAI's GPT models to identify questions, vector embeddings to group similar questions together, and contextual analysis to create comprehensive answers.
The pipeline consists of 6 sequential steps:
Discord CSV → Add UUIDs → Extract Questions → Generate Embeddings →
Group Similar Questions → Gather Context → Generate Final FAQs
-
Add UUIDs (
add-uuids.js)- Reads the Discord message export CSV
- Assigns a unique UUID to each message for later reference
- Output:
discord_messages_with_uuid.csv
-
Extract Questions (
extract-questions.js)- Processes messages in batches using GPT
- Identifies questions suitable for FAQ entries
- Filters out vague questions, staff clarifications, and general chat
- Output:
extracted_questions.json
-
Generate Embeddings (
generate-embeddings.js)- Creates vector embeddings for each question using OpenAI's
text-embedding-3-smallmodel - Processes in batches of 100 questions
- Output:
questions-with-embeddings.json
- Creates vector embeddings for each question using OpenAI's
-
Group Similar Questions (
deduplicate-questions.js)- Uses cosine similarity to find similar questions
- Groups questions with 70%+ similarity together
- Output:
question-groups-2.json
-
Gather Context (
gather-context.js)- For each question in each group, retrieves surrounding messages
- Collects the next 10 messages after each question as context
- Output:
question-groups-with-context.json
-
Generate FAQ Entries (
generate-faq-entries.js)- Analyzes grouped questions with their context
- Uses GPT to create polished FAQ question/answer pairs
- Processes 5 groups in parallel for efficiency
- Output:
final-faq-entries.json
- Node.js (v14 or higher)
- OpenAI API key
- Discord chat export in CSV format
-
Clone this repository
-
Install dependencies:
npm install
-
Set up your OpenAI API key:
cp .env.example .env
Then edit
.envand add your OpenAI API key:OPENAI_API_KEY=your_actual_api_key_here
Run the scripts in order:
# 1. Add UUIDs to your Discord export
node add-uuids.js
# 2. Extract questions from messages
node extract-questions.js
# 3. Generate embeddings for questions
node generate-embeddings.js
# 4. Group similar questions together
node deduplicate-questions.js
# 5. Gather context messages for each question
node gather-context.js
# 6. Generate final FAQ entries
node generate-faq-entries.jsPlace your Discord chat export CSV in the project root. Update the filename in add-uuids.js:7 if needed:
const inputFile = 'YOUR_DISCORD_EXPORT.csv';Expected CSV columns:
- Date
- Username
- User tag
- Content
- Mentions
- link
| File | Description |
|---|---|
discord_messages_with_uuid.csv |
Original messages with UUID added |
extracted_questions.json |
Questions identified by AI |
questions-with-embeddings.json |
Questions with vector embeddings (large file) |
question-groups-2.json |
Grouped similar questions |
question-groups-with-context.json |
Groups with surrounding message context |
final-faq-entries.json |
Final polished FAQ entries |
extract-questions.js
batchSize(line 31): Messages processed per batch (default: 10)windowSize(line 32): Context window size (default: 20)
generate-embeddings.js
batchSize(line 21): Questions per embedding batch (default: 100)
deduplicate-questions.js
similarityThreshold(line 21): Minimum similarity for grouping (default: 0.7)
gather-context.js
contextWindowSize(line 9): Messages to gather after each question (default: 10)
generate-faq-entries.js
parallelRequests(line 151): Concurrent API requests (default: 5)SYSTEM_PROMPT(line 9): Customize the AI's behavior and context
openai- OpenAI API clientcsv-parse/csv-stringify- CSV file handlinguuid- UUID generationdotenv- Environment variable management
- The pipeline is designed to save progress after each batch, allowing you to resume if interrupted
- Rate limiting delays are built in to avoid API throttling
- The
questions-with-embeddings.jsonfile can be quite large (30MB+) due to vector data - Total processing time depends on the number of messages and OpenAI API rate limits
This pipeline makes extensive use of OpenAI's API:
- GPT calls for question extraction and FAQ generation
- Embedding API calls for similarity matching
For a run of 5,000 Discord messages, the pipeline:
- Cost approximately $1.50 using the default configuration - incredibly cheap!
- Generated just under 700 FAQ entries
- Quality was quite good overall, though not perfect (mainly limited by incorrect answers in the original Discord messages rather than the AI processing)
The default configuration uses cost-optimized models, but you can improve quality by upgrading:
GPT Models (used in extract-questions.js and generate-faq-entries.js):
- Current:
gpt-5-mini(cheap, fast) - Upgrade to:
gpt-5(better quality, higher cost)
Embedding Models (used in generate-embeddings.js):
- Current:
text-embedding-3-small(cost-effective) - Upgrade to:
text-embedding-3-large(better question grouping, higher cost)
Simply change the model names in the respective files to use higher-quality models. Costs will increase but quality improvements may be worth it for critical FAQ generation.
ISC