Skip to content

JackPriceBurns/faq-extract

Repository files navigation

FAQ Extract

An automated pipeline for extracting and generating FAQ entries from Discord chat messages using AI and vector embeddings.

Overview

This project processes Discord chat exports to automatically generate high-quality FAQ entries. It uses OpenAI's GPT models to identify questions, vector embeddings to group similar questions together, and contextual analysis to create comprehensive answers.

How It Works

The pipeline consists of 6 sequential steps:

Discord CSV → Add UUIDs → Extract Questions → Generate Embeddings →
Group Similar Questions → Gather Context → Generate Final FAQs

Pipeline Details

  1. Add UUIDs (add-uuids.js)

    • Reads the Discord message export CSV
    • Assigns a unique UUID to each message for later reference
    • Output: discord_messages_with_uuid.csv
  2. Extract Questions (extract-questions.js)

    • Processes messages in batches using GPT
    • Identifies questions suitable for FAQ entries
    • Filters out vague questions, staff clarifications, and general chat
    • Output: extracted_questions.json
  3. Generate Embeddings (generate-embeddings.js)

    • Creates vector embeddings for each question using OpenAI's text-embedding-3-small model
    • Processes in batches of 100 questions
    • Output: questions-with-embeddings.json
  4. Group Similar Questions (deduplicate-questions.js)

    • Uses cosine similarity to find similar questions
    • Groups questions with 70%+ similarity together
    • Output: question-groups-2.json
  5. Gather Context (gather-context.js)

    • For each question in each group, retrieves surrounding messages
    • Collects the next 10 messages after each question as context
    • Output: question-groups-with-context.json
  6. Generate FAQ Entries (generate-faq-entries.js)

    • Analyzes grouped questions with their context
    • Uses GPT to create polished FAQ question/answer pairs
    • Processes 5 groups in parallel for efficiency
    • Output: final-faq-entries.json

Prerequisites

  • Node.js (v14 or higher)
  • OpenAI API key
  • Discord chat export in CSV format

Installation

  1. Clone this repository

  2. Install dependencies:

    npm install
  3. Set up your OpenAI API key:

    cp .env.example .env

    Then edit .env and add your OpenAI API key:

    OPENAI_API_KEY=your_actual_api_key_here
    

Usage

Run the scripts in order:

# 1. Add UUIDs to your Discord export
node add-uuids.js

# 2. Extract questions from messages
node extract-questions.js

# 3. Generate embeddings for questions
node generate-embeddings.js

# 4. Group similar questions together
node deduplicate-questions.js

# 5. Gather context messages for each question
node gather-context.js

# 6. Generate final FAQ entries
node generate-faq-entries.js

Input File

Place your Discord chat export CSV in the project root. Update the filename in add-uuids.js:7 if needed:

const inputFile = 'YOUR_DISCORD_EXPORT.csv';

Expected CSV columns:

  • Date
  • Username
  • User tag
  • Content
  • Mentions
  • link

Output Files

File Description
discord_messages_with_uuid.csv Original messages with UUID added
extracted_questions.json Questions identified by AI
questions-with-embeddings.json Questions with vector embeddings (large file)
question-groups-2.json Grouped similar questions
question-groups-with-context.json Groups with surrounding message context
final-faq-entries.json Final polished FAQ entries

Configuration

Adjustable Parameters

extract-questions.js

  • batchSize (line 31): Messages processed per batch (default: 10)
  • windowSize (line 32): Context window size (default: 20)

generate-embeddings.js

  • batchSize (line 21): Questions per embedding batch (default: 100)

deduplicate-questions.js

  • similarityThreshold (line 21): Minimum similarity for grouping (default: 0.7)

gather-context.js

  • contextWindowSize (line 9): Messages to gather after each question (default: 10)

generate-faq-entries.js

  • parallelRequests (line 151): Concurrent API requests (default: 5)
  • SYSTEM_PROMPT (line 9): Customize the AI's behavior and context

Dependencies

  • openai - OpenAI API client
  • csv-parse / csv-stringify - CSV file handling
  • uuid - UUID generation
  • dotenv - Environment variable management

Notes

  • The pipeline is designed to save progress after each batch, allowing you to resume if interrupted
  • Rate limiting delays are built in to avoid API throttling
  • The questions-with-embeddings.json file can be quite large (30MB+) due to vector data
  • Total processing time depends on the number of messages and OpenAI API rate limits

Cost Considerations

This pipeline makes extensive use of OpenAI's API:

  • GPT calls for question extraction and FAQ generation
  • Embedding API calls for similarity matching

Real-world Cost Example

For a run of 5,000 Discord messages, the pipeline:

  • Cost approximately $1.50 using the default configuration - incredibly cheap!
  • Generated just under 700 FAQ entries
  • Quality was quite good overall, though not perfect (mainly limited by incorrect answers in the original Discord messages rather than the AI processing)

Model Alternatives

The default configuration uses cost-optimized models, but you can improve quality by upgrading:

GPT Models (used in extract-questions.js and generate-faq-entries.js):

  • Current: gpt-5-mini (cheap, fast)
  • Upgrade to: gpt-5 (better quality, higher cost)

Embedding Models (used in generate-embeddings.js):

  • Current: text-embedding-3-small (cost-effective)
  • Upgrade to: text-embedding-3-large (better question grouping, higher cost)

Simply change the model names in the respective files to use higher-quality models. Costs will increase but quality improvements may be worth it for critical FAQ generation.

License

ISC

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors