Skip to content

feat(sync): add GitHub Discussion crawling support#21

Open
mvanhorn wants to merge 1 commit intopwrdrvr:mainfrom
mvanhorn:osc/feat-discussion-crawling
Open

feat(sync): add GitHub Discussion crawling support#21
mvanhorn wants to merge 1 commit intopwrdrvr:mainfrom
mvanhorn:osc/feat-discussion-crawling

Conversation

@mvanhorn
Copy link

Summary

Add opt-in GitHub Discussion crawling via --include-discussions flag on sync and refresh commands. Discussions are stored in the existing threads table with kind = 'discussion' and flow through embedding, clustering, and search automatically.

Why

Many repos use GitHub Discussions for feature requests, Q&A, and bug reports that overlap with issues. When someone files a Discussion about "download stalls" and another person opens an issue about "download timeout", ghcrawl currently can't detect the relationship because it only crawls issues and PRs.

Adding Discussion support catches cross-type duplicates. The threads table already has a text kind column, and the downstream pipeline (documents, embeddings, clustering, search) is kind-agnostic, so discussions work with zero changes to that code.

Technical approach

GitHub Discussions require the GraphQL API (no REST endpoint exists). This uses Octokit's built-in octokit.graphql() support, so no new dependencies are needed. The listRepositoryDiscussions() method in client.ts handles cursor-based pagination, since cutoff, limit, and per-page delay matching the existing REST client behavior.

If Discussions are disabled on a repo, the GraphQL error is caught, a warning is logged, and an empty array is returned. No crash.

Each discussion is mapped to the same record shape as issues/PRs via mapDiscussionToRecord() (client.ts:94). The discussion category name (e.g., "Feature Request", "Q&A") is prepended to labels so it appears in the cluster display.

Changes

File What changed
packages/api-contract/src/contracts.ts Add 'discussion' to threadKindSchema, add includeDiscussions to refresh request
packages/api-core/src/github/client.ts Add listRepositoryDiscussions() with GraphQL pagination + mapDiscussionToRecord() helper
packages/api-core/src/github/client.test.ts 7 tests for discussion-to-record mapping (null author/body/category, open/closed state, labels)
packages/api-core/src/service.ts Add discussion sync block in syncRepository(), pass flag through refreshRepository()
packages/api-core/src/api/server.ts Accept discussion in kind filter for threads endpoint
apps/cli/src/main.ts Add --include-discussions flag to sync/refresh, update usage text

Usage

# Sync with discussions
ghcrawl sync owner/repo --include-discussions

# Full refresh with discussions
ghcrawl refresh owner/repo --include-discussions

# Filter to discussions only
ghcrawl threads owner/repo --kind discussion

# Default behavior is unchanged (no discussions unless flag is set)
ghcrawl refresh owner/repo  # issues + PRs only, same as before

Testing

  • pnpm build passes
  • 7 new tests for mapDiscussionToRecord() pass (normal mapping, null author, null body, null category, closed state, open state, category-as-label)
  • 2 pre-existing config test failures on main are unrelated
  • Discussion sync integration testing requires a repo with Discussions enabled; the mapping logic is tested via the exported pure function

The threadKindSchema enum change at contracts.ts:3 is the only contract-level change. Existing consumers that don't use --include-discussions see no behavioral difference because no discussions are synced by default.

This contribution was developed with AI assistance (Claude Code + Codex).

Add opt-in Discussion sync via --include-discussions flag on sync and
refresh commands. Uses Octokit's built-in GraphQL to query the
Discussions API. Discussions are stored in the threads table with
kind='discussion' and flow through embedding and clustering
automatically.

Handles repos with Discussions disabled gracefully (warns and returns
empty). Adds 'discussion' to threadKindSchema contract enum.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@huntharo huntharo added this to ghcrawl Mar 19, 2026
@huntharo huntharo moved this to In Review in ghcrawl Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants