Skip to content

Conversation

@aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Nov 19, 2025

Summary

Implements a new MCP tool suggest_external_documentation_urls that auto-suggests external documentation URLs for APIs using DuckDuckGo search, heuristic pattern matching, and metadata enrichment. The tool helps streamline the discovery of vendor documentation URLs when building connectors.

Key Features:

  • Searches for 12 documentation categories (api_reference, api_release_history, rate_limits, authentication_guide, etc.)
  • Uses DuckDuckGo search with category-specific queries
  • Scores URLs based on domain matching, path patterns, and search relevance
  • Enriches titles by fetching page metadata (og:title, twitter:title, h1 tags)
  • Returns strongly-typed Pydantic BaseModel instead of arbitrary dicts
  • Validates URLs and detects login requirements

Dependencies Added:

  • duckduckgo-search>=7.0.0 - For web search
  • beautifulsoup4>=4.12.0 - For HTML parsing
  • lxml>=5.0.0 - For fast HTML parsing

Code Review Feedback Addressed:

  • Refactored test classes to bare functions per pytest best practices
  • Converted category names to title case
  • Changed return type from list[dict] to list[ExternalDocumentationUrl] (Pydantic BaseModel)
  • Propagated DOCUMENTATION_TYPES Literal through codebase for full type safety

Review & Testing Checklist for Human

  • Test with real API names: Try the tool with common APIs (Stripe, Salesforce, GitHub, Twilio, etc.) to verify search quality and URL relevance. The tool hasn't been tested end-to-end in production yet.
  • Verify rate limiting behavior: DuckDuckGo has unofficial rate limits (~1 req/s). Test with multiple rapid requests to ensure the sleep(0.5) rate limiting is sufficient and doesn't cause failures.
  • Check network failure handling: Test with network issues (timeouts, DNS failures) to ensure the tool degrades gracefully and doesn't crash the MCP server.
  • Verify Pydantic serialization: Confirm the ExternalDocumentationUrl model serializes correctly when returned from the MCP tool (check that field names are snake_case: doc_type, requires_login).
  • Test with invalid allowed_types: Pass unknown documentation types to verify the filtering logic works and doesn't cause runtime errors.

Test Plan

# Example manual test
from connector_builder_mcp._external_docs_suggester import suggest_external_documentation_urls

# Test with a well-known API
results = suggest_external_documentation_urls(
    api_name="Stripe",
    vendor_domain="stripe.com",
    allowed_types=["api_reference", "api_release_history", "rate_limits"],
    max_results_per_type=2
)

# Verify results contain relevant URLs with proper titles
for result in results: 
    print(f"{result.doc_type}: {result.title} - {result.url}")

Notes

  • The tool uses DuckDuckGo search which is free but has unofficial rate limits. Consider adding retry logic or fallback to other search providers if rate limiting becomes an issue in production.
  • The cast() usage in the filtering logic bypasses type checking - this is intentional to handle user-provided allowed_types that may contain invalid values, but it means invalid types are silently filtered out rather than raising errors.
  • Unit tests cover helper functions but not the full end-to-end search flow. Integration tests would be valuable for production confidence.

Session: https://app.devin.ai/sessions/0612cddd74b2441ea970d065b8d4180b
Requested by: AJ Steers (@aaronsteers)

… search and metadata enrichment

- Add new MCP tool to auto-suggest external documentation URLs for APIs
- Implement DuckDuckGo search with category-specific queries
- Add heuristic pattern matching for URL scoring and categorization
- Implement metadata enrichment (og:title, h1, etc.) for high-quality titles
- Add URL validation and canonicalization
- Include comprehensive unit tests
- Add dependencies: duckduckgo-search, beautifulsoup4, lxml

Co-Authored-By: AJ Steers <[email protected]>
@devin-ai-integration
Copy link
Contributor

Original prompt from AJ Steers
@Devin - Let's explore options to streamline the discovery of external api docs. I think we can add as tool(s) for connector builder MCP.
Thread URL: https://airbytehq-team.slack.com/archives/D089P0UPVT4/p1763524294126809

@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 19, 2025

Warning

Rate limit exceeded

@devin-ai-integration[bot] has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 25 minutes and 52 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between fdab738 and 4e30940.

📒 Files selected for processing (2)
  • connector_builder_mcp/_external_docs_suggester.py (1 hunks)
  • connector_builder_mcp/mcp/guidance.py (2 hunks)
📝 Walkthrough

Walkthrough

Adds an external documentation URL suggester module, registers it as a GUIDANCE tool (duplicate definition present), adds unit tests, and introduces runtime HTML-parsing and search dependencies. The suggester discovers, validates, enriches, scores, canonicalizes, deduplicates, and returns documentation URL candidates.

Changes

Cohort / File(s) Summary
Core implementation
connector_builder_mcp/_external_docs_suggester.py
Adds URLCandidate dataclass and suggest_external_documentation_urls() public API. Implements vendor-domain derivation (explicit → api_base_url → DuckDuckGo), DuckDuckGo search fallback, URL validation (HEAD → GET, follow redirects), login detection, title enrichment (og:title/twitter:title/h1/title), URL canonicalization, scoring (domain origin, path/category patterns, locale/version heuristics, fragment handling, keyword alignment), category-driven search patterns, deduplication (_are_urls_similar), and result formatting.
Guidance integration
connector_builder_mcp/mcp/guidance.py
Imports suggester and registers a GUIDANCE-domain tool suggest_external_documentation_urls(...) with parameter validation, logging, and error handling. Patch contains two identical tool definitions (duplicate).
Dependencies
pyproject.toml
Adds runtime dependencies: beautifulsoup4>=4.12.0,<5.0, duckduckgo-search>=7.0.0,<8.0, lxml>=5.0.0,<6.0; adds deptry per-rule ignore DEP002 = ["lxml"].
Tests
tests/unit/test_external_docs_suggester.py
Adds unit tests covering vendor-domain derivation, official-domain checks, scoring heuristics, canonicalization, title cleaning/generation, and URL similarity detection for internal helpers.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant Guidance
    participant Suggester
    participant WebSearch
    participant Validator
    participant Enricher

    User->>Guidance: suggest_external_documentation_urls(api_name, ...)
    Guidance->>Suggester: invoke suggester API

    rect rgb(245,250,255)
    Suggester->>Suggester: derive vendor_domain (explicit → api_base_url → DuckDuckGo)
    Suggester->>Suggester: select categories (allowed_types or defaults)
    end

    loop per category
        Suggester->>WebSearch: search category-specific queries
        WebSearch-->>Suggester: candidate URLs
        loop per candidate
            Suggester->>Validator: validate URL (HEAD → fallback GET, follow redirects)
            Validator-->>Suggester: (valid?, requires_login, final_url)
            Suggester->>Enricher: fetch page & extract title metadata
            Enricher-->>Suggester: enriched title
            Suggester->>Suggester: canonicalize URL, compute score
        end
    end

    Suggester->>Suggester: deduplicate, sort, limit per category
    Suggester-->>Guidance: return list[dict]{title,url,type,requiresLogin}
    Guidance-->>User: results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Review focus:
    • Scoring heuristics and thresholds in _score_url.
    • URL validation flow (HEAD vs GET, redirects) and login detection.
    • Vendor-domain derivation (api_base_url parsing, DuckDuckGo fallback).
    • Duplicate GUIDANCE tool definition in guidance.py.
    • Tests covering canonicalization and URL similarity edge cases.

Suggested reviewers

  • pedroslopez
  • pnilan
  • dbgold17
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1763528159-add-external-docs-suggester

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the enhancement New feature or request label Nov 19, 2025
@github-actions
Copy link

👋 Welcome to the Airbyte Connector Builder MCP!

Thank you for your contribution! Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1763528159-add-external-docs-suggester", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1763528159-add-external-docs-suggester#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /build-connector - Builds the default connector on-demand using the AI builder
  • /build-connector prompt="<your prompt>" - Builds a connector on-demand using the AI builder
  • /poe <command> - Runs any poe command in the uv virtual environment

AI Builder Evaluations

AI builder evaluations run automatically under the following conditions:

  • When a PR is marked as "ready for review"
  • When a PR is reopened

A set of standardized evaluations also run on a schedule (Mon/Wed/Fri at midnight UTC) and can be manually triggered via workflow dispatch.

Helpful Resources

If you have any questions, feel free to ask in the PR comments or join our Slack community.

📝 Edit this welcome message.

@github-actions
Copy link

github-actions bot commented Nov 19, 2025

PyTest Results (Full)

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
0 files   ±0   0 ❌ ±0 

Results for commit 4e30940. ± Comparison against base commit 0750ae6.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Nov 19, 2025

PyTest Results (Fast)

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
0 files   ±0   0 ❌ ±0 

Results for commit 4e30940. ± Comparison against base commit 0750ae6.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
connector_builder_mcp/mcp/guidance.py (1)

16-18: Wire-up to _external_docs_suggester looks good; consider tightening tool metadata and docs

The wrapper correctly mirrors the underlying suggest_external_documentation_urls API, adds logging, and returns an empty list on failure as intended. Two small follow-ups to consider:

  1. mcp_tool annotations
    Since this tool issues external HTTP/search calls but is otherwise read-only and idempotent, it would be clearer to annotate it explicitly:
-@mcp_tool(
-    domain=ToolDomain.GUIDANCE,
-)
+@mcp_tool(
+    domain=ToolDomain.GUIDANCE,
+    read_only=True,
+    idempotent=True,
+    open_world=True,
+)

This helps downstream orchestration reason about side effects and external reach.

  1. Valid allowed_types vs. implementation
    The docstring advertises developer_community as a valid type, but CATEGORY_SEARCH_PATTERNS in _external_docs_suggester has no entry for that key, so such requests will always come back empty. Either add a developer_community entry to CATEGORY_SEARCH_PATTERNS or drop it from the “Valid types” list here to avoid confusion.

Also applies to: 282-374

tests/unit/test_external_docs_suggester.py (1)

1-284: Good, focused coverage of the pure heuristics; consider mocking the networked bits later

These tests do a nice job exercising the core heuristics (_derive_vendor_domain, domain checks, scoring, canonicalization, title cleaning, canonical title generation, URL similarity) with realistic inputs while avoiding real network calls.

If you want to harden this further in the future, the main gap is around _validate_url and _enrich_title. Those could be covered with requests and DDGS patched out (e.g., faking status codes, redirects, and simple HTML snippets) so behavior on 2xx/3xx/4xx and title extraction is locked in without hitting the network.

connector_builder_mcp/_external_docs_suggester.py (1)

33-66: Search/scoring pipeline is solid; consider DDGS reuse and explicit category typing

The end-to-end flow (derive vendor domain → per-category DDG search → URL validation → scoring → enrichment → deduplication) looks well thought out and the heuristics line up with the tests.

A couple of non-blocking refinements you might want to fold in:

  1. Reuse DDGS instance and centralize rate limiting
    _derive_vendor_domain and _search_for_category each construct DDGS() instances and _search_for_category calls .text in a loop with time.sleep(0.5) after each call. For connectors that query many categories, this can add noticeable latency and spin up several sessions. Consider creating a single DDGS instance per top-level suggest_external_documentation_urls call (e.g., pass it into _derive_vendor_domain and _search_for_category) and implementing rate limiting/backoff in one place.

  2. Leverage DOCUMENTATION_TYPES for allowed_types/categories
    You’ve defined DOCUMENTATION_TYPES but allowed_types is still typed as list[str] | None, and CATEGORY_SEARCH_PATTERNS doesn’t include all of the literal values (e.g., developer_community). Wiring allowed_types and category variables to DOCUMENTATION_TYPES would give you static checking on category names and make it harder to accidentally advertise a type that isn’t actually searched over.

None of this is urgent, but it would make the new tool a bit more robust and self-documenting.

Also applies to: 313-378, 381-442

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0750ae6 and 8268626.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • connector_builder_mcp/_external_docs_suggester.py (1 hunks)
  • connector_builder_mcp/mcp/guidance.py (2 hunks)
  • pyproject.toml (1 hunks)
  • tests/unit/test_external_docs_suggester.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-10T17:09:05.986Z
Learnt from: aaronsteers
Repo: airbytehq/connector-builder-mcp PR: 116
File: .github/workflows/ci-tests.yml:110-113
Timestamp: 2025-10-10T17:09:05.986Z
Learning: The connector-builder-mcp repository at airbytehq/connector-builder-mcp is a private/internal repository that does not accept external contributions or forks.

Applied to files:

  • connector_builder_mcp/mcp/guidance.py
🧬 Code graph analysis (3)
tests/unit/test_external_docs_suggester.py (1)
connector_builder_mcp/_external_docs_suggester.py (7)
  • _are_urls_similar (445-456)
  • _canonicalize_url (174-181)
  • _clean_title (277-290)
  • _derive_vendor_domain (88-118)
  • _generate_canonical_title (293-310)
  • _is_official_domain (121-133)
  • _score_url (136-171)
connector_builder_mcp/mcp/guidance.py (2)
connector_builder_mcp/_external_docs_suggester.py (1)
  • suggest_external_documentation_urls (381-442)
connector_builder_mcp/mcp/_mcp_utils.py (2)
  • mcp_tool (102-149)
  • ToolDomain (49-80)
connector_builder_mcp/_external_docs_suggester.py (1)
connector_builder_mcp/mcp/guidance.py (1)
  • suggest_external_documentation_urls (285-374)
🪛 GitHub Actions: Linters
connector_builder_mcp/_external_docs_suggester.py

[error] 246-246: Mypy type error: Item "AttributeValueList" of "str | AttributeValueList" has no attribute "strip" [union-attr]


[error] 252-252: Mypy type error: Item "AttributeValueList" of "str | AttributeValueList" has no attribute "strip" [union-attr]

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
🔇 Additional comments (1)
pyproject.toml (1)

12-16: New HTML/search dependencies look appropriate for the docs suggester

The added beautifulsoup4, duckduckgo-search, and lxml dependencies align with the new _external_docs_suggester module and have sensible <next-major caps. Just confirm these bounds match what you’re running in the target environment (container/base image) so you don’t hit unexpected resolver conflicts.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
connector_builder_mcp/_external_docs_suggester.py (1)

244-258: Incomplete fix for meta content attribute access.

The past review comment identified mypy union-attr errors on lines 246 and 252. While isinstance(content, str) checks have been added (lines 247, 255), the implementation is incomplete:

  1. Lines 246 and 254 still use bracket notation ["content"] which doesn't satisfy mypy's type narrowing.
  2. The list[str] case is silently ignored rather than handled.

Although og:title and twitter:title are rarely multi-valued in practice, the code should either:

  • Follow the suggested fix in the past review to handle both str and list cases explicitly, OR
  • Use .get("content", "") with isinstance checks and document why lists are skipped.

Apply the fix from the past review comment (see existing comments) or use this simpler alternative:

         og_title_tag = soup.find("meta", property="og:title")
-        if og_title_tag and og_title_tag.get("content"):
-            content = og_title_tag["content"]
+        if og_title_tag:
+            content = og_title_tag.get("content", "")
             if isinstance(content, str):
                 title = content.strip()
                 if title:
                     return _clean_title(title), "og_title"

         twitter_title_tag = soup.find("meta", attrs={"name": "twitter:title"})
-        if twitter_title_tag and twitter_title_tag.get("content"):
-            content = twitter_title_tag["content"]
+        if twitter_title_tag:
+            content = twitter_title_tag.get("content", "")
             if isinstance(content, str):
                 title = content.strip()
                 if title:
                     return _clean_title(title), "twitter_title"
🧹 Nitpick comments (3)
connector_builder_mcp/_external_docs_suggester.py (3)

67-73: Clarify GitHub denylist logic or implement the exception.

The comment on line 72 states "Unless it's the vendor's org", but the _is_official_domain function (lines 121-134) doesn't implement any special handling for vendor GitHub repositories. Either remove the comment or add logic to allow vendor-owned GitHub repos (e.g., checking if the path starts with the vendor's org name).


103-116: Add rate limiting to DuckDuckGo search.

The DuckDuckGo search on line 104 lacks rate limiting, while _search_for_category (line 345) includes a 0.5s delay. Given the PR notes about aggressive DuckDuckGo rate limits (~1 req/s), consider adding a similar delay here or centralizing rate limiting logic.

     try:
         results = DDGS().text(f"{api_name} developer docs", max_results=3)
+        time.sleep(0.5)  # Rate limiting for DuckDuckGo
         if results:

449-460: Consider more precise URL similarity logic.

Lines 457-460 treat URLs as similar if one path is a prefix of the other (e.g., /docs and /docs/api/reference are considered similar). This could cause over-deduplication if a vendor has both a general docs page and specific sub-pages that should both be returned. Consider comparing only the first N path segments or using exact path matching.

Example case to consider:

  • /api (general API docs) vs. /api/reference (specific API reference) — currently treated as duplicates, but user might want both.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8268626 and 5fd320a.

📒 Files selected for processing (2)
  • connector_builder_mcp/_external_docs_suggester.py (1 hunks)
  • pyproject.toml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (1)
connector_builder_mcp/_external_docs_suggester.py (1)
connector_builder_mcp/mcp/guidance.py (1)
  • suggest_external_documentation_urls (285-374)
🔇 Additional comments (1)
connector_builder_mcp/_external_docs_suggester.py (1)

341-346: Verify rate limiting is sufficient for DuckDuckGo.

The PR notes indicate DuckDuckGo has aggressive rate limits (~1 req/s), but line 345 uses time.sleep(0.5). With multiple categories and up to 2 patterns per category, this could result in ~2 req/s, potentially triggering rate limits. Consider increasing the delay to 1.0 second or implementing exponential backoff as suggested in the PR notes.

Based on PR notes about rate limiting concerns.

assert result == expected


class TestGenerateCanonicalTitle:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless class. Never do this. Just use bare test functions. Bare test functions require less context parsing in order to grok the test itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Converted all test classes to bare functions. All 29 tests still pass and lint checks are clean.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
connector_builder_mcp/_external_docs_suggester.py (4)

88-118: Consider a more graceful fallback when deriving vendor_domain from search

The heuristic to prefer domains with PREFERRED_SUBDOMAINS makes sense, but when no such domain is found you currently fall all the way back to api_name.lower().replace(" ", "").replace("-", ""), which often isn’t a valid domain. You may want to keep the first successful search domain as a secondary fallback before dropping to the name-only heuristic, to avoid site:stripe / status.stripe-style queries that will never validate.


136-172: Scoring heuristics look solid; consider coverage for all documentation types

The scoring logic (vendor domain boost, preferred subdomains, path patterns, fragment and locale penalties, title keyword bonus) is well structured and easy to tune. Note that only categories present in URL_PATH_PATTERNS / CATEGORY_SEARCH_PATTERNS benefit from these boosts; if you intend to support types like "developer_community" or "other" consistently, adding patterns for them here would improve result quality.


317-383: Search flow is clear; consider minor ergonomics and performance tweaks

The category search logic is easy to follow and respects the DDG rate-limiting concerns (0.5s sleep, max two patterns per category). A few optional improvements you might consider:

  • Reuse a single DDGS() instance across calls (e.g., passed into _search_for_category or created once at module level) instead of constructing a new one per query, to reduce overhead and centralize any future configuration.
  • For categories not present in CATEGORY_SEARCH_PATTERNS, _search_for_category silently returns no candidates. If you expect callers to pass such categories (e.g., "developer_community"), emitting a debug log when search_patterns is empty could make debugging easier.

Functionally this is fine as-is; these would just smooth operational behavior.


449-460: URL similarity heuristic may over-deduplicate in some edge cases

_are_urls_similar treats URLs as similar when one path is a prefix of the other (path1.startswith(path2) or vice versa). This works well for many docs patterns (/docs/api vs /docs/api/overview), but could over-deduplicate cases like /docs/api vs /docs/api-console where the shared prefix doesn’t imply “same page”. If you run into false positives in practice, you could tighten this to compare by path segments (splitting on /) rather than raw string prefixes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a119292 and fdab738.

📒 Files selected for processing (2)
  • connector_builder_mcp/_external_docs_suggester.py (1 hunks)
  • tests/unit/test_external_docs_suggester.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/test_external_docs_suggester.py
🧰 Additional context used
🧬 Code graph analysis (1)
connector_builder_mcp/_external_docs_suggester.py (1)
connector_builder_mcp/mcp/guidance.py (1)
  • suggest_external_documentation_urls (285-374)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (4)
connector_builder_mcp/_external_docs_suggester.py (4)

76-86: URLCandidate dataclass is clear and well-scoped

The URLCandidate dataclass cleanly captures all the metadata you need for scoring, enrichment, and output mapping. Field naming is consistent with the public return shape, which keeps the mapping at the end of suggest_external_documentation_urls straightforward.


174-181: URL canonicalization is conservative and safe

The locale-stripping and trailing-slash handling are conservative and should avoid breaking most docs URLs while improving deduplication. This looks good as a first-pass canonicalizer.


281-315: Title cleaning and canonical title generation are well-structured

The suffix regexes and whitespace normalization in _clean_title are pragmatic and should handle most “... | Docs / Developer” patterns. _generate_canonical_title is now consistently Title-Cased and covers all documented categories, with a sensible fallback for unknown ones.


385-447: Top-level suggester behavior and error handling look solid

The main suggest_external_documentation_urls entry point wires together domain derivation, category search, sorting, deduplication, and final shaping cleanly. Grouping results by category before enforcing max_results_per_type and running _are_urls_similar gives predictable per-type output. Returning an empty list on errors (with callers doing the logging/exception handling) matches the intended GUIDANCE tool contract.

Comment on lines +184 to +225
def _validate_url(url: str) -> tuple[bool, bool, str]:
"""Validate URL and check if it requires login.
Returns:
(is_valid, requires_login, final_url)
"""
try:
response = requests.head(
url, allow_redirects=True, timeout=10, headers={"User-Agent": "Mozilla/5.0"}
)

final_url = response.url if hasattr(response, "url") else url

if response.status_code in [200, 301, 302, 303]:
return True, False, final_url
elif response.status_code in [401, 403]:
return True, True, final_url
else:
return False, False, url

except requests.exceptions.RequestException:
try:
response = requests.get(
url,
allow_redirects=True,
timeout=10,
stream=True,
headers={"User-Agent": "Mozilla/5.0"},
)
final_url = response.url if hasattr(response, "url") else url

if response.status_code in [200, 301, 302, 303]:
return True, False, final_url
elif response.status_code in [401, 403]:
return True, True, final_url
else:
return False, False, url

except Exception as e:
logger.warning(f"Failed to validate URL {url}: {e}")
return False, False, url

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Tighten connection handling and HEAD/GET fallback in _validate_url

The validation logic is reasonable, but there are a couple of robustness improvements worth considering:

  • You use requests.get(..., stream=True, ...) in the fallback without explicitly closing the response or ensuring the body is fully consumed. While CPython will usually clean this up, it’s safer to either drop stream=True (since you only need status_code/url) or use a with requests.get(...) as response: block to guarantee connection release.
  • Some APIs return 405 for HEAD on docs endpoints that are perfectly valid for GET. Right now that path is treated as invalid and never re-tried with GET because the GET is only issued on RequestException. You might want to treat specific non-2xx/3xx codes from HEAD (like 405) as a signal to fall back to GET as well.

These changes would reduce the risk of connection pool exhaustion and increase validation accuracy for more finicky servers.

Comment on lines +227 to +279
def _enrich_title(url: str, search_title: str, category: str, vendor_name: str) -> tuple[str, str]:
"""Enrich title by fetching page metadata.
Returns:
(enriched_title, title_source)
"""
try:
response = requests.get(
url,
timeout=10,
stream=True,
headers={"User-Agent": "Mozilla/5.0"},
)

content = response.raw.read(50000)
soup = BeautifulSoup(content, "lxml")

og_title_tag = soup.find("meta", property="og:title")
if og_title_tag and og_title_tag.get("content"):
content = og_title_tag["content"]
if isinstance(content, str):
title = content.strip()
if title:
return _clean_title(title), "og_title"

twitter_title_tag = soup.find("meta", attrs={"name": "twitter:title"})
if twitter_title_tag and twitter_title_tag.get("content"):
content = twitter_title_tag["content"]
if isinstance(content, str):
title = content.strip()
if title:
return _clean_title(title), "twitter_title"

h1_tag = soup.find("h1")
if h1_tag:
title = h1_tag.get_text().strip()
if title:
return _clean_title(title), "h1"

title_tag = soup.find("title")
if title_tag:
title = title_tag.get_text().strip()
if title:
return _clean_title(title), "html_title"

except Exception as e:
logger.debug(f"Failed to enrich title for {url}: {e}")

if search_title:
return _clean_title(search_title), "search_result"

return _generate_canonical_title(vendor_name, category), "canonical"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Improve streaming usage and response handling in _enrich_title

The enrichment flow and precedence order (og → twitter → h1 → title → search → canonical) are good, but the HTTP handling can be made safer and simpler:

  • Similar to _validate_url, you call requests.get(..., stream=True, ...) and then response.raw.read(50000) without closing the response explicitly. Consider using with requests.get(..., stream=False, ...) as response: and then content = response.content[:50000] (or response.text) to let requests manage decompression and connection reuse.
  • Reusing the name content for both the HTML payload and meta tag attribute values can be a bit confusing when reading the function. Renaming the HTML buffer (e.g. html_bytes) would improve clarity.

Example of the first change:

-        response = requests.get(
-            url,
-            timeout=10,
-            stream=True,
-            headers={"User-Agent": "Mozilla/5.0"},
-        )
-
-        content = response.raw.read(50000)
-        soup = BeautifulSoup(content, "lxml")
+        with requests.get(
+            url,
+            timeout=10,
+            headers={"User-Agent": "Mozilla/5.0"},
+        ) as response:
+            content = response.content[:50000]
+        soup = BeautifulSoup(content, "lxml")

This keeps the rest of the parsing logic unchanged while making the HTTP lifecycle more robust.

le=5,
),
] = 1,
) -> list[dict]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't return arbitrary dicts (or list of dicts) from functions - ESPECIALLY not public MCP tools. The return needs be a Pydantic base model - or in this case a list thereof.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Switched to Pydantic BaseModel. Created ExternalDocumentationUrl model with these fields:

  • title: str - Title of the documentation page
  • url: str - URL of the documentation page
  • doc_type: DOCUMENTATION_TYPES - Type of documentation (using the existing Literal type)
  • requires_login: bool - Whether authentication is required

Both the internal function and MCP tool now return list[ExternalDocumentationUrl] for type safety. Used snake_case field names per Python conventions, with doc_type instead of type to avoid shadowing the builtin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All code review feedback has been addressed! ✅

Changes made:

  1. ✅ Test classes → bare functions (commit a119292)
  2. ✅ Category names → title case (commit fdab738)
  3. ✅ Return type → Pydantic BaseModel (commits 78bc33c, 63074c9)

The ExternalDocumentationUrl model now provides full type safety with these fields:

  • title: str - Title of the documentation page
  • url: str - URL of the documentation page
  • doc_type: DOCUMENTATION_TYPES - Type using the Literal type for compile-time validation
  • requires_login: bool - Whether authentication is required

I also propagated the DOCUMENTATION_TYPES Literal through the entire codebase (URLCandidate, _search_for_category, _score_url, results_by_category) to ensure full type safety and added filtering logic to validate allowed_types against known categories.

All CI checks are passing including MyPy type checking. Ready for review!

devin-ai-integration bot and others added 4 commits November 19, 2025 06:04
- Change URLCandidate.category from str to DOCUMENTATION_TYPES
- Update _search_for_category and _score_url signatures to use DOCUMENTATION_TYPES
- Add filtering logic to validate allowed_types against known categories
- Update results_by_category dict to use DOCUMENTATION_TYPES keys
- Add cast import for type narrowing after filtering

This fixes MyPy type error when instantiating ExternalDocumentationUrl model.

Co-Authored-By: AJ Steers <[email protected]>
@devin-ai-integration devin-ai-integration bot changed the title feat(guidance): Add suggest_external_documentation_urls tool feat(guidance): Add suggest_external_documentation_urls tool with DuckDuckGo search Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants