-
Notifications
You must be signed in to change notification settings - Fork 2
feat(guidance): Add suggest_external_documentation_urls tool with DuckDuckGo search #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… search and metadata enrichment - Add new MCP tool to auto-suggest external documentation URLs for APIs - Implement DuckDuckGo search with category-specific queries - Add heuristic pattern matching for URL scoring and categorization - Implement metadata enrichment (og:title, h1, etc.) for high-quality titles - Add URL validation and canonicalization - Include comprehensive unit tests - Add dependencies: duckduckgo-search, beautifulsoup4, lxml Co-Authored-By: AJ Steers <[email protected]>
Original prompt from AJ Steers |
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
Warning Rate limit exceeded@devin-ai-integration[bot] has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 25 minutes and 52 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds an external documentation URL suggester module, registers it as a GUIDANCE tool (duplicate definition present), adds unit tests, and introduces runtime HTML-parsing and search dependencies. The suggester discovers, validates, enriches, scores, canonicalizes, deduplicates, and returns documentation URL candidates. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Guidance
participant Suggester
participant WebSearch
participant Validator
participant Enricher
User->>Guidance: suggest_external_documentation_urls(api_name, ...)
Guidance->>Suggester: invoke suggester API
rect rgb(245,250,255)
Suggester->>Suggester: derive vendor_domain (explicit → api_base_url → DuckDuckGo)
Suggester->>Suggester: select categories (allowed_types or defaults)
end
loop per category
Suggester->>WebSearch: search category-specific queries
WebSearch-->>Suggester: candidate URLs
loop per candidate
Suggester->>Validator: validate URL (HEAD → fallback GET, follow redirects)
Validator-->>Suggester: (valid?, requires_login, final_url)
Suggester->>Enricher: fetch page & extract title metadata
Enricher-->>Suggester: enriched title
Suggester->>Suggester: canonicalize URL, compute score
end
end
Suggester->>Suggester: deduplicate, sort, limit per category
Suggester-->>Guidance: return list[dict]{title,url,type,requiresLogin}
Guidance-->>User: results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Suggested reviewers
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
👋 Welcome to the Airbyte Connector Builder MCP!Thank you for your contribution! Here are some helpful tips and reminders for your convenience. Testing This Branch via MCPTo test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration: {
"mcpServers": {
"connector-builder-mcp-dev": {
"command": "uvx",
"args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1763528159-add-external-docs-suggester", "connector-builder-mcp"]
}
}
}Testing This Branch via CLIYou can test this version of the MCP Server using the following CLI snippet: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1763528159-add-external-docs-suggester#egg=airbyte-connector-builder-mcp' --helpPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
AI Builder EvaluationsAI builder evaluations run automatically under the following conditions:
A set of standardized evaluations also run on a schedule (Mon/Wed/Fri at midnight UTC) and can be manually triggered via workflow dispatch. Helpful ResourcesIf you have any questions, feel free to ask in the PR comments or join our Slack community. |
…Py errors Co-Authored-By: AJ Steers <[email protected]>
…dependency Co-Authored-By: AJ Steers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
connector_builder_mcp/mcp/guidance.py (1)
16-18: Wire-up to_external_docs_suggesterlooks good; consider tightening tool metadata and docsThe wrapper correctly mirrors the underlying
suggest_external_documentation_urlsAPI, adds logging, and returns an empty list on failure as intended. Two small follow-ups to consider:
- mcp_tool annotations
Since this tool issues external HTTP/search calls but is otherwise read-only and idempotent, it would be clearer to annotate it explicitly:-@mcp_tool( - domain=ToolDomain.GUIDANCE, -) +@mcp_tool( + domain=ToolDomain.GUIDANCE, + read_only=True, + idempotent=True, + open_world=True, +)This helps downstream orchestration reason about side effects and external reach.
- Valid
allowed_typesvs. implementation
The docstring advertisesdeveloper_communityas a valid type, butCATEGORY_SEARCH_PATTERNSin_external_docs_suggesterhas no entry for that key, so such requests will always come back empty. Either add adeveloper_communityentry toCATEGORY_SEARCH_PATTERNSor drop it from the “Valid types” list here to avoid confusion.Also applies to: 282-374
tests/unit/test_external_docs_suggester.py (1)
1-284: Good, focused coverage of the pure heuristics; consider mocking the networked bits laterThese tests do a nice job exercising the core heuristics (_derive_vendor_domain, domain checks, scoring, canonicalization, title cleaning, canonical title generation, URL similarity) with realistic inputs while avoiding real network calls.
If you want to harden this further in the future, the main gap is around
_validate_urland_enrich_title. Those could be covered withrequestsandDDGSpatched out (e.g., faking status codes, redirects, and simple HTML snippets) so behavior on 2xx/3xx/4xx and title extraction is locked in without hitting the network.connector_builder_mcp/_external_docs_suggester.py (1)
33-66: Search/scoring pipeline is solid; consider DDGS reuse and explicit category typingThe end-to-end flow (derive vendor domain → per-category DDG search → URL validation → scoring → enrichment → deduplication) looks well thought out and the heuristics line up with the tests.
A couple of non-blocking refinements you might want to fold in:
Reuse DDGS instance and centralize rate limiting
_derive_vendor_domainand_search_for_categoryeach constructDDGS()instances and_search_for_categorycalls.textin a loop withtime.sleep(0.5)after each call. For connectors that query many categories, this can add noticeable latency and spin up several sessions. Consider creating a singleDDGSinstance per top-levelsuggest_external_documentation_urlscall (e.g., pass it into_derive_vendor_domainand_search_for_category) and implementing rate limiting/backoff in one place.Leverage DOCUMENTATION_TYPES for
allowed_types/categories
You’ve definedDOCUMENTATION_TYPESbutallowed_typesis still typed aslist[str] | None, andCATEGORY_SEARCH_PATTERNSdoesn’t include all of the literal values (e.g.,developer_community). Wiringallowed_typesand category variables toDOCUMENTATION_TYPESwould give you static checking on category names and make it harder to accidentally advertise a type that isn’t actually searched over.None of this is urgent, but it would make the new tool a bit more robust and self-documenting.
Also applies to: 313-378, 381-442
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
connector_builder_mcp/_external_docs_suggester.py(1 hunks)connector_builder_mcp/mcp/guidance.py(2 hunks)pyproject.toml(1 hunks)tests/unit/test_external_docs_suggester.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-10T17:09:05.986Z
Learnt from: aaronsteers
Repo: airbytehq/connector-builder-mcp PR: 116
File: .github/workflows/ci-tests.yml:110-113
Timestamp: 2025-10-10T17:09:05.986Z
Learning: The connector-builder-mcp repository at airbytehq/connector-builder-mcp is a private/internal repository that does not accept external contributions or forks.
Applied to files:
connector_builder_mcp/mcp/guidance.py
🧬 Code graph analysis (3)
tests/unit/test_external_docs_suggester.py (1)
connector_builder_mcp/_external_docs_suggester.py (7)
_are_urls_similar(445-456)_canonicalize_url(174-181)_clean_title(277-290)_derive_vendor_domain(88-118)_generate_canonical_title(293-310)_is_official_domain(121-133)_score_url(136-171)
connector_builder_mcp/mcp/guidance.py (2)
connector_builder_mcp/_external_docs_suggester.py (1)
suggest_external_documentation_urls(381-442)connector_builder_mcp/mcp/_mcp_utils.py (2)
mcp_tool(102-149)ToolDomain(49-80)
connector_builder_mcp/_external_docs_suggester.py (1)
connector_builder_mcp/mcp/guidance.py (1)
suggest_external_documentation_urls(285-374)
🪛 GitHub Actions: Linters
connector_builder_mcp/_external_docs_suggester.py
[error] 246-246: Mypy type error: Item "AttributeValueList" of "str | AttributeValueList" has no attribute "strip" [union-attr]
[error] 252-252: Mypy type error: Item "AttributeValueList" of "str | AttributeValueList" has no attribute "strip" [union-attr]
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
🔇 Additional comments (1)
pyproject.toml (1)
12-16: New HTML/search dependencies look appropriate for the docs suggesterThe added
beautifulsoup4,duckduckgo-search, andlxmldependencies align with the new_external_docs_suggestermodule and have sensible<next-majorcaps. Just confirm these bounds match what you’re running in the target environment (container/base image) so you don’t hit unexpected resolver conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
connector_builder_mcp/_external_docs_suggester.py (1)
244-258: Incomplete fix for meta content attribute access.The past review comment identified mypy
union-attrerrors on lines 246 and 252. Whileisinstance(content, str)checks have been added (lines 247, 255), the implementation is incomplete:
- Lines 246 and 254 still use bracket notation
["content"]which doesn't satisfy mypy's type narrowing.- The
list[str]case is silently ignored rather than handled.Although
og:titleandtwitter:titleare rarely multi-valued in practice, the code should either:
- Follow the suggested fix in the past review to handle both
strandlistcases explicitly, OR- Use
.get("content", "")withisinstancechecks and document why lists are skipped.Apply the fix from the past review comment (see existing comments) or use this simpler alternative:
og_title_tag = soup.find("meta", property="og:title") - if og_title_tag and og_title_tag.get("content"): - content = og_title_tag["content"] + if og_title_tag: + content = og_title_tag.get("content", "") if isinstance(content, str): title = content.strip() if title: return _clean_title(title), "og_title" twitter_title_tag = soup.find("meta", attrs={"name": "twitter:title"}) - if twitter_title_tag and twitter_title_tag.get("content"): - content = twitter_title_tag["content"] + if twitter_title_tag: + content = twitter_title_tag.get("content", "") if isinstance(content, str): title = content.strip() if title: return _clean_title(title), "twitter_title"
🧹 Nitpick comments (3)
connector_builder_mcp/_external_docs_suggester.py (3)
67-73: Clarify GitHub denylist logic or implement the exception.The comment on line 72 states "Unless it's the vendor's org", but the
_is_official_domainfunction (lines 121-134) doesn't implement any special handling for vendor GitHub repositories. Either remove the comment or add logic to allow vendor-owned GitHub repos (e.g., checking if the path starts with the vendor's org name).
103-116: Add rate limiting to DuckDuckGo search.The DuckDuckGo search on line 104 lacks rate limiting, while
_search_for_category(line 345) includes a 0.5s delay. Given the PR notes about aggressive DuckDuckGo rate limits (~1 req/s), consider adding a similar delay here or centralizing rate limiting logic.try: results = DDGS().text(f"{api_name} developer docs", max_results=3) + time.sleep(0.5) # Rate limiting for DuckDuckGo if results:
449-460: Consider more precise URL similarity logic.Lines 457-460 treat URLs as similar if one path is a prefix of the other (e.g.,
/docsand/docs/api/referenceare considered similar). This could cause over-deduplication if a vendor has both a general docs page and specific sub-pages that should both be returned. Consider comparing only the first N path segments or using exact path matching.Example case to consider:
/api(general API docs) vs./api/reference(specific API reference) — currently treated as duplicates, but user might want both.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
connector_builder_mcp/_external_docs_suggester.py(1 hunks)pyproject.toml(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (1)
connector_builder_mcp/_external_docs_suggester.py (1)
connector_builder_mcp/mcp/guidance.py (1)
suggest_external_documentation_urls(285-374)
🔇 Additional comments (1)
connector_builder_mcp/_external_docs_suggester.py (1)
341-346: Verify rate limiting is sufficient for DuckDuckGo.The PR notes indicate DuckDuckGo has aggressive rate limits (~1 req/s), but line 345 uses
time.sleep(0.5). With multiple categories and up to 2 patterns per category, this could result in ~2 req/s, potentially triggering rate limits. Consider increasing the delay to 1.0 second or implementing exponential backoff as suggested in the PR notes.Based on PR notes about rate limiting concerns.
| assert result == expected | ||
|
|
||
|
|
||
| class TestGenerateCanonicalTitle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useless class. Never do this. Just use bare test functions. Bare test functions require less context parsing in order to grok the test itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Converted all test classes to bare functions. All 29 tests still pass and lint checks are clean.
…back Co-Authored-By: AJ Steers <[email protected]>
Co-Authored-By: AJ Steers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
connector_builder_mcp/_external_docs_suggester.py (4)
88-118: Consider a more graceful fallback when derivingvendor_domainfrom searchThe heuristic to prefer domains with
PREFERRED_SUBDOMAINSmakes sense, but when no such domain is found you currently fall all the way back toapi_name.lower().replace(" ", "").replace("-", ""), which often isn’t a valid domain. You may want to keep the first successful search domain as a secondary fallback before dropping to the name-only heuristic, to avoidsite:stripe/status.stripe-style queries that will never validate.
136-172: Scoring heuristics look solid; consider coverage for all documentation typesThe scoring logic (vendor domain boost, preferred subdomains, path patterns, fragment and locale penalties, title keyword bonus) is well structured and easy to tune. Note that only categories present in
URL_PATH_PATTERNS/CATEGORY_SEARCH_PATTERNSbenefit from these boosts; if you intend to support types like"developer_community"or"other"consistently, adding patterns for them here would improve result quality.
317-383: Search flow is clear; consider minor ergonomics and performance tweaksThe category search logic is easy to follow and respects the DDG rate-limiting concerns (0.5s sleep, max two patterns per category). A few optional improvements you might consider:
- Reuse a single
DDGS()instance across calls (e.g., passed into_search_for_categoryor created once at module level) instead of constructing a new one per query, to reduce overhead and centralize any future configuration.- For categories not present in
CATEGORY_SEARCH_PATTERNS,_search_for_categorysilently returns no candidates. If you expect callers to pass such categories (e.g.,"developer_community"), emitting a debug log whensearch_patternsis empty could make debugging easier.Functionally this is fine as-is; these would just smooth operational behavior.
449-460: URL similarity heuristic may over-deduplicate in some edge cases
_are_urls_similartreats URLs as similar when one path is a prefix of the other (path1.startswith(path2)or vice versa). This works well for many docs patterns (/docs/apivs/docs/api/overview), but could over-deduplicate cases like/docs/apivs/docs/api-consolewhere the shared prefix doesn’t imply “same page”. If you run into false positives in practice, you could tighten this to compare by path segments (splitting on/) rather than raw string prefixes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
connector_builder_mcp/_external_docs_suggester.py(1 hunks)tests/unit/test_external_docs_suggester.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/test_external_docs_suggester.py
🧰 Additional context used
🧬 Code graph analysis (1)
connector_builder_mcp/_external_docs_suggester.py (1)
connector_builder_mcp/mcp/guidance.py (1)
suggest_external_documentation_urls(285-374)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
🔇 Additional comments (4)
connector_builder_mcp/_external_docs_suggester.py (4)
76-86: URLCandidate dataclass is clear and well-scopedThe
URLCandidatedataclass cleanly captures all the metadata you need for scoring, enrichment, and output mapping. Field naming is consistent with the public return shape, which keeps the mapping at the end ofsuggest_external_documentation_urlsstraightforward.
174-181: URL canonicalization is conservative and safeThe locale-stripping and trailing-slash handling are conservative and should avoid breaking most docs URLs while improving deduplication. This looks good as a first-pass canonicalizer.
281-315: Title cleaning and canonical title generation are well-structuredThe suffix regexes and whitespace normalization in
_clean_titleare pragmatic and should handle most “... | Docs / Developer” patterns._generate_canonical_titleis now consistently Title-Cased and covers all documented categories, with a sensible fallback for unknown ones.
385-447: Top-level suggester behavior and error handling look solidThe main
suggest_external_documentation_urlsentry point wires together domain derivation, category search, sorting, deduplication, and final shaping cleanly. Grouping results by category before enforcingmax_results_per_typeand running_are_urls_similargives predictable per-type output. Returning an empty list on errors (with callers doing the logging/exception handling) matches the intended GUIDANCE tool contract.
| def _validate_url(url: str) -> tuple[bool, bool, str]: | ||
| """Validate URL and check if it requires login. | ||
| Returns: | ||
| (is_valid, requires_login, final_url) | ||
| """ | ||
| try: | ||
| response = requests.head( | ||
| url, allow_redirects=True, timeout=10, headers={"User-Agent": "Mozilla/5.0"} | ||
| ) | ||
|
|
||
| final_url = response.url if hasattr(response, "url") else url | ||
|
|
||
| if response.status_code in [200, 301, 302, 303]: | ||
| return True, False, final_url | ||
| elif response.status_code in [401, 403]: | ||
| return True, True, final_url | ||
| else: | ||
| return False, False, url | ||
|
|
||
| except requests.exceptions.RequestException: | ||
| try: | ||
| response = requests.get( | ||
| url, | ||
| allow_redirects=True, | ||
| timeout=10, | ||
| stream=True, | ||
| headers={"User-Agent": "Mozilla/5.0"}, | ||
| ) | ||
| final_url = response.url if hasattr(response, "url") else url | ||
|
|
||
| if response.status_code in [200, 301, 302, 303]: | ||
| return True, False, final_url | ||
| elif response.status_code in [401, 403]: | ||
| return True, True, final_url | ||
| else: | ||
| return False, False, url | ||
|
|
||
| except Exception as e: | ||
| logger.warning(f"Failed to validate URL {url}: {e}") | ||
| return False, False, url | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tighten connection handling and HEAD/GET fallback in _validate_url
The validation logic is reasonable, but there are a couple of robustness improvements worth considering:
- You use
requests.get(..., stream=True, ...)in the fallback without explicitly closing the response or ensuring the body is fully consumed. While CPython will usually clean this up, it’s safer to either dropstream=True(since you only needstatus_code/url) or use awith requests.get(...) as response:block to guarantee connection release. - Some APIs return
405forHEADon docs endpoints that are perfectly valid forGET. Right now that path is treated as invalid and never re-tried withGETbecause the GET is only issued onRequestException. You might want to treat specific non-2xx/3xx codes fromHEAD(like405) as a signal to fall back toGETas well.
These changes would reduce the risk of connection pool exhaustion and increase validation accuracy for more finicky servers.
| def _enrich_title(url: str, search_title: str, category: str, vendor_name: str) -> tuple[str, str]: | ||
| """Enrich title by fetching page metadata. | ||
| Returns: | ||
| (enriched_title, title_source) | ||
| """ | ||
| try: | ||
| response = requests.get( | ||
| url, | ||
| timeout=10, | ||
| stream=True, | ||
| headers={"User-Agent": "Mozilla/5.0"}, | ||
| ) | ||
|
|
||
| content = response.raw.read(50000) | ||
| soup = BeautifulSoup(content, "lxml") | ||
|
|
||
| og_title_tag = soup.find("meta", property="og:title") | ||
| if og_title_tag and og_title_tag.get("content"): | ||
| content = og_title_tag["content"] | ||
| if isinstance(content, str): | ||
| title = content.strip() | ||
| if title: | ||
| return _clean_title(title), "og_title" | ||
|
|
||
| twitter_title_tag = soup.find("meta", attrs={"name": "twitter:title"}) | ||
| if twitter_title_tag and twitter_title_tag.get("content"): | ||
| content = twitter_title_tag["content"] | ||
| if isinstance(content, str): | ||
| title = content.strip() | ||
| if title: | ||
| return _clean_title(title), "twitter_title" | ||
|
|
||
| h1_tag = soup.find("h1") | ||
| if h1_tag: | ||
| title = h1_tag.get_text().strip() | ||
| if title: | ||
| return _clean_title(title), "h1" | ||
|
|
||
| title_tag = soup.find("title") | ||
| if title_tag: | ||
| title = title_tag.get_text().strip() | ||
| if title: | ||
| return _clean_title(title), "html_title" | ||
|
|
||
| except Exception as e: | ||
| logger.debug(f"Failed to enrich title for {url}: {e}") | ||
|
|
||
| if search_title: | ||
| return _clean_title(search_title), "search_result" | ||
|
|
||
| return _generate_canonical_title(vendor_name, category), "canonical" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve streaming usage and response handling in _enrich_title
The enrichment flow and precedence order (og → twitter → h1 → title → search → canonical) are good, but the HTTP handling can be made safer and simpler:
- Similar to
_validate_url, you callrequests.get(..., stream=True, ...)and thenresponse.raw.read(50000)without closing the response explicitly. Consider usingwith requests.get(..., stream=False, ...) as response:and thencontent = response.content[:50000](orresponse.text) to letrequestsmanage decompression and connection reuse. - Reusing the name
contentfor both the HTML payload and meta tag attribute values can be a bit confusing when reading the function. Renaming the HTML buffer (e.g.html_bytes) would improve clarity.
Example of the first change:
- response = requests.get(
- url,
- timeout=10,
- stream=True,
- headers={"User-Agent": "Mozilla/5.0"},
- )
-
- content = response.raw.read(50000)
- soup = BeautifulSoup(content, "lxml")
+ with requests.get(
+ url,
+ timeout=10,
+ headers={"User-Agent": "Mozilla/5.0"},
+ ) as response:
+ content = response.content[:50000]
+ soup = BeautifulSoup(content, "lxml")This keeps the rest of the parsing logic unchanged while making the HTTP lifecycle more robust.
| le=5, | ||
| ), | ||
| ] = 1, | ||
| ) -> list[dict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't return arbitrary dicts (or list of dicts) from functions - ESPECIALLY not public MCP tools. The return needs be a Pydantic base model - or in this case a list thereof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Switched to Pydantic BaseModel. Created ExternalDocumentationUrl model with these fields:
title: str- Title of the documentation pageurl: str- URL of the documentation pagedoc_type: DOCUMENTATION_TYPES- Type of documentation (using the existing Literal type)requires_login: bool- Whether authentication is required
Both the internal function and MCP tool now return list[ExternalDocumentationUrl] for type safety. Used snake_case field names per Python conventions, with doc_type instead of type to avoid shadowing the builtin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All code review feedback has been addressed! ✅
Changes made:
- ✅ Test classes → bare functions (commit a119292)
- ✅ Category names → title case (commit fdab738)
- ✅ Return type → Pydantic BaseModel (commits 78bc33c, 63074c9)
The ExternalDocumentationUrl model now provides full type safety with these fields:
title: str- Title of the documentation pageurl: str- URL of the documentation pagedoc_type: DOCUMENTATION_TYPES- Type using the Literal type for compile-time validationrequires_login: bool- Whether authentication is required
I also propagated the DOCUMENTATION_TYPES Literal through the entire codebase (URLCandidate, _search_for_category, _score_url, results_by_category) to ensure full type safety and added filtering logic to validate allowed_types against known categories.
All CI checks are passing including MyPy type checking. Ready for review!
…afety Co-Authored-By: AJ Steers <[email protected]>
Co-Authored-By: AJ Steers <[email protected]>
- Change URLCandidate.category from str to DOCUMENTATION_TYPES - Update _search_for_category and _score_url signatures to use DOCUMENTATION_TYPES - Add filtering logic to validate allowed_types against known categories - Update results_by_category dict to use DOCUMENTATION_TYPES keys - Add cast import for type narrowing after filtering This fixes MyPy type error when instantiating ExternalDocumentationUrl model. Co-Authored-By: AJ Steers <[email protected]>
Co-Authored-By: AJ Steers <[email protected]>
Summary
Implements a new MCP tool
suggest_external_documentation_urlsthat auto-suggests external documentation URLs for APIs using DuckDuckGo search, heuristic pattern matching, and metadata enrichment. The tool helps streamline the discovery of vendor documentation URLs when building connectors.Key Features:
Dependencies Added:
duckduckgo-search>=7.0.0- For web searchbeautifulsoup4>=4.12.0- For HTML parsinglxml>=5.0.0- For fast HTML parsingCode Review Feedback Addressed:
list[dict]tolist[ExternalDocumentationUrl](Pydantic BaseModel)DOCUMENTATION_TYPESLiteral through codebase for full type safetyReview & Testing Checklist for Human
sleep(0.5)rate limiting is sufficient and doesn't cause failures.ExternalDocumentationUrlmodel serializes correctly when returned from the MCP tool (check that field names are snake_case:doc_type,requires_login).Test Plan
Notes
cast()usage in the filtering logic bypasses type checking - this is intentional to handle user-providedallowed_typesthat may contain invalid values, but it means invalid types are silently filtered out rather than raising errors.Session: https://app.devin.ai/sessions/0612cddd74b2441ea970d065b8d4180b
Requested by: AJ Steers (@aaronsteers)