Skip to content

feat(mcp): Add get_api_docs_urls tool with data.externalDocumentationUrls support#858

Merged
Aaron ("AJ") Steers (aaronsteers) merged 10 commits intomainfrom
devin/1762889439-get-api-docs-urls-tool
Nov 12, 2025
Merged

feat(mcp): Add get_api_docs_urls tool with data.externalDocumentationUrls support#858
Aaron ("AJ") Steers (aaronsteers) merged 10 commits intomainfrom
devin/1762889439-get-api-docs-urls-tool

Conversation

@aaronsteers
Copy link
Copy Markdown
Member

@aaronsteers Aaron ("AJ") Steers (aaronsteers) commented Nov 11, 2025

refactor(registry): Move manifest parsing logic from MCP to registry module

Summary

Refactored the manifest parsing and API docs URL extraction logic from the MCP module (airbyte/mcp/connector_registry.py) into the core registry module (airbyte/registry.py) to improve separation of concerns and code organization.

Key changes:

  • Moved ApiDocsUrl class, helper functions (_manifest_url_for, _fetch_manifest_dict, _extract_docs_from_registry), and created new public function get_connector_api_docs_urls in airbyte/registry.py
  • Simplified MCP get_api_docs_urls tool to be a thin wrapper that calls get_connector_api_docs_urls from registry
  • Updated _extract_docs_from_registry to use _get_registry_url() for consistency and include documentationUrl from registry metadata
  • Added _DEFAULT_MANIFEST_URL constant to registry.py to avoid circular import with _executors.util
  • Updated unit tests to import from registry module with proper mocking

Review & Testing Checklist for Human

  • Circular import fix: Verify that duplicating the _DEFAULT_MANIFEST_URL constant in registry.py (instead of importing from _executors.util) is acceptable. This was necessary to avoid a circular import since _executors.util imports from registry.
  • End-to-end testing: Test the MCP get_api_docs_urls tool with a real connector (e.g., source-faker) to ensure the refactored code works correctly and returns the expected documentation URLs.
  • API design: Verify that importing the private constant _DEFAULT_MANIFEST_URL from registry.py in the MCP module is acceptable, or if it should be made public.
  • Deduplication logic: Confirm that URL deduplication still works correctly after the refactoring (test with a connector that has duplicate URLs across registry and manifest sources).

Notes

  • All 8 unit tests passing locally
  • All lint checks passing (poe fix-and-check)
  • The refactoring maintains backward compatibility for the MCP tool API - it still returns the same data structure
  • CI is currently running on the latest commit

Requested by: AJ Steers (aj@airbyte.io) / Aaron ("AJ") Steers (@aaronsteers)
Session: https://app.devin.ai/sessions/02ae774e5a3a4052b5260e136bec5ea0

Summary by CodeRabbit

  • New Features

    • Retrieve aggregated API documentation URLs for connectors from registry and connector manifests (title, URL, source, type, login requirement), with deduplication and clear "Connector not found." response when absent.
    • Exposed as a read-only, idempotent registry tool for external use.
  • Tests

    • Added unit tests for manifest fetching/parsing, docs extraction, deduplication, error cases, and empty-manifest handling.

devin-ai-integration bot and others added 2 commits November 11, 2025 19:36
- Implement new MCP tool to retrieve API documentation URLs for connectors
- Support both canonical connector IDs and API names as input
- Extract docs URLs from multiple sources:
  - Registry metadata (Airbyte documentation)
  - Connector manifest.yaml files (description, metadata.assist.docsUrl, metadata.apiDocs)
- Add comprehensive unit tests with 16 test cases
- Propose metadata.apiDocs format for storing multiple API docs URLs in manifest.yaml

Co-Authored-By: AJ Steers <aj@airbyte.io>
…tionUrls format

- Add support for new data.externalDocumentationUrls format in manifest.yaml
- Add doc_type field (api_reference, api_release_history, api_deprecations, other)
- Add requires_login field for docs requiring authentication
- Maintain backward compatibility with metadata.assist.docsUrl and metadata.apiDocs
- Update ApiDocsUrl model with Field aliases for proper JSON serialization
- Add 3 new unit tests for external docs format (19 total tests passing)
- Extract docs from data.externalDocumentationUrls with type and requiresLogin support

Co-Authored-By: AJ Steers <aj@airbyte.io>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin - In the PyAirbyte MCP server we need a new tool called 'Get API Docs URLs' in the registry domain:

For a request regarding specific a connector id (source-foo or destination-foo), the tool should return a set of docs urls relevant to the connector's upstream API. In place of a canonical connector name the tool should also accept a string representing the API name (i.e. should accept "source-facebook-marketing" OR "Facebook Marketing API").

If a canonical connector ID is provided, the tool should return any links stored in the connector's manifest.yaml file, pulling from the <http://connectors.airbyte.com|connectors.airbyte.com> registry endpoint. (You may need to search a bit for the exact url patterns.)
1. I don't think this exists yet in the metadata.yaml files, so please plan for the docs list to exist in this format roughly:
2. In `data.docs` entry, we'll have map of docs title or friendly name to the url. (Search for any precedent that already exists for providing docs urls in manifest.yaml - and propose an approach you think makes sense. I care less about a canonical "type" of docs URL, so long as there is space for both a title and url. E.g. a yaml entry could be "- Shopify API Deprecations Page": <https://docs.shopify.com/api/something/something>, or else the same info split out as - {title: "...", url: "...", type: "api-deprecation-docs"})
Let's also make a plan for a parallel tool called "Find API Docs URLs" - and for this we may need to use a websearch and we _might_ need to either inject an MCP prompt or else leverage LLM Sampling: <https://gofastmcp.com/clients/sampling>. Sampling is helpful (I think) in cases where the MCP server doesn't have its own API key for LLM processing. I don't have experience with that feature in MCP servers, so we'll need to evaluate carefully if this approach makes sense.

Our approach will be in two phases: the first phase will do the first tool. You will need limi... (563 chars truncated...)

@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This PyAirbyte Version

You can test this version of PyAirbyte using the following:

# Run PyAirbyte CLI from this branch:
uvx --from 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1762889439-get-api-docs-urls-tool' pyairbyte --help

# Install PyAirbyte from this branch for development:
pip install 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1762889439-get-api-docs-urls-tool'

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /fix-pr - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test-pr - Runs tests with the updated PyAirbyte

Community Support

Questions? Join the #pyairbyte channel in our Slack workspace.

📝 Edit this welcome message.

- Replace 'in' list membership checks with set-based assertions
- Use issubset() to verify expected URLs are present
- Resolves CodeQL 'Incomplete URL substring sanitization' alert
- No functional changes, test still validates URL extraction correctly

Co-Authored-By: AJ Steers <aj@airbyte.io>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 11, 2025

📝 Walkthrough

Walkthrough

Adds ApiDocsUrl model and manifest-driven API docs discovery: registry-derived + manifest.yaml parsing, manifest fetch helpers, deduplication, a public get_connector_api_docs_urls function, and an MCP-exposed get_api_docs_urls wrapper plus unit tests. wdyt?

Changes

Cohort / File(s) Summary
Registry: API docs model & manifest helpers
airbyte/registry.py
Adds ApiDocsUrl data model, _DEFAULT_MANIFEST_URL, _manifest_url_for, _fetch_manifest_dict, _extract_docs_from_registry, and public get_connector_api_docs_urls(connector_name: str) which aggregates registry and manifest externalDocumentationUrls, validates connector existence, and deduplicates results.
MCP tool wrapper for registry lookup
airbyte/mcp/connector_registry.py
Replaces DEFAULT_MANIFEST_URL usage with _DEFAULT_MANIFEST_URL, imports new registry symbols, and adds MCP tool get_api_docs_urls(connector_name) that delegates to get_connector_api_docs_urls and returns "Connector not found." on AirbyteConnectorNotRegisteredError.
Unit tests
tests/unit_tests/test_mcp_connector_registry.py
New tests covering manifest URL construction, 404 handling, YAML/JSON parsing, ApiDocsUrl.from_manifest_dict, externalDocumentationUrls extraction (type/defaulting/requiresLogin), connector-not-found handling, deduplication, and aggregation using mocks/fixtures.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant MCP as get_api_docs_urls (MCP)
    participant API as get_connector_api_docs_urls
    participant Registry as Connector Registry
    participant Fetcher as _fetch_manifest_dict
    participant HTTP as HTTP Client
    participant Parser as ApiDocsUrl.from_manifest_dict

    Note over MCP,API: high-level flow for docs discovery
    User->>MCP: get_api_docs_urls(connector_name)
    MCP->>API: delegate and handle not-found

    alt connector found
        API->>Registry: read registry docs entries
        Registry-->>API: registry_docs
        API->>Fetcher: compute manifest URL & GET
        Fetcher->>HTTP: GET manifest.yaml/json
        HTTP-->>Fetcher: payload / 404
        Fetcher->>Parser: parse manifest -> manifest_docs
        Parser-->>API: manifest_docs
        API->>API: combine registry_docs + manifest_docs
        API->>API: deduplicate & normalize
        API-->>MCP: list[ApiDocsUrl]
        MCP-->>User: list[ApiDocsUrl]
    else connector not found
        API-->>MCP: raises AirbyteConnectorNotRegisteredError
        MCP-->>User: "Connector not found."
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay extra attention to _fetch_manifest_dict (timeouts, HTTP error handling, YAML vs JSON parsing) — wdyt?
  • Verify ApiDocsUrl.from_manifest_dict for correct doc_type defaulting to "other" and requires_login parsing — wdyt?
  • Confirm deduplication/canonicalization rules across registry vs manifest URLs (scheme, trailing slash, query params) — wdyt?

Possibly related PRs

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a new MCP tool that retrieves API documentation URLs with support for a new manifest field.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1762889439-get-api-docs-urls-tool

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 813d0d0 and b4dcfa2.

📒 Files selected for processing (2)
  • airbyte/mcp/connector_registry.py (2 hunks)
  • tests/unit_tests/test_mcp_connector_registry.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/unit_tests/test_mcp_connector_registry.py (1)
airbyte/mcp/connector_registry.py (6)
  • ApiDocsUrl (167-176)
  • ApiDocsUrlsResult (179-184)
  • _extract_urls_from_manifest_description (223-255)
  • _fetch_manifest_docs_urls (308-328)
  • _resolve_connector_name (187-220)
  • get_api_docs_urls (336-395)
airbyte/mcp/connector_registry.py (3)
airbyte/sources/registry.py (2)
  • get_available_connectors (225-281)
  • get_connector_metadata (195-222)
airbyte/sources/util.py (1)
  • get_source (47-139)
airbyte/_util/meta.py (1)
  • is_docker_installed (193-194)
🪛 GitHub Check: CodeQL
tests/unit_tests/test_mcp_connector_registry.py

[failure] 81-81: Incomplete URL substring sanitization
The string https://dashboard.example.com/ may be at an arbitrary position in the sanitized URL.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (No Creds)

- Split long set literal across multiple lines per ruff style
- No functional changes

Co-Authored-By: AJ Steers <aj@airbyte.io>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Nov 11, 2025

PyTest Results (Fast Tests Only, No Creds)

320 tests  +8   320 ✅ +8   5m 48s ⏱️ -5s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit b8a74db. ± Comparison against base commit 782b1f2.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Nov 11, 2025

PyTest Results (Full)

389 tests  +8   373 ✅ +8   25m 24s ⏱️ + 1m 18s
  1 suites ±0    16 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit b8a74db. ± Comparison against base commit 782b1f2.

♻️ This comment has been updated with latest results.

…registry

- Add _extract_docs_from_registry() helper function to extract externalDocumentationUrls from connector registry
- Integrate registry extraction into get_api_docs_urls tool
- Update docstring to reflect new data sources
- Tested with source-faker which now returns 3 docs URLs including Python Faker Library Documentation and Faker Changelog from registry externalDocumentationUrls
- All 19 unit tests passing
- All lint checks passing

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
airbyte/mcp/connector_registry.py (1)

204-218: Hyphenated API names still fail to resolve.

This is the same issue flagged in the previous review. When a user supplies "Facebook-Marketing API", the hyphen remains in search_term (line 204) while connector_name_clean (lines 214-216) replaces hyphens with spaces. The comparison then misses, returning "Connector not found." Could we normalize both values to handle hyphens/underscores consistently before comparing, wdyt?

Apply this diff to normalize both search_term and connector_name_clean:

 search_term = re.sub(r"\s+(api|rest api)$", "", connector_identifier_lower, flags=re.IGNORECASE)
+search_term = re.sub(r"[-_]+", " ", search_term)
+search_term = re.sub(r"\s+", " ", search_term).strip()

 for connector_name in available_connectors:
     metadata = None
     with contextlib.suppress(Exception):
         metadata = get_connector_metadata(connector_name)

     if metadata:
         pass

     connector_name_clean = (
         connector_name.replace("source-", "").replace("destination-", "").replace("-", " ")
     )
+    connector_name_clean = re.sub(r"\s+", " ", connector_name_clean.lower().strip())
     if search_term in connector_name_clean or connector_name_clean in search_term:
         return connector_name
🧹 Nitpick comments (3)
airbyte/mcp/connector_registry.py (3)

207-212: Remove unused metadata retrieval.

The metadata is fetched but never used—lines 211-212 contain a no-op if metadata: pass. Should we remove this dead code or was there an intended use for it, wdyt?

Apply this diff to remove the unused code:

 for connector_name in available_connectors:
-    metadata = None
-    with contextlib.suppress(Exception):
-        metadata = get_connector_metadata(connector_name)
-
-    if metadata:
-        pass
-
     connector_name_clean = (
         connector_name.replace("source-", "").replace("destination-", "").replace("-", " ")
     )

223-255: Consider the scope of standalone URL extraction.

The standalone URL pattern (line 239) will match any https?:// URL in the description. If the description contains unrelated URLs (e.g., examples, company homepage), they'll be included as API docs. Would it make sense to limit extraction to certain sections or add filtering logic, wdyt?


331-374: Consider making the registry URL configurable.

The registry URL is hardcoded at line 343. If the URL changes or if we need to support different registries (e.g., for testing), this could be inflexible. Would it make sense to extract this to a constant or configuration, wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6309b76 and 9bff539.

📒 Files selected for processing (1)
  • airbyte/mcp/connector_registry.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte/mcp/connector_registry.py (3)
airbyte/sources/registry.py (2)
  • get_available_connectors (225-281)
  • get_connector_metadata (195-222)
airbyte/sources/util.py (1)
  • get_source (47-139)
airbyte/_util/meta.py (1)
  • is_docker_installed (193-194)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (5)
airbyte/mcp/connector_registry.py (5)

7-11: LGTM!

The new imports (re, requests, yaml) are appropriate for the URL extraction, HTTP fetching, and YAML parsing functionality being added.


167-184: LGTM!

The data models are well-structured with appropriate field aliases (doc_typetype, requires_loginrequiresLogin) and the populate_by_name configuration for flexible input handling.


258-305: LGTM!

Comprehensive extraction from multiple manifest sources (description, data.externalDocumentationUrls, metadata.assist.docsUrl, metadata.apiDocs) with appropriate defensive checks using isinstance.


308-328: LGTM!

Good error handling with specific 404 check, reasonable timeout (10s), and graceful degradation returning an empty list on failures. The broad exception catch is appropriate for this best-effort fetch.


434-439: LGTM!

The URL deduplication logic correctly preserves the first occurrence of each unique URL, using a set for efficient lookup. This aligns with the PR objective of "exact-URL deduplication across sources."

- Remove ApiDocsUrlsResult wrapper class, return list[ApiDocsUrl] directly
- Remove _resolve_connector_name helper, only accept canonical connector IDs
- Remove _extract_urls_from_manifest_description, only parse YAML fields
- Remove filtering conditions from externalDocumentationUrls parsing
- Fix lint error by breaking long description line

Co-Authored-By: AJ Steers <aj@airbyte.io>
- Remove metadata.assist.docsUrl and metadata.apiDocs parsing (non-existent specs)
- Remove blanket try/except blocks, let errors raise naturally
- Remove tests for hallucinated specs (test_manifest_with_assist_docs_url, test_manifest_with_api_docs, test_manifest_with_mixed_formats, test_manifest_request_error)
- Remove test_successful_retrieval test (confusing test with outdated source references)
- Update docstring to reflect only supported sources: registry and manifest data.externalDocumentationUrls
- 5 focused unit tests now passing

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte/mcp/connector_registry.py (2)

221-259: Consider consistent error handling across sources.

Similar to _fetch_manifest_docs_urls, this function will raise on any HTTP error (line 232), which could prevent get_api_docs_urls from returning partial results when the registry is temporarily unavailable but manifest data is accessible.

For consistency and resilience, should network errors here also be caught to allow aggregation from available sources? wdyt?


307-311: Should we handle failures per-source to enable partial aggregation?

Currently, if _extract_docs_from_registry fails (line 307), we never reach _fetch_manifest_docs_urls (line 310), and vice versa. This means a network issue with one source causes complete failure even if other sources are available.

Given the function's goal to "aggregate URLs from multiple sources," would it make sense to wrap each source's extraction in a try/except to collect available data and only fail if all sources fail? This would align with the PR description's claim of graceful network error handling. wdyt?

-    registry_urls = _extract_docs_from_registry(connector_name)
-    docs_urls.extend(registry_urls)
-
-    manifest_urls = _fetch_manifest_docs_urls(connector_name)
-    docs_urls.extend(manifest_urls)
+    with contextlib.suppress(Exception):
+        registry_urls = _extract_docs_from_registry(connector_name)
+        docs_urls.extend(registry_urls)
+
+    with contextlib.suppress(Exception):
+        manifest_urls = _fetch_manifest_docs_urls(connector_name)
+        docs_urls.extend(manifest_urls)

Alternatively, if errors should propagate (per previous maintainer feedback), we could clarify the PR description to remove the "graceful network error handling" claim.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9bff539 and bd9b023.

📒 Files selected for processing (2)
  • airbyte/mcp/connector_registry.py (2 hunks)
  • tests/unit_tests/test_mcp_connector_registry.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit_tests/test_mcp_connector_registry.py
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte/mcp/connector_registry.py (4)
airbyte/mcp/_tool_utils.py (1)
  • mcp_tool (102-148)
airbyte/sources/registry.py (1)
  • get_available_connectors (225-281)
airbyte/sources/util.py (1)
  • get_source (47-139)
airbyte/_util/meta.py (1)
  • is_docker_installed (193-194)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (No Creds)
🔇 Additional comments (5)
airbyte/mcp/connector_registry.py (5)

9-10: LGTM!

The requests and yaml imports are appropriate for fetching and parsing manifest files and registry data.


166-176: LGTM!

The ApiDocsUrl model is well-structured with sensible defaults and flexible field naming through aliases. The use of populate_by_name allows both type/doc_type and requiresLogin/requires_login to work correctly.


178-199: LGTM!

The manifest extraction logic correctly navigates the nested structure and applies appropriate defaults for optional fields (type and requiresLogin).


202-218: Verify error handling aligns with PR objectives.

The PR description states the tool "handles missing manifests (404s) and network errors gracefully," but currently only 404 responses return an empty list. Other network errors (timeouts, DNS failures, 5xx responses) will propagate via raise_for_status() on line 215, potentially causing get_api_docs_urls to fail completely even if registry data is available.

Is this the intended behavior, or should we also handle other request exceptions gracefully to ensure partial results can be returned? wdyt?


287-320: The aggregation logic looks solid.

The flow correctly:

  • Validates connector exists
  • Gathers docs from multiple sources
  • Deduplicates by URL (keeping first occurrence)

The deduplication strategy (lines 313-318) only checks URL equality, which means if the same URL appears from multiple sources with different titles or types, only the first is kept. This seems intentional for avoiding duplicate links, which makes sense.

…assmethod

- Add _manifest_url_for() helper to generate manifest URL
- Add _fetch_manifest_dict() helper to fetch and parse manifest from URL
- Add ApiDocsUrl.from_manifest_dict() classmethod to extract docs from manifest dict
- Remove old _extract_docs_from_manifest() and _fetch_manifest_docs_urls() functions
- Update unit tests to test new helper functions and classmethod separately
- Add typing_extensions.Self for proper return type annotation

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/unit_tests/test_mcp_connector_registry.py (2)

28-58: Consider adding tests for non-404 error cases.

The current tests cover the 404 case (returns empty dict) and the success case. It would be valuable to add test coverage for:

  1. Non-404 HTTP errors (should raise HTTPError after raise_for_status())
  2. Invalid YAML content (should raise YAMLError)

These would help verify the error propagation behavior mentioned in the PR summary ("errors raise naturally"), wdyt?


143-181: Consider adding tests for error scenarios in external calls.

The deduplication test provides good coverage of the happy path. To improve resilience testing, consider adding test cases for:

  1. _extract_docs_from_registry raising an exception (network failure, malformed JSON)
  2. _fetch_manifest_dict raising an exception for non-404 errors
  3. Verifying that partial failures don't prevent returning available docs

This would validate that the function gracefully handles failures from individual sources while still aggregating what it can, wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd9b023 and 731878f.

📒 Files selected for processing (2)
  • airbyte/mcp/connector_registry.py (2 hunks)
  • tests/unit_tests/test_mcp_connector_registry.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/unit_tests/test_mcp_connector_registry.py (1)
airbyte/mcp/connector_registry.py (5)
  • ApiDocsUrl (204-242)
  • _fetch_manifest_dict (182-201)
  • _manifest_url_for (167-179)
  • get_api_docs_urls (291-346)
  • from_manifest_dict (216-242)
airbyte/mcp/connector_registry.py (4)
airbyte/mcp/_tool_utils.py (1)
  • mcp_tool (102-148)
airbyte/sources/registry.py (1)
  • get_available_connectors (225-281)
airbyte/sources/util.py (1)
  • get_source (47-139)
airbyte/_util/meta.py (1)
  • is_docker_installed (193-194)
🔇 Additional comments (9)
airbyte/mcp/connector_registry.py (6)

167-179: LGTM!

The _manifest_url_for helper is straightforward and correctly formats the manifest URL using the connector name and "latest" version.


182-201: Good use of yaml.safe_load for security.

The function correctly uses yaml.safe_load() instead of yaml.load(), which prevents arbitrary code execution from malicious YAML files. The 404 handling and the fallback to empty dict on line 201 (when yaml.safe_load returns None for empty files) are both appropriate.


204-242: LGTM!

The ApiDocsUrl model is well-designed:

  • Field aliases properly support both camelCase and snake_case naming conventions via populate_by_name
  • The from_manifest_dict classmethod uses defensive programming with .get() calls and type checks
  • Correctly extracts from data.externalDocumentationUrls as specified in the PR requirements

339-346: URL-based deduplication preserves first occurrence metadata.

The deduplication logic keeps only the first occurrence of each URL and discards subsequent entries with the same URL, even if they have different metadata (title, doc_type, etc.). This means if the registry and manifest both list the same URL with different metadata, only the first one encountered will be returned. Is this the intended behavior, wdyt?


9-10: No action needed—both dependencies are already declared.

The requests and pyyaml packages are already present in pyproject.toml with explicit version constraints (pyyaml = "^6.0.2" and requests = "!=3.32.0"). The project also includes their type stubs. The imports are safe and properly backed by project dependencies.


259-264: No action needed—the search logic is sound.

The registry data confirms that each dockerRepository is unique, and they consistently follow the org/connector-name format (e.g., airbyte/source-100ms, farosai/airbyte-victorops-source). Since endswith(f"/{connector_name}") matches against the final component, the search will reliably identify the correct entry without ambiguity. The concern about potential duplicates or format inconsistency is not a real issue here.

tests/unit_tests/test_mcp_connector_registry.py (3)

17-25: LGTM!

Simple and effective smoke test for the URL generation helper. The assertions verify the essential components are present in the generated URL.


61-127: LGTM!

Excellent test coverage for the ApiDocsUrl.from_manifest_dict classmethod:

  • Verifies extraction of all fields including type aliases
  • Tests default values for optional fields (doc_type defaults to "other", requires_login defaults to False)
  • Handles edge case of empty manifest

160-172: Test fixture URL is flagged by security scanners but is safe.

The security scanner flagged the URL at line 172 as "Incomplete URL substring sanitization". However, this is a test fixture URL used in mocking, not user-controlled input or production code. This is expected and safe in the test context. If the security scanner continues to flag this, you could add a comment like # nosec or # test fixture URL to suppress the warning, wdyt?

Resolved conflicts:
- Added yaml import alongside new logging and exceptions imports from main
- Kept refactored manifest parsing helpers and classmethod
- Kept new get_connector_version_history function from main

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
airbyte/mcp/connector_registry.py (1)

178-190: Consider inlining this helper?

This is a thin wrapper around DEFAULT_MANIFEST_URL.format(). Since it's only called once (line 345) and doesn't add much abstraction, would it be simpler to inline it directly at the call site, wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 731878f and 9521495.

📒 Files selected for processing (1)
  • airbyte/mcp/connector_registry.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte/mcp/connector_registry.py (4)
airbyte/registry.py (1)
  • get_available_connectors (226-282)
airbyte/sources/util.py (1)
  • get_source (47-139)
airbyte/_util/meta.py (1)
  • is_docker_installed (193-194)
airbyte/_executors/python.py (1)
  • docs_url (108-112)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
🔇 Additional comments (4)
airbyte/mcp/connector_registry.py (4)

193-212: LGTM!

Clean error handling that lets non-404 failures propagate naturally while gracefully handling missing manifests. The or {} fallback on line 212 is a nice touch for empty responses.


350-356: Question about deduplication strategy?

The deduplication only checks URL equality and keeps the first occurrence. If the same URL appears in both registry and manifest sources with different metadata (e.g., different title or doc_type), the first one wins and the other metadata is discarded. Is this the intended behavior, or should we consider merging/enriching metadata when the same URL appears multiple times, wdyt?

For example, if registry says title="API Reference" and manifest says title="Official API Docs" for the same URL, currently we'd keep "API Reference" and lose "Official API Docs".


342-343: Verify error handling is intentional?

_extract_docs_from_registry has no error handling and will raise if the registry fetch fails. Based on past comments about letting errors raise naturally, this seems intentional—but wanted to confirm this is the desired behavior (vs. logging and continuing with partial results), wdyt?


297-357: Overall structure looks solid!

The aggregation logic cleanly combines docs from multiple sources (Airbyte docs, registry, and manifest), and the deduplication ensures no duplicate URLs in the result. The error handling strategy of letting registry/manifest failures propagate while suppressing connector initialization errors is clear and intentional.

…module

- Move ApiDocsUrl class, helper functions, and get_connector_api_docs_urls to airbyte/registry.py
- Simplify MCP get_api_docs_urls tool to call registry function
- Update _extract_docs_from_registry to use _get_registry_url() and include documentationUrl
- Add _DEFAULT_MANIFEST_URL constant to registry.py to avoid circular import
- Update tests to import from registry module
- All 8 unit tests passing, all lint checks passing

Co-Authored-By: AJ Steers <aj@airbyte.io>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9521495 and b8a74db.

📒 Files selected for processing (3)
  • airbyte/mcp/connector_registry.py (3 hunks)
  • airbyte/registry.py (3 hunks)
  • tests/unit_tests/test_mcp_connector_registry.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
airbyte/registry.py (1)
airbyte/exceptions.py (1)
  • AirbyteConnectorNotRegisteredError (288-297)
tests/unit_tests/test_mcp_connector_registry.py (3)
airbyte/mcp/connector_registry.py (1)
  • get_api_docs_urls (183-203)
airbyte/registry.py (4)
  • ApiDocsUrl (303-341)
  • _fetch_manifest_dict (359-378)
  • _manifest_url_for (344-356)
  • from_manifest_dict (315-341)
airbyte/exceptions.py (1)
  • AirbyteConnectorNotRegisteredError (288-297)
airbyte/mcp/connector_registry.py (3)
airbyte/registry.py (4)
  • ApiDocsUrl (303-341)
  • InstallType (48-54)
  • get_available_connectors (232-288)
  • get_connector_api_docs_urls (432-474)
airbyte/mcp/_tool_utils.py (1)
  • mcp_tool (102-148)
airbyte/exceptions.py (1)
  • AirbyteConnectorNotRegisteredError (288-297)

@aaronsteers Aaron ("AJ") Steers (aaronsteers) merged commit 7eb746b into main Nov 12, 2025
24 checks passed
@aaronsteers Aaron ("AJ") Steers (aaronsteers) deleted the devin/1762889439-get-api-docs-urls-tool branch November 12, 2025 20:15
devin-ai-integration bot added a commit that referenced this pull request Nov 12, 2025
- Removed deduplication logic from get_connector_docs_urls function
- Removed test_deduplication_of_urls test
- Updated docstring to remove mention of deduplication

The deduplication feature was previously introduced in PR #858 but is being
removed per product direction. Duplicate URLs from registry and manifest
sources will now be preserved in the returned list.

Co-Authored-By: AJ Steers <aj@airbyte.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant