Skip to content

Conversation

@aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Oct 22, 2025

feat: Bump langchain dependencies to 1.0.x

Summary

Upgraded langchain dependencies from 0.1.x to 1.0.x as part of a multi-stage plan to modernize the CDK's vector database dependencies. This is Stage 2 of the planned work.

Changes:

  • Bumped langchain from 0.1.16 to ^1.0.2
  • Bumped langchain_core from 0.1.42 to ^1.0.0
  • Added langchain_community ^0.4 as an explicit dependency (was previously only transitive)
  • Updated import paths in embedder.py and document_processor.py to use new langchain 1.0 package structure:
    • langchain.embeddings.*langchain_community.embeddings
    • langchain.text_splitterlangchain_text_splitters
    • langchain.utilslangchain_core.utils.strings
  • Fixed CohereEmbeddings initialization to include required user_agent parameter

Test Results:

  • Fast tests: ✅ 3800 passed
  • Full integration tests: ⏳ Not yet run
  • pgvector destination testing: ⏳ Not yet done

Review & Testing Checklist for Human

⚠️ Important: This PR has NOT been tested with pgvector destination yet, which was the primary validation requirement.

  • Address deprecation warnings: Tests show that CohereEmbeddings and OpenAIEmbeddings should be imported from dedicated packages (langchain-cohere, langchain-openai) rather than langchain_community. Should we update to use those packages instead? See test output warnings.
  • Test with pgvector destination locally: Run end-to-end tests with pgvector to confirm the langchain upgrade doesn't break vector DB functionality. This was the main validation requirement.
  • Search for missed langchain imports: Only 2 files were updated. Verify there are no other langchain imports in the codebase: git grep -r "from langchain\." --include="*.py"
  • Review poetry.lock changes: The lock file has extensive changes due to dependency tree updates. Scan for any unexpected version bumps or conflicts.
  • Run full test suite: Only fast tests have been run. Consider running full CDK test suite before merging.

Recommended Test Plan

  1. Install updated CDK with poetry install --all-extras
  2. Set up local pgvector instance
  3. Run pgvector destination tests with various embedding configurations (OpenAI, Cohere, Fake)
  4. Verify document processing and text splitting still works correctly
  5. Check that embeddings are generated and stored properly

Notes

  • User explicitly requested ignoring CI until local tests pass, so CI failures are expected at this stage
  • The user_agent="airbyte-cdk" parameter was added to CohereEmbeddings - consider if this should include version info
  • This is part of a 3-stage upgrade plan; Stage 3 will involve updating pgvector destination to use this new CDK version

Session Info:

Summary by CodeRabbit

  • Chores
    • Upgraded LangChain-related tooling to the new modular ecosystem (langchain_core, langchain_community, langchain_text_splitters) for improved compatibility and stability.
    • Updated vector DB integration extras to reflect the new packages.
    • Improved embedding provider interoperability and metadata sent to providers for more reliable embedding operations.

Important

Auto-merge enabled.

This PR is set to merge automatically when all requirements are met.

Note

Auto-merge may have been disabled. Please check the PR status to confirm.

- Update langchain from 0.1.16 to ^1.0.2
- Update langchain_core from 0.1.42 to ^1.0.0
- Add langchain_community ^0.4 as explicit dependency
- Update imports to use new langchain package structure:
  - langchain_community.embeddings for embedding classes
  - langchain_text_splitters for text splitting
  - langchain_core.utils.strings for utility functions
- Fix CohereEmbeddings initialization to include required user_agent parameter
- All fast tests passing (3800 passed)

Co-Authored-By: AJ Steers <[email protected]>
Copilot AI review requested due to automatic review settings October 22, 2025 23:55
@devin-ai-integration
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin - We need to bump our langchain dependencies in the python CDK. Can you make a plan to bump the CDK dependencies, and then use the pgvector destination to test if your changes are successful? I expect you to fully ignore CI for this, until such time as you have tests locally succeeding. You should expect significant breaking changes since we are many versions behind for the langchain dependency.
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1761175248996799?thread_ts=1761175248.996799

@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions github-actions bot added the enhancement New feature or request label Oct 22, 2025
@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1761176353-bump-langchain#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1761176353-bump-langchain

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR upgrades the langchain dependencies from 0.1.x to 1.0.x, migrating to the new package structure introduced in langchain's major version update. The changes update import paths across the codebase to align with langchain 1.0's modularized architecture.

  • Bumped core langchain packages to 1.0.x versions and added langchain_community as explicit dependency
  • Updated import statements to use new package structure (langchain_community, langchain_text_splitters, langchain_core.utils.strings)
  • Added required user_agent parameter to CohereEmbeddings initialization

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
pyproject.toml Updated langchain dependency versions to 1.0.x and added langchain_community to extras
airbyte_cdk/destinations/vector_db_based/embedder.py Migrated embedding imports to langchain_community and added user_agent to CohereEmbeddings
airbyte_cdk/destinations/vector_db_based/document_processor.py Updated text splitter imports to langchain_text_splitters and utils to langchain_core

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Updates move LangChain-related imports to new modular packages (langchain_community, langchain_core, langchain_text_splitters), add user_agent="airbyte-cdk" to CohereEmbeddings, and update pyproject.toml dependency entries and extras accordingly.

Changes

Cohort / File(s) Summary
LangChain Import Migration
airbyte_cdk/destinations/vector_db_based/document_processor.py, airbyte_cdk/destinations/vector_db_based/embedder.py
Replaced imports from legacy langchain.* modules with langchain_text_splitters, langchain_core.utils.strings, and langchain_community.embeddings. Added user_agent="airbyte-cdk" argument to CohereEmbeddings construction.
Dependency Configuration
pyproject.toml
Removed old langchain entries; added/updated langchain_community = "^0.4", langchain_core = "^1.0.0", and langchain_text_splitters = "^1.0.0". Updated vector-db-based extras to include the new packages plus existing providers.

Sequence Diagram(s)

(No sequence diagram generated — changes are import/dependency updates and a single constructor parameter addition without new control flow.)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

dependencies

Suggested reviewers

  • aaronsteers
  • brianjlai
  • pedroslopez

Quick question: Have you verified that the added user_agent="airbyte-cdk" parameter for CohereEmbeddings is accepted across the langchain_community versions in the specified range, wdyt?

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "feat: Bump langchain dependencies to 1.0.x" directly and accurately reflects the primary objective of this changeset. The core change is upgrading langchain-related dependencies (langchain, langchain_core, langchain_community, and langchain_text_splitters) to their 1.0.x versions, which is precisely what the title communicates. The accompanying import path refactoring in embedder.py and document_processor.py and the parameter fix for CohereEmbeddings are necessary consequences of the version bump rather than standalone changes. The title is concise, uses the appropriate "feat:" prefix, and is specific enough that someone reviewing the commit history would immediately understand the primary intent.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1761176353-bump-langchain

Comment @coderabbitai help to get the list of available commands and usage tips.

- Remove unused langchain dependency
- Add langchain_text_splitters as explicit dependency
- Update vector-db-based extras to include langchain_text_splitters
- Fixes deptry dependency analysis errors

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
airbyte_cdk/destinations/vector_db_based/embedder.py (1)

145-147: User agent version info.

As mentioned in a previous review, the hardcoded user_agent='airbyte-cdk' could benefit from including version information for better tracking in Cohere's logs.

pyproject.toml (1)

71-71: Outdated comment about openai dependency.

As noted in a previous review, the comment about openai being used indirectly by langchain is outdated with langchain 1.0.x.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20ae208 and 02226a9.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • airbyte_cdk/destinations/vector_db_based/document_processor.py (1 hunks)
  • airbyte_cdk/destinations/vector_db_based/embedder.py (2 hunks)
  • pyproject.toml (2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Dependency Analysis
airbyte_cdk/destinations/vector_db_based/document_processor.py

[error] 13-13: DEP003 'langchain_text_splitters' imported but it is a transitive dependency

pyproject.toml

[error] 1-1: DEP002 'langchain' defined as a dependency but not used in the codebase

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-shopify
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (2)
airbyte_cdk/destinations/vector_db_based/document_processor.py (1)

12-13: Import paths look good for langchain 1.0.x migration!

The updated import paths for stringify_dict and text splitters align well with the langchain 1.0.x package structure. Just note that langchain_text_splitters needs to be added as a declared dependency in pyproject.toml (I've flagged this in my review of that file).

airbyte_cdk/destinations/vector_db_based/embedder.py (1)

10-15: Now let me verify the scope of impact on the codebase:

Migrate to dedicated embedding packages to resolve deprecation warnings

OpenAIEmbeddings from langchain_community is deprecated since version 0.0.9 with removal target 1.0, and CohereEmbeddings is deprecated since version 0.0.30. Since you're already doing a major version upgrade, would you consider migrating to the recommended packages now?

  • Use langchain_openai.OpenAIEmbeddings instead
  • Use langchain_cohere.CohereEmbeddings instead

This will eliminate deprecation warnings and keep the codebase ahead of LangChain 1.0. What do you think?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pyproject.toml (1)

71-71: Consider updating the openai dependency comment for clarity with langchain 1.0.x.

The comment states openai is "Used indirectly by langchain library", but with langchain 1.0.x's modular architecture, this relationship has changed. OpenAI embeddings now come from the dedicated langchain-openai package (per deprecation warnings noted in the PR).

Should we clarify the comment to reflect that this is an optional dependency for embedding support, and possibly note that langchain-openai is the recommended approach going forward? This would help future maintainers understand the dependency rationale, wdyt?

-openai = { version = "0.27.9", extras = ["embeddings"], optional = true }  # Used indirectly by langchain library
+openai = { version = "0.27.9", extras = ["embeddings"], optional = true }  # Optional: Used for embedding support. With langchain 1.0.x, langchain-openai package is the recommended approach for OpenAI embeddings.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02226a9 and 5e19d46.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • pyproject.toml (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
🔇 Additional comments (1)
pyproject.toml (1)

67-69: Modular langchain 1.0.x architecture properly reflected.

The migration to the modular ecosystem looks solid. All three required packages are declared with appropriate version constraints, and the vector-db-based extras have been updated correctly to include them. The removal of the langchain metapackage aligns with the 1.0.x architecture shift.

Also applies to: 125-125

@github-actions
Copy link

PyTest Results (Fast)

3 802 tests  ±0   3 790 ✅ ±0   6m 14s ⏱️ -29s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 5e19d46. ± Comparison against base commit 20ae208.

@github-actions
Copy link

PyTest Results (Full)

3 805 tests   3 793 ✅  11m 42s ⏱️
    1 suites     12 💤
    1 files        0 ❌

Results for commit 5e19d46.

@devin-ai-integration
Copy link
Contributor

Good question! Looking at the current code and deprecation warnings:

Current state:

  • We're using OpenAIEmbeddings from langchain_community.embeddings
  • The deprecation warning suggests migrating to langchain-openai package
  • We have an old openai = "0.27.9" dependency that was used indirectly by the old langchain

Recommendation:
Yes, we should migrate to langchain-openai for a few reasons:

  1. Deprecation warnings: The tests show OpenAIEmbeddings is deprecated in langchain_community and will be removed in a future version
  2. Dedicated package: langchain-openai is the official package for OpenAI integrations in langchain 1.0+
  3. Better maintenance: Dedicated packages get better support and updates
  4. Consistency: We'd be following the same pattern as langchain-cohere (which also has a deprecation warning)

Changes needed:

  1. Add langchain-openai as an optional dependency
  2. Update imports in embedder.py: from langchain_openai import OpenAIEmbeddings, AzureOpenAIEmbeddings
  3. Potentially update the openai package version constraint (or remove it if langchain-openai brings its own)
  4. Similarly, consider adding langchain-cohere for the Cohere embeddings

Should I make these changes now, or would you prefer to handle this in a follow-up PR?

devin-ai-integration bot added a commit to airbytehq/airbyte that referenced this pull request Oct 23, 2025
- Update CDK dependency to use local path (langchain upgrade branch)
- Temporarily comment out PyAirbyte dependency (will restore after CDK PR merges)
- Update Python version constraint to ^3.10,<3.13
- Run poetry lock to update dependencies

This is Stage 3 of the langchain upgrade plan:
- Stage 1: ✅ Bump pgvector to latest CDK (skipped, handled by PyAirbyte)
- Stage 2: ✅ Bump langchain in CDK (PR #810)
- Stage 3: 🔄 Update pgvector to use new CDK with updated langchain

Related PR: airbytehq/airbyte-python-cdk#810

Co-Authored-By: AJ Steers <[email protected]>
@aaronsteers
Copy link
Contributor Author

Devin, please reply here in a comment. What are the latest versions of all libraries we are updating. Please link to their PyPI pages and/or to their GitHub releases pages.

@devin-ai-integration
Copy link
Contributor

Here are the latest versions of all libraries we're updating:

Core LangChain Packages:

Removed:

  • langchain: Removed (was 0.1.16)
    • This was the monolithic package that has been split into modular packages
    • We now use the specific packages we need instead

Version Constraints Used:

  • langchain-core = "^1.0.0" - allows 1.x.x updates
  • langchain-community = "^0.4" - allows 0.4.x updates
  • langchain-text-splitters = "^1.0.0" - allows 1.x.x updates

Breaking Changes:
The main breaking changes from 0.1.x to 1.0.x were:

  1. Package reorganization: embeddings moved from langchain.embeddings to langchain_community.embeddings
  2. Text splitters moved to dedicated langchain_text_splitters package
  3. Core utilities moved to langchain_core.utils
  4. Some embedding classes now require additional parameters (e.g., user_agent for CohereEmbeddings)

All changes have been addressed in this PR with updated imports and parameters.

@aaronsteers aaronsteers requested a review from dbgold17 October 23, 2025 02:09
@aaronsteers
Copy link
Contributor Author

aaronsteers commented Oct 23, 2025

@dbgold17 - Do you mind reviewing and approving if things look okay?

Context:

  • None of the Vector store destinations are on the latest version of the CDK anyway. So...
  • Merging+releasing should have no ill affect.
  • This clears a path for testing destination-pgvector next, and then others if PGVector succeeds.

Important: This also clears our path for the pandas and numpy dependencies to be bumped.

@aaronsteers aaronsteers enabled auto-merge (squash) October 23, 2025 02:12
Copy link
Contributor

@dbgold17 dbgold17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if we can go all the way to 1.0.x without issue let's do that instead of my PR to go to 0.3.0

@aaronsteers aaronsteers disabled auto-merge October 23, 2025 05:11
@aaronsteers aaronsteers enabled auto-merge (squash) October 23, 2025 05:13
@dbgold17
Copy link
Contributor

@aaronsteers I looked into the shopify error and my best guess is that the SSL failure is related to this dummy shop name defined in an invalid config file. The test produces the same result for me locally running on master as well, however for some reason it still marks it as PASSED while it is marked as a FAILURE on CI. Log snippet below from my local run:

{"type":"LOG","log":{"level":"ERROR","message":"Unexpected exception in error handler: HTTPSConnectionPool(host='shop_name.myshopify.com', port=443): Max retries exceeded with url: /admin/api/2025-01/shop.json?limit=250&order=updated_at+asc&updated_at_min=2020-11-01 (Caused by SSLError(SSLCertVerificationError(1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'shop_name.myshopify.com'. (_ssl.c:1006)\")))"}}
{"type":"LOG","log":{"level":"ERROR","message":"Giving up _send(...) after 6 tries (airbyte_cdk.sources.streams.http.exceptions.DefaultBackoffException: Unexpected exception in error handler: HTTPSConnectionPool(host='shop_name.myshopify.com', port=443): Max retries exceeded with url: /admin/api/2025-01/shop.json?limit=250&order=updated_at+asc&updated_at_min=2020-11-01 (Caused by SSLError(SSLCertVerificationError(1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'shop_name.myshopify.com'. (_ssl.c:1006)\"))))"}}
{"type":"LOG","log":{"level":"ERROR","message":"Check failed"}}
PASSED

@aaronsteers aaronsteers merged commit c0e506f into main Oct 23, 2025
43 of 46 checks passed
@aaronsteers aaronsteers deleted the devin/1761176353-bump-langchain branch October 23, 2025 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants