Skip to content

Add more tool call integration tests and improve integration test infrastructure#142

Open
atdrendel wants to merge 6 commits intoml-explore:mainfrom
shareup:pr-133-qwen3.5-tool-calling
Open

Add more tool call integration tests and improve integration test infrastructure#142
atdrendel wants to merge 6 commits intoml-explore:mainfrom
shareup:pr-133-qwen3.5-tool-calling

Conversation

@atdrendel
Copy link
Copy Markdown
Contributor

@atdrendel atdrendel commented Mar 11, 2026

Proposed changes

I've noticed that tool calling doesn't always get tested when adding support for new models to mlx-swift-lm. Additionally, modifying the tool calling code is, at least to me, a bit stressful because we don't have robust tool calling integration tests proving code changes don't break tool calling for older models.

This pull request hopes to start improving our support for tool call testing going forward by:

  1. Adding tool call integration tests for Qwen3.5 and Nemotron
  2. Extending IntegrationTestModels to support the new models being tested
  3. Using IntegrationTestModels inside of ToolCallIntegrationTests to allow for lazy loading of the models (meaning that developers who want to add a new model can just download the model they need if they only want to run the integration tests for that single model)

Note: The Nemotron integration tests are skipped by default because the model is huge. However, enabling the tests involves just swapping false for true in XCSkipIf(). This seems like a sensible choice to me because of the size of the model, but I'd be curious what others think.

Here are a few of the recent pull requests made to address broken tool calling in some popular models:

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@DePasqualeOrg
Copy link
Copy Markdown
Contributor

In #118, all of the integration tests were moved to a separate module that can be imported to run in the adapter packages. So either this gets merged first and the new/revised tests need to be re-implemented there, or that gets merged first and the tests need to be implemented in the new way.

@atdrendel
Copy link
Copy Markdown
Contributor Author

I'm happy to have #118 get merged first, and then I'll re-add these tests in the correct spot.

@atdrendel atdrendel force-pushed the pr-133-qwen3.5-tool-calling branch from f888b37 to 877cf57 Compare March 12, 2026 21:48
@atdrendel atdrendel marked this pull request as ready for review March 12, 2026 21:51
Copilot AI review requested due to automatic review settings March 12, 2026 21:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors MLXLM integration tests to use a shared async model-loading actor (instead of per-test-suite static state) and expands tool-call integration coverage to additional models.

Changes:

  • Centralized model IDs and added cached async container loaders in IntegrationTestModels.
  • Updated tool-call integration tests to lazily load model containers via IntegrationTestModels with skip-on-unavailable behavior.
  • Added new Nemotron and Qwen3.5 tool-call auto-detection and end-to-end parsing tests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
Tests/MLXLMIntegrationTests/ToolCallIntegrationTests.swift Reworked model loading to use shared async loaders; added Nemotron + Qwen3.5 tool-call tests.
Tests/MLXLMIntegrationTests/IntegrationTestModels.swift Added additional model IDs and per-model cached loading methods (via stored Tasks) for integration tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants