Skip to content

Add MiniMax as LLM evaluation provider#32

Open
octo-patch wants to merge 1 commit intosnap-research:mainfrom
octo-patch:feature/add-minimax-provider
Open

Add MiniMax as LLM evaluation provider#32
octo-patch wants to merge 1 commit intosnap-research:mainfrom
octo-patch:feature/add-minimax-provider

Conversation

@octo-patch
Copy link
Copy Markdown

Summary

  • Add MiniMax (M2.5, M2.5-highspeed, M2.7) as a new LLM provider for the LoCoMo long-term conversational memory benchmark
  • Follow the existing provider pattern (GPT, Claude, Gemini) with run_minimax() in global_methods.py and task_eval/minimax_utils.py
  • MiniMax models offer 204K-1M context windows via OpenAI-compatible API, making them suitable for long-term conversation evaluation

Changes

File Description
global_methods.py Add run_minimax() and set_minimax_key() with OpenAI-compatible API, temperature clamping, think-tag stripping
task_eval/minimax_utils.py New MiniMax evaluation utils with get_minimax_answers() following claude_utils.py pattern
task_eval/evaluate_qa.py Add MiniMax model dispatch alongside GPT/Claude/Gemini
scripts/evaluate_minimax.sh Evaluation script for MiniMax-M2.5 and MiniMax-M2.5-highspeed
scripts/env.sh Add MINIMAX_API_KEY environment variable
README.MD Add MiniMax evaluation instructions
tests/test_minimax_unit.py 27 unit tests covering API calls, model mapping, temperature clamping, think-tag stripping, answer parsing
tests/test_minimax_integration.py 3 integration tests with real MiniMax API calls

Usage

Set your MiniMax API key in scripts/env.sh, then run:
bash scripts/evaluate_minimax.sh

Test plan

  • 27 unit tests passing (mocked API calls)
  • 3 integration tests passing (real API calls)
  • Verify evaluation output format matches existing providers

Add MiniMax (M2.5, M2.5-highspeed, M2.7) as a new LLM provider for the
LoCoMo long-term conversational memory benchmark, following the existing
provider pattern (GPT, Claude, Gemini).

- Add run_minimax() to global_methods.py with OpenAI-compatible API,
  temperature clamping, and think-tag stripping for M2.5 models
- Create task_eval/minimax_utils.py with get_minimax_answers() following
  the claude_utils.py pattern (large context, no truncation)
- Update evaluate_qa.py with MiniMax model dispatch
- Add scripts/evaluate_minimax.sh evaluation script
- Add MINIMAX_API_KEY to scripts/env.sh
- Update README.MD with MiniMax evaluation instructions
- Add 27 unit tests and 3 integration tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant