[Fix] Fix imgedit eval logic for call openai client #966

kcz358 · 2026-01-04T03:37:41Z

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

A descriptive title: [xxx] XXXX
A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Ask for review

Once you feel comfortable for your PR, feel free to @ one of the contributors to review

@kcz358 @KemingWu

Thank you for your contributions!

kcz358 · 2026-01-04T03:38:24Z

Fix the issue as mention in #941

claude · 2026-01-04T03:39:15Z

Code Review for PR #966

Summary

This PR fixes the OpenAI API call structure in the ImgEdit evaluation logic by reordering message content and adding a temperature parameter. The changes are valid and improve evaluation quality.

Changes Analysis

1. Message Content Reordering ✅

Change: Images now appear before the text prompt in the message content array.

Rationale:

For vision-language models (like GPT-4o), placing images before instructions follows OpenAI's recommended best practices
The prompts in prompt.py explicitly say "Below are the images before and after editing:" at the end, implying images should be presented first
This ordering helps the model better understand the visual context before receiving scoring instructions

Impact: Should improve evaluation accuracy and consistency.

2. Temperature Parameter Addition ✅

Change: Added temperature=0.1 to the API call.

Rationale:

For evaluation/scoring tasks, low temperature (0-0.2) ensures deterministic, consistent results
This aligns with other evaluation code in the codebase:
- lmms_eval/tasks/mia_bench/utils.py:72 uses temperature: 0
- lmms_eval/tasks/egothink/utils.py:149 uses temperature=0
Temperature 0.1 (vs 0.0) provides minimal randomness while maintaining reproducibility

Impact: More consistent scoring across multiple runs of the same evaluation.

Code Quality Assessment

Strengths:

Minimal, focused change (2 additions, 1 deletion)
Follows existing patterns in the codebase
No breaking changes to API or functionality
Appropriate for bug fix classification

Considerations:

Test Coverage: No tests exist for the imgedit module. Consider adding tests for:
- _call_openai_for_evaluation() function (with mocked OpenAI client)
- parse_gpt_scores() function with various response formats
- Message content structure validation

Documentation: The change is self-explanatory, but a code comment explaining why images come before text could help future maintainers:

# Place images before text for better vision model comprehension
content = [
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{original_b64}"}},
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{edited_b64}"}},
    {"type": "text", "text": full_prompt},
]

Potential Issue - Line 201: The temperature=0.1 parameter is appropriate, but ensure this value is tested. If evaluation variance is observed, consider:
- Making temperature configurable via environment variable
- Documenting the choice in code comments

Security & Performance

Security: ✅ No security concerns. The change doesn't introduce new attack vectors.
Performance: ✅ Neutral impact. Message reordering and temperature parameter don't affect API latency.

Compliance with CLAUDE.md Guidelines

✅ Minimal Changes: Only modifies code related to the bug fix
✅ Simplicity: Straightforward, readable change
✅ Code Style: Follows existing formatting (88 char limit: ✓, indentation: ✓)
⚠️ Testing: No regression test added (though none existed before)
⚠️ Type Hints: Existing code already has proper type hints

Recommendations

Merge as-is: The fix is correct and improves evaluation quality
Future enhancement: Add integration tests for the ImgEdit evaluation pipeline
Optional: Add a brief inline comment explaining the image-first ordering rationale

Final Verdict

APPROVED ✅

This is a well-targeted bug fix that improves the correctness of the OpenAI API integration. The changes align with best practices and codebase conventions.

Suggested commit message enhancement:

[Fix] Fix imgedit eval logic for call openai client

Reorder message content to place images before text prompt, following 
OpenAI vision model best practices. Add temperature=0.1 for consistent 
evaluation scoring.

This ensures better visual context understanding by the evaluation model
and improves reproducibility of scores across multiple runs.

Fix imgedit eval logic for call openai client

fb70c8f

kcz358 merged commit 69e5287 into main Jan 4, 2026
3 checks passed

kcz358 deleted the dev/imgedit branch January 4, 2026 03:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] Fix imgedit eval logic for call openai client #966

[Fix] Fix imgedit eval logic for call openai client #966

kcz358 commented Jan 4, 2026 •

edited

Loading

Uh oh!

kcz358 commented Jan 4, 2026

Uh oh!

Uh oh!

claude bot commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Fix] Fix imgedit eval logic for call openai client #966

[Fix] Fix imgedit eval logic for call openai client #966

Conversation

kcz358 commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

When you open a pull-request, please be sure to include the following

Ask for review

Uh oh!

kcz358 commented Jan 4, 2026

Uh oh!

Uh oh!

claude bot commented Jan 4, 2026

Code Review for PR #966

Summary

Changes Analysis

1. Message Content Reordering ✅

2. Temperature Parameter Addition ✅

Code Quality Assessment

Security & Performance

Compliance with CLAUDE.md Guidelines

Recommendations

Final Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kcz358 commented Jan 4, 2026 •

edited

Loading