fix: filter multimodal content from log samples while preserving metadata #962

Luodian · 2025-12-30T16:28:58Z

Summary

Adds is_multimodal_content() helper function to detect actual multimodal data types
Filters out non-serializable multimodal content (images, audio, video) from --log_samples output
Preserves all scalar metadata fields (e.g., doc_id, image_id, image_path) for dataset traceability

Problem

The previous implementation in #943 removed all filtering, which could cause:

Serialization errors with PIL.Image, numpy arrays, torch tensors
Extremely large JSONL log files when binary data gets serialized

The original filtering ("image" not in key) was too aggressive and removed useful metadata fields.

Solution

Type-based detection instead of key-name matching:

def is_multimodal_content(value: Any) -> bool:
    # Detects: PIL.Image, np.ndarray, torch.Tensor, bytes
    # HuggingFace formats: dict with 'array' (audio) or 'bytes' (image)

This preserves all ID/path/metadata fields while excluding actual binary content.

Test plan

Run evaluation with --log_samples on a multimodal task
Verify JSONL output contains metadata fields but not binary data
Verify no serialization errors occur

Closes #943

- Fix FileExistsError -> FileNotFoundError (correct exception type) - Replace print() with eval_logger for consistent logging - Add type hints to all functions - Fix missing comma bug in final_answer_patterns list - Remove redundant image_path = image_path assignment - Initialize op variable to prevent potential UnboundLocalError - Break long prompt string for readability (88 char line limit)

…data When using --log_samples, the previous implementation either saved all fields (causing serialization issues with images/audio) or filtered based on key names (missing useful metadata like image_id, image_path). This fix introduces is_multimodal_content() that detects actual multimodal data types (PIL.Image, numpy arrays, torch tensors, HuggingFace audio/image dicts) while preserving all scalar metadata fields for dataset traceability. Github-Issue:#943

claude · 2025-12-30T16:30:50Z

Code Review: PR #962

Summary

This PR addresses filtering of multimodal content from log samples while preserving metadata. The overall approach is solid and well-thought-out, introducing a type-based detection system instead of key-name matching.

Positive Aspects

✅ Good Solution Design: Type-based detection via is_multimodal_content() is more robust than key-name filtering
✅ Proper Type Hints: All functions now have proper type annotations
✅ Better Logging: Replaced print() with eval_logger for consistency
✅ Bug Fixes: Fixed several issues in spatialviz utils (FileExistsError → FileNotFoundError, missing comma, etc.)
✅ Code Quality: Applied formatting and improved readability

Issues & Recommendations

🔴 Critical Issues

Missing Test Coverage (CLAUDE.md violation)
- The PR description mentions this is a bug fix related to [Task] add AV-SpeakerBench #943, but there are no tests
- Per CLAUDE.md: "Bug fixes require regression tests"
- Recommendation: Add tests for is_multimodal_content() covering:
  - PIL Images
  - numpy arrays
  - torch tensors
  - HuggingFace dict formats ({'array': ...}, {'bytes': ...})
  - Regular dicts that should pass through
  - Edge cases (None, empty dicts, etc.)

Potential False Positives in is_multimodal_content() (lmms_eval/utils.py:120-122)

The function returns True for ANY dict containing 'array' or 'bytes' keys
This could filter out legitimate metadata dicts like `{"array": [1, 2, 3], "metadata": "..."}"

Recommendation: Make the detection more specific:

if isinstance(value, dict):
    # HuggingFace audio format has 'array' with numpy array
    if 'array' in value and isinstance(value.get('array'), np.ndarray):
        return True
    # HuggingFace image format has 'bytes' with bytes data  
    if 'bytes' in value and isinstance(value.get('bytes'), (bytes, bytearray)):
        return True

🟡 Medium Priority Issues

Uninitialized Variable Risk (lmms_eval/tasks/spatialviz/utils.py:73-89)
- Variable op is declared outside the if/else but only initialized inside
- While the current logic appears safe, if both conditions fail, op would be unbound
- The commit message says "Initialize op variable to prevent potential UnboundLocalError" but op: List[str] = [] would be clearer
- Recommendation: Initialize explicitly at declaration:
```
op: List[str] = []
if think_match and answer_match:
    # ...
```
PR References Wrong Issue
- PR description says "Closes [Task] add AV-SpeakerBench #943" but [Task] add AV-SpeakerBench #943 is "[Task] add AV-SpeakerBench", not related to log samples
- The commit trailer says Github-Issue:#943 which is incorrect
- Recommendation: Update PR description and commit message to reference the correct issue, or remove the reference if there isn't one
Incomplete Multimodal Detection
- The function doesn't handle other potential multimodal types like:
  - pathlib.Path objects pointing to media files
  - Video formats (could be file handles or custom objects)
  - Audio objects from other libraries
- Recommendation: Consider whether these need filtering too, or document the explicit scope

🟢 Minor Issues

Import Organization (lmms_eval/utils.py:123-124)

PIL.Image is imported inside a try/except within the function
This is executed every time is_multimodal_content() is called

Recommendation: Import at module level or cache the import check:

# At module level
try:
    from PIL import Image as PILImage
    HAS_PIL = True
except ImportError:
    HAS_PIL = False

# In function
if HAS_PIL and isinstance(value, PILImage.Image):
    return True

Documentation (lmms_eval/utils.py:105-117)

The docstring is good but could clarify the preservation behavior

Recommendation: Add note about what gets preserved:

"""
Check if a value is multimodal content that should not be serialized.

This allows metadata fields (image_id, image_path, etc.) to be preserved
while filtering actual binary/tensor data.

Returns True for:
...
"""

Type Hint for spatialviz_aggregate_results (lmms_eval/tasks/spatialviz/utils.py:126)
- Return type is float but the function logs detailed breakdown
- Consider if the return signature should be more explicit about what the float represents
- Recommendation: Add docstring explaining the return value is overall accuracy

Code Style Compliance

✅ CLAUDE.md Compliance:

Type hints: ✅ Added
Docstrings: ⚠️ Partial (could be improved)
Line length: ✅ Properly wrapped
PEP 8 naming: ✅ Correct
f-strings: ✅ Used appropriately

Performance Considerations

The is_multimodal_content() function is called for every field in every document during logging
The PIL import inside the function adds overhead on every call
For large evaluations with many documents, this could add up
Recommendation: Cache the PIL import check as suggested above

Security Considerations

✅ No security concerns identified - the code doesn't execute user input or introduce injection risks

Testing Recommendations

Based on CLAUDE.md requirements, please add:

# tests/test_utils.py or similar
import numpy as np
import torch
from PIL import Image
from lmms_eval.utils import is_multimodal_content

def test_is_multimodal_content_with_images():
    img = Image.new('RGB', (100, 100))
    assert is_multimodal_content(img) is True

def test_is_multimodal_content_with_numpy():
    arr = np.array([1, 2, 3])
    assert is_multimodal_content(arr) is True

def test_is_multimodal_content_with_torch():
    tensor = torch.tensor([1, 2, 3])
    assert is_multimodal_content(tensor) is True

def test_is_multimodal_content_with_bytes():
    assert is_multimodal_content(b'binary') is True
    assert is_multimodal_content(bytearray(b'binary')) is True

def test_is_multimodal_content_with_hf_formats():
    # HuggingFace audio format
    assert is_multimodal_content({'array': np.array([1])}) is True
    # HuggingFace image format  
    assert is_multimodal_content({'bytes': b'data'}) is True

def test_is_multimodal_content_preserves_metadata():
    # These should NOT be filtered
    assert is_multimodal_content('image_path.jpg') is False
    assert is_multimodal_content({'image_id': 123}) is False
    assert is_multimodal_content({'doc_id': 'abc'}) is False
    assert is_multimodal_content(123) is False
    assert is_multimodal_content([1, 2, 3]) is False

Overall Assessment

This is a well-intentioned fix that addresses a real problem. The main concerns are:

Missing tests (required by CLAUDE.md)
Overly broad dict filtering (could cause false positives)
Performance overhead from repeated PIL import

Recommendation: Request changes to address the critical issues before merging.

Review based on CLAUDE.md guidelines and best practices

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-12-30T16:34:08Z

lmms_eval/evaluator.py

                    for key, value in doc.items():
-                        saved_doc[key] = value
+                        if not is_multimodal_content(value):
+                            saved_doc[key] = value


Filter lists of multimodal content before logging

The new filter only checks the top-level value type, so a list of images/tensors still passes through unchanged. Several tasks populate doc["images"] with a list of PIL images (e.g., lmms_eval/tasks/stare/utils.py around lines 105–109 and live_bench/utils.py), meaning --log_samples will still attempt to serialize those lists and can hit the same serialization/size problems this change is meant to fix. Consider recursively filtering lists/tuples or stripping multimodal elements before saving.

Useful? React with 👍 / 👎.

Luodian added 3 commits December 31, 2025 00:24

style: apply black formatting

ed9afa0

Luodian merged commit c79490b into main Dec 30, 2025
1 of 3 checks passed

Luodian deleted the fix/filter-multimodal-content-in-log-samples branch December 30, 2025 16:29

chatgpt-codex-connector bot reviewed Dec 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: filter multimodal content from log samples while preserving metadata #962

fix: filter multimodal content from log samples while preserving metadata #962

Uh oh!

Luodian commented Dec 30, 2025

Uh oh!

Uh oh!

claude bot commented Dec 30, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: filter multimodal content from log samples while preserving metadata #962

fix: filter multimodal content from log samples while preserving metadata #962

Uh oh!

Conversation

Luodian commented Dec 30, 2025

Summary

Problem

Solution

Test plan

Uh oh!

Uh oh!

claude bot commented Dec 30, 2025

Code Review: PR #962

Summary

Positive Aspects

Issues & Recommendations

🔴 Critical Issues

🟡 Medium Priority Issues

🟢 Minor Issues

Code Style Compliance

Performance Considerations

Security Considerations

Testing Recommendations

Overall Assessment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants