Skip to content

Conversation

@Luodian
Copy link
Contributor

@Luodian Luodian commented Dec 30, 2025

Summary

  • Adds is_multimodal_content() helper function to detect actual multimodal data types
  • Filters out non-serializable multimodal content (images, audio, video) from --log_samples output
  • Preserves all scalar metadata fields (e.g., doc_id, image_id, image_path) for dataset traceability

Problem

The previous implementation in #943 removed all filtering, which could cause:

  • Serialization errors with PIL.Image, numpy arrays, torch tensors
  • Extremely large JSONL log files when binary data gets serialized

The original filtering ("image" not in key) was too aggressive and removed useful metadata fields.

Solution

Type-based detection instead of key-name matching:

def is_multimodal_content(value: Any) -> bool:
    # Detects: PIL.Image, np.ndarray, torch.Tensor, bytes
    # HuggingFace formats: dict with 'array' (audio) or 'bytes' (image)

This preserves all ID/path/metadata fields while excluding actual binary content.

Test plan

  • Run evaluation with --log_samples on a multimodal task
  • Verify JSONL output contains metadata fields but not binary data
  • Verify no serialization errors occur

Closes #943

- Fix FileExistsError -> FileNotFoundError (correct exception type)
- Replace print() with eval_logger for consistent logging
- Add type hints to all functions
- Fix missing comma bug in final_answer_patterns list
- Remove redundant image_path = image_path assignment
- Initialize op variable to prevent potential UnboundLocalError
- Break long prompt string for readability (88 char line limit)
…data

When using --log_samples, the previous implementation either saved all
fields (causing serialization issues with images/audio) or filtered based
on key names (missing useful metadata like image_id, image_path).

This fix introduces is_multimodal_content() that detects actual multimodal
data types (PIL.Image, numpy arrays, torch tensors, HuggingFace audio/image
dicts) while preserving all scalar metadata fields for dataset traceability.

Github-Issue:#943
@Luodian Luodian merged commit c79490b into main Dec 30, 2025
1 of 3 checks passed
@Luodian Luodian deleted the fix/filter-multimodal-content-in-log-samples branch December 30, 2025 16:29
@claude
Copy link

claude bot commented Dec 30, 2025

Code Review: PR #962

Summary

This PR addresses filtering of multimodal content from log samples while preserving metadata. The overall approach is solid and well-thought-out, introducing a type-based detection system instead of key-name matching.

Positive Aspects

Good Solution Design: Type-based detection via is_multimodal_content() is more robust than key-name filtering
Proper Type Hints: All functions now have proper type annotations
Better Logging: Replaced print() with eval_logger for consistency
Bug Fixes: Fixed several issues in spatialviz utils (FileExistsError → FileNotFoundError, missing comma, etc.)
Code Quality: Applied formatting and improved readability

Issues & Recommendations

🔴 Critical Issues

  1. Missing Test Coverage (CLAUDE.md violation)

    • The PR description mentions this is a bug fix related to [Task] add AV-SpeakerBench #943, but there are no tests
    • Per CLAUDE.md: "Bug fixes require regression tests"
    • Recommendation: Add tests for is_multimodal_content() covering:
      • PIL Images
      • numpy arrays
      • torch tensors
      • HuggingFace dict formats ({'array': ...}, {'bytes': ...})
      • Regular dicts that should pass through
      • Edge cases (None, empty dicts, etc.)
  2. Potential False Positives in is_multimodal_content() (lmms_eval/utils.py:120-122)

    • The function returns True for ANY dict containing 'array' or 'bytes' keys
    • This could filter out legitimate metadata dicts like `{"array": [1, 2, 3], "metadata": "..."}"
    • Recommendation: Make the detection more specific:
      if isinstance(value, dict):
          # HuggingFace audio format has 'array' with numpy array
          if 'array' in value and isinstance(value.get('array'), np.ndarray):
              return True
          # HuggingFace image format has 'bytes' with bytes data  
          if 'bytes' in value and isinstance(value.get('bytes'), (bytes, bytearray)):
              return True

🟡 Medium Priority Issues

  1. Uninitialized Variable Risk (lmms_eval/tasks/spatialviz/utils.py:73-89)

    • Variable op is declared outside the if/else but only initialized inside
    • While the current logic appears safe, if both conditions fail, op would be unbound
    • The commit message says "Initialize op variable to prevent potential UnboundLocalError" but op: List[str] = [] would be clearer
    • Recommendation: Initialize explicitly at declaration:
      op: List[str] = []
      if think_match and answer_match:
          # ...
  2. PR References Wrong Issue

    • PR description says "Closes [Task] add AV-SpeakerBench #943" but [Task] add AV-SpeakerBench #943 is "[Task] add AV-SpeakerBench", not related to log samples
    • The commit trailer says Github-Issue:#943 which is incorrect
    • Recommendation: Update PR description and commit message to reference the correct issue, or remove the reference if there isn't one
  3. Incomplete Multimodal Detection

    • The function doesn't handle other potential multimodal types like:
      • pathlib.Path objects pointing to media files
      • Video formats (could be file handles or custom objects)
      • Audio objects from other libraries
    • Recommendation: Consider whether these need filtering too, or document the explicit scope

🟢 Minor Issues

  1. Import Organization (lmms_eval/utils.py:123-124)

    • PIL.Image is imported inside a try/except within the function
    • This is executed every time is_multimodal_content() is called
    • Recommendation: Import at module level or cache the import check:
      # At module level
      try:
          from PIL import Image as PILImage
          HAS_PIL = True
      except ImportError:
          HAS_PIL = False
      
      # In function
      if HAS_PIL and isinstance(value, PILImage.Image):
          return True
  2. Documentation (lmms_eval/utils.py:105-117)

    • The docstring is good but could clarify the preservation behavior
    • Recommendation: Add note about what gets preserved:
      """
      Check if a value is multimodal content that should not be serialized.
      
      This allows metadata fields (image_id, image_path, etc.) to be preserved
      while filtering actual binary/tensor data.
      
      Returns True for:
      ...
      """
  3. Type Hint for spatialviz_aggregate_results (lmms_eval/tasks/spatialviz/utils.py:126)

    • Return type is float but the function logs detailed breakdown
    • Consider if the return signature should be more explicit about what the float represents
    • Recommendation: Add docstring explaining the return value is overall accuracy

Code Style Compliance

CLAUDE.md Compliance:

  • Type hints: ✅ Added
  • Docstrings: ⚠️ Partial (could be improved)
  • Line length: ✅ Properly wrapped
  • PEP 8 naming: ✅ Correct
  • f-strings: ✅ Used appropriately

Performance Considerations

  • The is_multimodal_content() function is called for every field in every document during logging
  • The PIL import inside the function adds overhead on every call
  • For large evaluations with many documents, this could add up
  • Recommendation: Cache the PIL import check as suggested above

Security Considerations

✅ No security concerns identified - the code doesn't execute user input or introduce injection risks

Testing Recommendations

Based on CLAUDE.md requirements, please add:

# tests/test_utils.py or similar
import numpy as np
import torch
from PIL import Image
from lmms_eval.utils import is_multimodal_content

def test_is_multimodal_content_with_images():
    img = Image.new('RGB', (100, 100))
    assert is_multimodal_content(img) is True

def test_is_multimodal_content_with_numpy():
    arr = np.array([1, 2, 3])
    assert is_multimodal_content(arr) is True

def test_is_multimodal_content_with_torch():
    tensor = torch.tensor([1, 2, 3])
    assert is_multimodal_content(tensor) is True

def test_is_multimodal_content_with_bytes():
    assert is_multimodal_content(b'binary') is True
    assert is_multimodal_content(bytearray(b'binary')) is True

def test_is_multimodal_content_with_hf_formats():
    # HuggingFace audio format
    assert is_multimodal_content({'array': np.array([1])}) is True
    # HuggingFace image format  
    assert is_multimodal_content({'bytes': b'data'}) is True

def test_is_multimodal_content_preserves_metadata():
    # These should NOT be filtered
    assert is_multimodal_content('image_path.jpg') is False
    assert is_multimodal_content({'image_id': 123}) is False
    assert is_multimodal_content({'doc_id': 'abc'}) is False
    assert is_multimodal_content(123) is False
    assert is_multimodal_content([1, 2, 3]) is False

Overall Assessment

This is a well-intentioned fix that addresses a real problem. The main concerns are:

  1. Missing tests (required by CLAUDE.md)
  2. Overly broad dict filtering (could cause false positives)
  3. Performance overhead from repeated PIL import

Recommendation: Request changes to address the critical issues before merging.


Review based on CLAUDE.md guidelines and best practices

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 565 to +567
for key, value in doc.items():
saved_doc[key] = value
if not is_multimodal_content(value):
saved_doc[key] = value

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Filter lists of multimodal content before logging

The new filter only checks the top-level value type, so a list of images/tensors still passes through unchanged. Several tasks populate doc["images"] with a list of PIL images (e.g., lmms_eval/tasks/stare/utils.py around lines 105–109 and live_bench/utils.py), meaning --log_samples will still attempt to serialize those lists and can hit the same serialization/size problems this change is meant to fix. Consider recursively filtering lists/tuples or stripping multimodal elements before saving.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants