Skip to content

Conversation

@ruili33
Copy link
Contributor

@ruili33 ruili33 commented Jul 13, 2025

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

  • A descriptive title: [xxx] XXXX
  • A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Thank you for your contributions!

Summary for the Request

Hi EvolvingLMMs-Lab,

This pull request introduces support for the TimeScope benchmark, a new benchmark for evaluating long-form video comprehension in Large Multimodal Models. Further information on the design and methodology of TimeScope is available here: https://github.com/orrzohar/blog/blob/main/timescope.md.

Please let us know any changes or additions are needed. We're happy to adjust to align with the repo's standards.

We appreciate your efforts in reviewing this contribution.

Best Regards,
The Apollo Team

Summary by CodeRabbit

  • New Features
    • Introduced configuration files for the "longtimescope" and "timescope" video evaluation tasks, enabling dataset selection, evaluation settings, and model-specific prompt templates.
    • Added utility modules for both tasks, supporting video frame conversion, answer extraction, robust video file resolution, and detailed accuracy metric aggregation with per-category reporting.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 13, 2025

Walkthrough

New YAML configuration files and utility modules were introduced for the "timescope" and "longtimescope" tasks in the lmms_eval framework. These additions define dataset paths, generation and evaluation parameters, model-specific prompt templates, and provide utility functions for video data handling, answer extraction, result processing, and metric aggregation.

Changes

File(s) Change Summary
lmms_eval/tasks/longtimescope/longtimescope.yaml Added YAML config for "longtimescope" task: dataset, generation, metrics, and model prompt templates.
lmms_eval/tasks/timescope/timescope.yaml Added YAML config for "timescope" task: dataset, generation, metrics, and model prompt templates.
lmms_eval/tasks/longtimescope/utils.py New utility module: video frame conversion, video path resolution, answer extraction, result processing, aggregation for "longtimescope".
lmms_eval/tasks/timescope/utils.py New utility module: video frame conversion, video path resolution, answer extraction, result processing, aggregation for "timescope".

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant TaskConfig (YAML)
    participant Utils
    participant Model

    User->>TaskConfig: Load task YAML (timescope/longtimescope)
    TaskConfig->>Utils: Assign processing, aggregation functions
    User->>Model: Run model with visual/text inputs
    Model-->>User: Generate predictions
    User->>Utils: Process results (extract answer, metadata)
    Utils->>Utils: Aggregate results (compute accuracy)
    Utils-->>User: Return evaluation metrics
Loading

Poem

In the meadow of code, two tasks now bloom,
With YAML and utils, they chase away gloom.
Frames and answers, all counted with care,
Metrics and prompts, a well-matched pair.
Timescope and longtimescope, hop into view—
The rabbit applauds this release, with a thump and a "whew!"
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
lmms_eval/tasks/longtimescope/utils.py (1)

1-143: Significant code duplication with timescope/utils.py.

This file is nearly identical to lmms_eval/tasks/timescope/utils.py with only minor differences (line 34: loading longtimescope.yaml vs timescope.yaml). This violates the DRY principle and creates maintenance overhead.

Consider refactoring to share common code between both tasks by:

  1. Creating a shared base utility module with common functions
  2. Parameterizing the configuration file name
  3. Having task-specific modules inherit or import from the shared module

This would reduce code duplication from ~140 lines to potentially ~10 lines per task.

Would you like me to propose a refactored structure that eliminates this duplication?

♻️ Duplicate comments (1)
lmms_eval/tasks/longtimescope/utils.py (1)

1-15: Apply the same fixes as timescope/utils.py.

This file has identical issues to the timescope utilities:

  • Remove unused imports (lines 1-2, 6, 8, 10-11, 15)
  • Fix unused loop variable (line 37)
  • Fix f-string without placeholders (lines 50, 101)
  • Remove redundant assignment (line 54)
  • Replace sys.exit() with exception (line 60)
  • Fix missing commas in answer_prefixes list (lines 71-72)

Please apply the same fixes as recommended for lmms_eval/tasks/timescope/utils.py.

Also applies to: 37-37, 50-50, 53-54, 60-60, 71-72, 101-101

🧹 Nitpick comments (7)
lmms_eval/tasks/timescope/timescope.yaml (1)

37-37: Fix trailing spaces.

Remove the trailing spaces at the end of line 37 to comply with YAML formatting standards.

-  # qwen_vl:  
+  # qwen_vl:
lmms_eval/tasks/longtimescope/longtimescope.yaml (2)

37-37: Fix trailing spaces.

Remove the trailing spaces at the end of line 37 to comply with YAML formatting standards.

-  # qwen_vl:  
+  # qwen_vl:

11-11: Consider renaming shared utility functions.

The longtimescope task references utils.timescope_* functions, which may be confusing since these are shared between timescope and longtimescope. Consider renaming to more generic names like doc_to_visual, process_results, and aggregate_results to reflect their shared usage.

Also applies to: 21-21, 25-25

lmms_eval/tasks/timescope/utils.py (4)

37-37: Fix unused loop variable.

The loop variable i is not used within the loop body.

-    for i, line in enumerate(raw_data):
+    for _, line in enumerate(raw_data):

50-50: Fix f-string without placeholders.

Remove the unnecessary f prefix from the logging statement.

-    eval_logger.info(f"base_cache_dir", base_cache_dir, "cache_name", cache_name)
+    eval_logger.info("base_cache_dir", base_cache_dir, "cache_name", cache_name)

53-54: Remove redundant assignment.

Line 54 performs a redundant assignment where video_path is assigned to itself.

     if os.path.exists(video_path):
-        video_path = video_path
+        pass

101-101: Fix f-string without placeholders.

Remove the unnecessary f prefix from the return statement.

-    return {f"timescope_perception_score": data_dict}
+    return {"timescope_perception_score": data_dict}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c837cfb and 7a00a4d.

📒 Files selected for processing (4)
  • lmms_eval/tasks/longtimescope/longtimescope.yaml (1 hunks)
  • lmms_eval/tasks/longtimescope/utils.py (1 hunks)
  • lmms_eval/tasks/timescope/timescope.yaml (1 hunks)
  • lmms_eval/tasks/timescope/utils.py (1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
lmms_eval/tasks/timescope/timescope.yaml

[error] 37-37: trailing spaces

(trailing-spaces)

lmms_eval/tasks/longtimescope/longtimescope.yaml

[error] 37-37: trailing spaces

(trailing-spaces)

🪛 Ruff (0.11.9)
lmms_eval/tasks/timescope/utils.py

1-1: datetime imported but unused

Remove unused import: datetime

(F401)


2-2: json imported but unused

Remove unused import: json

(F401)


6-6: collections.defaultdict imported but unused

Remove unused import: collections.defaultdict

(F401)


8-8: typing.Dict imported but unused

Remove unused import

(F401)


8-8: typing.List imported but unused

Remove unused import

(F401)


8-8: typing.Optional imported but unused

Remove unused import

(F401)


8-8: typing.Union imported but unused

Remove unused import

(F401)


10-10: cv2 imported but unused

Remove unused import: cv2

(F401)


11-11: numpy imported but unused

Remove unused import: numpy

(F401)


15-15: lmms_eval.tasks._task_utils.file_utils.generate_submission_file imported but unused

Remove unused import: lmms_eval.tasks._task_utils.file_utils.generate_submission_file

(F401)


37-37: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


50-50: f-string without any placeholders

Remove extraneous f prefix

(F541)


101-101: f-string without any placeholders

Remove extraneous f prefix

(F541)


139-139: Loop control variable k not used within loop body

(B007)

lmms_eval/tasks/longtimescope/utils.py

1-1: datetime imported but unused

Remove unused import: datetime

(F401)


2-2: json imported but unused

Remove unused import: json

(F401)


6-6: collections.defaultdict imported but unused

Remove unused import: collections.defaultdict

(F401)


8-8: typing.Dict imported but unused

Remove unused import

(F401)


8-8: typing.List imported but unused

Remove unused import

(F401)


8-8: typing.Optional imported but unused

Remove unused import

(F401)


8-8: typing.Union imported but unused

Remove unused import

(F401)


10-10: cv2 imported but unused

Remove unused import: cv2

(F401)


11-11: numpy imported but unused

Remove unused import: numpy

(F401)


15-15: lmms_eval.tasks._task_utils.file_utils.generate_submission_file imported but unused

Remove unused import: lmms_eval.tasks._task_utils.file_utils.generate_submission_file

(F401)


37-37: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


50-50: f-string without any placeholders

Remove extraneous f prefix

(F541)


101-101: f-string without any placeholders

Remove extraneous f prefix

(F541)


139-139: Loop control variable k not used within loop body

(B007)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
lmms_eval/tasks/timescope/utils.py (2)

1-15: Address unused imports flagged by static analysis.

Multiple imports are not used in this module and should be removed to improve code clarity and reduce dependencies.


60-60: Replace sys.exit() with exception.

Using sys.exit() terminates the entire program, which may not be appropriate in a library context. Consider raising an exception instead to allow proper error handling.

🧹 Nitpick comments (6)
lmms_eval/tasks/timescope/utils.py (6)

20-28: Remove commented-out code.

This commented-out code block appears to be leftover from development and should be removed to keep the codebase clean.

-# with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
-#     raw_data = f.readlines()
-#     safe_data = []
-#     for i, line in enumerate(raw_data):
-#         # remove function definition since yaml load cannot handle it
-#         if "!function" not in line:
-#             safe_data.append(line)
-
-#     config = yaml.safe_load("".join(safe_data))

37-37: Use underscore for unused loop variable.

The loop variable i is not used within the loop body. Use _ to indicate it's intentionally unused.

-    for i, line in enumerate(raw_data):
+    for _, line in enumerate(raw_data):

50-50: Remove unnecessary f-string prefix.

This f-string doesn't contain any placeholders, so the f prefix is unnecessary.

-    eval_logger.info(f"base_cache_dir", base_cache_dir, "cache_name", cache_name)
+    eval_logger.info("base_cache_dir", base_cache_dir, "cache_name", cache_name)

103-103: Remove unnecessary f-string prefix.

This f-string doesn't contain any placeholders, so the f prefix is unnecessary.

-    return {f"timescope_perception_score": data_dict}
+    return {"timescope_perception_score": data_dict}

141-141: Use underscore for unused loop variable.

The loop variable k is not used within the loop body. Use _ to indicate it's intentionally unused.

-    for k, v in category2score.items():
+    for _, v in category2score.items():

88-144: Consider adding type hints for better code documentation.

The functions would benefit from type hints to improve code clarity and enable better IDE support.

+from typing import Dict, List, Any

-def timescope_process_results(doc, results):
+def timescope_process_results(doc: Dict[str, Any], results: List[str]) -> Dict[str, Dict[str, Any]]:

-def timescope_aggregate_results(results):
+def timescope_aggregate_results(results: List[Dict[str, Any]]) -> float:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a00a4d and ea8ff21.

📒 Files selected for processing (1)
  • lmms_eval/tasks/timescope/utils.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
lmms_eval/tasks/timescope/utils.py

1-1: datetime imported but unused

Remove unused import: datetime

(F401)


2-2: json imported but unused

Remove unused import: json

(F401)


6-6: collections.defaultdict imported but unused

Remove unused import: collections.defaultdict

(F401)


8-8: typing.Dict imported but unused

Remove unused import

(F401)


8-8: typing.List imported but unused

Remove unused import

(F401)


8-8: typing.Optional imported but unused

Remove unused import

(F401)


8-8: typing.Union imported but unused

Remove unused import

(F401)


10-10: cv2 imported but unused

Remove unused import: cv2

(F401)


11-11: numpy imported but unused

Remove unused import: numpy

(F401)


15-15: lmms_eval.tasks._task_utils.file_utils.generate_submission_file imported but unused

Remove unused import: lmms_eval.tasks._task_utils.file_utils.generate_submission_file

(F401)


37-37: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


50-50: f-string without any placeholders

Remove extraneous f prefix

(F541)


103-103: f-string without any placeholders

Remove extraneous f prefix

(F541)


141-141: Loop control variable k not used within loop body

(B007)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Cursor BugBot

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect F-string Usage in Logging

The eval_logger.info call incorrectly uses an f-string f"base_cache_dir" without placeholders, leading to base_cache_dir and cache_name being logged as separate, unformatted arguments. This results in "base_cache_dir" being logged literally, rather than a single formatted message. The call should be eval_logger.info(f"base_cache_dir: {base_cache_dir}, cache_name: {cache_name}").

lmms_eval/tasks/longtimescope/utils.py#L49-L50

cache_dir = os.path.join(base_cache_dir, cache_name)
eval_logger.info(f"base_cache_dir", base_cache_dir, "cache_name", cache_name)

lmms_eval/tasks/timescope/utils.py#L49-L50

cache_dir = os.path.join(base_cache_dir, cache_name)
eval_logger.info(f"base_cache_dir", base_cache_dir, "cache_name", cache_name)

Fix in CursorFix in Web


Bug: Comma Omission Causes Prefix Concatenation

Missing commas in the answer_prefixes list cause unintended string concatenation (e.g., "The best option is" "The correct option is" becomes "The best option isThe correct option is"). This prevents the extract_characters_regex function from correctly identifying and removing the intended individual answer prefixes from responses.

lmms_eval/tasks/longtimescope/utils.py#L70-L73

"The answer",
"The best option is" "The correct option is",
"Best answer:" "Best option:",
]

Fix in CursorFix in Web


BugBot free trial expires on July 22, 2025
Learn more in the Cursor dashboard.

Was this report helpful? Give feedback by reacting with 👍 or 👎

@Luodian Luodian merged commit 0d8aac8 into EvolvingLMMs-Lab:main Jul 15, 2025
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants