Skip to content

Conversation

@kcz358
Copy link
Collaborator

@kcz358 kcz358 commented Jan 6, 2026

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

  • A descriptive title: [xxx] XXXX
  • A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Ask for review

Once you feel comfortable for your PR, feel free to @ one of the contributors to review

Eval Server: @Luodian @kcz358 @mwxely

Summary

Introduces an HTTP server for running lmms-eval evaluations via REST API. This enables programmatic evaluation submission and job management.

Features

  • HTTP Server (lmms_eval/entrypoints/http_server.py)
  • Endpoints: /evaluate, /jobs/{id}, /queue, /health, /tasks, /models
  • Python Client (lmms_eval/entrypoints/client.py)
  • Server Launch (lmms_eval/launch_server.py)

What to do with future plans:

  • Possibilities to perform future refactor logic for lmms-eval (agentic evaluation, streaming QA)
  • Perform stats testing based on the model score
  • With the client interface, can easily compare the testing together with 2 models follow the anthropic style, allowing us to maintain a leaderboard and report in the future

Launch Server

python -m lmms_eval.launch_server --host 0.0.0.0 --port 8000

Use Case

"""
Test script to send multiple evaluation requests to the server.

Usage:
    python test_client.py
"""

from lmms_eval.entrypoints import EvalClient
import tempfile
import shutil

def main():
    # Connect to server
    client = EvalClient("http://localhost:8000")

    # Check server health
    print("Checking server health...")
    health = client.health()
    print(f"Server status: {health}")

    # Submit multiple evaluation jobs
    jobs = []

    # Job 1: qwen2.5-vl on mme
    print("\nSubmitting job 1: qwen2.5-vl on mme...")
    output_dirs = []
    output_dirs.append(tempfile.mkdtemp())
    job1 = client.evaluate(
        model="qwen2_5_vl",
        tasks=["mme"],
        model_args={
            "pretrained": "Qwen/Qwen2.5-VL-7B-Instruct",
            "device_map": "cuda",
            "attn_implementation": "flash_attention_2",
        },
        batch_size=4,
        limit=32,
        log_samples=True,
        num_gpus=8,
        output_dir=output_dirs[0],
    )
    print(f"Job 1 submitted: {job1}")
    jobs.append(job1["job_id"])

    # Job 2: another qwen2.5-vl eval (to test queue)
    output_dirs.append(tempfile.mkdtemp())
    print("\nSubmitting job 2: qwen2.5-vl on mme (second run)...")
    job2 = client.evaluate(
        model="qwen2_5_vl",
        tasks=["mme"],
        model_args={
            "pretrained": "Qwen/Qwen2.5-VL-7B-Instruct",
            "device_map": "cuda",
            "attn_implementation": "flash_attention_2",
        },
        batch_size=4,
        limit=32,
        log_samples=True,
        num_gpus=8,
        output_dir=output_dirs[1],
    )
    print(f"Job 2 submitted: {job2}")
    jobs.append(job2["job_id"])

    # Check queue status
    print("\nQueue status:")
    queue = client.get_queue_status()
    print(f"  Queue size: {queue['queue_size']}")
    print(f"  Running job: {queue['running_job']}")
    print(f"  Queued jobs: {queue['queued_jobs']}")

    # Wait for jobs to complete
    print("\n" + "=" * 50)
    print("Waiting for jobs to complete...")
    print("=" * 50)

    for job_id in jobs:
        print(f"\nWaiting for job {job_id[:8]}...")
        try:
            result = client.wait_for_job(job_id, poll_interval=10.0, verbose=True)
            print(f"Job {job_id[:8]} completed!")
            print(f"Results: {result}")
        except RuntimeError as e:
            print(f"Job {job_id[:8]} failed: {e}")
    
    for output_dir in output_dirs:
        shutil.rmtree(output_dir)

    print("\nAll jobs finished!")


if __name__ == "__main__":
    main()

Output (Client Side)

image

@EvolvingLMMs-Lab EvolvingLMMs-Lab deleted a comment from claude bot Jan 6, 2026
@EvolvingLMMs-Lab EvolvingLMMs-Lab deleted a comment from claude bot Jan 6, 2026
- Add CANCELLED status to JobStatus enum
- Mark jobs as CANCELLED instead of deleting (fixes race condition)
- Worker now skips CANCELLED jobs
- Remove compare_results methods (no server endpoint exists)
- Remove unused launcher_args parameter

Github-Issue: #972
@EvolvingLMMs-Lab EvolvingLMMs-Lab deleted a comment from claude bot Jan 6, 2026
@Luodian
Copy link
Contributor

Luodian commented Jan 6, 2026

@claude

kcz358 and others added 6 commits January 7, 2026 03:34
…nto the lifespan and thus we can extend to more state in the future
Fix multiple issues identified during PR review for the HTTP server
entrypoints module.

Type Safety (client.py):
- Add proper type annotations to __exit__ and __aexit__ methods with
  explicit exc_type, exc_val, exc_tb parameters instead of *args
- Add return type annotations to __enter__ and __aenter__ methods
- Add ExcInfo type alias for context manager exception info

Code Deduplication (client.py):
- Extract _process_job_status() helper function to eliminate duplicate
  job status handling logic between EvalClient.wait_for_job() and
  AsyncEvalClient.wait_for_job()

Memory Management (job_scheduler.py):
- Add cleanup_old_jobs() method to prevent unbounded growth of _jobs
  dictionary which could cause memory leaks in long-running servers
- Cleanup runs automatically after each job completes/fails
- Add max_completed_jobs constructor parameter (default: 100) to
  configure retention limit for finished jobs

Configuration (job_scheduler.py):
- Add temp_dir_prefix constructor parameter to make temporary directory
  prefix configurable instead of hardcoded "lmms_eval_"
- Add class constants DEFAULT_MAX_COMPLETED_JOBS and
  DEFAULT_TEMP_DIR_PREFIX for better discoverability

Error Messages (server_args.py):
- Improve port validation error to show both the invalid value and its
  type for easier debugging (e.g., "got '8080' (type: str)")
- Add docstring to __post_init__ method
Fix multiple issues identified during PR review for the HTTP server
entrypoints module.

Type Safety (client.py):
- Add proper type annotations to __exit__ and __aexit__ methods with
  explicit exc_type, exc_val, exc_tb parameters instead of *args
- Add return type annotations to __enter__ and __aenter__ methods
- Add ExcInfo type alias for context manager exception info

Code Deduplication (client.py):
- Extract _process_job_status() helper function to eliminate duplicate
  job status handling logic between EvalClient.wait_for_job() and
  AsyncEvalClient.wait_for_job()

Memory Management (job_scheduler.py):
- Add cleanup_old_jobs() method to prevent unbounded growth of _jobs
  dictionary which could cause memory leaks in long-running servers
- Cleanup runs automatically after each job completes/fails
- Add max_completed_jobs constructor parameter (default: 100) to
  configure retention limit for finished jobs

Configuration (job_scheduler.py):
- Add temp_dir_prefix constructor parameter to make temporary directory
  prefix configurable instead of hardcoded "lmms_eval_"
- Add class constants DEFAULT_MAX_COMPLETED_JOBS and
  DEFAULT_TEMP_DIR_PREFIX for better discoverability

Error Messages (server_args.py):
- Improve port validation error to show both the invalid value and its
  type for easier debugging (e.g., "got '8080' (type: str)")
- Add docstring to __post_init__ method
@Luodian
Copy link
Contributor

Luodian commented Jan 7, 2026

@claude

@EvolvingLMMs-Lab EvolvingLMMs-Lab deleted a comment from claude bot Jan 7, 2026
@claude
Copy link

claude bot commented Jan 7, 2026

Claude finished @Luodian's task in 2m 17s —— View job


Code Review Complete ✅

I've completed a thorough review of this PR introducing the HTTP eval server for lmms-eval. Overall, this is a well-structured implementation with good architecture. Below are my findings organized by severity.


🔴 Critical Issues

1. Security: Unauthenticated Server Exposed (http_server.py:27-32)

The server has no authentication or authorization. While there's a warning comment, this is a significant security risk.

Issue: Any user with network access can:

  • Submit unlimited evaluation jobs
  • Query all job results
  • Cancel any job
  • Potentially cause resource exhaustion

Recommendation: Add at minimum:

  • API key authentication via headers
  • Rate limiting per client
  • Job ownership tracking (users can only view/cancel their own jobs)

2. Missing Type Hints on Public APIs (CLAUDE.md violation)

Several functions lack complete type hints, violating the project's code quality standards:

  • job_scheduler.py:446: _parse_output_directory return type should be explicit
  • server_args.py:11-14: __post_init__ should have -> None return type
  • launch_server.py:7-17: prepare_server_args parameter and return types missing

Required: Per CLAUDE.md line 27-28, "Type hints required for all code" and "Public APIs must have docstrings"


🟡 High Priority Issues

3. Race Condition in Job Cancellation (job_scheduler.py:214-241)

The cancel_job method checks job status but the job could transition to RUNNING between the check and status update.

Current flow:

async def cancel_job(self, job_id: str) -> tuple[bool, str]:
    async with self._jobs_lock:
        job = self._jobs.get(job_id)
        if job.status == JobStatus.RUNNING:  # Could change after this check
            return False, "Cannot cancel a running job"
        job.status = JobStatus.CANCELLED

Issue: The _job_worker (line 340) pulls from queue outside the lock, creating a window where status can change.

Fix: Check for JobStatus.CANCELLED in _start_job (already done at line 302) but document this pattern.

4. Subprocess Evaluation Blocks Event Loop (job_scheduler.py:371-443)

The _run_evaluation method streams subprocess output in a blocking manner:

while True:
    line = await proc.stdout.readline()  # Blocks event loop per line

Issue: This prevents other async operations from running smoothly during long evaluations.

Recommendation: Use asyncio.create_task() for output streaming or buffer larger chunks.

5. No Resource Cleanup on Server Shutdown (http_server.py:57, job_scheduler.py:146-154)

When the server shuts down with jobs in the queue:

  • Running jobs are not gracefully terminated
  • Queued jobs remain in memory
  • Subprocess evaluations may become orphaned

Fix: In JobScheduler.stop():

async def stop(self):
    # Cancel all queued jobs
    async with self._jobs_lock:
        for job_id, job in self._jobs.items():
            if job.status == JobStatus.QUEUED:
                job.status = JobStatus.CANCELLED
    
    # Then stop worker
    if self._worker_task:
        self._worker_task.cancel()

6. Temp Directory Leak (job_scheduler.py:378)

Temporary directories created by tempfile.mkdtemp() are never cleaned up:

output_path = config.get("output_dir") or tempfile.mkdtemp(prefix=self._temp_dir_prefix)

Fix: Track created temp dirs and clean them up, or use tempfile.TemporaryDirectory() context manager.

7. Missing Docstrings on Public APIs (CLAUDE.md violation)

Per CLAUDE.md line 28, "Public APIs must have docstrings". Missing:

  • client.py:31-58: _process_job_status function
  • server_args.py:11-22: __post_init__ and from_dict methods need docstrings

🟢 Medium Priority Issues

8. Inconsistent Error Handling (client.py:106-111)

The _request method raises httpx exceptions but tests in utils.py:342 catch generic Exception:

except Exception as e:
    return TestResult(success=False, error=str(e))

Recommendation: Define custom exception types for clearer error handling.

9. Line Length Violations (CLAUDE.md:31)

Project standard is 88 chars max. Found violations:

  • server_args.py:14: 138 characters
  • job_scheduler.py:183: 101 characters

Fix: Break long lines per CLAUDE.md guidelines.

10. Hardcoded Constants (job_scheduler.py:117-118)

Magic numbers should be configurable:

DEFAULT_MAX_COMPLETED_JOBS = 100
DEFAULT_TEMP_DIR_PREFIX = "lmms_eval_"

Recommendation: Make these part of ServerArgs for better configurability.

11. No Logging of Job Lifecycle Events (job_scheduler.py)

Job state transitions (queued → running → completed) are not logged, making debugging difficult.

Fix: Add logger calls in _start_job, _complete_job, _fail_job:

async def _start_job(self, job_id: str) -> Optional[dict]:
    async with self._jobs_lock:
        job = self._jobs.get(job_id)
        if job is None or job.status == JobStatus.CANCELLED:
            return None
        logger.info(f"Starting job {job_id}")  # ADD THIS
        self._current_job_id = job_id
        job.status = JobStatus.RUNNING
        job.started_at = datetime.now().isoformat()
        return job.request.model_dump()

12. Test Uses Deprecated preexec_fn (utils.py:180)

The preexec_fn parameter is not safe with threads and deprecated:

self.process = subprocess.Popen(
    cmd,
    preexec_fn=os.setsid,  # Not safe with threads
)

Fix: Use start_new_session=True instead (Python 3.11+) or handle via process groups differently.


🔵 Low Priority / Style Issues

13. Redundant Dependency in pyproject.toml (line 65)

hf_transfer is listed twice (lines 48 and 65).

14. Missing Type Narrowing (client.py:250)

Return statement has type ignore comment that could be avoided with proper narrowing:

return result  # type: ignore[return-value]

15. Inconsistent Async/Sync Client Patterns (client.py)

AsyncEvalClient doesn't implement all methods from EvalClient (list_tasks, list_models, is_healthy missing).

Recommendation: Either implement these or document why they're excluded.

16. Test Utils Uses Blocking time.sleep in Async Context (utils.py:100)

Should use asyncio.sleep for consistency:

time.sleep(1.0)  # Should be async

✅ Positive Aspects

  1. Excellent Architecture: Clean separation of concerns between server, scheduler, and client
  2. Good Documentation: Comprehensive docstrings and README
  3. Thread Safety: Proper use of asyncio locks for concurrent access
  4. Test Infrastructure: Well-designed test utilities and decorators
  5. Error Handling: Generally good error messages and status tracking
  6. Lifespan Management: Correct use of FastAPI lifespan context manager

📋 Recommendations Summary

Must Fix Before Merge:

  1. Add type hints to all public APIs (CLAUDE.md requirement)
  2. Fix line length violations (CLAUDE.md requirement)
  3. Add authentication or clearly document deployment restrictions
  4. Implement temp directory cleanup
  5. Add proper shutdown handling for jobs

Should Fix:
6. Add job lifecycle logging
7. Improve async/sync patterns in subprocess handling
8. Complete AsyncEvalClient implementation
9. Add configurable constants to ServerArgs

Nice to Have:
10. Define custom exception types
11. Add rate limiting
12. Add metrics/monitoring endpoints


@kcz358
Copy link
Collaborator Author

kcz358 commented Jan 8, 2026

For security issue, we ignore this for now. Currently we assume the server is for internal machine deployment only. Might need extra middle layer to handle these if it needs to be an actual service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants