[feat] Init an http eval server and entrypoints for lmms_eval #972

kcz358 · 2026-01-06T09:33:26Z

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

A descriptive title: [xxx] XXXX
A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Ask for review

Once you feel comfortable for your PR, feel free to @ one of the contributors to review

Eval Server: @Luodian @kcz358 @mwxely

Summary

Introduces an HTTP server for running lmms-eval evaluations via REST API. This enables programmatic evaluation submission and job management.

Features

HTTP Server (lmms_eval/entrypoints/http_server.py)
Endpoints: /evaluate, /jobs/{id}, /queue, /health, /tasks, /models
Python Client (lmms_eval/entrypoints/client.py)
Server Launch (lmms_eval/launch_server.py)

What to do with future plans:

Possibilities to perform future refactor logic for lmms-eval (agentic evaluation, streaming QA)
Perform stats testing based on the model score
With the client interface, can easily compare the testing together with 2 models follow the anthropic style, allowing us to maintain a leaderboard and report in the future

Launch Server

python -m lmms_eval.launch_server --host 0.0.0.0 --port 8000

Use Case

"""
Test script to send multiple evaluation requests to the server.

Usage:
    python test_client.py
"""

from lmms_eval.entrypoints import EvalClient
import tempfile
import shutil

def main():
    # Connect to server
    client = EvalClient("http://localhost:8000")

    # Check server health
    print("Checking server health...")
    health = client.health()
    print(f"Server status: {health}")

    # Submit multiple evaluation jobs
    jobs = []

    # Job 1: qwen2.5-vl on mme
    print("\nSubmitting job 1: qwen2.5-vl on mme...")
    output_dirs = []
    output_dirs.append(tempfile.mkdtemp())
    job1 = client.evaluate(
        model="qwen2_5_vl",
        tasks=["mme"],
        model_args={
            "pretrained": "Qwen/Qwen2.5-VL-7B-Instruct",
            "device_map": "cuda",
            "attn_implementation": "flash_attention_2",
        },
        batch_size=4,
        limit=32,
        log_samples=True,
        num_gpus=8,
        output_dir=output_dirs[0],
    )
    print(f"Job 1 submitted: {job1}")
    jobs.append(job1["job_id"])

    # Job 2: another qwen2.5-vl eval (to test queue)
    output_dirs.append(tempfile.mkdtemp())
    print("\nSubmitting job 2: qwen2.5-vl on mme (second run)...")
    job2 = client.evaluate(
        model="qwen2_5_vl",
        tasks=["mme"],
        model_args={
            "pretrained": "Qwen/Qwen2.5-VL-7B-Instruct",
            "device_map": "cuda",
            "attn_implementation": "flash_attention_2",
        },
        batch_size=4,
        limit=32,
        log_samples=True,
        num_gpus=8,
        output_dir=output_dirs[1],
    )
    print(f"Job 2 submitted: {job2}")
    jobs.append(job2["job_id"])

    # Check queue status
    print("\nQueue status:")
    queue = client.get_queue_status()
    print(f"  Queue size: {queue['queue_size']}")
    print(f"  Running job: {queue['running_job']}")
    print(f"  Queued jobs: {queue['queued_jobs']}")

    # Wait for jobs to complete
    print("\n" + "=" * 50)
    print("Waiting for jobs to complete...")
    print("=" * 50)

    for job_id in jobs:
        print(f"\nWaiting for job {job_id[:8]}...")
        try:
            result = client.wait_for_job(job_id, poll_interval=10.0, verbose=True)
            print(f"Job {job_id[:8]} completed!")
            print(f"Results: {result}")
        except RuntimeError as e:
            print(f"Job {job_id[:8]} failed: {e}")
    
    for output_dir in output_dirs:
        shutil.rmtree(output_dir)

    print("\nAll jobs finished!")


if __name__ == "__main__":
    main()

Output (Client Side)

Github-Issue: #972

- Add CANCELLED status to JobStatus enum - Mark jobs as CANCELLED instead of deleting (fixes race condition) - Worker now skips CANCELLED jobs - Remove compare_results methods (no server endpoint exists) - Remove unused launcher_args parameter Github-Issue: #972

Luodian · 2026-01-06T17:53:54Z

@claude

…qwen25vl

…the function inside the server state

…nto the lifespan and thus we can extend to more state in the future

Fix multiple issues identified during PR review for the HTTP server entrypoints module. Type Safety (client.py): - Add proper type annotations to __exit__ and __aexit__ methods with explicit exc_type, exc_val, exc_tb parameters instead of *args - Add return type annotations to __enter__ and __aenter__ methods - Add ExcInfo type alias for context manager exception info Code Deduplication (client.py): - Extract _process_job_status() helper function to eliminate duplicate job status handling logic between EvalClient.wait_for_job() and AsyncEvalClient.wait_for_job() Memory Management (job_scheduler.py): - Add cleanup_old_jobs() method to prevent unbounded growth of _jobs dictionary which could cause memory leaks in long-running servers - Cleanup runs automatically after each job completes/fails - Add max_completed_jobs constructor parameter (default: 100) to configure retention limit for finished jobs Configuration (job_scheduler.py): - Add temp_dir_prefix constructor parameter to make temporary directory prefix configurable instead of hardcoded "lmms_eval_" - Add class constants DEFAULT_MAX_COMPLETED_JOBS and DEFAULT_TEMP_DIR_PREFIX for better discoverability Error Messages (server_args.py): - Improve port validation error to show both the invalid value and its type for easier debugging (e.g., "got '8080' (type: str)") - Add docstring to __post_init__ method

Luodian · 2026-01-07T14:27:09Z

@claude

claude · 2026-01-07T14:27:25Z

Claude finished @Luodian's task in 2m 17s —— View job

Code Review Complete ✅

I've completed a thorough review of this PR introducing the HTTP eval server for lmms-eval. Overall, this is a well-structured implementation with good architecture. Below are my findings organized by severity.

🔴 Critical Issues

1. Security: Unauthenticated Server Exposed (http_server.py:27-32)

The server has no authentication or authorization. While there's a warning comment, this is a significant security risk.

Issue: Any user with network access can:

Submit unlimited evaluation jobs
Query all job results
Cancel any job
Potentially cause resource exhaustion

Recommendation: Add at minimum:

API key authentication via headers
Rate limiting per client
Job ownership tracking (users can only view/cancel their own jobs)

2. Missing Type Hints on Public APIs (CLAUDE.md violation)

Several functions lack complete type hints, violating the project's code quality standards:

job_scheduler.py:446: _parse_output_directory return type should be explicit
server_args.py:11-14: __post_init__ should have -> None return type
launch_server.py:7-17: prepare_server_args parameter and return types missing

Required: Per CLAUDE.md line 27-28, "Type hints required for all code" and "Public APIs must have docstrings"

🟡 High Priority Issues

3. Race Condition in Job Cancellation (job_scheduler.py:214-241)

The cancel_job method checks job status but the job could transition to RUNNING between the check and status update.

Current flow:

async def cancel_job(self, job_id: str) -> tuple[bool, str]:
    async with self._jobs_lock:
        job = self._jobs.get(job_id)
        if job.status == JobStatus.RUNNING:  # Could change after this check
            return False, "Cannot cancel a running job"
        job.status = JobStatus.CANCELLED

Issue: The _job_worker (line 340) pulls from queue outside the lock, creating a window where status can change.

Fix: Check for JobStatus.CANCELLED in _start_job (already done at line 302) but document this pattern.

4. Subprocess Evaluation Blocks Event Loop (job_scheduler.py:371-443)

The _run_evaluation method streams subprocess output in a blocking manner:

while True:
    line = await proc.stdout.readline()  # Blocks event loop per line

Issue: This prevents other async operations from running smoothly during long evaluations.

Recommendation: Use asyncio.create_task() for output streaming or buffer larger chunks.

5. No Resource Cleanup on Server Shutdown (http_server.py:57, job_scheduler.py:146-154)

When the server shuts down with jobs in the queue:

Running jobs are not gracefully terminated
Queued jobs remain in memory
Subprocess evaluations may become orphaned

Fix: In JobScheduler.stop():

async def stop(self):
    # Cancel all queued jobs
    async with self._jobs_lock:
        for job_id, job in self._jobs.items():
            if job.status == JobStatus.QUEUED:
                job.status = JobStatus.CANCELLED
    
    # Then stop worker
    if self._worker_task:
        self._worker_task.cancel()

6. Temp Directory Leak (job_scheduler.py:378)

Temporary directories created by tempfile.mkdtemp() are never cleaned up:

output_path = config.get("output_dir") or tempfile.mkdtemp(prefix=self._temp_dir_prefix)

Fix: Track created temp dirs and clean them up, or use tempfile.TemporaryDirectory() context manager.

7. Missing Docstrings on Public APIs (CLAUDE.md violation)

Per CLAUDE.md line 28, "Public APIs must have docstrings". Missing:

client.py:31-58: _process_job_status function
server_args.py:11-22: __post_init__ and from_dict methods need docstrings

🟢 Medium Priority Issues

8. Inconsistent Error Handling (client.py:106-111)

The _request method raises httpx exceptions but tests in utils.py:342 catch generic Exception:

except Exception as e:
    return TestResult(success=False, error=str(e))

Recommendation: Define custom exception types for clearer error handling.

9. Line Length Violations (CLAUDE.md:31)

Project standard is 88 chars max. Found violations:

server_args.py:14: 138 characters
job_scheduler.py:183: 101 characters

Fix: Break long lines per CLAUDE.md guidelines.

10. Hardcoded Constants (job_scheduler.py:117-118)

Magic numbers should be configurable:

DEFAULT_MAX_COMPLETED_JOBS = 100
DEFAULT_TEMP_DIR_PREFIX = "lmms_eval_"

Recommendation: Make these part of ServerArgs for better configurability.

11. No Logging of Job Lifecycle Events (job_scheduler.py)

Job state transitions (queued → running → completed) are not logged, making debugging difficult.

Fix: Add logger calls in _start_job, _complete_job, _fail_job:

async def _start_job(self, job_id: str) -> Optional[dict]:
    async with self._jobs_lock:
        job = self._jobs.get(job_id)
        if job is None or job.status == JobStatus.CANCELLED:
            return None
        logger.info(f"Starting job {job_id}")  # ADD THIS
        self._current_job_id = job_id
        job.status = JobStatus.RUNNING
        job.started_at = datetime.now().isoformat()
        return job.request.model_dump()

12. Test Uses Deprecated `preexec_fn` (utils.py:180)

The preexec_fn parameter is not safe with threads and deprecated:

self.process = subprocess.Popen(
    cmd,
    preexec_fn=os.setsid,  # Not safe with threads
)

Fix: Use start_new_session=True instead (Python 3.11+) or handle via process groups differently.

🔵 Low Priority / Style Issues

13. Redundant Dependency in pyproject.toml (line 65)

hf_transfer is listed twice (lines 48 and 65).

14. Missing Type Narrowing (client.py:250)

Return statement has type ignore comment that could be avoided with proper narrowing:

return result  # type: ignore[return-value]

15. Inconsistent Async/Sync Client Patterns (client.py)

AsyncEvalClient doesn't implement all methods from EvalClient (list_tasks, list_models, is_healthy missing).

Recommendation: Either implement these or document why they're excluded.

16. Test Utils Uses Blocking `time.sleep` in Async Context (utils.py:100)

Should use asyncio.sleep for consistency:

time.sleep(1.0)  # Should be async

✅ Positive Aspects

Excellent Architecture: Clean separation of concerns between server, scheduler, and client
Good Documentation: Comprehensive docstrings and README
Thread Safety: Proper use of asyncio locks for concurrent access
Test Infrastructure: Well-designed test utilities and decorators
Error Handling: Generally good error messages and status tracking
Lifespan Management: Correct use of FastAPI lifespan context manager

📋 Recommendations Summary

Must Fix Before Merge:

Add type hints to all public APIs (CLAUDE.md requirement)
Fix line length violations (CLAUDE.md requirement)
Add authentication or clearly document deployment restrictions
Implement temp directory cleanup
Add proper shutdown handling for jobs

Should Fix:
6. Add job lifecycle logging
7. Improve async/sync patterns in subprocess handling
8. Complete AsyncEvalClient implementation
9. Add configurable constants to ServerArgs

Nice to Have:
10. Define custom exception types
11. Add rate limiting
12. Add metrics/monitoring endpoints

kcz358 · 2026-01-08T06:41:45Z

For security issue, we ignore this for now. Currently we assume the server is for internal machine deployment only. Might need extra middle layer to handle these if it needs to be an actual service.

kcz358 added 5 commits January 6, 2026 10:16

Init http server entrypoints for lmms_eval

f96ce2a

Add launch server function for lmms_eval

54ee4b4

Add fastapi and uvicorn into deps for server

5b556eb

Return the results and the samples from the eval side

126ce1f

Fix some comment issues for resource leak

b19c739

kcz358 force-pushed the dev/eval_server branch from 830a68d to b19c739 Compare January 6, 2026 10:16

Luodian added 2 commits January 7, 2026 01:38

fix: remove unused imports and move asyncio import to module level

b7d62ec

Github-Issue: #972

fix: add client __del__ for resource cleanup and use loguru logging

f5f2508

Github-Issue: #972

EvolvingLMMs-Lab deleted a comment from claude bot Jan 6, 2026

kcz358 and others added 6 commits January 7, 2026 03:34

Add cicd test case for lmms_eval, Auto start, clean and test mme for …

47e4488

…qwen25vl

Lint

9943bb6

Add asyncio lock to avoid race condition and refactor to encapsulate …

9701c73

…the function inside the server state

Rfc the server state into a more concrete class, init the scheduler i…

45c0170

…nto the lifespan and thus we can extend to more state in the future

EvolvingLMMs-Lab deleted a comment from claude bot Jan 7, 2026

kcz358 added 7 commits January 8, 2026 06:42

Fix type hints, test subprocess open error for the server

986cef6

Allow server args to pass max job and default name for http server

d8855f5

feat: Add protocol.py with API Pydantic models

e173268

refactor: Import models from protocol.py

a42d7e9

feat: Add /merge route for checkpoint merging

1b52c99

feat: Add merge_checkpoint method to EvalClient

8bd62bc

refactor: Update exports in __init__.py

aef5f4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Init an http eval server and entrypoints for lmms_eval #972

[feat] Init an http eval server and entrypoints for lmms_eval #972

kcz358 commented Jan 6, 2026

Uh oh!

Luodian commented Jan 6, 2026

Uh oh!

Luodian commented Jan 7, 2026

Uh oh!

claude bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

kcz358 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[feat] Init an http eval server and entrypoints for lmms_eval #972

Are you sure you want to change the base?

[feat] Init an http eval server and entrypoints for lmms_eval #972

Conversation

kcz358 commented Jan 6, 2026

When you open a pull-request, please be sure to include the following

Ask for review

Summary

Features

Launch Server

Use Case

Output (Client Side)

Uh oh!

Luodian commented Jan 6, 2026

Uh oh!

Luodian commented Jan 7, 2026

Uh oh!

claude bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

🔴 Critical Issues

1. Security: Unauthenticated Server Exposed (http_server.py:27-32)

2. Missing Type Hints on Public APIs (CLAUDE.md violation)

🟡 High Priority Issues

3. Race Condition in Job Cancellation (job_scheduler.py:214-241)

4. Subprocess Evaluation Blocks Event Loop (job_scheduler.py:371-443)

5. No Resource Cleanup on Server Shutdown (http_server.py:57, job_scheduler.py:146-154)

6. Temp Directory Leak (job_scheduler.py:378)

7. Missing Docstrings on Public APIs (CLAUDE.md violation)

🟢 Medium Priority Issues

8. Inconsistent Error Handling (client.py:106-111)

9. Line Length Violations (CLAUDE.md:31)

10. Hardcoded Constants (job_scheduler.py:117-118)

11. No Logging of Job Lifecycle Events (job_scheduler.py)

12. Test Uses Deprecated preexec_fn (utils.py:180)

🔵 Low Priority / Style Issues

13. Redundant Dependency in pyproject.toml (line 65)

14. Missing Type Narrowing (client.py:250)

15. Inconsistent Async/Sync Client Patterns (client.py)

16. Test Utils Uses Blocking time.sleep in Async Context (utils.py:100)

✅ Positive Aspects

📋 Recommendations Summary

Uh oh!

kcz358 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Jan 7, 2026 •

edited

Loading

12. Test Uses Deprecated `preexec_fn` (utils.py:180)

16. Test Utils Uses Blocking `time.sleep` in Async Context (utils.py:100)