Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 156 additions & 0 deletions tests/llm_proxy/test_text_content.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Copyright (c) Microsoft. All rights reserved.

import asyncio
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'asyncio' is not used.

Suggested change
import asyncio

Copilot uses AI. Check for mistakes.
import json
import sys
from ast import literal_eval
from typing import Any, Dict, List, Sequence, Type, Union, cast

Comment on lines +7 to +8
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Type' is not used.
Import of 'Any' is not used.
Import of 'Dict' is not used.
Import of 'List' is not used.
Import of 'Sequence' is not used.
Import of 'Union' is not used.
Import of 'cast' is not used.

Suggested change
from typing import Any, Dict, List, Sequence, Type, Union, cast

Copilot uses AI. Check for mistakes.
sys.path.append("examples/claude_code")

import anthropic
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'anthropic' is not used.

Suggested change
import anthropic

Copilot uses AI. Check for mistakes.
import openai
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'openai' is not used.

Suggested change
import openai

Copilot uses AI. Check for mistakes.
import pytest
from litellm.integrations.custom_logger import CustomLogger
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'CustomLogger' is not used.

Suggested change
from litellm.integrations.custom_logger import CustomLogger

Copilot uses AI. Check for mistakes.
from portpicker import pick_unused_port
from swebench.harness.constants import SWEbenchInstance
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'SWEbenchInstance' is not used.

Suggested change
from swebench.harness.constants import SWEbenchInstance

Copilot uses AI. Check for mistakes.
from swebench.harness.utils import load_swebench_dataset # pyright: ignore[reportUnknownVariableType]
from transformers import AutoTokenizer

from agentlightning import LitAgentRunner, OtelTracer
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'LitAgentRunner' is not used.
Import of 'OtelTracer' is not used.

Suggested change
from agentlightning import LitAgentRunner, OtelTracer
# from agentlightning import LitAgentRunner, OtelTracer

Copilot uses AI. Check for mistakes.
from agentlightning.llm_proxy import LLMProxy, _reset_litellm_logging_worker # pyright: ignore[reportPrivateUsage]
from agentlightning.store import LightningStore, LightningStoreServer, LightningStoreThreaded
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'LightningStore' is not used.

Suggested change
from agentlightning.store import LightningStore, LightningStoreServer, LightningStoreThreaded
from agentlightning.store import LightningStoreServer, LightningStoreThreaded

Copilot uses AI. Check for mistakes.
from agentlightning.store.memory import InMemoryLightningStore
from agentlightning.types import LLM, Span
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'LLM' is not used.
Import of 'Span' is not used.

Suggested change
from agentlightning.types import LLM, Span

Copilot uses AI. Check for mistakes.
from examples.claude_code.claude_code_agent import ClaudeCodeAgent, _load_dataset

from ..common.tracer import clear_tracer_provider

pytest.skip(reason="Debug only", allow_module_level=True)


@pytest.mark.asyncio
@pytest.mark.parametrize(
"otlp_enabled",
[
True,
],
)
async def test_claude_code(otlp_enabled: bool):
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test function name test_claude_code is vague and doesn't clearly describe what is being tested. Based on the PR title "add unit test for text loss in claude code", consider a more descriptive name like test_claude_code_text_preservation_across_turns or test_claude_code_response_text_in_next_prompt.

Suggested change
async def test_claude_code(otlp_enabled: bool):
async def test_claude_code_text_preservation_across_turns(otlp_enabled: bool):

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test function lacks a docstring explaining its purpose, what it tests, and any setup requirements (like the remote endpoint). Add a docstring that describes: 1) What aspect of text preservation is being tested, 2) The test methodology, 3) The expected behavior being validated.

Suggested change
async def test_claude_code(otlp_enabled: bool):
async def test_claude_code(otlp_enabled: bool):
"""
Test the Claude Code agent's ability to preserve text content when interacting with a remote LLM endpoint.
This test loads a sample from the SWE-bench dataset, initializes the Claude Code agent, and runs it using a specified
remote model endpoint. It checks that the agent processes the input and preserves the relevant text content throughout
its operation. The test requires the remote endpoint to be available and the specified model to be deployed.
Args:
otlp_enabled (bool): Whether to enable OTLP tracing for the test.
Expected behavior:
- The agent should process the input sample and preserve the text content as expected.
- The test will interact with a remote endpoint, so network connectivity and endpoint availability are required.
"""

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +39
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is parameterized with only a single value True for otlp_enabled. This is unusual for parameterization - typically you'd either remove the parameterization and set it to a constant, or test both True and False cases. If only True is needed, remove the parameterization decorator.

Suggested change
@pytest.mark.parametrize(
"otlp_enabled",
[
True,
],
)
async def test_claude_code(otlp_enabled: bool):
async def test_claude_code():
otlp_enabled = True

Copilot uses AI. Check for mistakes.
# For unknown reasons, I don't have local machine for debugging,
# this model is deployed remotely, so as the whole unit test.
Comment on lines +40 to +41
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded model name and endpoint suggest this test requires external infrastructure to run. Consider using a test fixture or mock, or document this as a manual test that requires specific setup. This will make the test suite more portable and easier to run in CI/CD environments.

Copilot uses AI. Check for mistakes.
model_name = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 5 for max_turns is not explained. Consider adding a comment explaining why this specific value is chosen, or define it as a named constant at the module level (e.g., TEST_MAX_TURNS = 5).

Copilot uses AI. Check for mistakes.
endpoint = "http://localhost:8000/v1"
max_turns = 5

clear_tracer_provider()
_reset_litellm_logging_worker() # type: ignore

# Prepare utilities for testing
dataset_path = "examples/claude_code/swebench_samples.jsonl"
instance = _load_dataset(dataset_path, limit=1)[0]
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load full swebench dataset. Mainly for evaluation purposes.
swebench_full_dataset = load_swebench_dataset("princeton-nlp/SWE-bench", split="test")
# Initialize Claude Code Agent
claude_code_agent = ClaudeCodeAgent(swebench_full_dataset=swebench_full_dataset, max_turns=max_turns)

system_prompt_piece = "Please do not commit your edits. We will do it later."

# Initialize agl Infrastructure
inmemory_store = InMemoryLightningStore()
if otlp_enabled:
store = LightningStoreServer(store=inmemory_store, host="127.0.0.1", port=pick_unused_port())
await store.start()
else:
store = LightningStoreThreaded(inmemory_store)

Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model version "claude-sonnet-4-5-20250929" has a date of 2025-09-29, which is in the future. This appears to be a fictional or placeholder model name. Consider using a documented or real model name, or clarify in a comment that this is a test placeholder.

Suggested change
# NOTE: The model names below ("claude-sonnet-4-5-20250929", "claude-haiku-4-5-20251001") are placeholders for testing purposes.
# They do not refer to real, documented model versions.

Copilot uses AI. Check for mistakes.
proxy = LLMProxy(
model_list=[
{
"model_name": "claude-sonnet-4-5-20250929",
"litellm_params": {
"model": "hosted_vllm/" + model_name,
"api_base": endpoint,
},
},
{
"model_name": "claude-haiku-4-5-20251001",
"litellm_params": {
"model": "hosted_vllm/" + model_name,
"api_base": endpoint,
},
},
Comment on lines +78 to +83
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model version "claude-haiku-4-5-20251001" has a date of 2025-10-01, which is in the future. This appears to be a fictional or placeholder model name. Consider using a documented or real model name, or clarify in a comment that this is a test placeholder.

Suggested change
"model_name": "claude-haiku-4-5-20251001",
"litellm_params": {
"model": "hosted_vllm/" + model_name,
"api_base": endpoint,
},
},
# NOTE: The following model name is intentionally fictional and used as a test placeholder.
"model_name": "claude-haiku-4-5-20251001",
"litellm_params": {
"model": "hosted_vllm/" + model_name,
"api_base": endpoint,
},

Copilot uses AI. Check for mistakes.
],
launch_mode="thread" if not otlp_enabled else "mp",
port=pick_unused_port(),
store=store,
)
proxy.server_launcher._access_host = "localhost"
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the private attribute _access_host of server_launcher is not recommended as it couples the test to internal implementation details. If this property needs to be overridden for testing, consider adding a public API or test hook in the LLMProxy class.

Suggested change
proxy.server_launcher._access_host = "localhost"
# Avoid direct access to private attribute _access_host.
# If LLMProxy or server_launcher exposes a public setter, use it here.
# For example: proxy.server_launcher.set_access_host("localhost")
# If not, consider adding a public API to LLMProxy/server_launcher for testing purposes.
# (Direct access to _access_host is discouraged and flagged by CodeQL.)

Copilot uses AI. Check for mistakes.
await proxy.start()

rollout = await store.start_rollout(None)

resource = proxy.as_resource(rollout.rollout_id, rollout.attempt.attempt_id, model="local")
Comment on lines +93 to +94
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug print statements should be removed or replaced with proper logging. These statements can clutter test output and are typically used during development but should not remain in production test code.

Copilot uses AI. Check for mistakes.
print(f">>> DEBUG: {proxy.server_launcher.access_endpoint=}")
print(f">>> DEBUG: {resource.endpoint=}, {resource.model=}")

# Dry run to generate spans
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "# Dry run to generate spans" could be more descriptive. Consider clarifying that this is executing the actual test scenario: "# Execute ClaudeCodeAgent rollout to generate spans for validation".

Suggested change
# Dry run to generate spans
# Execute ClaudeCodeAgent rollout to generate spans for validation

Copilot uses AI. Check for mistakes.
await claude_code_agent.rollout_async(
task=instance,
resources={"llm": resource},
rollout=rollout,
)

spans = await store.query_spans(rollout.rollout_id)

# Preprocess raw spans
valid_spans = []
for span in spans:
if span.name != "raw_gen_ai_request":
continue

prompt_ids = span.attributes["llm.hosted_vllm.prompt_token_ids"]
prompt_text = tokenizer.decode(literal_eval(prompt_ids))
if system_prompt_piece not in prompt_text:
continue

choice = literal_eval(span.attributes["llm.hosted_vllm.choices"])[0]
response_ids = choice["token_ids"]
response_text = tokenizer.decode(response_ids)

prompt_messages = literal_eval(span.attributes["llm.hosted_vllm.messages"])
response_message = choice["message"]

valid_spans.append(
{
"prompt_text": prompt_text,
"response_text": response_text,
"prompt_messages": prompt_messages,
"response_message": response_message,
}
)

with open("logs/test_spans.jsonl", "w") as f:
for span in spans:
f.write(json.dumps(span.model_dump(), indent=2) + "\n")

with open("logs/test_valid_spans.jsonl", "w") as f:
Comment on lines +132 to +138
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should not write to hardcoded directories like "logs/". This creates several issues: 1) The directory may not exist, causing the test to fail. 2) It leaves artifacts on the filesystem after test completion. 3) It can cause issues in CI/CD environments. Use pytest's tmp_path fixture instead to write to a temporary directory that is automatically cleaned up.

Copilot uses AI. Check for mistakes.
for span in valid_spans:
f.write(json.dumps(span, indent=2) + "\n")

# Test case 1: At least two valid spans
assert len(valid_spans) > 1
print(f"Generated {len(spans)} spans with {len(valid_spans)} LLM requests.")
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing this print statement or converting it to use a proper logger. Print statements in tests can clutter CI/CD output and should generally be avoided unless debugging.

Suggested change
print(f"Generated {len(spans)} spans with {len(valid_spans)} LLM requests.")

Copilot uses AI. Check for mistakes.

# Test case 2:
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "# Test case 2:" is incomplete. It should describe what Test case 2 is verifying. Consider adding a descriptive comment like "# Test case 2: Verify that previous response text appears in the next prompt".

Suggested change
# Test case 2:
# Test case 2: Verify that previous response text appears in the next prompt

Copilot uses AI. Check for mistakes.
for i in range(1, len(valid_spans)):
prev = valid_spans[i - 1]
curr = valid_spans[i]

# The current prompt should contain the previous response
assert prev["response_text"] in curr["prompt_text"]

await proxy.stop()
if isinstance(store, LightningStoreServer):
await store.stop()