vllm-project · markurtz · Aug 28, 2024 · Aug 28, 2024 · Aug 28, 2024
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@
 Scale Efficiently: Evaluate and Optimize Your LLM Deployments for Real-World Inference Needs
 </h3>
 
-[![GitHub Release](https://img.shields.io/github/release/neuralmagic/guidellm.svg?label=Version)](https://github.com/neuralmagic/guidellm/releases) [![Documentation](https://img.shields.io/badge/Documentation-8A2BE2?logo=read-the-docs&logoColor=%23ffffff&color=%231BC070)](https://github.com/neuralmagic/guidellm/tree/main/docs) [![License](https://img.shields.io/github/license/neuralmagic/guidellm.svg)](https://github.com/neuralmagic/guidellm/blob/main/LICENSE) [![PyPi Release](https://img.shields.io/pypi/v/guidellm.svg?label=PyPi%20Release)](https://pypi.python.org/pypi/guidellm) [![Pypi Release](https://img.shields.io/pypi/v/guidellm-nightly.svg?label=PyPi%20Nightly)](https://pypi.python.org/pypi/guidellm-nightly) [![Python Versions](https://img.shields.io/pypi/pyversions/guidellm.svg?label=Python)](https://pypi.python.org/pypi/guidellm) [![Nightly Build](https://img.shields.io/github/actions/workflow/status/neuralmagic/guidellm/nightly.yml?branch=main&label=Nightly%20Build)](https://github.com/neuralmagic/guidellm/actions/workflows/nightly.yml)
+[![GitHub Release](https://img.shields.io/github/release/neuralmagic/guidellm.svg?label=Version)](https://github.com/neuralmagic/guidellm/releases) [![Documentation](https://img.shields.io/badge/Documentation-8A2BE2?logo=read-the-docs&logoColor=%23ffffff&color=%231BC070)](https://github.com/neuralmagic/guidellm/tree/main/docs) [![License](https://img.shields.io/github/license/neuralmagic/guidellm.svg)](https://github.com/neuralmagic/guidellm/blob/main/LICENSE) [![PyPI Release](https://img.shields.io/pypi/v/guidellm.svg?label=PyPI%20Release)](https://pypi.python.org/pypi/guidellm) [![Pypi Release](https://img.shields.io/pypi/v/guidellm-nightly.svg?label=PyPI%20Nightly)](https://pypi.python.org/pypi/guidellm-nightly) [![Python Versions](https://img.shields.io/pypi/pyversions/guidellm.svg?label=Python)](https://pypi.python.org/pypi/guidellm) [![Nightly Build](https://img.shields.io/github/actions/workflow/status/neuralmagic/guidellm/nightly.yml?branch=main&label=Nightly%20Build)](https://github.com/neuralmagic/guidellm/actions/workflows/nightly.yml)
 
 ## Overview
 
@@ -65,10 +65,12 @@ To run a GuideLLM evaluation, use the `guidellm` command with the appropriate mo
 ```bash
 guidellm \
   --target "http://localhost:8000/v1" \
-  --model "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
+  --model "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" \
+  --data-type emulated \
+  --data "prompt_tokens=512,generated_tokens=128"
 ```
 
-The above command will begin the evaluation and output progress updates similar to the following: <img src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/sample-benchmark.gif" />
+The above command will begin the evaluation and output progress updates similar to the following (if running on a different server, be sure to update the target!): <img src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/sample-benchmark.gif" />
 
 Notes:
 
@@ -88,17 +90,39 @@ The end of the output will include important performance summary metrics such as
 
 <img alt="Sample GuideLLM benchmark end output" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/sample-output-end.png" />
 
-### Advanced Settings
+### Configurations
 
-GuideLLM provides various options to customize evaluations, including setting the duration of each benchmark run, the number of concurrent requests, and the request rate. For a complete list of options and advanced settings, see the [GuideLLM CLI Documentation](https://github.com/neuralmagic/guidellm/blob/main/docs/guides/cli.md).
+GuideLLM provides various CLI and environment options to customize evaluations, including setting the duration of each benchmark run, the number of concurrent requests, and the request rate.
 
-Some common advanced settings include:
+Some common configurations for the CLI include:
 
-- `--rate-type`: The rate to use for benchmarking. Options include `sweep` (shown above), `synchronous` (one request at a time), `throughput` (all requests at once), `constant` (a constant rate defined by `--rate`), and `poisson` (a poisson distribution rate defined by `--rate`).
-- `--data-type`: The data to use for the benchmark. Options include `emulated` (default shown above, emulated to match a given prompt and output length), `transformers` (a transformers dataset), and `file` (a {text, json, jsonl, csv} file with a list of prompts).
+- `--rate-type`: The rate to use for benchmarking. Options include `sweep`, `synchronous`, `throughput`, `constant`, and `poisson`.
+  - `--rate-type sweep`: (default) Sweep runs through the full range of performance for the server. Starting with a `synchronous` rate first, then `throughput`, and finally 10 `constant` rates between the min and max request rate found.
+  - `--rate-type synchronous`: Synchronous runs requests in a synchronous manner, one after the other.
+  - `--rate-type throughput`: Throughput runs requests in a throughput manner, sending requests as fast as possible.
+  - `--rate-type constant`: Constant runs requests at a constant rate. Specify the rate in requests per second with the `--rate` argument. For example, `--rate 10` or multiple rates with `--rate 10 --rate 20 --rate 30`.
+  - `--rate-type poisson`: Poisson draws from a poisson distribution with the mean at the specified rate, adding some real-world variance to the runs. Specify the rate in requests per second with the `--rate` argument. For example, `--rate 10` or multiple rates with `--rate 10 --rate 20 --rate 30`.
+- `--data-type`: The data to use for the benchmark. Options include `emulated`, `transformers`, and `file`.
+  - `--data-type emulated`: Emulated supports an EmulationConfig in string or file format for the `--data` argument to generate fake data. Specify the number of prompt tokens at a minimum and optionally the number of output tokens and other params for variance in the length. For example, `--data "prompt_tokens=128"`, `--data "prompt_tokens=128,generated_tokens=128"`, or `--data "prompt_tokens=128,prompt_tokens_variance=10"`.
+  - `--data-type file`: File supports a file path or URL to a file for the `--data` argument. The file should contain data encoded as a CSV, JSONL, TXT, or JSON/YAML file with a single prompt per line for CSV, JSONL, and TXT or a list of prompts for JSON/YAML. For example, `--data "data.txt"` where data.txt contents are `"prompt1\nprompt2\nprompt3"`.
+  - `--data-type transformers`: Transformers supports a dataset name or dataset file path for the `--data` argument. For example, `--data "neuralmagic/LLM_compression_calibration"`.
 - `--max-seconds`: The maximum number of seconds to run each benchmark. The default is 120 seconds.
 - `--max-requests`: The maximum number of requests to run in each benchmark.
 
+For a full list of supported CLI arguments, run the following command:
+
+```bash
+guidellm --help
+```
+
+For a full list of configuration options, run the following command:
+
+```bash
+guidellm-config
+```
+
+For further information, see the [GuideLLM Documentation](#Documentation).
+
 ## Resources
 
 ### Documentation
@@ -109,7 +133,7 @@ Our comprehensive documentation provides detailed guides and resources to help y
 
 - [**Installation Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/install.md) - Step-by-step instructions to install GuideLLM, including prerequisites and setup tips.
 - [**Architecture Overview**](https://github.com/neuralmagic/guidellm/tree/main/docs/architecture.md) - A detailed look at GuideLLM's design, components, and how they interact.
-- [**CLI Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/guides/cli_usage.md) - Comprehensive usage information for running GuideLLM via the command line, including available commands and options.
+- [**CLI Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/guides/cli.md) - Comprehensive usage information for running GuideLLM via the command line, including available commands and options.
 - [**Configuration Guide**](https://github.com/neuralmagic/guidellm/tree/main/docs/guides/configuration.md) - Instructions on configuring GuideLLM to suit various deployment needs and performance goals.
 
 ### Supporting External Documentation

diff --git a/docs/assets/sample-benchmark.gif b/docs/assets/sample-benchmark.gif
diff --git a/docs/assets/sample-benchmarks.gif b/docs/assets/sample-benchmarks.gif
diff --git a/pyproject.toml b/pyproject.toml
@@ -75,6 +75,7 @@ dev = [
 
 [project.entry-points.console_scripts]
 guidellm = "guidellm.main:generate_benchmark_report_cli"
+guidellm-config = "guidellm.config:print_config"
 
 
 # ************************************************

diff --git a/src/guidellm/backend/base.py b/src/guidellm/backend/base.py
@@ -1,9 +1,14 @@
+import asyncio
 import functools
 from abc import ABC, abstractmethod
 from typing import AsyncGenerator, Dict, List, Literal, Optional, Type, Union
 
 from loguru import logger
 from pydantic import BaseModel
+from transformers import (  # type: ignore  # noqa: PGH003
+    AutoTokenizer,
+    PreTrainedTokenizer,
+)
 
 from guidellm.core import TextGenerationRequest, TextGenerationResult
 
@@ -103,10 +108,21 @@ def create(cls, backend_type: BackendEngine, **kwargs) -> "Backend":
         return Backend._registry[backend_type](**kwargs)
 
     def __init__(self, type_: BackendEngine, target: str, model: str):
+        """
+        Base constructor for the Backend class.
+        Calls into test_connection to ensure the backend is reachable.
+        Ensure all setup is done in the subclass constructor before calling super.
+
+        :param type_: The type of the backend.
+        :param target: The target URL for the backend.
+        :param model: The model used by the backend.
+        """
         self._type = type_
         self._target = target
         self._model = model
 
+        self.test_connection()
+
     @property
     def default_model(self) -> str:
         """
@@ -148,6 +164,48 @@ def model(self) -> str:
         """
         return self._model
 
+    def model_tokenizer(self) -> PreTrainedTokenizer:
+        """
+        Get the tokenizer for the backend model.
+
+        :return: The tokenizer instance.
+        """
+        return AutoTokenizer.from_pretrained(self.model)
+
+    def test_connection(self) -> bool:
+        """
+        Test the connection to the backend by running a short text generation request.
+        If successful, returns True, otherwise raises an exception.
+
+        :return: True if the connection is successful.
+        :rtype: bool
+        :raises ValueError: If the connection test fails.
+        """
+        try:
+            asyncio.get_running_loop()
+            is_async = True
+        except RuntimeError:
+            is_async = False
+
+        if is_async:
+            logger.warning("Running in async mode, cannot test connection")
+            return True
+
+        try:
+            request = TextGenerationRequest(
+                prompt="Test connection", output_token_count=5
+            )
+
+            asyncio.run(self.submit(request))
+            return True
+        except Exception as err:
+            raise_err = RuntimeError(
+                f"Backend connection test failed for backend type={self.type_} "
+                f"with target={self.target} and model={self.model} with error: {err}"
+            )
+            logger.error(raise_err)
+            raise raise_err from err
+
     async def submit(self, request: TextGenerationRequest) -> TextGenerationResult:
         """
         Submit a text generation request and return the result.

diff --git a/src/guidellm/config.py b/src/guidellm/config.py
@@ -1,5 +1,6 @@
+import json
 from enum import Enum
-from typing import Dict, List, Optional
+from typing import Dict, List, Optional, Sequence
 
 from pydantic import BaseModel, Field, model_validator
 from pydantic_settings import BaseSettings, SettingsConfigDict
@@ -10,6 +11,7 @@
     "Environment",
     "LoggingSettings",
     "OpenAISettings",
+    "print_config",
     "ReportGenerationSettings",
     "Settings",
     "reload_settings",
@@ -70,7 +72,6 @@ class DatasetSettings(BaseModel):
     preferred_data_splits: List[str] = Field(
         default_factory=lambda: ["test", "tst", "validation", "val", "train"]
     )
-    default_tokenizer: str = "neuralmagic/Meta-Llama-3.1-8B-FP8"
 
 
 class EmulatedDataSettings(BaseModel):
@@ -163,6 +164,53 @@ def set_default_source(cls, values):
 
         return values
 
+    def generate_env_file(self) -> str:
+        """
+        Generate the .env file from the current settings
+        """
+        return Settings._recursive_generate_env(
+            self,
+            self.model_config["env_prefix"],  # type: ignore  # noqa: PGH003
+            self.model_config["env_nested_delimiter"],  # type: ignore  # noqa: PGH003
+        )
+
+    @staticmethod
+    def _recursive_generate_env(model: BaseModel, prefix: str, delimiter: str) -> str:
+        env_file = ""
+        add_models = []
+        for key, value in model.model_dump().items():
+            if isinstance(value, BaseModel):
+                # add nested properties to be processed after the current level
+                add_models.append((key, value))
+                continue
+
+            dict_values = (
+                {
+                    f"{prefix}{key.upper()}{delimiter}{sub_key.upper()}": sub_value
+                    for sub_key, sub_value in value.items()
+                }
+                if isinstance(value, dict)
+                else {f"{prefix}{key.upper()}": value}
+            )
+
+            for tag, sub_value in dict_values.items():
+                if isinstance(sub_value, Sequence) and not isinstance(sub_value, str):
+                    value_str = ",".join(f'"{item}"' for item in sub_value)
+                    env_file += f"{tag}=[{value_str}]\n"
+                elif isinstance(sub_value, Dict):
+                    value_str = json.dumps(sub_value)
+                    env_file += f"{tag}={value_str}\n"
+                elif not sub_value:
+                    env_file += f"{tag}=\n"
+                else:
+                    env_file += f'{tag}="{sub_value}"\n'
+
+        for key, value in add_models:
+            env_file += Settings._recursive_generate_env(
+                value, f"{prefix}{key.upper()}{delimiter}", delimiter
+            )
+        return env_file
+
 
 settings = Settings()
 
@@ -173,3 +221,14 @@ def reload_settings():
     """
     new_settings = Settings()
     settings.__dict__.update(new_settings.__dict__)
+
+
+def print_config():
+    """
+    Print the current configuration settings
+    """
+    print(f"Settings: \n{settings.generate_env_file()}")  # noqa: T201
+
+
+if __name__ == "__main__":
+    print_config()
diff --git a/src/guidellm/core/request.py b/src/guidellm/core/request.py
@@ -28,3 +28,17 @@ class TextGenerationRequest(Serializable):
         default_factory=dict,
         description="The parameters for the text generation request.",
     )
+
+    def __str__(self) -> str:
+        prompt_short = (
+            self.prompt[:32] + "..."
+            if self.prompt and len(self.prompt) > 32  # noqa: PLR2004
+            else self.prompt
+        )
+
+        return (
+            f"TextGenerationRequest(id={self.id}, "
+            f"prompt={prompt_short}, prompt_token_count={self.prompt_token_count}, "
+            f"output_token_count={self.output_token_count}, "
+            f"params={self.params})"
+        )