Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
514 changes: 215 additions & 299 deletions documentation/backend_documentation/input_structure_for_the_simulation.md

Large diffs are not rendered by default.

49 changes: 49 additions & 0 deletions documentation/backend_documentation/metrics_to_measure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
### **FastSim — simulation's metrics**

Metrics are the lifeblood of any simulation, transforming a series of abstract events into concrete, actionable insights about system performance, resource utilization, and potential bottlenecks. FastSim provides a flexible and robust metrics collection system designed to give you a multi-faceted view of your system's behavior under load.

To achieve this, FastSim categorizes metrics into three distinct types based on their collection methodology:

1. **Sampled Metrics (`SampledMetricName`):** These metrics provide a **time-series view** of the system's state. They are captured at fixed, regular intervals throughout the simulation's duration (e.g., every second). This methodology is ideal for understanding trends, observing oscillations, and measuring the continuous utilization of finite resources like CPU and RAM. Think of them as periodic snapshots of your system's health.

2. **Event-based Metrics (`EventMetricName`):** These metrics are recorded **only when a specific event occurs**. Their collection is asynchronous and irregular, triggered by discrete happenings within the simulation, such as the completion of a request. This methodology is perfect for measuring the properties of individual transactions, such as end-to-end latency, where an average value is less important than understanding the full distribution of outcomes.

3. **Aggregated Metrics (`AggregatedMetricName`):** These are not collected directly during the simulation but are **calculated after the simulation ends**. They provide high-level statistical summaries (like mean, median, and percentiles) derived from the raw data collected by Event-based metrics. They distill thousands of individual data points into a handful of key performance indicators (KPIs) that are easy to interpret.

The following sections provide a detailed breakdown of each metric within these categories, explaining what they measure and the rationale for their importance.

---

### **1. Sampled Metrics: A Time-Series Perspective**

Sampled metrics are configured in the `SimulationSettings` payload. Enabling them allows you to plot the evolution of system resources over time, which is crucial for identifying saturation points and transient performance issues.

| Metric Name (`SampledMetricName`) | Description & Rationale |
| :--- | :--- |
| **`READY_QUEUE_LEN`** | **What it is:** The number of tasks in the `asyncio` event loop's "ready" queue waiting for their turn to run on the CPU. <br><br> **Rationale:** This is arguably the most critical indicator of **CPU saturation**. In a single-threaded Python process, only one coroutine can run at a time (held by the GIL). If this queue length is consistently greater than zero, it means tasks are ready to do work but are forced to wait because the CPU is busy. A long or growing queue is a definitive sign that your application is CPU-bound and that the CPU is a primary bottleneck. |
| **`CORE_BUSY`** | **What it is:** The number of server CPU cores that are currently executing a task. <br><br> **Rationale:** This provides a direct measure of **CPU utilization**. When plotted over time, it shows how effectively you are using your provisioned processing power. If `CORE_BUSY` is consistently at its maximum value (equal to `server_resources.cpu_cores`), the system is CPU-saturated. Conversely, if it's consistently low while latency is high, the bottleneck is likely elsewhere (e.g., I/O). It perfectly complements `READY_QUEUE_LEN` to form a complete picture of CPU health. |
| **`EVENT_LOOP_IO_SLEEP`** | **What it is:** A measure indicating if the event loop is idle, polling for I/O operations to complete. <br><br> **Rationale:** This metric helps you determine if your system is **I/O-bound**. If the event loop spends a significant amount of time in this state, it means the CPU is underutilized because it has no ready tasks to run and is instead waiting for external systems (like databases, caches, or downstream APIs) to respond. High values for this metric coupled with low CPU utilization are a clear signal to investigate and optimize the performance of your I/O operations. |
| **`RAM_IN_USE`** | **What it is:** The total amount of memory (in MB) currently allocated by all active requests within a server. <br><br> **Rationale:** Essential for **capacity planning and stability analysis**. This metric allows you to visualize your system's memory footprint under load. You can identify which endpoints cause memory spikes and ensure your provisioned RAM is sufficient. A steadily increasing `RAM_IN_USE` value that never returns to a baseline is the classic signature of a **memory leak**, a critical bug this metric helps you detect. |
| **`THROUGHPUT_RPS`** | **What it is:** The number of requests successfully completed per second, calculated over the last sampling window. <br><br> **Rationale:** This is a fundamental measure of **system performance and capacity**. It answers the question: "How much work is my system actually doing?" Plotting throughput against user load or other resource metrics is key to understanding your system's scaling characteristics. A drop in throughput often correlates with a spike in latency or resource saturation, helping you pinpoint the exact moment a bottleneck began to affect performance. |

---

### **2. Event-based Metrics: A Per-Transaction Perspective**

Event-based metrics are also enabled in the `SimulationSettings` payload. They generate a collection of raw data points, one for each relevant event, which is ideal for statistical analysis of transactional performance.

| Metric Name (`EventMetricName`) | Description & Rationale |
| :--- | :--- |
| **`RQS_LATENCY`** | **What it is:** The total end-to-end duration, in seconds, for a single request to be fully processed. <br><br> **Rationale:** This is the **primary user-facing performance metric**. Users directly experience latency. While a simple average can be useful, it often hides critical problems. By collecting the latency for *every single request*, FastSim allows for the calculation of statistical distributions and, most importantly, **tail-latency percentiles (p95, p99)**. These percentiles represent the worst-case experience for your users and are crucial for evaluating Service Level Objectives (SLOs) and ensuring a consistent user experience. |
| **`LLM_COST`** | **What it is:** The estimated monetary cost (e.g., in USD) incurred by a single call to an external Large Language Model (LLM) API during a request. <br><br> **Rationale:** In modern AI-powered applications, API calls to third-party services like LLMs can be a major operational expense. This metric moves beyond technical performance to measure **financial performance**. By tracking cost on a per-event basis, you can attribute expenses to specific endpoints or user behaviors, identify unnecessarily costly operations, and make informed decisions to optimize your application's cost-effectiveness. |

---

### **3. Aggregated Metrics: High-Level Summaries**

**Important:** Aggregated metrics are **not configured in the input payload**. They are automatically calculated by the FastSim engine at the end of a simulation run, based on the raw data collected from the enabled Event-based metrics.

| Metric Name (`AggregatedMetricName`) | Description & Rationale |
| :--- | :--- |
| **`LATENCY_STATS`** | **What it is:** A statistical summary of the entire collection of `RQS_LATENCY` data points. This typically includes the mean, median (p50), standard deviation, and high-end percentiles (p95, p99, p99.9). <br><br> **Rationale:** This provides a comprehensive and easily digestible summary of your system's latency profile. While the raw data is essential, these summary statistics answer high-level questions quickly. The mean tells you the average experience, the median protects against outliers, and the p95/p99 values tell you the latency that 95% or 99% of your users will beat—a critical KPI for reliability and user satisfaction. |
| **`LLM_STATS`** | **What it is:** A statistical summary of the `LLM_COST` data points. This can include total cost over the simulation, average cost per request, and cost distribution. <br><br> **Rationale:** This gives you a bird's-eye view of the financial implications of your system's design. Instead of looking at individual transaction costs, `LLM_STATS` provides the bottom line: the total operational cost during the simulation period. This is invaluable for budgeting, forecasting, and validating the financial viability of new features. |
4 changes: 2 additions & 2 deletions src/app/api/simulation.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@
from fastapi import APIRouter

from app.core.simulation.simulation_run import run_simulation
from app.schemas.requests_generator_input import RqsGeneratorInput
from app.schemas.full_simulation_input import SimulationPayload
from app.schemas.simulation_output import SimulationOutput

router = APIRouter()

@router.post("/simulation")
async def event_loop_simulation(input_data: RqsGeneratorInput) -> SimulationOutput:
async def event_loop_simulation(input_data: SimulationPayload) -> SimulationOutput:
"""Run the simulation and return aggregate KPIs."""
rng = np.random.default_rng()
return run_simulation(input_data, rng=rng)
Expand Down
40 changes: 40 additions & 0 deletions src/app/config/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,3 +182,43 @@ class SystemEdges(StrEnum):
"""

NETWORK_CONNECTION = "network_connection"

# ======================================================================
# CONSTANTS FOR SAMPLED METRICS
# ======================================================================

class SampledMetricName(StrEnum):
"""
define the metrics sampled every fixed amount of
time to create a time series
"""

READY_QUEUE_LEN = "ready_queue_len" #length of the event loop ready q
CORE_BUSY = "core_busy"
EVENT_LOOP_IO_SLEEP = "event_loop_io_sleep"
RAM_IN_USE = "ram_in_use"
THROUGHPUT_RPS = "throughput_rps"

# ======================================================================
# CONSTANTS FOR EVENT METRICS
# ======================================================================

class EventMetricName(StrEnum):
"""
define the metrics triggered by event with no
time series
"""

RQS_LATENCY = "rqs_latency"
LLM_COST = "llm_cost"


# ======================================================================
# CONSTANTS FOR AGGREGATED METRICS
# ======================================================================

class AggregatedMetricName(StrEnum):
"""aggregated metrics to calculate at the end of simulation"""

LATENCY_STATS = "latency_stats"
LLM_STATS = "llm_stats"
6 changes: 4 additions & 2 deletions src/app/core/event_samplers/gaussian_poisson.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,12 @@
uniform_variable_generator,
)
from app.schemas.requests_generator_input import RqsGeneratorInput
from app.schemas.simulation_settings_input import SimulationSettings


def gaussian_poisson_sampling(
input_data: RqsGeneratorInput,
sim_settings: SimulationSettings,
*,
rng: np.random.Generator | None = None,
) -> Generator[float, None, None]:
Expand All @@ -35,11 +37,11 @@ def gaussian_poisson_sampling(
Λ = U * (mean_req_per_minute_per_user / 60) [req/s].
3. While inside the current window, draw gaps
Δt ~ Exponential(Λ) using inverse-CDF.
4. Stop once the virtual clock exceeds *simulation_time*.
4. Stop once the virtual clock exceeds *total_simulation_time*.
"""
rng = rng or np.random.default_rng()

simulation_time = input_data.total_simulation_time
simulation_time = sim_settings.total_simulation_time
user_sampling_window = input_data.user_sampling_window

# λ_u : mean concurrent users per window
Expand Down
6 changes: 4 additions & 2 deletions src/app/core/event_samplers/poisson_poisson.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,12 @@
uniform_variable_generator,
)
from app.schemas.requests_generator_input import RqsGeneratorInput
from app.schemas.simulation_settings_input import SimulationSettings


def poisson_poisson_sampling(
input_data: RqsGeneratorInput,
sim_settings: SimulationSettings,
*,
rng: np.random.Generator | None = None,
) -> Generator[float, None, None]:
Expand All @@ -32,11 +34,11 @@ def poisson_poisson_sampling(
Λ = U * (mean_req_per_minute_per_user / 60) [req/s].
3. While inside the current window, draw gaps
Δt ~ Exponential(Λ) using inverse-CDF.
4. Stop once the virtual clock exceeds *simulation_time*.
4. Stop once the virtual clock exceeds *total_simulation_time*.
"""
rng = rng or np.random.default_rng()

simulation_time = input_data.total_simulation_time
simulation_time = sim_settings.total_simulation_time
user_sampling_window = input_data.user_sampling_window

# λ_u : mean concurrent users per window
Expand Down
38 changes: 38 additions & 0 deletions src/app/core/helpers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
"""helpers for the simulation"""

from collections.abc import Iterable

from app.config.constants import EventMetricName, SampledMetricName


def alloc_sample_metric(
enabled_sample_metrics: Iterable[SampledMetricName],
) -> dict[str, list[float | int]]:
"""
After the pydantic validation of the whole input we
instantiate a dictionary to collect the sampled metrics the
user want to measure
"""
# t is the alignment parameter for example assume
# the snapshot for the sampled metrics are done every 10ms
# t = [10,20,30,40....] to each t will correspond a measured
# metric corresponding to that time interval

dict_sampled_metrics: dict[str, list[float | int]] = {"t": []}
for key in enabled_sample_metrics:
dict_sampled_metrics[key] = []
return dict_sampled_metrics


def alloc_event_metric(
enabled_event_metrics: Iterable[EventMetricName],
) -> dict[str, list[float | int]]:
"""
After the pydantic validation of the whole input we
instantiate a dictionary to collect the event metrics the
user want to measure
"""
dict_event_metrics: dict[str, list[float | int]] = {}
for key in enabled_event_metrics:
dict_event_metrics[key] = []
return dict_event_metrics
4 changes: 4 additions & 0 deletions src/app/core/simulation/requests_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,12 @@
import numpy as np

from app.schemas.requests_generator_input import RqsGeneratorInput
from app.schemas.simulation_settings_input import SimulationSettings


def requests_generator(
input_data: RqsGeneratorInput,
sim_settings: SimulationSettings,
*,
rng: np.random.Generator | None = None,
) -> Generator[float, None, None]:
Expand All @@ -41,12 +43,14 @@ def requests_generator(
#Gaussian-Poisson model
return gaussian_poisson_sampling(
input_data=input_data,
sim_settings=sim_settings,
rng=rng,

)

# Poisson + Poisson
return poisson_poisson_sampling(
input_data=input_data,
sim_settings=sim_settings,
rng=rng,
)
28 changes: 16 additions & 12 deletions src/app/core/simulation/simulation_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,28 +14,32 @@

import numpy as np

from app.schemas.requests_generator_input import RqsGeneratorInput
from app.schemas.full_simulation_input import SimulationPayload






def run_simulation(
input_data: RqsGeneratorInput,
input_data: SimulationPayload,
*,
rng: np.random.Generator,
) -> SimulationOutput:
"""Simulation executor in Simpy"""
gaps: Generator[float, None, None] = requests_generator(input_data, rng=rng)
sim_settings = input_data.sim_settings

requests_generator_input = input_data.rqs_input

gaps: Generator[float, None, None] = requests_generator(
requests_generator_input,
sim_settings,
rng=rng)
env = simpy.Environment()

simulation_time = input_data.total_simulation_time
# pydantic in the validation assign a value and mypy is not
# complaining because a None cannot be compared in the loop
# to a float
assert simulation_time is not None

total_request_per_time_period = {
"simulation_time": simulation_time,
"simulation_time": sim_settings.total_simulation_time,
"total_requests": 0,
}

Expand All @@ -47,10 +51,10 @@ def arrival_process(
total_request_per_time_period["total_requests"] += 1

env.process(arrival_process(env))
env.run(until=simulation_time)
env.run(until=sim_settings.total_simulation_time)

return SimulationOutput(
total_requests=total_request_per_time_period,
metric_2=str(input_data.avg_request_per_minute_per_user.mean),
metric_n=str(input_data.avg_active_users.mean),
metric_2=str(requests_generator_input.avg_request_per_minute_per_user.mean),
metric_n=str(requests_generator_input.avg_active_users.mean),
)
2 changes: 2 additions & 0 deletions src/app/schemas/full_simulation_input.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from pydantic import BaseModel

from app.schemas.requests_generator_input import RqsGeneratorInput
from app.schemas.simulation_settings_input import SimulationSettings
from app.schemas.system_topology_schema.full_system_topology_schema import TopologyGraph


Expand All @@ -11,3 +12,4 @@ class SimulationPayload(BaseModel):

rqs_input: RqsGeneratorInput
topology_graph: TopologyGraph
sim_settings: SimulationSettings
7 changes: 0 additions & 7 deletions src/app/schemas/requests_generator_input.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,6 @@ class RqsGeneratorInput(BaseModel):

avg_active_users: RVConfig
avg_request_per_minute_per_user: RVConfig
total_simulation_time: int = Field(
default=TimeDefaults.SIMULATION_TIME,
ge=TimeDefaults.MIN_SIMULATION_TIME,
description=(
f"Simulation time in seconds (>= {TimeDefaults.MIN_SIMULATION_TIME})."
),
)

user_sampling_window: int = Field(
default=TimeDefaults.USER_SAMPLING_WINDOW,
Expand Down
31 changes: 31 additions & 0 deletions src/app/schemas/simulation_settings_input.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
"""define a class with the global settings for the simulation"""

from pydantic import BaseModel, Field

from app.config.constants import EventMetricName, SampledMetricName, TimeDefaults


class SimulationSettings(BaseModel):
"""Global parameters that apply to the whole run."""

total_simulation_time: int = Field(
default=TimeDefaults.SIMULATION_TIME,
ge=TimeDefaults.MIN_SIMULATION_TIME,
description="Simulation horizon in seconds.",
)

enabled_sample_metrics: set[SampledMetricName] = Field(
default_factory=lambda: {
SampledMetricName.READY_QUEUE_LEN,
SampledMetricName.CORE_BUSY,
SampledMetricName.RAM_IN_USE,
},
description="Which time-series KPIs to collect by default.",
)
enabled_event_metrics: set[EventMetricName] = Field(
default_factory=lambda: {
EventMetricName.RQS_LATENCY,
},
description="Which per-event KPIs to collect by default.",
)

Loading