-
Notifications
You must be signed in to change notification settings - Fork 1
Features/definition full payload simulation #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
GioeleB00
merged 6 commits into
develop
from
features/definition-full-payload-simulation
Jul 15, 2025
Merged
Changes from 5 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
fa72b3b
Improved input structure and pytest
GioeleB00 e104dc6
Improved pytest structure accordingly to the new schema
GioeleB00 b7370d5
definition of the metrics to be measured and update of the simulation…
GioeleB00 bc45dc0
improved documentations added rationale behind metrics
GioeleB00 b15dbff
improved pytest logic and code coherence
GioeleB00 8a43eab
Update src/app/core/helpers.py
GioeleB00 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
514 changes: 215 additions & 299 deletions
514
documentation/backend_documentation/input_structure_for_the_simulation.md
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| ### **FastSim — simulation's metrics** | ||
|
|
||
| Metrics are the lifeblood of any simulation, transforming a series of abstract events into concrete, actionable insights about system performance, resource utilization, and potential bottlenecks. FastSim provides a flexible and robust metrics collection system designed to give you a multi-faceted view of your system's behavior under load. | ||
|
|
||
| To achieve this, FastSim categorizes metrics into three distinct types based on their collection methodology: | ||
|
|
||
| 1. **Sampled Metrics (`SampledMetricName`):** These metrics provide a **time-series view** of the system's state. They are captured at fixed, regular intervals throughout the simulation's duration (e.g., every second). This methodology is ideal for understanding trends, observing oscillations, and measuring the continuous utilization of finite resources like CPU and RAM. Think of them as periodic snapshots of your system's health. | ||
|
|
||
| 2. **Event-based Metrics (`EventMetricName`):** These metrics are recorded **only when a specific event occurs**. Their collection is asynchronous and irregular, triggered by discrete happenings within the simulation, such as the completion of a request. This methodology is perfect for measuring the properties of individual transactions, such as end-to-end latency, where an average value is less important than understanding the full distribution of outcomes. | ||
|
|
||
| 3. **Aggregated Metrics (`AggregatedMetricName`):** These are not collected directly during the simulation but are **calculated after the simulation ends**. They provide high-level statistical summaries (like mean, median, and percentiles) derived from the raw data collected by Event-based metrics. They distill thousands of individual data points into a handful of key performance indicators (KPIs) that are easy to interpret. | ||
|
|
||
| The following sections provide a detailed breakdown of each metric within these categories, explaining what they measure and the rationale for their importance. | ||
|
|
||
| --- | ||
|
|
||
| ### **1. Sampled Metrics: A Time-Series Perspective** | ||
|
|
||
| Sampled metrics are configured in the `SimulationSettings` payload. Enabling them allows you to plot the evolution of system resources over time, which is crucial for identifying saturation points and transient performance issues. | ||
|
|
||
| | Metric Name (`SampledMetricName`) | Description & Rationale | | ||
| | :--- | :--- | | ||
| | **`READY_QUEUE_LEN`** | **What it is:** The number of tasks in the `asyncio` event loop's "ready" queue waiting for their turn to run on the CPU. <br><br> **Rationale:** This is arguably the most critical indicator of **CPU saturation**. In a single-threaded Python process, only one coroutine can run at a time (held by the GIL). If this queue length is consistently greater than zero, it means tasks are ready to do work but are forced to wait because the CPU is busy. A long or growing queue is a definitive sign that your application is CPU-bound and that the CPU is a primary bottleneck. | | ||
| | **`CORE_BUSY`** | **What it is:** The number of server CPU cores that are currently executing a task. <br><br> **Rationale:** This provides a direct measure of **CPU utilization**. When plotted over time, it shows how effectively you are using your provisioned processing power. If `CORE_BUSY` is consistently at its maximum value (equal to `server_resources.cpu_cores`), the system is CPU-saturated. Conversely, if it's consistently low while latency is high, the bottleneck is likely elsewhere (e.g., I/O). It perfectly complements `READY_QUEUE_LEN` to form a complete picture of CPU health. | | ||
| | **`EVENT_LOOP_IO_SLEEP`** | **What it is:** A measure indicating if the event loop is idle, polling for I/O operations to complete. <br><br> **Rationale:** This metric helps you determine if your system is **I/O-bound**. If the event loop spends a significant amount of time in this state, it means the CPU is underutilized because it has no ready tasks to run and is instead waiting for external systems (like databases, caches, or downstream APIs) to respond. High values for this metric coupled with low CPU utilization are a clear signal to investigate and optimize the performance of your I/O operations. | | ||
| | **`RAM_IN_USE`** | **What it is:** The total amount of memory (in MB) currently allocated by all active requests within a server. <br><br> **Rationale:** Essential for **capacity planning and stability analysis**. This metric allows you to visualize your system's memory footprint under load. You can identify which endpoints cause memory spikes and ensure your provisioned RAM is sufficient. A steadily increasing `RAM_IN_USE` value that never returns to a baseline is the classic signature of a **memory leak**, a critical bug this metric helps you detect. | | ||
| | **`THROUGHPUT_RPS`** | **What it is:** The number of requests successfully completed per second, calculated over the last sampling window. <br><br> **Rationale:** This is a fundamental measure of **system performance and capacity**. It answers the question: "How much work is my system actually doing?" Plotting throughput against user load or other resource metrics is key to understanding your system's scaling characteristics. A drop in throughput often correlates with a spike in latency or resource saturation, helping you pinpoint the exact moment a bottleneck began to affect performance. | | ||
|
|
||
| --- | ||
|
|
||
| ### **2. Event-based Metrics: A Per-Transaction Perspective** | ||
|
|
||
| Event-based metrics are also enabled in the `SimulationSettings` payload. They generate a collection of raw data points, one for each relevant event, which is ideal for statistical analysis of transactional performance. | ||
|
|
||
| | Metric Name (`EventMetricName`) | Description & Rationale | | ||
| | :--- | :--- | | ||
| | **`RQS_LATENCY`** | **What it is:** The total end-to-end duration, in seconds, for a single request to be fully processed. <br><br> **Rationale:** This is the **primary user-facing performance metric**. Users directly experience latency. While a simple average can be useful, it often hides critical problems. By collecting the latency for *every single request*, FastSim allows for the calculation of statistical distributions and, most importantly, **tail-latency percentiles (p95, p99)**. These percentiles represent the worst-case experience for your users and are crucial for evaluating Service Level Objectives (SLOs) and ensuring a consistent user experience. | | ||
| | **`LLM_COST`** | **What it is:** The estimated monetary cost (e.g., in USD) incurred by a single call to an external Large Language Model (LLM) API during a request. <br><br> **Rationale:** In modern AI-powered applications, API calls to third-party services like LLMs can be a major operational expense. This metric moves beyond technical performance to measure **financial performance**. By tracking cost on a per-event basis, you can attribute expenses to specific endpoints or user behaviors, identify unnecessarily costly operations, and make informed decisions to optimize your application's cost-effectiveness. | | ||
|
|
||
| --- | ||
|
|
||
| ### **3. Aggregated Metrics: High-Level Summaries** | ||
|
|
||
| **Important:** Aggregated metrics are **not configured in the input payload**. They are automatically calculated by the FastSim engine at the end of a simulation run, based on the raw data collected from the enabled Event-based metrics. | ||
|
|
||
| | Metric Name (`AggregatedMetricName`) | Description & Rationale | | ||
| | :--- | :--- | | ||
| | **`LATENCY_STATS`** | **What it is:** A statistical summary of the entire collection of `RQS_LATENCY` data points. This typically includes the mean, median (p50), standard deviation, and high-end percentiles (p95, p99, p99.9). <br><br> **Rationale:** This provides a comprehensive and easily digestible summary of your system's latency profile. While the raw data is essential, these summary statistics answer high-level questions quickly. The mean tells you the average experience, the median protects against outliers, and the p95/p99 values tell you the latency that 95% or 99% of your users will beat—a critical KPI for reliability and user satisfaction. | | ||
| | **`LLM_STATS`** | **What it is:** A statistical summary of the `LLM_COST` data points. This can include total cost over the simulation, average cost per request, and cost distribution. <br><br> **Rationale:** This gives you a bird's-eye view of the financial implications of your system's design. Instead of looking at individual transaction costs, `LLM_STATS` provides the bottom line: the total operational cost during the simulation period. This is invaluable for budgeting, forecasting, and validating the financial viability of new features. | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| """helpers for the simulation""" | ||
|
|
||
| from collections.abc import Iterable | ||
|
|
||
| from app.config.constants import EventMetricName, SampledMetricName | ||
|
|
||
|
|
||
| def alloc_sample_metric( | ||
| enabled_sample_metrics: Iterable[SampledMetricName], | ||
| ) -> dict[str, list[float | int]]: | ||
| """ | ||
| After the pydantic validation of the whole input we | ||
| instantiate a dictionary to collect the sampled metrics the | ||
| user want to measure | ||
| """ | ||
| # t is the alignmente parameter for example assume | ||
| # the snapshot for the sampled metrics are done every 10ms | ||
| # t = [10,20,30,40....] to each t will correspond a measured | ||
| # metric corresponding to that time interval | ||
|
|
||
| dict_sampled_metrics: dict[str, list[float | int]] = {"t": []} | ||
| for key in enabled_sample_metrics: | ||
| dict_sampled_metrics[key] = [] | ||
| return dict_sampled_metrics | ||
|
|
||
|
|
||
| def alloc_event_metric( | ||
| enabled_event_metrics: Iterable[EventMetricName], | ||
| ) -> dict[str, list[float | int]]: | ||
| """ | ||
| After the pydantic validation of the whole input we | ||
| instantiate a dictionary to collect the event metrics the | ||
| user want to measure | ||
| """ | ||
| dict_event_metrics: dict[str, list[float | int]] = {} | ||
| for key in enabled_event_metrics: | ||
| dict_event_metrics[key] = [] | ||
| return dict_event_metrics | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| """define a class with the global settings for the simulation""" | ||
|
|
||
| from pydantic import BaseModel, Field | ||
|
|
||
| from app.config.constants import EventMetricName, SampledMetricName, TimeDefaults | ||
|
|
||
|
|
||
| class SimulationSettings(BaseModel): | ||
| """Global parameters that apply to the whole run.""" | ||
|
|
||
| total_simulation_time: int = Field( | ||
| default=TimeDefaults.SIMULATION_TIME, | ||
| ge=TimeDefaults.MIN_SIMULATION_TIME, | ||
| description="Simulation horizon in seconds.", | ||
| ) | ||
|
|
||
| enabled_sample_metrics: set[SampledMetricName] = Field( | ||
| default_factory=lambda: { | ||
| SampledMetricName.READY_QUEUE_LEN, | ||
| SampledMetricName.CORE_BUSY, | ||
| SampledMetricName.RAM_IN_USE, | ||
| }, | ||
| description="Which time-series KPIs to collect by default.", | ||
| ) | ||
| enabled_event_metrics: set[EventMetricName] = Field( | ||
| default_factory=lambda: { | ||
| EventMetricName.RQS_LATENCY, | ||
| }, | ||
| description="Which per-event KPIs to collect by default.", | ||
| ) | ||
|
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.