Skip to content

Commit 2686845

Browse files
authored
Features/metric sampler and collection (#7)
* defined architecture for the central collector + documentation * server modification to collect metrics and updated docs * Completed metric collector for server plus test * remove web app dependencies and added metrics for rqs latency * improved docs and improved metric collection * changes to make the code compatible with new changes * improved ci against toml changes * update lock file * minor change
1 parent 13d31d6 commit 2686845

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+1130
-2517
lines changed

.gitattributes

Lines changed: 0 additions & 4 deletions
This file was deleted.

.github/workflows/ci-cd-main.yml

Whitespace-only changes.

.github/workflows/ci-develop.yml

Lines changed: 2 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -62,33 +62,12 @@ jobs:
6262

6363

6464
# Job 2 ─ Full validation (executed only on push events)
65-
# --------------------------------------------------------------------------- #
66-
# Includes everything from the quick job plus:
67-
# • PostgreSQL service container
68-
# • Alembic migrations
69-
# • Integration tests
70-
# • Multi-stage Docker build and health-check
71-
7265
full:
7366
if: |
7467
github.event_name == 'push' &&
7568
github.ref == 'refs/heads/develop'
7669
runs-on: ubuntu-latest
7770

78-
services:
79-
postgres:
80-
image: postgres:17
81-
env:
82-
POSTGRES_USER: ${{ secrets.DB_USER }}
83-
POSTGRES_PASSWORD: ${{ secrets.DB_PASSWORD }}
84-
POSTGRES_DB: ${{ secrets.DB_NAME }}
85-
ports: ["5432:5432"]
86-
options: >-
87-
--health-cmd "pg_isready -U $POSTGRES_USER -d $POSTGRES_DB"
88-
--health-interval 10s
89-
--health-timeout 5s
90-
--health-retries 5
91-
9271
steps:
9372
- uses: actions/checkout@v3
9473
- uses: actions/setup-python@v4
@@ -110,41 +89,12 @@ jobs:
11089
- name: Run mypy
11190
run: poetry run mypy src
11291

113-
- name: Apply Alembic migrations
114-
env:
115-
ENVIRONMENT: test
116-
DB_URL: postgresql+psycopg://${{ secrets.DB_USER }}:${{ secrets.DB_PASSWORD }}@localhost:5432/${{ secrets.DB_NAME }}
117-
run: poetry run alembic upgrade head
118-
11992
- name: Run all tests
120-
env:
121-
ENVIRONMENT: test
122-
DB_URL: postgresql+asyncpg://${{ secrets.DB_USER }}:${{ secrets.DB_PASSWORD }}@localhost:5432/${{ secrets.DB_NAME }}
12393
run: |
12494
poetry run pytest \
12595
--cov=src --cov-report=term \
12696
--disable-warnings
12797
128-
- name: Build Docker image
129-
run: docker build --progress=plain -t backend:ci .
98+
13099

131-
- name: Smoke test container
132-
run: |
133-
# partiamo con --network host così il container condivide la rete del runner
134-
docker run -d \
135-
--name backend_ci \
136-
--network host \
137-
-e ENVIRONMENT=test \
138-
-e DB_URL=postgresql+asyncpg://${{ secrets.DB_USER }}:${{ secrets.DB_PASSWORD }}@localhost:5432/${{ secrets.DB_NAME }} \
139-
backend:ci \
140-
uvicorn app.main:app --host 0.0.0.0 --port 8000
141-
142-
for i in {1..10}; do
143-
if curl --silent --fail http://localhost:8000/health; then
144-
echo "✔ Health OK"; break
145-
else
146-
echo "Waiting…"; sleep 3
147-
fi
148-
done
149-
150-
docker stop backend_ci
100+

Dockerfile

Lines changed: 0 additions & 50 deletions
This file was deleted.

README.md

Lines changed: 79 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -1,176 +1,127 @@
1-
Certamente. Ecco il contenuto del `README.md` visualizzato direttamente qui.
2-
3-
-----
4-
51
# **FastSim Project Overview**
62

73
## **1. Why FastSim?**
84

9-
FastAPI + Uvicorn gives Python teams a lightning-fast async stack, yet sizing it for production still means guesswork, costly cloud load-tests, or late surprises. **FastSim** fills that gap by becoming a **digital twin** of your actual service:
5+
Modern async Python stacks like FastAPI + Uvicorn are incredibly fast, yet sizing them for production often involves guesswork, costly cloud load-tests, or late-stage surprises. **FastSim** fills that gap by acting as a **digital twin** of your service:
106

11-
* It **replicates** your FastAPI + Uvicorn event-loop behavior in SimPy, generating the same kinds of asynchronous steps (parsing, CPU work, I/O, LLM calls) that happen in real code.
12-
* It **models** your infrastructure primitives—CPU cores (via a SimPy `Resource`), database pools, rate-limiters, and even GPU inference quotas—so you can see queue lengths, scheduling delays, resource utilization, and end-to-end latency.
13-
* It **outputs** the very metrics you would scrape in production (p50/p95/p99 latency, ready-queue lag, concurrency, throughput, cost per LLM call), but entirely offline, in seconds.
7+
* It **replicates** the behavior of an async event-loop in SimPy, generating the same kinds of steps (parsing, CPU work, I/O waits) that happen in real code.
8+
* It **models** your infrastructure primitives—CPU cores, connection pools, and rate-limiters—so you can see queue lengths, scheduling delays, resource utilization, and end-to-end latency.
9+
* It **outputs** the very metrics you would scrape in production (p50/p95/p99 latency, ready-queue lag, concurrency, throughput), but entirely offline, in seconds.
1410

15-
With FastSim you can ask, *“What happens if traffic doubles on Black Friday?”*, *“How many cores are needed to keep p95 latency below 100 ms?”*, or *“Is our LLM-driven endpoint ready for prime time?”*—and get quantitative answers **before** you deploy.
11+
With FastSim, you can ask, *“What happens if traffic doubles on Black Friday?”*, *“How many cores are needed to keep p95 latency below 100 ms?”*, or *“Is our new endpoint ready for prime time?”*—and get quantitative answers **before** you deploy.
1612

1713
**Outcome:** Data-driven capacity planning, early performance tuning, and far fewer surprises in production.
1814

19-
## **2. Project Goals**
20-
21-
| \# | Goal | Practical Outcome |
22-
| :--- | :--- | :--- |
23-
| 1 | **Pre-production sizing** | Know the required core count, pool size, and replica count to meet your SLA. |
24-
| 2 | **Scenario analysis** | Explore various traffic models, endpoint mixes, latency distributions, and RTT. |
25-
| 3 | **Twin metrics** | Produce the same metrics you’ll scrape in production (latency, queue length, CPU utilization). |
26-
| 4 | **Rapid iteration** | A single YAML/JSON configuration or REST call generates a full performance report. |
27-
| 5 | **Educational value** | Visualize how GIL contention, queue length, and concurrency react to load. |
28-
29-
## **3. Who Benefits & Why**
30-
31-
| Audience | Pain-Point Solved | FastSim Value |
32-
| :--- | :--- | :--- |
33-
| **Backend Engineers** | Unsure if a 4-vCPU container can survive a marketing traffic spike. | Run *what-if* scenarios, tweak CPU cores or pool sizes, and get p95 latency and max-concurrency metrics before merging code. |
34-
| **DevOps / SRE** | Guesswork in capacity planning; high cost of over-provisioning. | Simulate 1 to N replicas, autoscaler thresholds, and database pool sizes to find the most cost-effective configuration that meets the SLA. |
35-
| **ML / LLM Product Teams** | LLM inference cost and latency are difficult to predict. | Model the LLM step with a price and latency distribution to estimate cost-per-request and the benefits of GPU batching without needing real GPUs. |
36-
| **Educators / Trainers** | Students struggle to visualize event-loop internals. | Visualize GIL ready-queue lag, CPU vs. I/O steps, and the effect of blocking code—perfect for live demos and labs. |
37-
| **Consultants / Architects** | Need a quick proof-of-concept for new client designs. | Define endpoints in YAML and demonstrate throughput and latency under projected load in minutes. |
38-
| **Open-Source Community** | Lacks a lightweight Python simulator for ASGI workloads. | An extensible codebase makes it easy to plug in new resources (e.g., rate-limiters, caches) or traffic models (e.g., spike, uniform ramp). |
39-
| **System-Design Interviewees** | Hard to quantify trade-offs in whiteboard interviews. | Prototype real-time metrics—queue lengths, concurrency, latency distributions—to demonstrate how your design scales and where bottlenecks lie. |
40-
41-
## **4. About This Documentation**
15+
## **2. Installation & Quick Start**
4216

43-
This project contains extensive documentation covering its vision, architecture, and technical implementation. The documents are designed to be read in sequence to build a comprehensive understanding of the project.
17+
FastSim is designed to be used as a Python library.
4418

45-
### **How to Read This Documentation**
19+
```bash
20+
# Installation (coming soon to PyPI)
21+
pip install fastsim
22+
```
4623

47-
For the best understanding of FastSim, we recommend reading the documentation in the following order:
24+
**Example Usage:**
25+
26+
1. Define your system topology in a `config.yml` file:
27+
28+
```yaml
29+
topology:
30+
servers:
31+
- id: "app-server-1"
32+
# ... server configuration ...
33+
load_balancers:
34+
- id: "main-lb"
35+
backends: ["app-server-1"]
36+
# ... lb configuration ...
37+
settings:
38+
duration_s: 60
39+
```
4840
49-
1. **README.md (This Document)**: Start here for a high-level overview of the project's purpose, goals, target audience, and development workflow. It provides the essential context for all other documents.
50-
2. **dev_worflow_guide**: This document details the github workflow for the development
51-
3. **simulation_input**: This document details the technical contract for configuring a simulation. It explains the `SimulationPayload` and its components (`rqs_input`, `topology_graph`, `sim_settings`). This is essential reading for anyone who will be creating or modifying simulation configurations.
52-
4. **runtime_and_resources**: A deep dive into the simulation's internal engine. It explains how the validated input is transformed into live SimPy processes (Actors, Resources, State). This is intended for advanced users or contributors who want to understand *how* the simulation works under the hood.
53-
5. **requests_generator**: This document covers the mathematical and algorithmic details behind the traffic generation model. It is for those interested in the statistical foundations of the simulator.
54-
6. **Simulation Metrics**: A comprehensive guide to all output metrics. It explains what each metric measures, how it's collected, and why it's important for performance analysis.
41+
2. Run the simulation from a Python script:
5542
56-
Optional **fastsim_vision**: a more detailed document about the project vision
43+
```python
44+
from fastsim import run_simulation
45+
from fastsim.schemas import SimulationPayload
5746

58-
you can find the documentation at the root of the project in the folder `documentation/`
47+
# Load and validate configuration using Pydantic
48+
payload = SimulationPayload.from_yaml("config.yml")
5949

60-
## **5. Development Workflow & Architecture Guide**
50+
# Run the simulation
51+
results = run_simulation(payload)
6152

62-
This section outlines the standardized development workflow, repository architecture, and branching strategy for the FastSim backend.
53+
# Analyze and plot results
54+
results.plot_latency_distribution()
55+
print(results.summary())
56+
```
6357

64-
### **Technology Stack**
58+
## **3. Who Benefits & Why**
6559

66-
* **Backend**: FastAPI
67-
* **Backend Package Manager**: Poetry
68-
* **Frontend**: React + JavaScript
69-
* **Database**: PostgreSQL
70-
* **Caching**: Redis
71-
* **Containerization**: Docker
60+
| Audience | Pain-Point Solved | FastSim Value |
61+
| :--- | :--- | :--- |
62+
| **Backend Engineers** | Unsure if a 4-vCPU container can survive a traffic spike. | Run *what-if* scenarios, tweak CPU cores, and get p95 latency metrics before merging code. |
63+
| **DevOps / SRE** | Guesswork in capacity planning; high cost of over-provisioning. | Simulate 1 to N replicas to find the most cost-effective configuration that meets the SLA. |
64+
| **ML / LLM Teams** | LLM inference cost and latency are difficult to predict. | Model the LLM step with a price and latency distribution to estimate cost-per-request. |
65+
| **Educators / Trainers** | Students struggle to visualize event-loop internals. | Visualize GIL ready-queue lag, CPU vs. I/O steps, and the effect of blocking code. |
66+
| **System-Design Interviewees** | Hard to quantify trade-offs in whiteboard interviews. | Prototype real-time metrics to demonstrate how your design scales and where bottlenecks lie. |
7267

73-
### **Backend Service (`FastSim-backend`)**
68+
## **4. Project Structure**
7469

75-
The repository hosts the entire FastAPI backend, which exposes the REST API, runs the discrete-event simulation, communicates with the database, and provides metrics.
70+
The project is a standard Python library managed with Poetry.
7671

7772
```
78-
fastsim-backend/
79-
├── Dockerfile
80-
├── docker_fs/
81-
│ ├── docker-compose.dev.yml
82-
│ └── docker-compose.prod.yml
83-
├── scripts/
84-
│ ├── init-docker-dev.sh
85-
│ └── quality-check.sh
86-
├── alembic/
87-
│ ├── env.py
88-
│ └── versions/
73+
fastsim/
8974
├── documentation/
90-
│ └── backend_documentation/
91-
├── tests/
92-
│ ├── unit/
93-
│ └── integration/
75+
│ └── ...
9476
├── src/
9577
│ └── app/
96-
│ ├── api/
9778
│ ├── config/
98-
│ ├── db/
9979
│ ├── metrics/
10080
│ ├── resources/
10181
│ ├── runtime/
102-
│ │ ├── rqs_state.py
103-
│ │ └── actors/
82+
│ │ ├── actors/
83+
│ │ └── rqs_state.py
10484
│ ├── samplers/
105-
│ ├── schemas/
106-
│ ├── main.py
107-
│ └── simulation_run.py
108-
├── poetry.lock
85+
│ └── schemas/
86+
├── tests/
87+
│ ├── unit/
88+
│ └── integration/
89+
├── .github/
90+
│ └── workflows/
91+
│ └── ci-develop.yml
10992
├── pyproject.toml
93+
├── poetry.lock
11094
└── README.md
11195
```
11296
113-
### **How to Start the Backend with Docker (Development)**
97+
## **5. Development & Contribution**
11498
115-
To spin up the backend and its supporting services in development mode:
116-
117-
1. **Install & run Docker** on your machine.
118-
2. **Clone** the repository and `cd` into its root.
119-
3. Execute:
120-
```bash
121-
bash ./scripts/init-docker-dev.sh
122-
```
123-
This will launch a **PostgreSQL** container and a **Backend** container that mounts your local `src/` folder with live-reload enabled.
124-
125-
### **Development Architecture & Philosophy**
126-
127-
We split responsibilities between Docker-managed services and local workflows.
128-
129-
* **Docker-Compose for Development**: Containers host external services (PostgreSQL) and run the FastAPI app. Your local `src/` directory is mounted into the backend container for hot-reloading. No tests, migrations, or linting run inside these containers during development.
130-
* **Local Quality & Testing Workflow**: All code quality tools, migrations, and tests are executed on your host machine for faster feedback and full IDE support.
99+
We welcome contributions\! The development workflow is managed by Poetry and quality is enforced by Ruff and MyPy.
131100
132101
| Task | Command | Notes |
133102
| :--- | :--- | :--- |
134-
| **Lint & format** | `poetry run ruff check src tests` | Style and best-practice validations |
135-
| **Type checking** | `poetry run mypy src tests` | Static type enforcement |
136-
| **Unit tests** | `poetry run pytest -m "not integration"` | Fast, isolated tests—no DB required |
137-
| **Integration tests** | `poetry run pytest -m integration` | Real-DB tests against Docker’s PostgreSQL |
138-
| **DB migrations** | `poetry run alembic upgrade head` | Applies migrations to your local Docker-hosted DB |
139-
140-
**Rationale**: Running tests or Alembic migrations inside Docker images would slow down your feedback loop and limit IDE features by requiring you to mount the full source tree and install dev dependencies in each build.
141-
142-
## **6. CI/CD with GitHub Actions**
103+
| **Install dependencies** | `poetry install --with dev` | Installs main and development packages. |
104+
| **Lint & format** | `poetry run ruff check src tests` | Style and best-practice validations. |
105+
| **Type checking** | `poetry run mypy src tests` | Static type enforcement. |
106+
| **Run all tests** | `poetry run pytest` | Executes the full test suite. |
143107
144-
We maintain two jobs on the `develop` branch to ensure code quality and stability.
108+
### **CI with GitHub Actions**
145109
146-
### **Quick (on Pull Requests)**
110+
We maintain two jobs on the `develop` branch to ensure code quality:
147111
148-
* Ruff & MyPy checks
149-
* Unit tests only
150-
* **No database required**
112+
* **Quick (on Pull Requests):** Runs Ruff, MyPy, and unit tests for immediate feedback.
113+
* **Full (on pushes to `develop`):** Runs the full suite, including integration tests and code coverage reports.
151114
152-
### **Full (on pushes to `develop`)**
115+
This guarantees that every commit in `develop` is style-checked, type-safe, and fully tested.
153116
154-
* All checks from the "Quick" suite
155-
* Starts a **PostgreSQL** service container
156-
* Runs **Alembic** migrations
157-
* Executes the **full test suite** (unit + integration)
158-
* Builds the **Docker** image
159-
* **Smoke-tests** the `/health` endpoint of the built container
117+
## **6. Limitations – v0.1 (First Public Release)**
160118
161-
**Guarantee**: Every commit in `develop` is style-checked, type-safe, database-tested, and Docker-ready.
119+
1. **Network Delay Model:** Only pure transport latency is simulated. Bandwidth-related effects (e.g., payload size, link speed) are not yet accounted for.
120+
2. **Concurrency Model:** The simulation models a single-threaded, cooperative event-loop (like `asyncio`). Multi-process or multi-threaded parallelism is not yet supported.
121+
3. **CPU Core Allocation:** Every server instance is pinned to one physical CPU core. Horizontal scaling is achieved by adding more server instances, not by using multiple cores within a single process.
162122
163-
## **7. Limitations – v0.1 (First Public Release)**
123+
These constraints will be revisited in future milestones.
164124
165-
1. **Network Delay Model**
166-
* Only pure transport latency is simulated.
167-
* Bandwidth-related effects (e.g., payload size, link speed, congestion) are NOT accounted for.
168-
2. **Concurrency Model**
169-
* The service exposes **async-only endpoints**.
170-
* Execution runs on a single `asyncio` event-loop thread.
171-
* No thread-pool workers or multi-process setups are supported yet; therefore, concurrency is limited to coroutine scheduling (cooperative, single-thread).
172-
3. **CPU Core Allocation**
173-
* Every server instance is pinned to **one physical CPU core**.
174-
* Horizontal scaling must be achieved via multiple containers/VMs, not via multi-core utilization inside a single process.
125+
## **7. Documentation**
175126
176-
These constraints will be revisited in future milestones once kernel-level context-switching costs, I/O bandwidth modeling, and multi-process orchestration are integrated.
127+
For a deeper understanding of FastSim, we recommend reading the detailed documentation located in the `/documentation` folder at the root of the project. A guided reading path is suggested within to build a comprehensive understanding of the project's vision, architecture, and technical implementation.

0 commit comments

Comments
 (0)