Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
2b26ea2
setting up the branch
GioeleB00 Jul 3, 2025
9e48446
minor correction
GioeleB00 Jul 3, 2025
7a0f860
minor changes
GioeleB00 Jul 3, 2025
1ab21d5
improved script for linux
GioeleB00 Jul 3, 2025
2a4675f
minor change
GioeleB00 Jul 3, 2025
91c0a7e
Delete docker_fs/.env.dev
GioeleB00 Jul 4, 2025
ab6c48a
Delete docker_fs/.env.test
GioeleB00 Jul 4, 2025
c06166b
Features/event generator (#1)
GioeleB00 Jul 6, 2025
86501f9
minor changes
GioeleB00 Jul 6, 2025
4a7081e
Merge branch 'develop' of github.com:GioeleB00/FastSim-backend into d…
GioeleB00 Jul 6, 2025
5c6f5db
Features/event generator documentation test improvements (#2)
GioeleB00 Jul 10, 2025
d43c915
README update
GioeleB00 Jul 10, 2025
02b4756
Improved constants management
GioeleB00 Jul 10, 2025
6f06bb7
Clean and refactor
GioeleB00 Jul 11, 2025
ee5d202
Features/request handler endpoint input (#3)
GioeleB00 Jul 13, 2025
953e318
Features/definition full payload simulation (#4)
GioeleB00 Jul 15, 2025
35d507e
Features/rqs generator runtime (#5)
GioeleB00 Jul 18, 2025
13d31d6
Features/client server runtime (#6)
GioeleB00 Jul 24, 2025
2686845
Features/metric sampler and collection (#7)
GioeleB00 Jul 29, 2025
7f1f7de
Features/metrics elaboration (#8)
GioeleB00 Aug 1, 2025
84f9897
Features/load balancer node (#9)
GioeleB00 Aug 3, 2025
7649866
Features/simulation runner (#10)
GioeleB00 Aug 6, 2025
d6ff9f6
Features/integration tests unit tests (#11)
GioeleB00 Aug 8, 2025
90df33d
new readme and guide to build yaml
GioeleB00 Aug 8, 2025
43eb8c5
Update README.md
GioeleB00 Aug 8, 2025
2a306ac
Added pybuilder and unit tests (#12)
GioeleB00 Aug 13, 2025
0098453
Refactor/change project name plus docs improvement (#13)
GioeleB00 Aug 13, 2025
c900708
minor changes
GioeleB00 Aug 13, 2025
adcb9eb
Refactor/pypi preparation (#14)
GioeleB00 Aug 14, 2025
111cf7b
minor changes
GioeleB00 Aug 14, 2025
59b54b6
minor changes
GioeleB00 Aug 14, 2025
02951e6
Features/lb example and docs tutorial (#15)
GioeleB00 Aug 15, 2025
d77415e
sanity ci check
GioeleB00 Aug 15, 2025
de52d04
Ci for main (#16)
GioeleB00 Aug 17, 2025
fb08027
fixing a bug
GioeleB00 Aug 17, 2025
09d99dc
bug fixed
GioeleB00 Aug 17, 2025
bc5e329
Merge branch 'main' into develop
GioeleB00 Aug 17, 2025
1c47441
version bump
GioeleB00 Aug 17, 2025
dfea747
Merge branch 'main' into develop
GioeleB00 Aug 17, 2025
c0684f0
Merge branch 'main' into develop
GioeleB00 Aug 17, 2025
7d641e9
Merge branch 'main' into develop
GioeleB00 Aug 17, 2025
623cd6a
Merge branch 'main' into develop
GioeleB00 Aug 17, 2025
f57059b
version bump
GioeleB00 Aug 17, 2025
e9afe18
Feature/event injection input (#19)
GioeleB00 Aug 19, 2025
e869e67
Feature/event injection runtime (#20)
GioeleB00 Aug 28, 2025
75a025f
Refactor/fixing readme small fixes (#21)
GioeleB00 Aug 29, 2025
bf274eb
small fix ci
GioeleB00 Aug 29, 2025
8aace46
fixing ci
GioeleB00 Aug 29, 2025
5ec9e80
small test fix
GioeleB00 Aug 29, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions .github/workflows/ci-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,12 @@ jobs:
run: poetry run mypy src tests

- name: All tests (unit + integration + system)
run: |
poetry run pytest \
--disable-warnings
run: poetry run pytest --disable-warnings --cov=asyncflow --cov-report=xml

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
files: coverage.xml
flags: tests
fail_ci_if_error: true
token: ${{ secrets.CODECOV_TOKEN }}
115 changes: 115 additions & 0 deletions CHANGELOG.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## \[Unreleased]

### Planned

* **Network baseline upgrade** (sockets, RAM per connection, keep-alive).
* **New metrics and visualization improvements** (queue wait times, service histograms).
* **Monte Carlo analysis** with confidence intervals.

---

## \[0.1.1] – 2025-08-29

### Added

* **Event Injection (runtime-ready):**

* Declarative events with `start` / `end` markers (server down/up, network spike start/end).
* Runtime scheduler integrated with SimPy, applying events at the right simulation time.
* Deterministic latency **offset handling** for network spikes (phase 1).

* **Improved Server Model:**

* Refined CPU + I/O handling with clearer queue accounting.
* Ready queue length now explicitly updated on contention.
* I/O queue metrics improved with better protection against mis-counting edge cases.
* Enhanced readability and maintainability in endpoint step execution flow.

### Documentation

* Expanded examples on event injection in YAML.
* Inline comments clarifying queue management logic.

### Notes

* This is still an **alpha-series** release, but now supports scenario-driven **event injection** and a more faithful **server runtime model**, paving the way for the upcoming network baseline upgrade.

---

## \[0.1.0a2] – 2025-08-17

### Fixed

* **Quickstart YAML in README**: corrected field to ensure a smooth first run for new users.

### Notes

* Minor docs polish only; no runtime changes.

---

## \[0.1.0a1] – 2025-08-17

### Changed

* Repository aligned with the **PyPI 0.1.0a1** build.
* Packaging metadata tidy-up in `pyproject.toml`.

### CI

* Main workflow now also triggers on **push** to `main`.

### Notes

* No functional/runtime changes.

---

## \[v0.1.0-alpha] – 2025-08-17

**First public alpha** of AsyncFlow β€” a SimPy-based, **event-loop-aware** simulator for async distributed systems.

### Highlights

* **Event-loop model** per server: explicit **CPU** (blocking), **I/O waits** (non-blocking), **RAM** residency.
* **Topology graph**: generator β†’ client β†’ (LB, optional) β†’ servers; multi-server via **round-robin**; **stochastic network latency** and optional dropouts.
* **Workload**: stochastic traffic via simple RV configs (Poisson defaults).

### Metrics & Analyzer

* **Event metrics**: `RqsClock` (end-to-end latency).
* **Sampled metrics**: `ready_queue_len`, `event_loop_io_sleep`, `ram_in_use`, `edge_concurrent_connection`.
* **Analyzer API** (`ResultsAnalyzer`):

* `get_latency_stats()`, `get_throughput_series()`
* Plots: `plot_latency_distribution()`, `plot_throughput()`
* Per-server: `plot_single_server_ready_queue()`, `plot_single_server_io_queue()`, `plot_single_server_ram()`
* Compact dashboards.

### Examples

* YAML quickstart (single server).
* Pythonic builder:

* Single server.
* **Load balancer + two servers** example with saved figures.

### Tooling & CI

* One-shot setup scripts (`dev_setup`, `quality_check`, `run_tests`, `run_sys_tests`) for Linux/macOS/Windows.
* GitHub Actions: Ruff + MyPy + Pytest; **system tests gate merges** into `main`.

### Compatibility

* **Python 3.12+** (Linux/macOS/Windows).
* Install from PyPI: `pip install asyncflow-sim`.




161 changes: 65 additions & 96 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,78 @@

# AsyncFlow β€” Event-Loop Aware Simulator for Async Distributed Systems
# AsyncFlow: Scenario-Driven Simulator for Async Systems

Created and maintained by @GioeleB00.

[![PyPI](https://img.shields.io/pypi/v/asyncflow-sim)](https://pypi.org/project/asyncflow-sim/)
[![Python](https://img.shields.io/pypi/pyversions/asyncflow-sim)](https://pypi.org/project/asyncflow-sim/)
[![License](https://img.shields.io/github/license/AsyncFlow-Sim/AsyncFlow)](LICENSE)
[![Status](https://img.shields.io/badge/status-v0.1.0alpha-orange)](#)
[![codecov](https://codecov.io/gh/AsyncFlow-Sim/AsyncFlow/branch/main/graph/badge.svg)](https://codecov.io/gh/AsyncFlow-Sim/AsyncFlow)
[![Ruff](https://img.shields.io/badge/lint-ruff-informational)](https://github.com/astral-sh/ruff)
[![Typing](https://img.shields.io/badge/typing-mypy-blueviolet)](https://mypy-lang.org/)
[![Tests](https://img.shields.io/badge/tests-pytest-6DA55F)](https://docs.pytest.org/)
[![SimPy](https://img.shields.io/badge/built%20with-SimPy-1f425f)](https://simpy.readthedocs.io/)

-----

AsyncFlow is a discrete-event simulator for modeling and analyzing the performance of asynchronous, distributed backend systems built with SimPy. You describe your system's topologyβ€”its servers, network links, and load balancersβ€”and AsyncFlow simulates the entire lifecycle of requests as they move through it.
**AsyncFlow** is a scenario-driven simulator for **asynchronous distributed backends**.
You don’t β€œpredict the Internet” β€” you **declare scenarios** (network RTT + jitter, resource caps, failure events) and AsyncFlow shows the operational impact: concurrency, queue growth, socket/RAM pressure, latency distributions. This means you can evaluate architectures before implementation: test scaling strategies, network assumptions, or failure modes without writing production code.

It provides a **digital twin** of your service, modeling not just the high-level architecture but also the low-level behavior of each server's **event loop**, including explicit **CPU work**, **RAM residency**, and **I/O waits**. This allows you to run realistic "what-if" scenarios that behave like production systems rather than toy benchmarks.
At its core, AsyncFlow is **event-loop aware**:

* **CPU work** blocks the loop,
* **RAM residency** ties up memory until release,
* **I/O waits** free the loop just like in real async frameworks.

With the new **event injection engine**, you can explore *what-if* dynamics: network spikes, server outages, degraded links, all under your control.

---

### What Problem Does It Solve?

Modern async stacks like FastAPI are incredibly performant, but predicting their behavior under real-world load is difficult. Capacity planning often relies on guesswork, expensive cloud-based load tests, or discovering bottlenecks only after a production failure. AsyncFlow is designed to replace that uncertainty with **data-driven forecasting**, allowing you to understand how your system will perform before you deploy a single line of code.
Predicting how an async system will behave under real-world load is notoriously hard. Teams often rely on rough guesses, over-provisioning, or painful production incidents. **AsyncFlow replaces guesswork with scenario-driven simulations**: you declare the conditions (network RTT, jitter, resource limits, injected failures) and observe the consequences on latency, throughput, and resource pressure.

---

### Why Scenario-Driven? *Design Before You Code*

AsyncFlow doesn’t need your backend to exist.
You can model your architecture with YAML or Python, run simulations, and explore bottlenecks **before writing production code**.
This scenario-driven approach lets you stress-test scaling strategies, network assumptions, and failure modes safely and repeatably.

---

### How Does It Work?

### How Does It Work? An Example Topology
AsyncFlow represents your system as a **directed graph of components**, for example: clients, load balancers, serversβ€”connected by network edges with configurable latency models. Each server is **event-loop aware**: CPU work blocks, RAM stays allocated, and I/O yields the loop, just like real async frameworks. You can define topologies via **YAML** or a **Pythonic builder**.

AsyncFlow models your system as a directed graph of interconnected components. A typical setup might look like this:
![Topology](https://raw.githubusercontent.com/AsyncFlow-Sim/AsyncFlow/main/readme_img/topology.png)

![Topology at a glance](readme_img/topology.png)
Run the simulation and inspect the outputs:

<p>
<a href="https://raw.githubusercontent.com/AsyncFlow-Sim/AsyncFlow/main/readme_img/lb_dashboard.png">
<img src="https://raw.githubusercontent.com/AsyncFlow-Sim/AsyncFlow/main/readme_img/lb_dashboard.png" alt="Latency + Throughput Dashboard" width="300">
</a>
<a href="https://raw.githubusercontent.com/AsyncFlow-Sim/AsyncFlow/main/readme_img/lb_server_srv-1_metrics.png">
<img src="https://raw.githubusercontent.com/AsyncFlow-Sim/AsyncFlow/main/readme_img/lb_server_srv-1_metrics.png" alt="Server 1 Metrics" width="300">
</a>
<a href="https://raw.githubusercontent.com/AsyncFlow-Sim/AsyncFlow/main/readme_img/lb_server_srv-2_metrics.png">
<img src="https://raw.githubusercontent.com/AsyncFlow-Sim/AsyncFlow/main/readme_img/lb_server_srv-2_metrics.png" alt="Server 2 Metrics" width="300">
</a>
</p>


---

### What Questions Can It Answer?

By running simulations on your defined topology, you can get quantitative answers to critical engineering questions, such as:
With scenario simulations, AsyncFlow helps answer questions such as:

* How does **p95 latency** shift if active users double?
* What happens when a **client–server edge** suffers a 20 ms spike for 60 seconds?
* Will a given endpoint pipeline β€” CPU parse β†’ RAM allocation β†’ DB I/O β€” still meet its **SLA at 40 RPS**?
* How many sockets and how much RAM will a load balancer need under peak conditions?

* How does **p95 latency** change if active users increase from 100 to 200?
* What is the impact on the system if the **client-to-server network latency** increases by 3ms?
* Will a specific API endpointβ€”with a pipeline of parsing, RAM allocation, and database I/Oβ€”hold its **SLA at a load of 40 requests per second**?
---

## Installation
Expand Down Expand Up @@ -167,7 +205,7 @@ You’ll get latency stats in the terminal and a PNG with four charts (latency d

**Want more?**

For ready-to-run scenariosβ€”including examples using the Pythonic builder and multi-server topologiesβ€”check out the `examples/` directory in the repository.
For ready-to-run scenarios including examples using the Pythonic builder and multi-server topologies, check out the `examples/` directory in the repository.

## Development

Expand Down Expand Up @@ -279,97 +317,28 @@ bash scripts/run_sys_tests.sh

Executes **pytest** with a terminal coverage summary (no XML, no slowest list).

## Current Limitations (v0.1.1)

AsyncFlow is still in alpha. The current release has some known limitations that are already on the project roadmap:

## What AsyncFlow Models (v0.1)

AsyncFlow provides a detailed simulation of your backend system. Here is a high-level overview of the core components it models. For a deeper technical dive into the implementation and design rationale, follow the links to the internal documentation.

* **Async Event Loop:** Simulates a single-threaded, non-blocking event loop per server. **CPU steps** block the loop, while **I/O steps** are non-blocking, accurately modeling `asyncio` behavior.
* *(Deep Dive: `docs/internals/runtime-and-resources.md`)*

* **System Resources:** Models finite server resources, including **CPU cores** and **RAM (MB)**. Requests must acquire these resources, creating natural back-pressure and contention when the system is under load.
* *(Deep Dive: `docs/internals/runtime-and-resources.md`)*

* **Endpoints & Request Lifecycles:** Models server endpoints as a linear sequence of **steps**. Each step is a distinct operation, such as `cpu_bound_operation`, `io_wait`, or `ram` allocation.
* *(Schema Definition: `docs/internals/simulation-input.md`)*

* **Network Edges:** Simulates the connections between system components. Each edge has a configurable **latency** (drawn from a probability distribution) and an optional **dropout rate** to model packet loss.
* *(Schema Definition: `docs/internals/simulation-input.md` | Runtime Behavior: `docs/internals/runtime-and-resources.md`)*

* **Stochastic Workload:** Generates user traffic based on a two-stage sampling model, combining the number of active users and their request rate per minute to produce a realistic, fluctuating load (RPS) on the system.
* *(Modeling Details with mathematical explanation and clear assumptions: `docs/internals/requests-generator.md`)*

* **Metrics & Outputs:** Collects two types of data: **time-series metrics** (e.g., `ready_queue_len`, `ram_in_use`) and **event-based data** (`RqsClock`). This raw data is used to calculate final KPIs like **p95/p99 latency** and **throughput**.
* *(Metric Reference: `docs/internals/metrics`)*

## Current Limitations (v0.1)

* Network realism: base latency + optional drops (no bandwidth/payload/TCP yet).
* Single event loop per server: no multi-process/multi-node servers yet.
* Linear endpoint flows: no branching/fan-out within an endpoint.
* No thread-level concurrency; modeling OS threads and scheduler/context switching is out of scope.”
* Stationary workload: no diurnal patterns or feedback/backpressure.
* Sampling cadence: very short spikes can be missed if `sample_period_s` is large.


## Roadmap (Order is not indicative of priority)

This roadmap outlines the key development areas to transform AsyncFlow into a comprehensive framework for statistical analysis and resilience modeling of distributed systems.

### 1. Monte Carlo Simulation Engine

**Why:** To overcome the limitations of a single simulation run and obtain statistically robust results. This transforms the simulator from an "intuition" tool into an engineering tool for data-driven decisions with confidence intervals.

* **Independent Replications:** Run the same simulation N times with different random seeds to sample the space of possible outcomes.
* **Warm-up Period Management:** Introduce a "warm-up" period to be discarded from the analysis, ensuring that metrics are calculated only on the steady-state portion of the simulation.
* **Ensemble Aggregation:** Calculate means, standard deviations, and confidence intervals for aggregated metrics (latency, throughput) across all replications.
* **Confidence Bands:** Visualize time-series data (e.g., queue lengths) with confidence bands to show variability over time.

### 2. Realistic Service Times (Stochastic Service Times)

**Why:** Constant service times underestimate tail latencies (p95/p99), which are almost always driven by "slow" requests. Modeling this variability is crucial for a realistic analysis of bottlenecks.

* **Distributions for Steps:** Allow parameters like `cpu_time` and `io_waiting_time` in an `EndpointStep` to be sampled from statistical distributions (e.g., Lognormal, Gamma, Weibull) instead of being fixed values.
* **Per-Request Sampling:** Each request will sample its own service times independently, simulating the natural variability of a real-world system.

### 3. Component Library Expansion

**Why:** To increase the variety and realism of the architectures that can be modeled.

* **New System Nodes:**
* `CacheRuntime`: To model caching layers (e.g., Redis) with hit/miss logic, TTL, and warm-up behavior.
* `APIGatewayRuntime`: To simulate API Gateways with features like rate-limiting and authentication caching.
* `DBRuntime`: A more advanced model for databases featuring connection pool contention and row-level locking.
* **New Load Balancer Algorithms:** Add more advanced routing strategies (e.g., Weighted Round Robin, Least Response Time).

### 4. Fault and Event Injection

**Why:** To test the resilience and behavior of the system under non-ideal conditions, a fundamental use case for Site Reliability Engineering (SRE).

* **API for Scheduled Events:** Introduce a system to schedule events at specific simulation times, such as:
* **Node Down/Up:** Turn a server off and on to test the load balancer's failover logic.
* **Degraded Edge:** Drastically increase the latency or drop rate of a network link.
* **Error Bursts:** Simulate a temporary increase in the rate of application errors.

### 5. Advanced Network Modeling
* **Network model** β€” only base latency + jitter/spikes.
Bandwidth, queuing, and protocol-level details (HTTP/2 streams, QUIC, TLS handshakes) are not yet modeled.

**Why:** To more faithfully model network-related bottlenecks that are not solely dependent on latency.
* **Server model** β€” single event loop per server.
Multi-process or multi-threaded execution is not yet supported.

* **Bandwidth and Payload Size:** Introduce the concepts of link bandwidth and request/response size to simulate delays caused by data transfer.
* **Retries and Timeouts:** Model retry and timeout logic at the client or internal service level.
* **Endpoint flows** β€” endpoints are linear pipelines.
Branching/fan-out (e.g. service calls to DB + cache) will be added in future versions.

### 6. Complex Endpoint Flows
* **Workload generation** β€” stationary workloads only.
No support yet for diurnal patterns, feedback loops, or adaptive backpressure.

**Why:** To model more realistic business logic that does not follow a linear path.
* **Overload policies** β€” no explicit handling of overload conditions.
Queue caps, deadlines, timeouts, rate limiting, and circuit breakers are not yet implemented.

* **Conditional Branching:** Introduce the ability to have conditional steps within an endpoint (e.g., a different path for a cache hit vs. a cache miss).
* **Fan-out / Fan-in:** Model scenarios where a service calls multiple downstream services in parallel and waits for their responses.
* **Sampling cadence** β€” very short events may be missed if the `sample_period_s` is too large.

### 7. Backpressure and Autoscaling

**Why:** To simulate the behavior of modern, adaptive systems that react to load.

* **Dynamic Rate Limiting:** Introduce backpressure mechanisms where services slow down the acceptance of new requests if their internal queues exceed a certain threshold.
* **Autoscaling Policies:** Model simple Horizontal Pod Autoscaler (HPA) policies where the number of server replicas increases or decreases based on metrics like CPU utilization or queue length.
πŸ“Œ See the [ROADMAP](./ROADMAP.md) for planned features and upcoming milestones.

Loading