This repository presents a Kubernetes-based monitoring and logging stack for large-scale experimentation with NimP2P nodes.
The solution is multi-tenant: multiple experiments can run concurrently without interfering with each other. Each experiment’s metrics and logs are isolated in Grafana dashboards, thanks to a unique experiment label applied to every pod created by the experiment’s StatefulSet.
- Deploying the Kubernetes Cluster
- Architecture and Design Choices
- Experiment Unit
- Metrics Collectors and Exporters
- Visualization
- Extra Thinking
- Future Improvements
- To deploy the cluster and all the required monitoring/logging components, Please follow the instruction in Cluster Setup.
- To use the nimp2p-lab cli tool, you can directly place the binary in this repo under your
/usr/local/bin/ - To build nimp2p-lab from source, run:
./build.sh
Note: you need to copy your /etc/rancher/k3s/k3s.yaml into ~/.kube/config so that nimp2p-lab can access Kubernetes.
At a high level, the proposed solution consists of a lightweight Golang-based CLI tool(nimp2p-lab) that interacts with a K3S cluster (preconfigured with monitoring/logging components), allowing users to:
- Create new experiments
- List active experiments and their status
- Scale experiments (up or down)
- Delete experiments
The picture below illustrates the overall stack, highlighting the main components and their corresponding kubernetes objects.
An experiment unit is a self-contained deployment that represents a NimP2P network.
Each unit consists of a StatefulSet (to manage the pods) and a dedicated headless service (to handle peer discovery).
I used a StatefulSet instead of a Deployment to ensure deterministic identities for the NimP2P pods in each experiment (e.g., nimp2p-0, nimp2p-1, …, nimp2p-N for the first, second, and N-th pods) and to maintain these identities even after rescheduling.
This predictable naming makes it easier to interpret experiment metrics and track the behavior of specific pods (even after restarts). More importantly, it allows logs from a given pod to be reliably found, even if the pod is rescheduled on a different node (e.g., after a crash or when resource limits are exceeded).
Key aspects:
- Unique identity: Every experiment has its own StatefulSet and headless ClusterIP service. Their names are unique, and the experiment name is propagated as a custom label to all pods.
- Metrics isolation: The custom labels are later used in Prometheus, Grafana and Promtail configuration to separate metrics per experiment, enabling a true multi-tenant setup.
- Peer isolation: A dedicated headless service ensures that nodes only discover peers within the same experiment, preventing cross-experiment traffic.
- Resource control: CPU, memory, and bandwidth limits are applied to each pod. This allows experiments to emulate constrained environments (e.g., Raspberry Pi, smartphones, or low-bandwidth links) while keeping cluster resources balanced.
The command below uses the nimp2p-lab tool to create an experiment named purple with 5 peers.
Each peer is restricted to 5% of a CPU core, 16 MiB of RAM, and 16 Mbps uplink/downlink bandwidth.
Each peer connects randomly to 2 discovered peers and send messages of 1440B every 2000 ms.
The docker image katakuri100/nimp2p:v2.0.14 is used for each peer (the --img parameter gives users the possibility to evaluate different versions/variants of the nimp2p application).
$ nimp2p-lab create --name purple --peers 5 --msize 1440 --mrate 2000 --conn 2 --cpu 0.05 --ram 16 --dl 16 --ul 16 --img katakuri100/nimp2p:v2.0.14
[INFO] headless service 'nimp2p-exp-purple' created
[INFO] experiment 'purple' created$ nimp2p-lab list
EXPERIMENT FULLNAME SCALE RUNNING STARTED AT AGE
blue nimp2p-exp-blue 4 4 2025-09-20T11:19:09Z 38m24s
purple nimp2p-exp-purple 5 5 2025-09-20T11:57:09Z 24sThe monitoring stack collects both host-level and pod-level metrics to provide visibility into the health of the cluster and the performance of NimP2P experiments.
Host-level metrics are essential for debugging node issues and tracking aggregate resource consumption.
These include:
- CPU and memory usage
- Bandwidth utilization
- Packet loss and byte errors
To collect these, I deployed Node Exporter as a DaemonSet, ensuring that metrics of every node in the cluster can be scrapped by Prometheus.
Pod-level metrics focus on evaluating NimP2P nodes directly. For these, I've selected:
- Uplink and downlink data rate between peers
- Packet drops and byte errors
- CPU, memory, and bandwidth consumption per pod
To export the above metrics, I deployed cAdvisor as a DaemonSet.
Round-trip time (RTT) between peers is a critical metric for evaluating network health and latency, especially in a GossipSub network:
- Messages may be forwarded across multiple peers, each hop adding extra latency.
- Measuring RTT between peers helps estimate the expected message delivery time, whether messages are relayed or delivered directly.
- RTT measurement also serves as a network health check: a successful ping between two nodes indicates that the network is functioning properly.
Currently, RTT metrics are not yet collected due to time constraints.
The plan is to implement a custom Golang exporter that runs as a sidecar container in each StatefulSet pod.
This exporter will:
-
Periodically discover peers via the experiment’s headless service
-
Periodically probe them (e.g., via ICMP or lightweight protocols)
-
Expose a /metrics endpoint for Prometheus scrapping.
Note: The Blackbox Exporter is insufficient for this scenario, since it cannot measure pod-to-pod RTT dynamically for every joinning peer.
Metrics are scraped by Prometheus, deployed on the master node as a standard Kubernetes Deployment.
- The configuration is stored in a ConfigMap (Prometheus config).
- Three jobs are currently defined:
- Node Exporter (host-level metrics)
- cAdvisor (pod-level metrics)
- RTT Exporter (as a placeholder)
- Filters are applied in the Prometheus config to only scrape the
dst-labnamespace, avoiding unnecessary metrics collection and reducing bandwidth overhead.
Logs are pushed to Loki, which is deployed on the master node as a standard Kubernetes Deployment. The Loki configuration is stored in a ConfigMap (Loki config).
Logs are pushed by Promtail, which is deployed as a DaemonSet, with the host’s pod log directory (/var/log/pods) mounted to expose container-level logs. The Promtail configuration is stored in a ConfigMap (Promtail config). The config includes multiple relabelling (in order to be able to show logs per experiment in grafana) and uses a regex to consider pods with label prefix=nimp2p-exp.*
For visualization, I rely on Grafana and the Kubernetes Dashboard, both deployed as standard Kubernetes deployments.
- Grafana: Configured through a ConfigMap, with Prometheus and Loki set as data sources.
- Kubernetes Dashboard: Provides a general-purpose web UI for inspecting workloads, nodes, and cluster health.
I created four main Grafana dashboards to support experiment analysis:
This dashboard tracks experiment-level metrics such as data rate, CPU, and memory usage — both per experiment and per individual NimP2P node. It also shows the contribution of each experiment (or node) to the overall cluster resource usage.
This makes it possible to detect when additional resources are required or when the cluster is nearing capacity.
Example (purple experiment):
This dashboard focuses on logs and message statistics:
- Raw logs per experiment and per individual NimP2P node
- Counts of
INFOvs.ERRORmessages - Number of transmitted messages (e.g., by counting
sending atin logs)
Future improvements could include:
- Tracking successful sends vs. failures
- Number of topics per peer
- Mesh degree per topic per peer
- End-to-end delivery delay
Example (purple experiment):
This dashboard monitors aggregated physical resources at the cluster and node levels:
- CPU
- Memory
- Bandwidth
- (future) RTT
It provides visibility into overall capacity and helps determine when additional nodes or resources are required to scale experiments.
This dashboard provides Kubernetes cluster health metrics, relying on cAdvisor exported metrics.
It gives insight into the resource usage of non-experiment workloads, so you can see the impact of supporting services (Prometheus, Grafana, Loki, etc.) alongside NimP2P experiments.
To guarantee a global message injection rate across an experiment (e.g., one message per second across all peers), I would introduce a dedicated Kubernetes Operator.
- The Operator would manage a CustomResourceDefinition (CRD) called
NimP2PExperiment. - Each CRD instance defines the experiment and contains a list of experimentUnits (representing individual NimP2P pods and their environment variables).
- Unlike a standard StatefulSet, the Operator can assign different environment variables to different pod, enabling fine-grained rate control.
For example, if the user requests a global message rate of 2 messages per second (MSGRATE=500ms), the Operator could distribute this load across peers:
- Pod 1:
MSGRATE=1000 - Pod 2:
MSGRATE=1000 - Remaining pods:
MSGRATE=0
Three resources bound scalability: CPU, memory, and bandwidth (BW).
In the current design:
- Per-pod limits are set (CPU, RAM, UL/DL bandwidth).
- Monitoring workloads (Prometheus, Grafana, Loki, etc.) are pinned to the master node, leaving worker nodes fully dedicated to experiments.
The maximum number of nimp2p pods that can be created is estimated as:
MAX_PODS = min(TOTAL_RAM / PER_PEER_RAM, TOTAL_CORES / PER_PEER_CPU, TOTAL_DL / PER_PEER_DL, TOTAL_UL / PER_PEER_UL)
Example:
- 2 worker nodes, each with 16 cores and 32 GB RAM (total 32 cores, 64 GB RAM).
- 1 Gbps interconnect.
- Default per-pod config: CPU=0.05, RAM=16 MiB, UL/DL=16 Mbps.
MAX_PODS = min(64 GB / 16 MB, 32 / 0.05, 1 Gbps / 16 Mbps, 1 Gbps / 16 Mbps) = min(4000, 640, 62, 62) = 62
This shows that bandwidth is the real limiting factor here, not CPU or RAM.
- If pods run on the same node, or if inter-node links are upgraded to 10–100 Gbps, bandwidth constraints are lifted and scalability improves.
- Alternatively, a single high-performance server can be used as the sole worker node.
Gossipsub parameters such as number of topics per peers, mesh degree and message sizes strongly influence network and performance:
-
Mesh degree:
-
Higher degree → more active connections.
-
With TCP, this means more file descriptors and more memory to maintain connection state.
-
Linux systems impose limits on file descriptors and memory, so high degrees can lead to connection rejections.
-
Message size:
- Large messages combined with high mesh degree consume significant bandwidth.
- Publish bursts can cause queue buildup and transient delay spikes, even if total bandwidth is sufficient. This may cause packet drops in case of Small queues.
Mitigation:
- Apply pacing techniques to smooth message bursts.
- Size queues at least 2× the Bandwidth-Delay Product (BDP) to absorb spikes.
When a NimP2P node appears to behave incorrectly (e.g., dropping messages or failing to connect), the root cause might not be the application itself.
It is important to first rule out network-related problems, which can occur at two levels: the host network or the Kubernetes pod network.
These typically arise from the underlying infrastructure — for example, middleboxes, deep packet inspection (DPI), or incorrect routing tables.
To detect such problems, we can:
- Deploy the Blackbox Exporter as a DaemonSet (with
hostNetworkset to true) to run ICMP/HTTP/HTTPS probes between hosts. - Monitor the results:
- Failed ICMP probes → indicate hosts cannot reach each other at the network layer.
- High TCP/UDP error counts ( already visible in the Physical Cluster Metrics Dashboard) → may signal packet drops caused by firewalls, middleboxes, or tunneling issues.
Even if the hosts can communicate, the Kubernetes overlay (CNI) may introduce failures — e.g., encapsulation problems preventing pods from reaching each other.
To detect such issues, we can:
- Run a custom RTT exporter as a sidecar in each pod.
- Have pods periodically ping each other through the experiment’s headless service.
- If these pod-to-pod probes fail, while host-to-host are all OK, the problem is likely within the CNI or overlay configuration.
By Monitoring both host-level and pod-level probes, we can determine whether an issue truly reflects misbehavior of a NimP2P node, or if it is simply a networking issue. Note: Small MTU in the hosts or inside the containers can also degrade performance. This can be detected by monitoring packet sizes. In case of large messages (e.g 1440B), IP packet sizes must be around 1500B if the configured MTU is 1500B. If the packet sizes are greatly lower than that during such an experiment, a low MTU might be the root cause.
-
Use Kube-OVN as the CNI in order to:
- Provide better experiment isolation and multi-tenancy through VPCs and subnets.
- Enable richer experiment scenarios (e.g., NAT, middleboxes, packet reordering) by inserting custom OpenFlow rules.
- Better enforce bandwith limits (more advanced than default annotations)
- Isolate monitoring traffic from experiment traffic (by using 2 eth interfaces and 2 virtual subnets per pod: monitoring and experiment subnets)
- Enable per-peer delay cmdline arg, allowing usesr to specify RTT between peers (I'll acheive this either by inserting openflow rules or by using an InitContainer for each pod in an experiment. This InitContainer will use tc and netem to add extra delay before the main container starts)
-
Add RTT probes
- Between hosts (via DaemonSets).
- Between pods within the same experiment (via sidecars).
-
Introduce an Experiment Operator
- Manage experiments declaratively via CRDs.






