Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,20 @@ Therefore, you should do these actions in the following scenarios

- **Signed up with `email address`, generated `swarm.pem`, BUT lost `swarm.pem`** OR **You want to run multiple nodes at once**: run from scratch with the same email address and generate a new `swarm.pem`.
- **Signed up with `email address`, generated `swarm.pem`, kept `swarm.pem`** -> you can re-run a single node using this pair if you've still got them both.
## Monitoring metrics

RL Swarm exposes basic metrics (CPU, GPU and job status) that can be scraped and visualized by external tools.

Typical ways to monitor your node include:

* Exporting metrics to Prometheus and building dashboards in Grafana.
* Shipping logs (stdout/stderr) to a centralized log store (e.g. Loki, Elastic, or any log aggregation).

We recommend keeping an eye on:
* Training job failures per hour.
* Average GPU utilization.
* Disk usage in the directory storing checkpoints.

## Troubleshooting

- **How do I find my logs?** You can find them inside the `/logs` directory:
Expand Down