Skip to content

Commit 69bca18

Browse files
authored
Update README.md
1 parent 8008975 commit 69bca18

File tree

1 file changed

+28
-3
lines changed

1 file changed

+28
-3
lines changed

README.md

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ We hope that our work guides and inspires future real-to-sim evaluation efforts.
1919
- [Installation](#installation)
2020
- [Examples](#examples)
2121
- [Current Environments](#current-environments)
22-
- [Metrics for Assessing the Effectiveness of Simulated Evaluation Pipelines](#metrics-for-assessing-the-effectiveness-of-simulated-evaluation-pipelines)
22+
- [Compare Your Approach to SIMPLER](#compare-your-approach-to-simpler)
2323
- [Code Structure](#code-structure)
2424
- [Adding New Policies](#adding-new-policies)
2525
- [Adding New Real-to-Sim Evaluation Environments and Robots](#adding-new-real-to-sim-evaluation-environments-and-robots)
@@ -127,9 +127,34 @@ We also support creating sub-tasks variations such as `google_robot_pick_{horizo
127127

128128
By default, Google Robot environments use a control frequency of 3hz, and Bridge environments use a control frequency of 5hz. Simulation frequency is ~500hz.
129129

130-
## Metrics for Assessing the Effectiveness of Simulated Evaluation Pipelines
131130

132-
In our paper, we use the Mean Maximum Rank Violation (MMRV) metric and the Pearson Correlation Coefficient metric to assess the correlation between real and simulated evaluation results. You can reproduce the metrics in `tools/calc_metrics.py` and assess your own real-to-sim evaluation pipeline.
131+
## Compare Your Approach to SIMPLER
132+
133+
We make it easy to compare your approach for offline robot policy evaluation to SIMPLER. In [our paper](https://simpler-env.github.io/) we use two metrics to measure the quality of simulated evaluation pipelines: Mean Maximum Rank Violation (MMRV) and the Pearson Correlation Coefficient.
134+
Both capture, how well the offline evaluations reflect the policy's real-world performance during robot rollouts.
135+
136+
To make comparisons easy, we provide all our raw policy evaluation data: performance values for all policies on all real-world tasks. We also provide the corresponding estimates for policy performance of SIMPLER and functions for computing the aforementioned metrics we report in the paper.
137+
138+
To compute the corresponding metrics for *your* offline policy evaluation approach `your_sim_eval(task, policy)`, you can use the following snippet:
139+
```
140+
from simpler_env.utils.metrics import mean_maximum_rank_violation, pearson_correlation
141+
from tools.calc_metrics import REAL_PERF
142+
143+
sim_eval_perf = [
144+
your_sim_eval(task="google_robot_move_near", policy=p)
145+
for p in ["rt-1-x", "octo", ...]
146+
]
147+
real_eval_perf = [
148+
REAL_PERF["google_robot_move_near"][p] for p in ["rt-1-x", "octo", ...]
149+
]
150+
mmrv = mean_maximum_rank_violation(real_eval_perf, sim_eval_perf)
151+
pearson = pearson_correlation(real_eval_perf, sim_eval_perf)
152+
```
153+
154+
To reproduce the key numbers from our paper for SIMPLER, you can run the [`tools/calc_metrics.py`](tools/calc_metrics.py) script:
155+
```
156+
python3 tools/calc_metrics.py
157+
```
133158

134159
## Code Structure
135160

0 commit comments

Comments
 (0)