You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+28-3Lines changed: 28 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@ We hope that our work guides and inspires future real-to-sim evaluation efforts.
19
19
-[Installation](#installation)
20
20
-[Examples](#examples)
21
21
-[Current Environments](#current-environments)
22
-
-[Metrics for Assessing the Effectiveness of Simulated Evaluation Pipelines](#metrics-for-assessing-the-effectiveness-of-simulated-evaluation-pipelines)
22
+
-[Compare Your Approach to SIMPLER](#compare-your-approach-to-simpler)
23
23
-[Code Structure](#code-structure)
24
24
-[Adding New Policies](#adding-new-policies)
25
25
-[Adding New Real-to-Sim Evaluation Environments and Robots](#adding-new-real-to-sim-evaluation-environments-and-robots)
@@ -127,9 +127,34 @@ We also support creating sub-tasks variations such as `google_robot_pick_{horizo
127
127
128
128
By default, Google Robot environments use a control frequency of 3hz, and Bridge environments use a control frequency of 5hz. Simulation frequency is ~500hz.
129
129
130
-
## Metrics for Assessing the Effectiveness of Simulated Evaluation Pipelines
131
130
132
-
In our paper, we use the Mean Maximum Rank Violation (MMRV) metric and the Pearson Correlation Coefficient metric to assess the correlation between real and simulated evaluation results. You can reproduce the metrics in `tools/calc_metrics.py` and assess your own real-to-sim evaluation pipeline.
131
+
## Compare Your Approach to SIMPLER
132
+
133
+
We make it easy to compare your approach for offline robot policy evaluation to SIMPLER. In [our paper](https://simpler-env.github.io/) we use two metrics to measure the quality of simulated evaluation pipelines: Mean Maximum Rank Violation (MMRV) and the Pearson Correlation Coefficient.
134
+
Both capture, how well the offline evaluations reflect the policy's real-world performance during robot rollouts.
135
+
136
+
To make comparisons easy, we provide all our raw policy evaluation data: performance values for all policies on all real-world tasks. We also provide the corresponding estimates for policy performance of SIMPLER and functions for computing the aforementioned metrics we report in the paper.
137
+
138
+
To compute the corresponding metrics for *your* offline policy evaluation approach `your_sim_eval(task, policy)`, you can use the following snippet:
139
+
```
140
+
from simpler_env.utils.metrics import mean_maximum_rank_violation, pearson_correlation
0 commit comments