The original OpenPI dataset is one that trains and evaluates models to predict entity states throughout a procedure.
![]()
The goal of OpenPI2.0 is to augment in the following aspects:
- Canonicalization: cluster entities and attributes with the same meaning;

- Salience: add automatic and manual labels of entity salience.

With these features, OpenPI2.0 facilitates work on entity state tracking. It leads to fairer evaluation (by reducing false negatives during prediction) and better downstream performance (by allowing filtering by entity salience).
In this repo, we provide two resources:
- An API that takes in procedure texts and outputs entities, attributes, states, clusters, cluster expansions, local salience, and global salience.
- A dataset for development, evaluation, and tuning of models to predict the above.
cd api/
- Format your input as
trial.json. - Set your OpenAI API key to the path in the
openai.api_key =line inpredict_all.py, or change that line to your desired path. - Run
openpi_api.py --input INPUT_PATH --output OUTPUT_PATH
Input: A list of procedures, each consisting of a goal and a list of steps. Output:
- The predicted schema (entities and corresponding attributes) of each step
- The predicted states based on the schema above of each step
- The predicted global entity salience of the entire procedure
- The predicted local entity salience of each step
The all-in-one OpenPI2.0 dev data are:
The resulting data file with entity and attribute clusters is data/dev-data-reformatted-v4.json for the development set. Those for the train and test sets are coming soon.
To create this data, we start with data/dev-ranked.json which is the original OpenPI data and perform canonicalization. See the README in source/cluster for more details on this.
The canonicalized OpenPI2.0 can evaluate entity state trcking more fairly. To get model predictions:
- Running
source/predict_schema.py --model MODEL --prompt 1|2andpredict_states.py --model MODELproduces predictions for the schemata subtask and the states subtask. The output is for exampledata/dev_schema_chatgpt_1.json. The prompt type 1 corresponds to predicting entities and attributes individually, while the prompt 2 corresponds to the combined prediction of an entire sentence (attribute of entity was pre-state before and post-state after) just like the original OpenPI evaluation. - Running
source/evaluate_schema.py --model MODEL [--og]or similarlyevaluate_states.pyandevaluate_combined.pyperforms evaluation of the above settings.--ogspecifies to use the over-generated and expanded clusters for a fairer exact match evaluation.
We provide both human-annotated and LLM-predicted entity salience labels.
For human-annotated labels:
data/dev-data-reformatted-v4_votes_salience_1-20.jsoncontains human annotations by human Adata/dev-data-reformatted-v4_votes_salience_1-20_human2.jsoncontains human annotations by human B
For LLM-predicted labels:
data/dev-data-reformatted-v4_pred-salience.jsoncontains LLM-predicted salience scores- This file is produced by running
source/predict_salience.py --model MODEL
To evaluate the salience labels by correlation:
- Running
source/evaluate_salience.pycalculates correlation among the above scores. - Running
source/plot_correlation.pyplots a bar chart of correlations for the first 20 procedures in the development set.
If you find our work helpful, please cite
@inproceedings{zhang-etal-2024-openpi2,
title = "{O}pen{PI}2.0: An Improved Dataset for Entity Tracking in Texts",
author = "Zhang, Li and
Xu, Hainiu and
Kommula, Abhinav and
Callison-Burch, Chris and
Tandon, Niket",
editor = "Graham, Yvette and
Purver, Matthew",
booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = mar,
year = "2024",
address = "St. Julian{'}s, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.eacl-long.10",
pages = "166--178",
abstract = "Much texts describe a changing world (e.g., procedures, stories, newswires), and understanding them requires tracking how entities change. An earlier dataset, OpenPI, provided crowdsourced annotations of entity state changes in text. However, a major limitation was that those annotations were free-form and did not identify salient changes, hampering model evaluation. To overcome these limitations, we present an improved dataset, OpenPI2.0, where entities and attributes are fully canonicalized and additional entity salience annotations are added. On our fairer evaluation setting, we find that current state-of-the-art language models are far from competent. We also show that using state changes of salient entities as a chain-of-thought prompt, downstream performance is improved on tasks such as question answering and classical planning, outperforming the setting involving all related entities indiscriminately. We offer OpenPI2.0 for the continued development of models that can understand the dynamics of entities in text.",
}