This repository contains the implementation of the approach discussed in Event causality identification with synthetic control.
The paper was presented at EMNLP 2024.
- Create a
.env
file in the root of the directory with theOPENAI_API_KEY
environment variable.
OPENAI_API_KEY="<your openai key>"
- Optionally, add Langfuse API keys to
.env
to enable tracing for OpenAI calls.
LANGFUSE_SECRET_KEY="<langfuse secret key>"
LANGFUSE_PUBLIC_KEY="<langfuse public key>"
LANGFUSE_HOST="<langfuse host>"
- Download the COPES dataset to
data/COPES.json
.curl -o data/COPES.json -LJ https://github.com/HKUST-KnowComp/COLA/raw/refs/heads/master/COPES_data/COPES.json
- Download the TinyStories dataset to
data/TinyStoriesV2-GPT4-train.txt
.curl -o data/TinyStoriesV2-GPT4-train.txt -L https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
conda
needs to be installed.
- Create the virtual environment using
conda env create -f environment.yml
- Convert
TinyStoriesV2-GPT4-train.txt
to parquet, by runningpython main.py setup-tiny-stories-parquet
- Create the BM25 index by running
python main.py setup-tiny-stories-corpus
Note that all indices are 0-indexed
Strategy | Description |
---|---|
gpt4 |
(Baseline) GPT4 Zeroshot Inference |
sc |
(Synthetic Control) GPT3.5 Synthetic Control |
sc4 |
(Synthetic Control) GPT4 Synthetic Control |
Run outputs are logged in output/<strategy>/
<test_case_id>
are all IDs from COPES.
python main.py run-testcase-event <test_case_id> <event_id> <strategy>
e.g. python main.py run-testcase-event 0 0 sc
python main.py run-one <test_case_id> <strategy>
python main.py run_from_list <path_to_json> <strategy
path_to_json
must be a file containing a single JSON array of indexes (e.g. [1,2,3,4]
)
python main.py print-testcases <path_to_json>
- Deadlocks have been observed to occasionally occur within DuckDB (or the Python DuckDB driver), causing corpus retrieval to fail.