-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
First of all, thank you for your great work!
While reproducing the results from the paper, I noticed that the performance I obtained was lower than the reported results.
I would like to check if I might have missed anything in my setup.
I conducted experiments on AdvHotPotQA and FEVER using GPT-3.5-turbo, and the results are as follows:
| Model | Dataset | Paper Performance | Reproduced Performance |
|---|---|---|---|
| GPT-3.5-turbo | AdvHotPotQA | 42.9 | 0.3636 |
| GPT-3.5-turbo | FEVER | 63.1 | 0.605 |
Below are the scripts I used for my experiment:
- Wikidata DB Script
$ wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
$ python preprocess_dump.py --input_file ./latest-all.json.gz --out_dir ./processed_wiki
$ python build_index.py --input_dir ./processed_wiki --output_dir ./processed_wiki/indices --num_chunks 16
$ python server.py --data_dir ./processed_wiki --chunk_number 0 --host_ip <server IP> --port 23546
$ python server.py --data_dir ./processed_wiki --chunk_number 1 --host_ip <server IP> --port 23547
…
$ python server.py --data_dir ./processed_wiki --chunk_number 15 --host_ip <server IP> --port 23562
- ToG 2.0 Execution Script
$ python main_tog2.py \
--dataset hotpot_e \
--max_length 256 \
--temperature_exploration 0 \
--temperature_reasoning 0 \
--width 3 \
--depth 3 \
--remove_unnecessary_rel True \
--LLM_type_rp gpt-3.5-turbo-16k \
--LLM_type gpt-3.5-turbo \
--opeani_api_keys <openai_api_key> \
--embedding_model_name bge-bi \
--relation_prune_combination True \
--num_sents_for_reasoning 10 \
--topic_prune True \
--self_consistency_threshold 0.8 \
--clue_query True
I would appreciate any insights on what might be causing this performance discrepancy.
Please let me know if there are any additional steps or configurations I should check.
Metadata
Metadata
Assignees
Labels
No labels