Skip to content

Performance Discrepancy in Reproducing Results for AdvHotPotQA and FEVER #1

@hipros

Description

@hipros

First of all, thank you for your great work!

While reproducing the results from the paper, I noticed that the performance I obtained was lower than the reported results.
I would like to check if I might have missed anything in my setup.

I conducted experiments on AdvHotPotQA and FEVER using GPT-3.5-turbo, and the results are as follows:

Model Dataset Paper Performance Reproduced Performance
GPT-3.5-turbo AdvHotPotQA 42.9 0.3636
GPT-3.5-turbo FEVER 63.1 0.605

Below are the scripts I used for my experiment:

  • Wikidata DB Script
$ wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
$ python preprocess_dump.py --input_file ./latest-all.json.gz --out_dir ./processed_wiki
$ python build_index.py --input_dir ./processed_wiki --output_dir ./processed_wiki/indices --num_chunks 16

$ python server.py --data_dir ./processed_wiki --chunk_number 0 --host_ip <server IP> --port 23546
$ python server.py --data_dir ./processed_wiki --chunk_number 1 --host_ip <server IP> --port 23547
…
$ python server.py --data_dir ./processed_wiki --chunk_number 15 --host_ip <server IP> --port 23562
  • ToG 2.0 Execution Script
$ python main_tog2.py \
--dataset hotpot_e \
--max_length 256 \
--temperature_exploration 0 \
--temperature_reasoning 0 \
--width 3 \
--depth 3 \
--remove_unnecessary_rel True \
--LLM_type_rp gpt-3.5-turbo-16k \
--LLM_type gpt-3.5-turbo \
--opeani_api_keys <openai_api_key> \
--embedding_model_name bge-bi \
--relation_prune_combination True \
--num_sents_for_reasoning 10 \
--topic_prune True \
--self_consistency_threshold 0.8 \
--clue_query True

I would appreciate any insights on what might be causing this performance discrepancy.
Please let me know if there are any additional steps or configurations I should check.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions