Code and Data for CodeTaxo
export OPENAI_API_KEY=your_openai_api_key
export REPLICATE_API_TOKEN=your_relicate_api_token
pip install -r requirements.txtAll datasets used in the CodeTaxo paper can be found in the ./data folder. Under ./data/<datasetName> (where <datasetName> is one of ['wordnet', 'graphine', 'semeval-sci', 'semeval-env', 'semeval-food']), you will typically find the following files:
test_taxonomy_expansion.json: The original taxonomy expansion dataset.test_definition.json: A dictionary containingentity_name: definitionpairs.SimCSE_sampled_test_taxonomy_expansion_50p.json: The taxonomy expansion dataset after filtering out 50 percent of entities using the Semantic Similarity Filter.SimCSE_sampled_test_taxonomy_expansion_topN.json: The taxonomy expansion dataset retaining the top-N entities using the Semantic Similarity Filter.
The model outputs under different settings can be found in the ./outputs folder. Each subfolder corresponds to a specific experiment configuration.
To get the evaluation results based on the existing model outputs:
python src/eval.py --dataset $dataset --model $model --prompt_template_name $prompt_template_name --num_demos $num_demos --sampling --select_demo--dataset: Dataset name,[wordnet, graphine, semeval-sci, semeval-env, semeval-food](default: wordnet)--model: Model name, e.g.,[gpt-4o, gpt-4o-mini, gpt-3.5-turbo, meta-llama-3-70b-instruct, codellama-34b-instruct, codellama-70b-instruct](default: gpt-4o)--prompt_template_name: Use CodeTaxo prompt or Natural Language prompt,[codetaxo, NL](default: codetaxo)--num_demos: Number of demos (default: 1)--no_definition: Exclude the entity definition in the prompt (default: False)--sampling: Use SimCSE sampling (default: False)--scale_factor: Scale factor for SimCSE sampling, keep the topscale_factor * 100% entities, used when--percentis set (default: 0.5)--percent: Keep scale_factor of the entities in the taxonomy (default: False)--topk: Number of similar entities to sample, used when--percentis not set(default: 100)--gen_explaination: Generate explanation for the taxonomy expansion (default: False)--select_demo: Use demo selection (default: False)
python src/main.py --dataset $dataset --model $model --prompt_template_name $prompt_template_name --num_demos $num_demos --sampling --select_demo--dataset: Dataset name,[wordnet, graphine, semeval-sci, semeval-env, semeval-food](default: wordnet)--model: Model name, e.g.,[gpt-4o, gpt-4o-mini, gpt-3.5-turbo, meta-llama-3-70b-instruct, codellama-34b-instruct, codellama-70b-instruct](default: gpt-4o)--prompt_template_name: Use CodeTaxo prompt or Natural Language prompt,[codetaxo, NL](default: codetaxo)--num_demos: Number of demos (default: 1)--no_definition: Exclude the entity definition in the prompt (default: False)--sampling: Use SimCSE sampling (default: False)--scale_factor: Scale factor for SimCSE sampling, keep the topscale_factor * 100% entities, used when--percentis set (default: 0.5)--percent: Keep scale_factor of the entities in the taxonomy (default: False)--topk: Number of similar entities to sample, used when--percentis not set(default: 100)--gen_explaination: Generate explanation for the taxonomy expansion (default: False)--select_demo: Use demo selection (default: False)