Codes and data set for IJCAI2022: Community Question Answering Entity Linking via Leveraging Auxiliary Data.
- torch == 1.8.0+
- transformers == 4.5.1
We construct a new dataset QuoraEL, which contains data of 504 CQA texts in total. The Wikipedia dump (July 2019 version) is used as the reference KB. Our data are in the folder data sets. CQAEL_dataset.json contains QuaraEL data mentioned above. Details of other files can be found in the codes for format conversion. Since our data set folder is too large, we release it here.
-
For each question, the following items are covered:
question title,question url,ID of question,answers,mentions in question title,topics.topicsincludestopic name,topic url,questions under this topic -
For each answer, the following items are covered:
answer url,answer id,upvote count,answer content,mentions in answer content,user name,user url,user history answers,user history questions -
For each mention, the following items are covered:
mention text,corresponding entity,candidates,gold entity indexcandidatesis a string and each candidate inCandidatesis like:<ENTITY>\t<WIKIPEDIA_ID>\t<PRIOR_PROB>The index of gold entity is '-1' if the mention cannot be linked to any candidates. There are 8030 mentions that can be linked to some candidate.
The data set is constructed in json format. You can load it easily.
import json
with open(PATH_OF_DATASET_FILE, 'r') as fp:
data = json.load(fp)modelsfolder: Codes of our model and baseline models. Baselines includes Deep-ED, Ment-Norm, FGS2EE, Zeshel, REL, BLINK, GENRE. Some data files can be downloaded via links in their original repository.datasetfolder: our data are in the subfoldercqa-el.CQAEL_dataset.jsoncontains QuaraEL data mentioned above. Details of other files can be found in the codes for format conversion.
For more details about the data set and the experiment settings, please refer to our paper.