Retionale behind path scoring

Hi, your work is impressive!

I have some questions while reading the source code regarding preprocessing.

1. Why do you assign score 1 (max score) to paths that retrieve no leaves?
    In the [`cal_path_val`](https://github.com/RUCKBReasoning/SubgraphRetrievalKBQA/blob/e332693322ac9dedddef44422a126e9b0afceb20/src/preprocessing/score_path.py#L36) function, you always return `1` when the `preds` (leaves deduced from a path) is an empty set. This means when you filtering paths for pretraining, paths that lead to no leaves will always be selected. Isn't it irrational that you regard a invalid path as with highest score?

2. Using the _HIT_ score as the metric is also a debatable choice. It makes sense for questions like _What are the books written by Ogai Mori?_, but it does not help when you ask _What is the most famous book from Ogai Mori?_. Since even if the path Ogai Mori --write--> A, I, U, E, O, etc is retrieved, it will likely be eliminated since the HIT score will be very low (1 / n_books_from_ogai)

3. Besides, could you please briefly explain the rationale from L35 - L 49 in [`negative_sampling.py`](https://github.com/RUCKBReasoning/SubgraphRetrievalKBQA/blob/e332693322ac9dedddef44422a126e9b0afceb20/src/preprocessing/negative_sampling.py#L35)? 
    In my interpretation, it means if the number of candidate entities is too large, then you simply discard this path. Am I correct?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retionale behind path scoring #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Retionale behind path scoring #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions