Hi, thank you for releasing the code.
While trying to run the official KGC code in kg/, I ran into a few points where I am not sure whether I am using the code as intended. I would really appreciate some clarification.
1. bf16 PNA aggregation in kg/src/model/pna.py
In kg/script/lp_pretrain.py, training enables mixed precision:
- kg/script/lp_pretrain.py:65-70
accelerator = Accelerator(
...
mixed_precision="bf16",
...
)
At the same time, the PNA text aggregation in kg/src/model/pna.py:22-29 computes:
mean = token_embs.sum(axis=1) / token_lengths
sq_mean = (token_embs**2).sum(axis=1) / token_lengths
std = (sq_mean - mean**2).clamp(min=1e-6).sqrt()
I am asking about this because in our reproduction attempts, this part appeared to produce NaN / Inf in aggregated text features on some FB datasets, which then led to clearly abnormal evaluation results. Since the code squares the text embeddings directly in the current dtype, I wanted to check whether this is the intended behavior under bf16 training, or whether this part is expected to be accumulated in float32 first and then cast back.
2. Expected PyTorch version for the public release
I could not find an explicit PyTorch version in the README:
The environment script seems to suggest something close to torch 2.2.0 + cu121 because of the PyG wheel source:
I am asking because several parts of the released code load processed datasets / cached blocks / checkpoints with plain torch.load(...), for example:
- kg/script/lp_pretrain.py:170
- kg/script/lp_pretrain.py:269
- kg/src/ultra/datasets.py:23
- kg/src/ultra/datasets.py:1072
- kg/src/data/duckdb.py:22
- kg/src/data/duckdb.py:140
- kg/src/data/duckdb.py:302
In our environment, these loading paths were difficult to run directly under newer PyTorch behavior, especially when processed files contained cached objects rather than only plain weight tensors. So I wanted to confirm which PyTorch version the public KGC release is actually expected to work with.
3. MTDEA text-data handling
In kg/src/data/datasets.py, I noticed several places where MTDEA-style dataset builders use the pattern:
- kg/src/data/datasets.py:1179-1191
For example:
train_text_data = None
test_text_data = self.text_store.desc_from_mapping(...)
train_data = Data(..., text_data=train_text_data)
Later, evaluation in kg/script/lp_eval.py:46-52 assumes data.text_data is ready and calls:
data.text_data.load_emb_db(ent_path, rel_path, stage="test")
I am asking because when extending evaluation to MTDEA-style datasets, this path looked unclear on our side: the dataset builder seems to initialize
some graph objects with text_data=None, while the evaluation path later assumes text_data is available and ready to load embedding DBs. I wanted to
check whether there is an additional preprocessing / initialization step required for MTDEA datasets before evaluation, or whether this path is
already expected to work directly from the released code.
If there is a recommended environment version or a minimal public KGC reproduction path, that would be very helpful.
Thanks a lot.
Hi, thank you for releasing the code.
While trying to run the official KGC code in kg/, I ran into a few points where I am not sure whether I am using the code as intended. I would really appreciate some clarification.
1. bf16 PNA aggregation in kg/src/model/pna.py
In kg/script/lp_pretrain.py, training enables mixed precision:
accelerator = Accelerator(
...
mixed_precision="bf16",
...
)
At the same time, the PNA text aggregation in kg/src/model/pna.py:22-29 computes:
mean = token_embs.sum(axis=1) / token_lengths
sq_mean = (token_embs**2).sum(axis=1) / token_lengths
std = (sq_mean - mean**2).clamp(min=1e-6).sqrt()
I am asking about this because in our reproduction attempts, this part appeared to produce NaN / Inf in aggregated text features on some FB datasets, which then led to clearly abnormal evaluation results. Since the code squares the text embeddings directly in the current dtype, I wanted to check whether this is the intended behavior under bf16 training, or whether this part is expected to be accumulated in float32 first and then cast back.
2. Expected PyTorch version for the public release
I could not find an explicit PyTorch version in the README:
The environment script seems to suggest something close to torch 2.2.0 + cu121 because of the PyG wheel source:
I am asking because several parts of the released code load processed datasets / cached blocks / checkpoints with plain torch.load(...), for example:
In our environment, these loading paths were difficult to run directly under newer PyTorch behavior, especially when processed files contained cached objects rather than only plain weight tensors. So I wanted to confirm which PyTorch version the public KGC release is actually expected to work with.
3. MTDEA text-data handling
In kg/src/data/datasets.py, I noticed several places where MTDEA-style dataset builders use the pattern:
For example:
train_text_data = None
test_text_data = self.text_store.desc_from_mapping(...)
train_data = Data(..., text_data=train_text_data)
Later, evaluation in kg/script/lp_eval.py:46-52 assumes data.text_data is ready and calls:
data.text_data.load_emb_db(ent_path, rel_path, stage="test")
I am asking because when extending evaluation to MTDEA-style datasets, this path looked unclear on our side: the dataset builder seems to initialize
some graph objects with text_data=None, while the evaluation path later assumes text_data is available and ready to load embedding DBs. I wanted to
check whether there is an additional preprocessing / initialization step required for MTDEA datasets before evaluation, or whether this path is
already expected to work directly from the released code.
If there is a recommended environment version or a minimal public KGC reproduction path, that would be very helpful.
Thanks a lot.