Improvement of the txt2kg example. by drivanov · Pull Request #10623 · pyg-team/pytorch_geometric

drivanov · 2026-02-26T19:39:46Z

Summary

This PR refactors the multiprocessing logic in torch_geometric.llm.models.txt2kg.py to improve reproducibility and stability.

Key Changes

Remove tmp-file IPC
- Workers now return List[(s, p, o)] directly
- Eliminates /tmp/outs_for_proc_* files
- No filesystem side effects
Fix spawn-related instability
- Worker moved to module scope (picklable)
- Explicit error propagation
- Prevents silent SIGTERM cascades
Ensure deterministic output
- Global lexicographic sort of triples after aggregation
- Output stable across runs and worker counts

Result

Reproducible KG extraction
Cleaner multiprocessing design
No temp files
Clearer failure modes
No API changes.

Change in llm.py eliminates the following warning:

`torch_dtype` is deprecated! Use `dtype` instead!

for more information, see https://pre-commit.ci

codecov · 2026-02-26T22:17:16Z

Codecov Report

❌ Patch coverage is 73.01587% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.92%. Comparing base (c211214) to head (72404fb).
⚠️ Report is 175 commits behind head on master.

Files with missing lines	Patch %	Lines
torch_geometric/llm/models/llm.py	16.66%	10 Missing ⚠️
torch_geometric/llm/models/txt2kg.py	86.66%	6 Missing ⚠️
torch_geometric/llm/models/g_retriever.py	83.33%	1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (73.01%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10623      +/-   ##
==========================================
- Coverage   86.11%   85.92%   -0.20%     
==========================================
  Files         496      510      +14     
  Lines       33655    36016    +2361     
==========================================
+ Hits        28981    30945    +1964     
- Misses       4674     5071     +397

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…geometric into txt2kg_sigterm

for more information, see https://pre-commit.ci

…geometric into txt2kg_sigterm

puririshi98

lgtm at a high level but please address these concerns:

The retry logic was silently deleted without replacement. The original code had up to 200 inner retries and 5 outer retries. This was there for a reason — NIM API calls over the network fail transiently. The new code has zero retry logic. A single transient HTTP error in any worker will now raise a RuntimeError and abort the entire job. For expensive, long-running KG extraction jobs, this is a significant regression in robustness.
_safe_worker swallows exceptions and returns a dict instead of raising. This is an anti-pattern with Pool.map. If a worker raises an exception, Pool.map will propagate it naturally to the main process — that's the correct behavior. The _safe_worker wrapper catches the exception, converts it to a dict, returns it as a "successful" result, and then the main process has to manually check every result for this sentinel value. This is more complex and fragile than just letting Pool.map propagate the exception. It also means if a worker silently returns a partial result before raising, it gets swallowed entirely.
The chunk distribution has an off-by-one / data loss bug that was present before and is not fixed. meta_chunk_size = int(len(chunks) / num_procs) with integer division means if len(chunks) is not divisible by num_procs, the last few chunks are silently dropped. For example, 10 chunks across 3 workers: sizes are 3, 3, 3 — the 10th chunk is never processed. This was a pre-existing bug but this PR was the right opportunity to fix it (e.g. using np.array_split or adjusting the slicing).
flat_triples.sort() sorts tuples lexicographically. This works, but the triples coming out of _parse_n_check_triples are lists, not tuples. Sorting lists also works, but the claim of "deterministic output" is only true within a single Python version and locale — list/tuple sort order is not guaranteed to be stable across Python versions for string content with mixed Unicode. Minor point, but worth noting for a library claiming reproducibility.

Aside from that please confirm you get an end 2 end run of txt2kg_rag on a small subsample using --use_x_percent_corpus

drivanov · 2026-02-28T21:42:33Z

Done.
Yes, the L1_TXT2KG_RAG_GPU test with the --use_x_percent_corpus .003 parameter passes without problems.

puririshi98

LGTM just make the CI green and we can merge

for more information, see https://pre-commit.ci

…geometric into txt2kg_sigterm

for more information, see https://pre-commit.ci

CHANGELOG.md

torch_geometric/llm/models/txt2kg.py

torch_geometric/llm/models/g_retriever.py

for more information, see https://pre-commit.ci

drivanov · 2026-03-12T01:30:37Z

@akihironitta :
I'm not entirely sure, but it seems there are some issues with the code coverage analysis script.

I added at least 10 new tests today, but I still see the same missing coverage for 17 lines

Patch coverage is 73.01587% with 17 lines in your changes missing coverage

When I try to dig deeper and explore the links on that page, I see that

Commits have different number of coverage report uploads:
BASE commit is 174 commits behind HEAD on master
My latest commit is compared to code that is 11 months old.

3. I also analyzed the Missed Lines of

torch_geometric/llm/models/llm.py
torch_geometric/llm/models/txt2kg.py
torch_geometric/llm/models/g_retriever.py

and with ChatGPT's help, I created a set of tests for them. When I add pdb breakpoints at the appropriate places, I see that the program actually hits them (for instance, that one).
However, the code coverage report still shows them as skipped, even though these new tests were actually executed:

      test/llm/models/test_llm.py::test_llm_prepare_inputs PASSED
      test/llm/models/test_llm.py::test_llm_single_prompt PASSED
      test/llm/models/test_llm.py::test_llm_variable_lengths PASSED

akihironitta · 2026-03-13T21:16:08Z

Thank you @drivanov for digging into what the issue is with codecov. I have seen this issue in the past, and it's likely due to their bug or our misconfiguration, so I'll merge this as is. Thank again for sending this patch!

Improvement of the txt2kg example.

2f63f1c

drivanov requested a review from puririshi98 as a code owner February 26, 2026 19:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

6bddbd6

for more information, see https://pre-commit.ci

drivanov added 2 commits February 26, 2026 15:31

Fixing decoding in LLM constructor

5f734a0

Merge branch 'txt2kg_sigterm' of https://github.com/drivanov/pytorch_…

97dbcbb

…geometric into txt2kg_sigterm

drivanov requested review from akihironitta and rusty1s as code owners February 26, 2026 23:36

pre-commit-ci bot and others added 9 commits February 26, 2026 23:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

c5ae4fd

for more information, see https://pre-commit.ci

Fixing comments

59975c9

Improvement of the txt2kg example.

cb5df8f

Fixing long line

024524c

[pre-commit.ci] auto fixes from pre-commit.com hooks

1895cc6

for more information, see https://pre-commit.ci

Fixing dtype missmatch bug

e6f20cf

Merge branch 'txt2kg_sigterm' of https://github.com/drivanov/pytorch_…

0094a74

…geometric into txt2kg_sigterm

One more attempt to fix bug

db684b4

Fixing test_sentence_transformer

ef4e22f

drivanov requested a review from wsad1 as a code owner February 27, 2026 19:23

Fixing test_sentence_transformer

b59a5e7

puririshi98 requested changes Feb 27, 2026

View reviewed changes

Enhancement requests from a reviewer

566d290

puririshi98 approved these changes Mar 2, 2026

View reviewed changes

pre-commit-ci bot and others added 7 commits March 3, 2026 18:22

[pre-commit.ci] auto fixes from pre-commit.com hooks

58b2a74

for more information, see https://pre-commit.ci

Merge branch 'master' into txt2kg_sigterm

aa525b1

Minor fixes

17aa5d7

[pre-commit.ci] auto fixes from pre-commit.com hooks

da4a6ac

for more information, see https://pre-commit.ci

Improving code coverage

09df1c2

Merge branch 'txt2kg_sigterm' of https://github.com/drivanov/pytorch_…

89376e8

…geometric into txt2kg_sigterm

[pre-commit.ci] auto fixes from pre-commit.com hooks

a6f2307

for more information, see https://pre-commit.ci

drivanov and others added 4 commits March 10, 2026 11:55

Fixing one of the tests for g_retriever

54a0488

[pre-commit.ci] auto fixes from pre-commit.com hooks

920488a

for more information, see https://pre-commit.ci

Merge branch 'master' into txt2kg_sigterm

fb5d16a

Fixing one of the tests for g_retriever

766d443

puririshi98 approved these changes Mar 10, 2026

View reviewed changes

puririshi98 enabled auto-merge (squash) March 10, 2026 19:37

akihironitta approved these changes Mar 10, 2026

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Update CHANGELOG.md

2f13fdd

akihironitta added the skip-changelog label Mar 10, 2026

akihironitta reviewed Mar 10, 2026

View reviewed changes

torch_geometric/llm/models/txt2kg.py Outdated Show resolved Hide resolved

Update torch_geometric/llm/models/txt2kg.py

5367e65

akihironitta reviewed Mar 10, 2026

View reviewed changes

torch_geometric/llm/models/txt2kg.py Outdated Show resolved Hide resolved

akihironitta reviewed Mar 10, 2026

View reviewed changes

torch_geometric/llm/models/g_retriever.py Outdated Show resolved Hide resolved

akihironitta and others added 3 commits March 10, 2026 12:42

Apply suggestions from code review

9f5c0fd

[pre-commit.ci] auto fixes from pre-commit.com hooks

7e52b1b

for more information, see https://pre-commit.ci

Apply yapf formatting

cfb272b

auto-merge was automatically disabled March 10, 2026 21:37
Head branch was pushed to by a user without write access

04cb approved these changes Mar 10, 2026

View reviewed changes

drivanov and others added 7 commits March 11, 2026 10:23

Few more tests added

13436d9

[pre-commit.ci] auto fixes from pre-commit.com hooks

8c155fb

for more information, see https://pre-commit.ci

Fixing variable name

be6e81c

Few more tests added

649e5a4

[pre-commit.ci] auto fixes from pre-commit.com hooks

b42e130

for more information, see https://pre-commit.ci

Few more tests added

51c5ed2

Decorator added

72404fb

akihironitta merged commit 2a120a6 into pyg-team:master Mar 13, 2026
18 of 19 checks passed

drivanov deleted the txt2kg_sigterm branch March 18, 2026 20:06

drivanov mentioned this pull request Mar 19, 2026

Fix loading of legacy HuggingFace BERT checkpoints. #10631

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement of the txt2kg example.#10623

Improvement of the txt2kg example.#10623
akihironitta merged 67 commits intopyg-team:masterfrom
drivanov:txt2kg_sigterm

drivanov commented Feb 26, 2026

Uh oh!

codecov bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

puririshi98 left a comment

Uh oh!

drivanov commented Feb 28, 2026

Uh oh!

puririshi98 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drivanov commented Mar 12, 2026

Uh oh!

akihironitta commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

drivanov commented Feb 26, 2026

Uh oh!

codecov bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

puririshi98 left a comment

Choose a reason for hiding this comment

Uh oh!

drivanov commented Feb 28, 2026

Uh oh!

puririshi98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drivanov commented Mar 12, 2026

Uh oh!

akihironitta commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Feb 26, 2026 •

edited

Loading