Improvement of the txt2kg example.#10623
Conversation
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (73.01%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #10623 +/- ##
==========================================
- Coverage 86.11% 85.92% -0.20%
==========================================
Files 496 510 +14
Lines 33655 36016 +2361
==========================================
+ Hits 28981 30945 +1964
- Misses 4674 5071 +397 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…geometric into txt2kg_sigterm
puririshi98
left a comment
There was a problem hiding this comment.
lgtm at a high level but please address these concerns:
-
The retry logic was silently deleted without replacement. The original code had up to 200 inner retries and 5 outer retries. This was there for a reason — NIM API calls over the network fail transiently. The new code has zero retry logic. A single transient HTTP error in any worker will now raise a RuntimeError and abort the entire job. For expensive, long-running KG extraction jobs, this is a significant regression in robustness.
-
_safe_worker swallows exceptions and returns a dict instead of raising. This is an anti-pattern with Pool.map. If a worker raises an exception, Pool.map will propagate it naturally to the main process — that's the correct behavior. The _safe_worker wrapper catches the exception, converts it to a dict, returns it as a "successful" result, and then the main process has to manually check every result for this sentinel value. This is more complex and fragile than just letting Pool.map propagate the exception. It also means if a worker silently returns a partial result before raising, it gets swallowed entirely.
-
The chunk distribution has an off-by-one / data loss bug that was present before and is not fixed. meta_chunk_size = int(len(chunks) / num_procs) with integer division means if len(chunks) is not divisible by num_procs, the last few chunks are silently dropped. For example, 10 chunks across 3 workers: sizes are 3, 3, 3 — the 10th chunk is never processed. This was a pre-existing bug but this PR was the right opportunity to fix it (e.g. using np.array_split or adjusting the slicing).
-
flat_triples.sort() sorts tuples lexicographically. This works, but the triples coming out of _parse_n_check_triples are lists, not tuples. Sorting lists also works, but the claim of "deterministic output" is only true within a single Python version and locale — list/tuple sort order is not guaranteed to be stable across Python versions for string content with mixed Unicode. Minor point, but worth noting for a library claiming reproducibility.
Aside from that please confirm you get an end 2 end run of txt2kg_rag on a small subsample using --use_x_percent_corpus
|
Done. |
puririshi98
left a comment
There was a problem hiding this comment.
LGTM just make the CI green and we can merge
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…geometric into txt2kg_sigterm
for more information, see https://pre-commit.ci
Head branch was pushed to by a user without write access
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
@akihironitta : I added at least 10 new tests today, but I still see the same missing coverage for 17 lines When I try to dig deeper and explore the links on that page, I see that
3. I also analyzed the Missed Lines of
and with ChatGPT's help, I created a set of tests for them. When I add
|
|
Thank you @drivanov for digging into what the issue is with codecov. I have seen this issue in the past, and it's likely due to their bug or our misconfiguration, so I'll merge this as is. Thank again for sending this patch! |
Summary
This PR refactors the multiprocessing logic in
torch_geometric.llm.models.txt2kg.pyto improve reproducibility and stability.Key Changes
Remove tmp-file IPC
List[(s, p, o)]directly/tmp/outs_for_proc_*filesFix spawn-related instability
SIGTERMcascadesEnsure deterministic output
Result
Change in
llm.pyeliminates the following warning: