Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions src/instructlab/training/main_ds.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import logging
import os
import subprocess
import sys

Check warning on line 9 in src/instructlab/training/main_ds.py

View workflow job for this annotation

GitHub Actions / pylint

W0611: Unused import sys (unused-import)
import time
import warnings

Expand Down Expand Up @@ -645,11 +646,17 @@
if "process" not in locals() or process is None:
return

failure = process.poll() != 0
# wait for the process to exit so we can properly read the exit code
process.wait(timeout=60)
process_code = process.poll()
failure = process_code != 0
Comment on lines +649 to +652
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing exception handling for subprocess.TimeoutExpired.

process.wait(timeout=60) raises subprocess.TimeoutExpired if the process doesn't exit within the timeout. When this exception is raised, lines 651-652 are skipped, leaving failure = False. This could mask a real failure since the function might complete without raising RuntimeError at line 674.

🔎 Proposed fix
         # wait for the process to exit so we can properly read the exit code
-        process.wait(timeout=60)
-        process_code = process.poll()
-        failure = process_code != 0
+        try:
+            process.wait(timeout=60)
+            process_code = process.poll()
+            failure = process_code != 0
+        except subprocess.TimeoutExpired:
+            logger.error("Training subprocess did not exit within 60 seconds.")
+            process_code = None
+            failure = True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# wait for the process to exit so we can properly read the exit code
process.wait(timeout=60)
process_code = process.poll()
failure = process_code != 0
# wait for the process to exit so we can properly read the exit code
try:
process.wait(timeout=60)
process_code = process.poll()
failure = process_code != 0
except subprocess.TimeoutExpired:
logger.error("Training subprocess did not exit within 60 seconds.")
process_code = None
failure = True
🤖 Prompt for AI Agents
In src/instructlab/training/main_ds.py around lines 649 to 652, the call to
process.wait(timeout=60) can raise subprocess.TimeoutExpired which is currently
unhandled and will skip setting failure; catch subprocess.TimeoutExpired, handle
it by terminating/killing the process (process.terminate() then process.kill()
if needed), call process.wait() again to reap it, set failure = True (and
capture the final exit code via process.poll()), and ensure any necessary
stdout/stderr are read before proceeding so the function does not silently treat
a timed-out process as a success.


if not failure:
logger.info("Operation completed successfully! 🎉")
else:
logger.error("Training subprocess has not exited yet. Sending SIGTERM.")
logger.error(
f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}"
)
Comment on lines +657 to +659
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Misleading error message.

This branch is taken when process_code != 0, meaning the process has already exited with a non-zero exit code. The message "Training subprocess has not exited yet" is incorrect and confusing. It should indicate the process exited with an error.

🔎 Proposed fix
         else:
             logger.error(
-                f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}"
+                f"Training subprocess exited with non-zero code: {process_code}"
             )

Note: After fixing this, you may also want to reconsider whether calling process.terminate() on line 661 is appropriate when the process has already exited (it's harmless but unnecessary).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.error(
f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}"
)
logger.error(
f"Training subprocess exited with non-zero code: {process_code}"
)
🤖 Prompt for AI Agents
In src/instructlab/training/main_ds.py around lines 657 to 659, the log message
incorrectly says "Training subprocess has not exited yet" even though this
branch is executed when process_code != 0 (the subprocess already exited with a
non-zero/error code); change the log to state that the subprocess exited with an
error and include the actual exit code and any available stderr/exception info
(e.g., "Training subprocess exited with non-zero code: {process_code}"); remove
or guard the subsequent process.terminate() call since the process is already
exited (or make it conditional only when the process is still alive) to avoid
unnecessary termination attempts.


process.terminate()
try:
Expand Down
Loading