Skip to content

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Jul 8, 2025

User description

Some GH self-hosted runners (slurm cases) will not abort to the slurm job after running the tests (they also don't print the failed cases). As a result, the job keeps going until the wall time max is reached and failed tests are never printed. This should fix that.


PR Type

Bug fix


Description

  • Fix thread hanging issues in SLURM test execution

  • Add proper thread joining with timeout mechanism

  • Improve exception handling and error reporting

  • Track thread completion status for better debugging


Changes diagram

flowchart LR
  A["WorkerThread"] --> B["Enhanced Exception Tracking"]
  B --> C["Thread Join with Timeout"]
  C --> D["Completion Status Check"]
  D --> E["Prevent SLURM Hangs"]
Loading

Changes walkthrough 📝

Relevant files
Bug fix
sched.py
Enhanced thread management and exception handling               

toolchain/mfc/sched.py

  • Enhanced WorkerThread class with completion tracking and full
    exception info
  • Added timeout-based thread joining to prevent infinite hangs
  • Improved exception handling to distinguish test failures from system
    errors
  • Added proper error reporting with traceback information
  • +40/-9   

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.
  • Copilot AI review requested due to automatic review settings July 8, 2025 13:13
    @sbryngelson sbryngelson requested a review from a team as a code owner July 8, 2025 13:13
    @qodo-merge-pro
    Copy link
    Contributor

    qodo-merge-pro bot commented Jul 8, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Logic Error

    The exception handling logic at lines 74-84 has a potential issue where completed_successfully is only set to True after the target function completes without exception, but the code checks this flag when an exception occurred. This creates a logical contradiction that may prevent proper error propagation.

    if threadHolder.thread.completed_successfully:
        # Test framework handled the exception gracefully (e.g., test failure)
        # Don't re-raise - this is expected behavior
        pass
    else:
        # Unhandled exception - this indicates a real problem
        if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info:
            error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}"
            raise Exception(error_msg) from threadHolder.thread.exc
        else:
            raise threadHolder.thread.exc
    Timeout Handling

    The 30-second timeout for thread joining may be too short for legitimate long-running tests, potentially causing false positives. The timeout value should be configurable or based on test characteristics rather than hardcoded.

        threadHolder.thread.join(timeout=30.0)  # 30 second timeout
    
        # Double-check that thread actually finished joining
        if threadHolder.thread.is_alive():
            # Thread didn't finish within timeout - this is a serious issue
            raise Exception(f"Thread {threadID} failed to join within 30 seconds timeout. "
                          f"Thread may be hung or in an inconsistent state.")
    
    except Exception as join_exc:
        # Handle join-specific exceptions with more context
        raise Exception(f"Failed to join thread {threadID}: {join_exc}. "
                      f"This may indicate a system threading issue or hung test case.")
    Exception Chaining

    The exception handling creates nested exceptions with potentially confusing error messages. The raise Exception(error_msg) from threadHolder.thread.exc pattern may obscure the original exception details that are important for debugging test failures.

    if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info:
        error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}"
        raise Exception(error_msg) from threadHolder.thread.exc
    else:
        raise threadHolder.thread.exc

    @qodo-merge-pro
    Copy link
    Contributor

    qodo-merge-pro bot commented Jul 8, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    General
    Fix exception handling scope issue

    The exception handling catches all exceptions including the timeout exception
    raised in the if block, creating a confusing error message chain. The timeout
    check should be outside the try-except block to avoid catching its own
    exception.

    toolchain/mfc/sched.py [56-68]

     try:
         threadHolder.thread.join(timeout=30.0)  # 30 second timeout
    -    
    -    # Double-check that thread actually finished joining
    -    if threadHolder.thread.is_alive():
    -        # Thread didn't finish within timeout - this is a serious issue
    -        raise Exception(f"Thread {threadID} failed to join within 30 seconds timeout. "
    -                      f"Thread may be hung or in an inconsistent state.")
    -        
     except Exception as join_exc:
         # Handle join-specific exceptions with more context
         raise Exception(f"Failed to join thread {threadID}: {join_exc}. "
                       f"This may indicate a system threading issue or hung test case.")
     
    +# Double-check that thread actually finished joining
    +if threadHolder.thread.is_alive():
    +    # Thread didn't finish within timeout - this is a serious issue
    +    raise Exception(f"Thread {threadID} failed to join within 30 seconds timeout. "
    +                  f"Thread may be hung or in an inconsistent state.")
    +
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion correctly identifies that the try...except block improperly catches the timeout exception it raises, which would lead to a confusing and incorrect error message.

    Medium
    Use else clause for completion flag

    The completed_successfully flag is set to True only when no exception occurs,
    but it's placed inside the try block after the target execution. This creates a
    race condition where the flag might not be set if an exception occurs during the
    flag assignment itself.

    toolchain/mfc/sched.py [16-23]

     try:
         if self._target:
             self._target(*self._args, **self._kwargs)
    -        self.completed_successfully = True  # Mark as completed successfully
     except Exception as exc:
         self.exc = exc
         # Store the full traceback for better error reporting
         self.exc_info = traceback.format_exc()
    +else:
    +    # Only set completion flag if no exception occurred
    +    self.completed_successfully = True
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion correctly proposes using a try...except...else block, which is a more robust and idiomatic way to set the completed_successfully flag only upon successful execution.

    Low
    Possible issue
    Fix contradictory exception handling logic

    The logic for handling exceptions when completed_successfully is True is flawed.
    If an exception occurred but the thread is marked as completed successfully,
    this indicates a bug in the completion tracking logic. Both conditions cannot be
    true simultaneously.

    toolchain/mfc/sched.py [73-84]

     if threadHolder.thread.exc is not None:
    -    if threadHolder.thread.completed_successfully:
    -        # Test framework handled the exception gracefully (e.g., test failure)
    -        # Don't re-raise - this is expected behavior
    -        pass
    +    # Always propagate exceptions - if there's an exception, the thread didn't complete successfully
    +    if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info:
    +        error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}"
    +        raise Exception(error_msg) from threadHolder.thread.exc
         else:
    -        # Unhandled exception - this indicates a real problem
    -        if hasattr(threadHolder.thread, 'exc_info') and threadHolder.thread.exc_info:
    -            error_msg = f"Worker thread {threadID} failed with unhandled exception:\n{threadHolder.thread.exc_info}"
    -            raise Exception(error_msg) from threadHolder.thread.exc
    -        else:
    -            raise threadHolder.thread.exc
    +        raise threadHolder.thread.exc
    Suggestion importance[1-10]: 7

    __

    Why: The suggestion correctly identifies that threadHolder.thread.exc and threadHolder.thread.completed_successfully cannot be simultaneously true, making the conditional logic flawed and containing unreachable code.

    Medium
    • Update

    Copy link
    Contributor

    Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull Request Overview

    This PR ensures that Slurm-based test runs abort promptly on failures and surface error details. It captures full stack traces in worker threads, enforces a join timeout to avoid hangs, and propagates unhandled thread exceptions.

    • Add traceback import and store exc_info in WorkerThread.
    • Implement a 30-second join timeout and improved exception propagation in join_first_dead_thread.
    • Move the final thread-wait loop inside the progress context for consistent cleanup.
    Comments suppressed due to low confidence (1)

    toolchain/mfc/sched.py:57

    • There’s no test simulating a hung thread or validating the join-timeout path; consider adding a unit test that forces a thread to exceed the timeout to confirm this behavior and error propagation.
                        threadHolder.thread.join(timeout=30.0)  # 30 second timeout
    

    @codecov
    Copy link

    codecov bot commented Jul 8, 2025

    Codecov Report

    All modified and coverable lines are covered by tests ✅

    Project coverage is 43.71%. Comparing base (adcc0dd) to head (253ac51).
    Report is 2 commits behind head on master.

    Additional details and impacted files
    @@           Coverage Diff           @@
    ##           master     #925   +/-   ##
    =======================================
      Coverage   43.71%   43.71%           
    =======================================
      Files          68       68           
      Lines       18360    18360           
      Branches     2292     2292           
    =======================================
      Hits         8026     8026           
      Misses       8945     8945           
      Partials     1389     1389           

    ☔ View full report in Codecov by Sentry.
    📢 Have feedback on the report? Share it here.

    🚀 New features to boost your workflow:
    • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

    @sbryngelson sbryngelson enabled auto-merge (squash) July 14, 2025 14:24
    @sbryngelson sbryngelson disabled auto-merge July 14, 2025 14:37
    @sbryngelson sbryngelson merged commit 568a837 into MFlowCode:master Jul 14, 2025
    30 of 31 checks passed
    @sbryngelson sbryngelson deleted the fix-slurm-hangs branch July 15, 2025 17:59
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Development

    Successfully merging this pull request may close these issues.

    1 participant