Skip to content

Conversation

mohammedahmed18
Copy link
Contributor

@mohammedahmed18 mohammedahmed18 commented Oct 17, 2025

User description

failed_tests_results

PR Type

Enhancement


Description

  • Log unique pytest error lines to LSP

  • Add error extraction utility function

  • Improve baseline failure messaging

  • Adjust performance tests return handling


Diagram Walkthrough

flowchart LR
  A["behavioral tests run"] -- "nonzero return" --> B["extract_unique_errors()"]
  B -- "unique E-lines" --> C["lsp_log(errors)"]
  A -- "coverage check" --> D["coverage_critic()"]
  D -- "fails" --> E["Failure: coverage below threshold"]
  A -- "tests fail" --> F["Failure: tests did not pass"]
Loading

File Walkthrough

Relevant files
Enhancement
code_utils.py
Utility to extract unique pytest error lines                         

codeflash/code_utils/code_utils.py

  • Add extract_unique_errors(pytest_output)
  • Use regex to capture lines starting with E
  • Return a set of unique trimmed error messages
+16/-0   
function_optimizer.py
Log behavior test errors and refine test flow                       

codeflash/optimization/function_optimizer.py

  • Import and use extract_unique_errors for LSP logging
  • Log errors to lsp_log when behavior tests fail
  • Clarify baseline warnings and failure messages
  • Simplify performance run return handling signature
+12/-2   

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Codeflash Bot seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The new function extract_unique_errors uses re.finditer but this file segment doesn't show an import for re. Ensure import re exists in this module or adjust imports to avoid NameError at runtime.

def extract_unique_errors(pytest_output: str) -> set[str]:
    unique_errors = set()

    # Regex pattern to match error lines:
    # - Start with 'E' followed by optional whitespace
    # - Capture the actual error message
    pattern = r"^E\s+(.*)$"

    for match in re.finditer(pattern, pytest_output, re.MULTILINE):
        error_message = match.group(1).strip()
        if error_message:
            unique_errors.add(error_message)

    return unique_errors
Behavior Change

In baseline coverage failure handling, an early Failure is now returned if any behavioral tests failed (did_pass_all_tests). Confirm this ordering doesn't mask low coverage messages or alter intended baseline logic.

did_pass_all_tests = all(result.did_pass for result in behavioral_results)
if not did_pass_all_tests:
    return Failure("Tests failed to pass for the original code.")
return Failure(
    f"Test coverage is {coverage_results.coverage}%, which is below the required threshold of {COVERAGE_THRESHOLD}%."
)
LSP Logging Robustness

When logging unique errors to LSP, extract_unique_errors(run_result.stdout) only inspects stdout. Pytest often writes errors to stderr; consider including stderr to avoid missing errors.

if is_LSP_enabled():
    unique_errors = extract_unique_errors(run_result.stdout)
    if unique_errors:
        lsp_log(LspCodeMessage(code="\n".join(unique_errors), file_name="errors"))

@github-actions
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Restore expected return unpacking

Preserve the original two-value unpacking so failures still return the sentinel.
Dropping the second value can break callers expecting both and hide parsing errors.
Capture and ignore the second value explicitly if unused.

codeflash/optimization/function_optimizer.py [1605-1609]

-benchmarking_results = self.run_and_parse_tests(
+benchmarking_results, _ = self.run_and_parse_tests(
     testing_type=TestingMode.PERFORMANCE,
     test_env=test_env,
     test_files=self.test_files,
     optimization_iteration=0,
 )
Suggestion importance[1-10]: 7

__

Why: The PR changed a previously two-value unpack into a single variable, while run_and_parse_tests still returns two values in some paths. Restoring , _ avoids tuple-to-single-value mismatch and potential runtime errors.

Medium
Guard against None coverage

Use short-circuit logic to avoid evaluating coverage_results.coverage when
coverage_results is None. Otherwise, a failed/None coverage result will raise an
AttributeError. Check the object before accessing its attributes.

codeflash/optimization/function_optimizer.py [1574-1580]

 if not coverage_critic(coverage_results, self.args.test_framework):
     did_pass_all_tests = all(result.did_pass for result in behavioral_results)
     if not did_pass_all_tests:
         return Failure("Tests failed to pass for the original code.")
+    if coverage_results is None or coverage_results.coverage is None:
+        return Failure("Test coverage could not be determined.")
     return Failure(
         f"Test coverage is {coverage_results.coverage}%, which is below the required threshold of {COVERAGE_THRESHOLD}%."
     )
Suggestion importance[1-10]: 5

__

Why: Adding a None-check for coverage_results could prevent an AttributeError if upstream ever returns None, but the PR context does not indicate such a case occurs. The change is safe and minor, improving robustness without addressing a confirmed bug.

Low
General
Parse both stdout and stderr

Include stderr in error extraction to avoid missing error lines some runners emit
there. Merge stdout and stderr prior to parsing to capture all relevant E ... lines.

codeflash/optimization/function_optimizer.py [1942-1946]

 if is_LSP_enabled():
-    unique_errors = extract_unique_errors(run_result.stdout)
+    combined_output = "\n".join(filter(None, [run_result.stdout, run_result.stderr]))
+    unique_errors = extract_unique_errors(combined_output)
     if unique_errors:
         lsp_log(LspCodeMessage(code="\n".join(unique_errors), file_name="errors"))
Suggestion importance[1-10]: 6

__

Why: Including stderr in parsing can surface additional pytest error lines that may be emitted there, improving error visibility for LSP. It's a reasonable enhancement with moderate impact and low risk.

Low

)

try:
benchmarking_results, _ = self.run_and_parse_tests(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ is needed

@mohammedahmed18 mohammedahmed18 changed the title [LSP] log the errors from the failed behavior test results [LSP] log the errors from the failed behavior test results (CF-757) Oct 20, 2025
@mohammedahmed18 mohammedahmed18 requested a review from KRRT7 October 21, 2025 10:46
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Oct 21, 2025

This PR is now faster! 🚀 mohammed ahmed accepted my code suggestion above.

Comment on lines +1584 to +1586
did_pass_all_tests = all(result.did_pass for result in behavioral_results)
if not did_pass_all_tests:
return Failure("Tests failed to pass for the original code.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're not quite good with our tests to where this is going to work smoothly, currently we have tests that fail frequently

Copy link
Contributor Author

@mohammedahmed18 mohammedahmed18 Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you are saying if the coverage wasn't enough, and some tests failed,
optimization didn't fail necessary because of failed tests? and what do you suggest instead ? @KRRT7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants