Skip to content

Conversation

@KRRT7
Copy link
Contributor

@KRRT7 KRRT7 commented Sep 17, 2025

PR Type

Enhancement, Tests, Bug fix


Description

  • Derive test context from pytest hooks

  • Add async throughput calculation support

  • Extend baseline model with throughput

  • Update tests for new interfaces


Diagram Walkthrough

flowchart LR
  pytestHook["pytest hooks set test context env"] -- "env vars" --> extractor["extract_test_context_from_frame uses env"]
  runParse["run_and_parse_tests returns results thrice"] -- "behavior/perf results" --> calcTP["calculate_async_throughput aggregates stdout"]
  calcTP -- "function throughput" --> baseline["OriginalCodeBaseline.async_throughput"]
Loading

File Walkthrough

Relevant files
Enhancement
6 files
codeflash_wrap_decorator.py
Replace stack introspection with env-based test context   
+12/-174
models.py
Add async throughput to baseline model                                     
+1/-0     
function_optimizer.py
Plumb results for throughput; compute and store async throughput
+44/-10 
parse_test_output.py
Add helpers to compute throughput from stdout                       
+63/-0   
pytest_plugin.py
Set/clear test context env vars per test                                 
+22/-0   
test_runner.py
Log executed pytest command for visibility                             
+1/-1     
Tests
8 files
test_async_run_and_parse_tests.py
Adapt to new API and class-name expectations                         
+10/-10 
test_async_wrapper_sqlite_validation.py
Provide env context; fix sync context extraction test       
+5/-2     
test_codeflash_capture.py
Update for triple-return API and expectations                       
+12/-12 
test_extract_test_context_from_frame.py
Remove obsolete stack-based context tests                               
+0/-217 
test_instrument_all_and_run.py
Adjust tests to new run_and_parse_tests signature               
+5/-5     
test_instrument_tests.py
Update behavioral/perf/line profile calls to new API         
+22/-22 
test_instrumentation_run_results_aiservice.py
Align to new return tuple from test runner                             
+5/-5     
test_pickle_patcher.py
Adapt pickle patch tests to new API                                           
+4/-4     
Configuration changes
1 files
codeflash.code-workspace
Update launch args to async concurrency example                   
+5/-1     

@github-actions
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Regex Robustness

The new throughput parsers rely on strict colon-delimited patterns without escaping, which may fail if module/function names contain colons or unexpected characters, or if stdout interleaves tags. Consider anchoring, stricter grouping, and normalization to avoid false positives/negatives.

start_pattern = re.compile(r"!\$######([^:]*):([^:]*):([^:]*):([^:]*):([^:]+)######\$!")
end_pattern = re.compile(r"!######([^:]*):([^:]*):([^:]*):([^:]*):([^:]+)######!")


def calculate_function_throughput_from_stdout(stdout: str, function_name: str) -> int:
    """A completed execution is defined as having both a start tag and matching end tag:
    Start: !$######test_module:test_function:function_name:loop_index:iteration_id######$!
    End:   !######test_module:test_function:function_name:loop_index:iteration_id######!
    """
    start_matches = start_pattern.findall(stdout)
    end_matches = end_pattern.findall(stdout)
    end_matches_set = set(end_matches)

    # Count completed executions for the specific function only
    function_throughput = 0

    for start_match in start_matches:
        # Check if this execution is for the function we're interested in and has a matching end tag
        # function_name is at index 2 in the match tuple
        if start_match in end_matches_set and len(start_match) > 2 and start_match[2] == function_name:
            function_throughput += 1

    return function_throughput
Excessive Logging

Logging the full pytest command at info level on every run could be noisy and leak paths; consider using debug level or redacting sensitive parts.

blocklist_args = [f"-p no:{plugin}" for plugin in BEHAVIORAL_BLOCKLISTED_PLUGINS if plugin != "cov"]
logger.info(f"{' '.join(coverage_cmd + common_pytest_args + blocklist_args + result_args + test_files)}")
results = execute_test_subprocess(
Env Var Dependence

extract_test_context_from_frame now hard-requires environment variables and raises if missing; ensure all runners (including unittest and non-pytest paths) set these vars or add a safer fallback to prevent crashes.

def extract_test_context_from_frame() -> tuple[str, str | None, str]:
    # test_module = os.environ.get("CODEFLASH_TEST_MODULE")
    test_module = os.environ["CODEFLASH_TEST_MODULE"]
    test_class = os.environ.get("CODEFLASH_TEST_CLASS", None)
    # test_function = os.environ.get("CODEFLASH_TEST_FUNCTION")
    test_function = os.environ["CODEFLASH_TEST_FUNCTION"]

    if test_module and test_function:
        return (test_module, test_class if test_class else None, test_function)

    raise RuntimeError(
        "Test context environment variables not set - ensure tests are run through codeflash test runner"
    )

@github-actions
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Normalize end tags before matching

The start/end tag formats elsewhere include optional extra timing appended to the
end tag, making tuple equality fail and undercounting throughput. Normalize end
matches (strip trailing timing after the last colon) before comparing to start
tuples. This aligns with the normalization used above and ensures completed
executions are counted correctly.

codeflash/verification/parse_test_output.py [81-103]

 start_pattern = re.compile(r"!\$######([^:]*):([^:]*):([^:]*):([^:]*):([^:]+)######\$!")
 end_pattern = re.compile(r"!######([^:]*):([^:]*):([^:]*):([^:]*):([^:]+)######!")
 
 
 def calculate_function_throughput_from_stdout(stdout: str, function_name: str) -> int:
     """A completed execution is defined as having both a start tag and matching end tag:
     Start: !$######test_module:test_function:function_name:loop_index:iteration_id######$!
-    End:   !######test_module:test_function:function_name:loop_index:iteration_id######!
+    End:   !######test_module:test_function:function_name:loop_index:iteration_id[:timing]######!
     """
     start_matches = start_pattern.findall(stdout)
-    end_matches = end_pattern.findall(stdout)
-    end_matches_set = set(end_matches)
+    raw_end_matches = end_pattern.findall(stdout)
 
-    # Count completed executions for the specific function only
+    # Normalize end matches by stripping optional trailing ':timing' from the last group
+    normalized_end_matches = []
+    for m in raw_end_matches:
+        if m and len(m) == 5:
+            last = m[4]
+            parts = last.split(":")
+            if len(parts) > 1:
+                # keep everything except the last timing segment
+                last = ":".join(parts[:-1])
+            normalized_end_matches.append((m[0], m[1], m[2], m[3], last))
+        else:
+            normalized_end_matches.append(m)
+
+    end_matches_set = set(normalized_end_matches)
+
     function_throughput = 0
-
     for start_match in start_matches:
-        # Check if this execution is for the function we're interested in and has a matching end tag
-        # function_name is at index 2 in the match tuple
-        if start_match in end_matches_set and len(start_match) > 2 and start_match[2] == function_name:
+        if len(start_match) > 2 and start_match[2] == function_name and start_match in end_matches_set:
             function_throughput += 1
 
     return function_throughput
Suggestion importance[1-10]: 8

__

Why: This addresses a subtle correctness bug: end tags may include timing info causing tuple mismatch; normalization aligns with earlier logic and prevents undercounting. High impact for accurate throughput metrics and consistent with PR context.

Medium
Prevent KeyError on env access

Using direct indexing on os.environ will raise KeyError in environments outside
pytest or if hooks fail, causing hard crashes. Safely read the variables with .get()
and raise the existing RuntimeError only when required fields are missing. This
keeps previous behavior while avoiding unexpected KeyError.

codeflash/code_utils/codeflash_wrap_decorator.py [33-45]

 def extract_test_context_from_frame() -> tuple[str, str | None, str]:
-    # test_module = os.environ.get("CODEFLASH_TEST_MODULE")
-    test_module = os.environ["CODEFLASH_TEST_MODULE"]
+    test_module = os.environ.get("CODEFLASH_TEST_MODULE")
     test_class = os.environ.get("CODEFLASH_TEST_CLASS", None)
-    # test_function = os.environ.get("CODEFLASH_TEST_FUNCTION")
-    test_function = os.environ["CODEFLASH_TEST_FUNCTION"]
+    test_function = os.environ.get("CODEFLASH_TEST_FUNCTION")
 
     if test_module and test_function:
         return (test_module, test_class if test_class else None, test_function)
 
     raise RuntimeError(
         "Test context environment variables not set - ensure tests are run through codeflash test runner"
     )
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that direct os.environ indexing can raise KeyError and proposes a safe .get() fallback without changing semantics. It's contextually accurate and reduces unexpected crashes, though not critical.

Medium
General
Avoid blank class env leakage

Writing empty string for CODEFLASH_TEST_CLASS makes downstream consumers distinguish
"no class" vs "unset". Use pop(..., None) before setup to avoid stale values, and
only set CODEFLASH_TEST_CLASS when a class exists. This prevents confusing blank
class names and ensures clean state per test.

codeflash/verification/pytest_plugin.py [454-474]

 @pytest.hookimpl(tryfirst=True)
 def pytest_runtest_setup(self, item: pytest.Item) -> None:
+    # Clear any previous test context to avoid leakage between tests
+    for var in ("CODEFLASH_TEST_MODULE", "CODEFLASH_TEST_CLASS", "CODEFLASH_TEST_FUNCTION"):
+        os.environ.pop(var, None)
+
     test_module_name = item.module.__name__ if item.module else "unknown_module"
 
-    test_class_name = None
-    if item.cls:
-        test_class_name = item.cls.__name__
+    test_class_name = item.cls.__name__ if item.cls else None
 
     test_function_name = item.name
     if "[" in test_function_name:
         test_function_name = test_function_name.split("[", 1)[0]
 
     os.environ["CODEFLASH_TEST_MODULE"] = test_module_name
-    os.environ["CODEFLASH_TEST_CLASS"] = test_class_name or ""
+    if test_class_name:
+        os.environ["CODEFLASH_TEST_CLASS"] = test_class_name
     os.environ["CODEFLASH_TEST_FUNCTION"] = test_function_name
 
 @pytest.hookimpl(trylast=True)
 def pytest_runtest_teardown(self, item: pytest.Item) -> None:
     """Clean up test context environment variables after each test."""
     for var in ["CODEFLASH_TEST_MODULE", "CODEFLASH_TEST_CLASS", "CODEFLASH_TEST_FUNCTION"]:
         os.environ.pop(var, None)
Suggestion importance[1-10]: 6

__

Why: Clearing env vars before setup and omitting empty class values improves robustness and state isolation. It's a reasonable enhancement with moderate impact; existing code already cleans up in teardown so the issue is not severe.

Low

KRRT7 and others added 4 commits September 22, 2025 12:21
The optimization achieves a 25% speedup by **eliminating redundant AST node creation** inside the loop. 

**Key change:** The `timeout_decorator` AST node is now created once before the loop instead of being recreated for every test method that needs it. In the original code, this AST structure was built 3,411 times during profiling, consuming significant time in object allocation and initialization.

**Why this works:** AST nodes are immutable once created, so the same `timeout_decorator` instance can be safely appended to multiple method decorator lists. This eliminates:
- Repeated `ast.Call()` constructor calls
- Redundant `ast.Name()` and `ast.Constant()` object creation
- Multiple attribute assignments for the same decorator structure

**Performance characteristics:** The optimization is most effective for large test classes with many test methods (showing 24-33% improvements in tests with 500+ methods), while having minimal impact on classes with few or no test methods. This makes it particularly valuable for comprehensive test suites where classes commonly contain dozens of test methods.

The line profiler shows the AST node creation operations dropped from ~3,400 hits to just ~25 hits, directly correlating with the observed speedup.
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Sep 22, 2025

⚡️ Codeflash found optimizations for this PR

📄 26% (0.26x) speedup for AsyncCallInstrumenter.visit_ClassDef in codeflash/code_utils/instrument_existing_tests.py

⏱️ Runtime : 17.4 milliseconds 13.9 milliseconds (best of 141 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch get-throughput-from-output).

@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Sep 22, 2025

⚡️ Codeflash found optimizations for this PR

📄 69% (0.69x) speedup for AsyncCallInstrumenter._call_in_positions in codeflash/code_utils/instrument_existing_tests.py

⏱️ Runtime : 313 microseconds 185 microseconds (best of 137 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch get-throughput-from-output).

…25-09-22T19.41.32

⚡️ Speed up method `AsyncCallInstrumenter.visit_ClassDef` by 26% in PR #739 (`get-throughput-from-output`)
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Sep 22, 2025

@github-actions github-actions bot added the workflow-modified This PR modifies GitHub Actions workflows label Sep 22, 2025
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Sep 23, 2025

⚡️ Codeflash found optimizations for this PR

📄 50% (0.50x) speedup for AsyncCallInstrumenter.visit_AsyncFunctionDef in codeflash/code_utils/instrument_existing_tests.py

⏱️ Runtime : 6.85 milliseconds 4.56 milliseconds (best of 103 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch get-throughput-from-output).

@KRRT7 KRRT7 requested a review from misrasaurabh1 September 23, 2025 07:58
@@ -1,19 +0,0 @@
name: Lint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll restore it later. it's behaving differently locally vs CI

line_id = os.environ["CODEFLASH_CURRENT_LINE_ID"]
loop_index = int(os.environ["CODEFLASH_LOOP_INDEX"])
test_module_name, test_class_name, test_name = extract_test_context_from_frame()
test_module_name, test_class_name, test_name = extract_test_context_from_env()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a line below why do we have a default of 0 - iteration = os.environ.get("CODEFLASH_TEST_ITERATION", "0")


blocklist_args = [f"-p no:{plugin}" for plugin in BEHAVIORAL_BLOCKLISTED_PLUGINS if plugin != "cov"]

logger.info(f"{' '.join(coverage_cmd + common_pytest_args + blocklist_args + result_args + test_files)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info level looks too high to me for this log

# When throughput data is available, accept if EITHER throughput OR runtime improves significantly
throughput_acceptance = throughput_improved and throughput_is_best
runtime_acceptance = runtime_improved and runtime_is_best
return throughput_acceptance or runtime_acceptance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this might have cases where one improves and the other degrades. This might require more thought in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, currently, runtime "degrades" codeflash-ai/optimize-me#121 but throughput improves, this still needs work.

Copy link
Contributor

@misrasaurabh1 misrasaurabh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving but leaving a few comments

@KRRT7 KRRT7 merged commit b221be4 into granular-async-instrumentation Sep 24, 2025
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Review effort 3/5 workflow-modified This PR modifies GitHub Actions workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants