Skip to content

Conversation

@mohammedahmed18
Copy link
Contributor

@mohammedahmed18 mohammedahmed18 commented Jul 16, 2025

User description

depends on #1705


Changes diagram

flowchart LR
  A["get_code_optimization_context()"] --> B["extract_code_markdown_context_from_files()"]
  B --> C["CodeStringsMarkdown (with splitter)"]
  C --> D["encoded_tokens_len(__str__)"]
  C --> E["find_preexisting_objects(__str__)"]
Loading

Changes walkthrough 📝

Relevant files
Enhancement
code_context_extractor.py
Switch to markdown context extraction                                       

codeflash/context/code_context_extractor.py

  • Replaced extract_code_string_context_from_files call
  • Now uses extract_code_markdown_context_from_files
  • Token length uses .__str__ of markdown context
  • Preexisting objects extracted via .__str__
  • +6/-5     
    models.py
    Add markdown code strings model                                                   

    codeflash/models/models.py

  • Added get_code_block_splitter helper
  • Enhanced CodeStringsMarkdown with cache and __str__
  • Changed read_writable_code type to CodeStringsMarkdown
  • Removed unused Field import
  • +16/-2   
    Tests
    test_code_context_extractor.py
    Update tests for markdown splitter                                             

    tests/test_code_context_extractor.py

  • Imported get_code_block_splitter
  • Updated expected context with splitter f-string
  • Assert on read_writable_code.__str__
  • +4/-3     

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.

  • PR Type

    Enhancement, Tests


    Description

    • Introduce markdown-based code context splitting

    • Switch extractor to extract_code_markdown_context_from_files

    • Update optimizer to use CodeStringsMarkdown.flat

    • Add splitter markers and parse for multi-file replacement


    Diagram Walkthrough

    flowchart LR
      A["get_code_optimization_context()"]
      A -- "use markdown extractor" --> B["extract_code_markdown_context_from_files()"]
      B --> C["CodeStringsMarkdown"]
      C -- "flatten with .flat" --> D["encoded_tokens_len / find_preexisting_objects"]
    
    Loading

    File Walkthrough

    Relevant files

    @github-actions
    Copy link

    github-actions bot commented Jul 16, 2025

    PR Reviewer Guide 🔍

    (Review updated until commit 654a6ec)

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Mutable Default

    The default value for code_strings is a mutable list, which can be shared across instances. Consider using a default_factory to avoid unexpected shared state in the Pydantic model.

    class CodeStringsMarkdown(BaseModel):
        code_strings: list[CodeString] = []
        cached_code: Optional[str] = None
    
        @property
        def flat(self) -> str:
            if self.cached_code is not None:
                return self.cached_code
            self.cached_code = "\n".join(
                get_code_block_splitter(block.file_path) + "\n" + block.code for block in self.code_strings
            )
            return self.cached_code
    
        @property
    Debug Print

    The code_print call in optimize_function appears to be leftover debugging code and should be removed or replaced with a logger to avoid unwanted console output.

    code_print(code_context.read_writable_code.flat)
    Regex Parsing

    The parse_splitter_markers method relies on a regex that may not robustly handle all file path formats, potentially leading to missing or mis-attributed code blocks. Validate against edge cases.

    def parse_splitter_markers(code_with_markers: str) -> dict[str, str]:
        pattern = rf"{LINE_SPLITTER_MARKER_PREFIX}([^\n]+)\n"
        matches = list(re.finditer(pattern, code_with_markers))
    
        results = {}
        for i, match in enumerate(matches):
            start = match.end()
            end = matches[i + 1].start() if i + 1 < len(matches) else len(code_with_markers)
            file_path = match.group(1).strip()
            code = code_with_markers[start:end].lstrip("\n")
            results[file_path] = code
        return results

    @github-actions
    Copy link

    github-actions bot commented Jul 16, 2025

    PR Code Suggestions ✨

    Latest suggestions up to 654a6ec
    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    General
    Skip markers for None file paths

    Guard against None file_path to avoid inserting literal "None" markers. Skip or
    handle blocks without a file_path explicitly before building the joined string.

    codeflash/models/models.py [154-160]

     def flat(self) -> str:
         if self.cached_code is not None:
             return self.cached_code
    -    self.cached_code = "\n".join(
    -        get_code_block_splitter(block.file_path) + "\n" + block.code for block in self.code_strings
    -    )
    +    parts = []
    +    for block in self.code_strings:
    +        if block.file_path is not None:
    +            parts.append(f"{get_code_block_splitter(block.file_path)}\n{block.code}")
    +        else:
    +            parts.append(block.code)
    +    self.cached_code = "\n".join(parts)
         return self.cached_code
    Suggestion importance[1-10]: 7

    __

    Why: Guarding against None in block.file_path prevents literal "None" markers and avoids confusion in the generated context, improving robustness.

    Medium
    Anchor and escape splitter regex

    Anchor the splitter marker at the start of a line and escape special characters to
    avoid partial matches, and enable multiline mode for accurate parsing.

    codeflash/models/models.py [172-175]

     @staticmethod
     def parse_splitter_markers(code_with_markers: str) -> dict[str, str]:
    -    pattern = rf"{LINE_SPLITTER_MARKER_PREFIX}([^\n]+)\n"
    -    matches = list(re.finditer(pattern, code_with_markers))
    +    pattern = rf"^{re.escape(LINE_SPLITTER_MARKER_PREFIX)}(.+)$"
    +    matches = list(re.finditer(pattern, code_with_markers, re.MULTILINE))
    Suggestion importance[1-10]: 5

    __

    Why: Anchoring the pattern with ^ and using re.escape makes the splitter regex more precise and prevents accidental partial matches, enhancing parsing reliability.

    Low
    Possible issue
    Fix f-string interpolation in warning

    The second string literal isn’t an f-string, so the braces won’t interpolate. Prefix
    it with f or merge into the preceding f-string to correctly display the existing
    keys.

    codeflash/optimization/function_optimizer.py [637-640]

     logger.warning(
    -    f"Optimized code not found for {relative_module_path} In the context\n-------\n{optimized_code}\n-------\n"
    -    "Existing files in the context are: {list(file_to_code_context.keys())}, re-check your 'split markers'"
    +    f"Optimized code not found for {relative_module_path} In the context\n"
    +    f"-------\n{optimized_code}\n-------\n"
    +    f"Existing files in the context are: {list(file_to_code_context.keys())}, re-check your 'split markers'"
     )
    Suggestion importance[1-10]: 6

    __

    Why: Prefixing the second literal with f ensures that {list(file_to_code_context.keys())} is interpolated, making the warning accurate and useful for debugging.

    Low

    Previous suggestions

    Suggestions up to commit 216eb7e
    CategorySuggestion                                                                                                                                    Impact
    General
    Normalize splitter path

    Use file_path.as_posix() in the f-string to ensure the splitter uses a consistent
    POSIX path, avoiding backslashes on Windows and matching test expectations.

    codeflash/models/models.py [142-143]

     def get_code_block_splitter(file_path: Path) -> str:
    -    return f"# codeflash-splitter__{file_path}"
    +    return f"# codeflash-splitter__{file_path.as_posix()}"
    Suggestion importance[1-10]: 7

    __

    Why: Using file_path.as_posix() in the splitter guarantees forward-slash paths on all OSes, avoiding backslashes on Windows and preventing cross-platform test failures.

    Medium
    Properly implement __str__

    Replace the @property on str with a standard special method override so calling
    str(obj) invokes it correctly and doesn’t shadow built-in behavior.

    codeflash/models/models.py [150-157]

    -@property
     def __str__(self) -> str:
         if self.cached_code is not None:
             return self.cached_code
         self.cached_code = "\n\n".join(
             get_code_block_splitter(block.file_path) + "\n" + block.code for block in self.code_strings
         )
         return self.cached_code
    Suggestion importance[1-10]: 6

    __

    Why: Defining __str__ with @property prevents str(obj) from invoking this logic; converting it to a normal special method override ensures the custom formatter is used when calling str(), improving usability.

    Low

    @mohammedahmed18 mohammedahmed18 changed the title start using markdown representation for read writable context start using markdown representation for read writable context (CF-687) Jul 16, 2025
    codeflash-ai bot added a commit that referenced this pull request Jul 25, 2025
    …n PR #553 (`feat/markdown-read-writable-context`)
    
    Here is an optimized version of your program with improved runtime and memory usage. Your original `path_to_code_string` function uses a dictionary comprehension, which is already efficient, but we can further optimize by minimizing attribute lookups and potential object string conversions. Also, since the base class already stores attributes, we can annotate expected attribute types for better speed in static analysis and C extensions (not runtime, but helps readability and future optimization). 
    
    Here's the improved version.
    
    
    
    **Notes about the optimization:**
    
    - The for-loop avoids repeated attribute lookups and is slightly faster and less memory-intensive than a dictionary comprehension in some cases (especially for larger datasets).
    - Converted `self.code_strings` to a local variable for faster access inside the loop.
    - No unnecessary temporary objects or function calls were introduced.
    - This also makes it easier to add future optimizations, such as slotting or generator-based approaches for extreme scale, if needed.
    
    **Performance justification:**  
    This makes the method marginally faster for large `code_strings` collections because it reduces temporary object allocations and attribute lookups, and dictionary insertion in a loop is roughly the same speed as a comprehension but is more explicit for optimization.
    
    Let me know if you need even lower-level optimization or have information about the structure of `file_path` or `code` that could allow further improvements!
    @codeflash-ai
    Copy link
    Contributor

    codeflash-ai bot commented Jul 25, 2025

    ⚡️ Codeflash found optimizations for this PR

    📄 66% (0.66x) speedup for CodeStringsMarkdown.path_to_code_string in codeflash/models/models.py

    ⏱️ Runtime : 4.52 microseconds 2.72 microseconds (best of 33 runs)

    I created a new dependent PR with the suggested changes. Please review:

    If you approve, it will be merged into this PR (branch feat/markdown-read-writable-context).

    @mohammedahmed18
    Copy link
    Contributor Author

    The bubble sort test is failing because it can't generate the .sort() optimization with the current prompt.


    Using the prompt here would fix the issue.

    2025-07-26-032457_hyprshot

    @mohammedahmed18 mohammedahmed18 marked this pull request as ready for review July 26, 2025 05:50
    @github-actions
    Copy link

    Persistent review updated to latest commit 654a6ec

    codeflash-ai bot added a commit that referenced this pull request Aug 5, 2025
    … (`feat/markdown-read-writable-context`)
    
    The optimization achieves a **52% speedup** by eliminating repeated attribute lookups through a simple but effective change: storing `self._cache` in a local variable `cache` at the beginning of the method.
    
    **Key optimization:**
    - **Reduced attribute access overhead**: Instead of accessing `self._cache` multiple times (3-4 times in the original), the optimized version accesses it once and stores it in a local variable. In Python, local variable access is significantly faster than attribute access since it avoids the overhead of attribute resolution through the object's `__dict__`.
    
    **Performance impact by operation:**
    - The `cache.get("file_to_path")` call becomes ~3x faster (from 14,423ns to 1,079ns per hit)
    - Dictionary assignments and returns also benefit from faster local variable access
    - Total runtime drops from 22.7μs to 14.8μs
    
    **Best suited for:**
    Based on the test results, this optimization is particularly effective for scenarios with frequent cache lookups, showing **48-58% improvements** in basic usage patterns. The optimization scales well regardless of the `code_strings` content size since the bottleneck was in the cache access pattern, not the dictionary comprehension itself.
    
    This is a classic Python micro-optimization that leverages the performance difference between local variables (stored in a fast array) versus instance attributes (requiring dictionary lookups).
    @codeflash-ai
    Copy link
    Contributor

    codeflash-ai bot commented Aug 5, 2025

    ⚡️ Codeflash found optimizations for this PR

    📄 53% (0.53x) speedup for CodeStringsMarkdown.file_to_path in codeflash/models/models.py

    ⏱️ Runtime : 22.7 microseconds 14.8 microseconds (best of 33 runs)

    I created a new dependent PR with the suggested changes. Please review:

    If you approve, it will be merged into this PR (branch feat/markdown-read-writable-context).

    _cache: dict = PrivateAttr(default_factory=dict)

    @property
    def flat(self) -> str:
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    can you add docstring to define what flat means?


    scoped_optimized_code = file_to_code_context.get(relative_module_path)
    if scoped_optimized_code is None:
    logger.warning(
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    i wonder if this code can be moved to the definition of replace_function_definitions_in_module. this will make this function simpler and make calling the replace_function_definitions_in_module safer

    misrasaurabh1
    misrasaurabh1 previously approved these changes Aug 5, 2025
    json_payload = json.dumps(payload, indent=None, default=pydantic_encoder)
    logger.debug(f"========JSON PAYLOAD FOR {url}==============")
    logger.debug(json_payload)
    logger.debug("======================")
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    not needed imo

    Copy link
    Contributor Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    yeah, committed that by accident, I removed it

    should_run_experiment, code_context, original_helper_code = initialization_result.unwrap()

    code_print(code_context.read_writable_code)
    code_print(code_context.read_writable_code.flat) # Should we print the markdown or the flattened code?
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    can probably stick with flattened code

    codeflash-ai bot added a commit that referenced this pull request Aug 6, 2025
    … (`feat/markdown-read-writable-context`)
    
    The optimized code achieves a 10% speedup through several targeted performance improvements:
    
    **Key Optimizations:**
    
    1. **Reduced attribute lookups in hot loops**: Pre-cached frequently accessed attributes like `helper.jedi_definition`, `helper.file_path.stem`, and method references (`helpers_by_file.__getitem__`) outside loops to avoid repeated attribute resolution.
    
    2. **Faster AST node type checking**: Replaced `isinstance(node, ast.ImportFrom)` with `type(node) is ast.ImportFrom` and cached AST classes (`ImportFrom = ast.ImportFrom`) to eliminate repeated class lookups during AST traversal.
    
    3. **Optimized entrypoint function discovery**: Used `ast.iter_child_nodes()` first to check top-level nodes before falling back to full `ast.walk()`, since entrypoint functions are typically at module level.
    
    4. **Eliminated expensive set operations**: Replaced `set.intersection()` calls with simple membership testing using a direct loop (`for n in possible_call_names: if n in called_fn_names`), which short-circuits on first match and avoids creating intermediate sets.
    
    5. **Streamlined data structure operations**: Used `setdefault()` and direct list operations instead of conditional checks, and stored local references to avoid repeated dictionary lookups.
    
    **Performance Impact by Test Case:**
    - Small-scale tests (basic usage): 3-12% improvement
    - Large-scale tests with many helpers: 10-15% improvement  
    - Import-heavy scenarios: 4-9% improvement
    
    The optimizations are particularly effective for codebases with many helper functions and complex import structures, where the reduced overhead in hot loops compounds significantly.
    @codeflash-ai
    Copy link
    Contributor

    codeflash-ai bot commented Aug 6, 2025

    ⚡️ Codeflash found optimizations for this PR

    📄 10% (0.10x) speedup for detect_unused_helper_functions in codeflash/context/unused_definition_remover.py

    ⏱️ Runtime : 20.7 milliseconds 18.8 milliseconds (best of 5 runs)

    I created a new dependent PR with the suggested changes. Please review:

    If you approve, it will be merged into this PR (branch feat/markdown-read-writable-context).

    ):
    generated_results = self.generate_tests_and_optimizations(
    testgen_context_code=code_context.testgen_context_code,
    testgen_context_code=code_context.testgen_context_code, # TODO: should we send the markdow context for the testgen instead.
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    we'll probably have to test the testgen responses with the new markdown format, are you modifying it on aiservice?

    Copy link
    Contributor Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    no didn't modify it

    codeflash-ai bot added a commit that referenced this pull request Aug 7, 2025
    …553 (`feat/markdown-read-writable-context`)
    
    The optimization moves the `type_mapping` dictionary from being recreated inside the `show_message_log` method on every call to being a class-level attribute `_type_mapping` that is created once when the class is defined.
    
    **Key optimization:**
    - **Dictionary creation elimination**: The original code recreates the same 5-element dictionary mapping message type strings to `MessageType` enums on every call to `show_message_log`. The optimized version creates this mapping once as a class attribute and reuses it.
    
    **Why this provides a speedup:**
    - Dictionary creation in Python involves memory allocation and hash table initialization overhead
    - The line profiler shows the original `show_message_log` method spending significant time (99.3% of its execution) on dictionary creation and operations
    - By eliminating repeated dictionary creation, the optimized version reduces per-call overhead from ~46ms total time to ~33μs (1000x+ improvement for this method)
    
    **Test case performance:**
    The optimization particularly benefits scenarios with frequent logging calls. Test cases like `test_successful_optimization_speedup_calculation` and `test_successful_optimization_with_different_function_name` that make multiple `show_message_log` calls see the most benefit, as they avoid the repeated dictionary allocation overhead on each logging operation.
    
    This is a classic Python optimization pattern - moving constant data structures outside frequently-called methods to avoid repeated allocation costs.
    @codeflash-ai
    Copy link
    Contributor

    codeflash-ai bot commented Aug 7, 2025

    ⚡️ Codeflash found optimizations for this PR

    📄 1,543% (15.43x) speedup for perform_function_optimization in codeflash/lsp/beta.py

    ⏱️ Runtime : 17.9 milliseconds 1.09 milliseconds (best of 33 runs)

    I created a new dependent PR with the suggested changes. Please review:

    If you approve, it will be merged into this PR (branch feat/markdown-read-writable-context).

    @mohammedahmed18 mohammedahmed18 merged commit 3171bb8 into main Aug 7, 2025
    19 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants