Skip to content

Conversation

@tomerqodo
Copy link

Benchmark PR from qodo-benchmark#275

@greptile-apps
Copy link

greptile-apps bot commented Jan 15, 2026

Greptile Summary

This PR fixes markdown link rendering by implementing a placeholder-based approach to preserve markdown links and images during URL removal. The fix prevents markdown links like [text](url) from being mangled into text](url) when the remove_urls_emails cleaning rule is enabled.

Key changes:

  • Modified CleanProcessor.clean() to use placeholders that protect entire markdown links/images before URL removal
  • Updated remove_leading_symbols() to detect and preserve markdown links at the start of text
  • Added comprehensive test coverage including edge cases like URL-as-link-text

Critical issues found:

  • Regex pattern for markdown images on line 38 missing capture group for alt text, causing match.group(2) on line 51 to fail with IndexError
  • Alt text hardcoded as "image" instead of preserving original value
  • Unused __init__ method added to CleanProcessor with default_rules parameter that's never used

Confidence Score: 2/5

  • This PR has critical runtime bugs that will cause failures when processing markdown images
  • Score reflects critical regex bug causing IndexError when match.group(2) is called on a pattern with only one capture group (line 38/51 in clean_processor.py). The alt text preservation is also broken. While the approach is sound and tests are comprehensive, the implementation will fail at runtime when processing markdown images.
  • Pay close attention to api/core/rag/cleaner/clean_processor.py - the markdown image pattern needs fixing on line 38, and the placeholder function needs updating on lines 49-54

Important Files Changed

Filename Overview
api/core/rag/cleaner/clean_processor.py Markdown link preservation logic added but has critical regex bug - match.group(2) called on pattern with only one capture group
api/core/tools/utils/text_processing_utils.py Added markdown link preservation check at start of function - straightforward and correct implementation
api/tests/unit_tests/core/rag/cleaner/test_clean_processor.py Comprehensive test coverage added for markdown preservation, but tests would fail due to bug in implementation

Sequence Diagram

sequenceDiagram
    participant User
    participant CleanProcessor
    participant RegexEngine
    
    User->>CleanProcessor: clean(text, process_rule)
    
    Note over CleanProcessor: Apply default cleaning
    CleanProcessor->>RegexEngine: Remove <| and |>
    CleanProcessor->>RegexEngine: Remove control characters
    
    alt remove_urls_emails enabled
        Note over CleanProcessor: Protect markdown links/images
        CleanProcessor->>RegexEngine: Match markdown links [text](url)
        RegexEngine-->>CleanProcessor: Replace with placeholders
        CleanProcessor->>RegexEngine: Match markdown images ![alt](url)
        RegexEngine-->>CleanProcessor: Replace with placeholders
        
        Note over CleanProcessor: Remove emails and URLs
        CleanProcessor->>RegexEngine: Remove email addresses
        CleanProcessor->>RegexEngine: Remove remaining URLs
        
        Note over CleanProcessor: Restore markdown
        CleanProcessor->>CleanProcessor: Restore links from placeholders
        CleanProcessor->>CleanProcessor: Restore images from placeholders
    end
    
    alt remove_extra_spaces enabled
        CleanProcessor->>RegexEngine: Reduce multiple newlines to 2
        CleanProcessor->>RegexEngine: Reduce multiple spaces to 1
    end
    
    CleanProcessor-->>User: Return cleaned text
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

# Replace the ENTIRE markdown link/image with a single placeholder to protect
# the link text (which might also be a URL) from being removed
markdown_link_pattern = r"\[([^\]]*)\]\((https?://[^)]+)\)"
markdown_image_pattern = r"!\[.*?\]\((https?://[^)]+)\)"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: pattern missing capture group for alt text - line 51 calls match.group(2) but this pattern only has one capture group (the URL)

Suggested change
markdown_image_pattern = r"!\[.*?\]\((https?://[^)]+)\)"
markdown_image_pattern = r"!\[(.*?)\]\((https?://[^)]+)\)"
Prompt To Fix With AI
This is a comment left during a code review.
Path: api/core/rag/cleaner/clean_processor.py
Line: 38:38

Comment:
**syntax:** pattern missing capture group for alt text - line 51 calls `match.group(2)` but this pattern only has one capture group (the URL)

```suggestion
                    markdown_image_pattern = r"!\[(.*?)\]\((https?://[^)]+)\)"
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +49 to +54
def replace_image_with_placeholder(match, placeholders=placeholders):
link_type = "image"
url = match.group(2)
placeholder = f"__MARKDOWN_PLACEHOLDER_{len(placeholders)}__"
placeholders.append((link_type, "image", url))
return placeholder
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: hardcoding "image" as alt text instead of preserving the original - should use match.group(1) after fixing the pattern

Suggested change
def replace_image_with_placeholder(match, placeholders=placeholders):
link_type = "image"
url = match.group(2)
placeholder = f"__MARKDOWN_PLACEHOLDER_{len(placeholders)}__"
placeholders.append((link_type, "image", url))
return placeholder
def replace_image_with_placeholder(match, placeholders=placeholders):
link_type = "image"
alt_text = match.group(1)
url = match.group(2)
placeholder = f"__MARKDOWN_PLACEHOLDER_{len(placeholders)}__"
placeholders.append((link_type, alt_text, url))
return placeholder
Prompt To Fix With AI
This is a comment left during a code review.
Path: api/core/rag/cleaner/clean_processor.py
Line: 49:54

Comment:
**logic:** hardcoding "image" as alt text instead of preserving the original - should use `match.group(1)` after fixing the pattern

```suggestion
                    def replace_image_with_placeholder(match, placeholders=placeholders):
                        link_type = "image"
                        alt_text = match.group(1)
                        url = match.group(2)
                        placeholder = f"__MARKDOWN_PLACEHOLDER_{len(placeholders)}__"
                        placeholders.append((link_type, alt_text, url))
                        return placeholder
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +5 to +7
def __init__(self, default_rules: dict | None = None):
"""Initialize CleanProcessor with optional default rules."""
self.default_rules = default_rules or {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: __init__ added but default_rules is never used - the clean method is a classmethod that doesn't access instance attributes

Prompt To Fix With AI
This is a comment left during a code review.
Path: api/core/rag/cleaner/clean_processor.py
Line: 5:7

Comment:
**style:** `__init__` added but `default_rules` is never used - the `clean` method is a classmethod that doesn't access instance attributes

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants