fix: [xxx](xxx) render as xxx](xxx) #81

tomerqodo · 2026-01-15T10:34:53Z

greptile-apps · 2026-01-15T10:38:05Z

Greptile Summary

This PR fixes markdown link rendering by implementing a placeholder-based approach to preserve markdown links and images during URL removal. The fix prevents markdown links like [text](url) from being mangled into text](url) when the remove_urls_emails cleaning rule is enabled.

Key changes:

Modified CleanProcessor.clean() to use placeholders that protect entire markdown links/images before URL removal
Updated remove_leading_symbols() to detect and preserve markdown links at the start of text
Added comprehensive test coverage including edge cases like URL-as-link-text

Critical issues found:

Regex pattern for markdown images on line 38 missing capture group for alt text, causing match.group(2) on line 51 to fail with IndexError
Alt text hardcoded as "image" instead of preserving original value
Unused __init__ method added to CleanProcessor with default_rules parameter that's never used

Confidence Score: 2/5

This PR has critical runtime bugs that will cause failures when processing markdown images
Score reflects critical regex bug causing IndexError when match.group(2) is called on a pattern with only one capture group (line 38/51 in clean_processor.py). The alt text preservation is also broken. While the approach is sound and tests are comprehensive, the implementation will fail at runtime when processing markdown images.
Pay close attention to api/core/rag/cleaner/clean_processor.py - the markdown image pattern needs fixing on line 38, and the placeholder function needs updating on lines 49-54

Important Files Changed

Filename	Overview
api/core/rag/cleaner/clean_processor.py	Markdown link preservation logic added but has critical regex bug - `match.group(2)` called on pattern with only one capture group
api/core/tools/utils/text_processing_utils.py	Added markdown link preservation check at start of function - straightforward and correct implementation
api/tests/unit_tests/core/rag/cleaner/test_clean_processor.py	Comprehensive test coverage added for markdown preservation, but tests would fail due to bug in implementation

Sequence Diagram

sequenceDiagram
    participant User
    participant CleanProcessor
    participant RegexEngine
    
    User->>CleanProcessor: clean(text, process_rule)
    
    Note over CleanProcessor: Apply default cleaning
    CleanProcessor->>RegexEngine: Remove <| and |>
    CleanProcessor->>RegexEngine: Remove control characters
    
    alt remove_urls_emails enabled
        Note over CleanProcessor: Protect markdown links/images
        CleanProcessor->>RegexEngine: Match markdown links [text](url)
        RegexEngine-->>CleanProcessor: Replace with placeholders
        CleanProcessor->>RegexEngine: Match markdown images ![alt](url)
        RegexEngine-->>CleanProcessor: Replace with placeholders
        
        Note over CleanProcessor: Remove emails and URLs
        CleanProcessor->>RegexEngine: Remove email addresses
        CleanProcessor->>RegexEngine: Remove remaining URLs
        
        Note over CleanProcessor: Restore markdown
        CleanProcessor->>CleanProcessor: Restore links from placeholders
        CleanProcessor->>CleanProcessor: Restore images from placeholders
    end
    
    alt remove_extra_spaces enabled
        CleanProcessor->>RegexEngine: Reduce multiple newlines to 2
        CleanProcessor->>RegexEngine: Reduce multiple spaces to 1
    end
    
    CleanProcessor-->>User: Return cleaned text

greptile-apps

_{5 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-15T10:38:02Z

api/core/rag/cleaner/clean_processor.py

+                    # Replace the ENTIRE markdown link/image with a single placeholder to protect
+                    # the link text (which might also be a URL) from being removed
+                    markdown_link_pattern = r"\[([^\]]*)\]\((https?://[^)]+)\)"
+                    markdown_image_pattern = r"!\[.*?\]\((https?://[^)]+)\)"


syntax: pattern missing capture group for alt text - line 51 calls match.group(2) but this pattern only has one capture group (the URL)

Suggested change

markdown_image_pattern = r"!\[.*?\]\((https?://[^)]+)\)"

markdown_image_pattern = r"!\[(.*?)\]\((https?://[^)]+)\)"

Prompt To Fix With AI

This is a comment left during a code review. Path: api/core/rag/cleaner/clean_processor.py Line: 38:38 Comment: **syntax:** pattern missing capture group for alt text - line 51 calls `match.group(2)` but this pattern only has one capture group (the URL) ```suggestion markdown_image_pattern = r"!\[(.*?)\]\((https?://[^)]+)\)" ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-15T10:38:03Z

api/core/rag/cleaner/clean_processor.py

+                    def replace_image_with_placeholder(match, placeholders=placeholders):
+                        link_type = "image"
+                        url = match.group(2)
+                        placeholder = f"__MARKDOWN_PLACEHOLDER_{len(placeholders)}__"
+                        placeholders.append((link_type, "image", url))
+                        return placeholder


logic: hardcoding "image" as alt text instead of preserving the original - should use match.group(1) after fixing the pattern

Suggested change

def replace_image_with_placeholder(match, placeholders=placeholders):

link_type = "image"

url = match.group(2)

placeholder = f"__MARKDOWN_PLACEHOLDER_{len(placeholders)}__"

placeholders.append((link_type, "image", url))

return placeholder

def replace_image_with_placeholder(match, placeholders=placeholders):

link_type = "image"

alt_text = match.group(1)

url = match.group(2)

placeholder = f"__MARKDOWN_PLACEHOLDER_{len(placeholders)}__"

placeholders.append((link_type, alt_text, url))

return placeholder

Prompt To Fix With AI

This is a comment left during a code review. Path: api/core/rag/cleaner/clean_processor.py Line: 49:54 Comment: **logic:** hardcoding "image" as alt text instead of preserving the original - should use `match.group(1)` after fixing the pattern ```suggestion def replace_image_with_placeholder(match, placeholders=placeholders): link_type = "image" alt_text = match.group(1) url = match.group(2) placeholder = f"__MARKDOWN_PLACEHOLDER_{len(placeholders)}__" placeholders.append((link_type, alt_text, url)) return placeholder ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-15T10:38:04Z

api/core/rag/cleaner/clean_processor.py

+    def __init__(self, default_rules: dict | None = None):
+        """Initialize CleanProcessor with optional default rules."""
+        self.default_rules = default_rules or {}


style: __init__ added but default_rules is never used - the clean method is a classmethod that doesn't access instance attributes

Prompt To Fix With AI

This is a comment left during a code review. Path: api/core/rag/cleaner/clean_processor.py Line: 5:7 Comment: **style:** `__init__` added but `default_rules` is never used - the `clean` method is a classmethod that doesn't access instance attributes How can I resolve this? If you propose a fix, please make it concise.

fatelei and others added 4 commits December 30, 2025 20:01

fix: fix extractor loses hyperlink text and target URL

d146347

chore: add unittest

536eff1

[autofix.ci] apply automated fixes

2d2487c

Apply changes for benchmark PR

cf16f6c

greptile-apps bot reviewed Jan 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: [xxx](xxx) render as xxx](xxx) #81

fix: [xxx](xxx) render as xxx](xxx) #81

Uh oh!

tomerqodo commented Jan 15, 2026

Uh oh!

greptile-apps bot commented Jan 15, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 15, 2026

Uh oh!

greptile-apps bot Jan 15, 2026

Uh oh!

greptile-apps bot Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	markdown_image_pattern = r"!\[.*?\]\((https?://[^)]+)\)"
	markdown_image_pattern = r"!\[(.*?)\]\((https?://[^)]+)\)"

fix: [xxx](xxx) render as xxx](xxx) #81

Are you sure you want to change the base?

fix: [xxx](xxx) render as xxx](xxx) #81

Uh oh!

Conversation

tomerqodo commented Jan 15, 2026

Uh oh!

greptile-apps bot commented Jan 15, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants