Skip to content

Conversation

@tomerqodo
Copy link

Benchmark PR langgenius#30392

Type: Clean (correct implementation)

Original PR Title: fix: xxx render as xxx](xxx)
Original PR Description: > [!IMPORTANT]

  1. Make sure you have read our contribution guidelines
  2. Ensure there is an associated issue and you have been assigned to it
  3. Use the correct syntax to link this PR: Fixes #<issue number>.

Summary

fix langgenius#30402

  1. clean_processor.py - Fixed URL removal to preserve markdown links and images

The previous implementation only protected markdown image URLs (alt) during URL removal, which caused:

  • Regular markdown links (text) to have their URL part stripped
  • Link text (which might itself be a URL) to be removed

Fix:

  • Now protects both markdown links and images with placeholders before URL removal
  • Changed placeholder format from MARKDOWN_IMAGE_URL_N to generic MARKDOWN_PLACEHOLDER_N
  • Added tuple structure for placeholders: (link_type, text/url, url) to distinguish links from images
  • Changed URL regex from https?://[^\s)]+ to https?://\S+ for better matching
  • Restores links as text and images as alt after cleanup
  1. text_processing_utils.py - Preserve leading markdown links

Added check to prevent remove_leading_symbols() from stripping markdown links at the start of text:

markdown_link_pattern = r'^\[([^\]]+)\]\((https?://[^)]+)\)'
if re.match(markdown_link_pattern, text):
   return text

This ensures links like OpenAI at the beginning of text are preserved.

Screenshots

Before After
... ...

before

Screen.Recording.2025-12-30.at.22.31.16.mov

after

Screen.Recording.2025-12-30.at.20.18.49.mov

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

Original PR URL: langgenius#30392

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants