Skip to content

Commit 0316afa

Browse files
committed
refactor: enhance HTML tag removal in text processing to exclude audio, video, and image tags
1 parent d0722dc commit 0316afa

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

apps/common/utils/common.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,9 @@ def markdown_to_plain_text(md: str) -> str:
116116
text = re.sub(r'\n{2,}', '\n', text)
117117
# 使用正则表达式去除所有 HTML 标签
118118
text = re.sub(r'<[^>]+>', '', text)
119+
# 先移除特定媒体标签(优先级高于通用HTML标签移除)
120+
text = re.sub(r'<(audio|video)[^>]*>.*?</\1>', '', text, flags=re.DOTALL) # 匹配音频/视频标签
121+
text = re.sub(r'<img[^>]*>', '', text) # 匹配图片标签
119122
# 去除多余的空白字符(包括换行符、制表符等)
120123
text = re.sub(r'\s+', ' ', text)
121124
# 去除表单渲染

0 commit comments

Comments
 (0)