-
Notifications
You must be signed in to change notification settings - Fork 5k
Sanitize image markdown in sources #2765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an issue where DocFX-style ::: image
markdown directives were being incorrectly processed by Prompty's image extraction functionality. The fix sanitizes these directives by escaping triple colons to HTML entities.
- Sanitizes DocFX image directives by converting triple colons to HTML entities (
:::
) - Refactors text cleaning logic to handle both newline normalization and directive escaping
- Adds comprehensive test coverage for the sanitization functionality
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
app/backend/approaches/approach.py | Implements sanitization logic to escape triple colons in source content |
tests/test_chatapproach.py | Adds test to verify DocFX image directive sanitization works correctly |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
s = s.replace("\n", " ").replace("\r", " ") # normalize newlines to spaces | ||
s = s.replace(":::", ":::") # escape DocFX/markdown triple colons | ||
return s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The function modifies and reassigns the same variable 's' multiple times. Consider using a more functional approach by chaining the operations or using intermediate variables for better readability.
s = s.replace("\n", " ").replace("\r", " ") # normalize newlines to spaces | |
s = s.replace(":::", ":::") # escape DocFX/markdown triple colons | |
return s | |
normalized = s.replace("\n", " ").replace("\r", " ") # normalize newlines to spaces | |
escaped = normalized.replace(":::", ":::") # escape DocFX/markdown triple colons | |
return escaped |
Copilot uses AI. Check for mistakes.
image_directive = ( | ||
"activator-introduction.md#page=1: Intro text before image. " | ||
':::image type="content" source="./media/activator-introduction/activator.png" ' | ||
'alt-text="Diagram that shows the architecture of Fabric Activator."::: More text after image.' | ||
) | ||
|
||
async def build_sources(): | ||
return await chat_approach.get_sources_content( | ||
[ | ||
Document( | ||
id="doc1", | ||
content=image_directive.split(": ", 1)[1], | ||
sourcepage="activator-introduction.md#page=1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The test string contains a magic value that mixes citation format with content. Consider extracting the citation prefix and content into separate variables to make the test structure clearer and more maintainable.
image_directive = ( | |
"activator-introduction.md#page=1: Intro text before image. " | |
':::image type="content" source="./media/activator-introduction/activator.png" ' | |
'alt-text="Diagram that shows the architecture of Fabric Activator."::: More text after image.' | |
) | |
async def build_sources(): | |
return await chat_approach.get_sources_content( | |
[ | |
Document( | |
id="doc1", | |
content=image_directive.split(": ", 1)[1], | |
sourcepage="activator-introduction.md#page=1", | |
citation_prefix = "activator-introduction.md#page=1" | |
image_content = ( | |
"Intro text before image. " | |
':::image type="content" source="./media/activator-introduction/activator.png" ' | |
'alt-text="Diagram that shows the architecture of Fabric Activator."::: More text after image.' | |
) | |
image_directive = f"{citation_prefix}: {image_content}" | |
async def build_sources(): | |
return await chat_approach.get_sources_content( | |
[ | |
Document( | |
id="doc1", | |
content=image_content, | |
sourcepage=citation_prefix, |
Copilot uses AI. Check for mistakes.
Purpose
A developer ingested markdown files that used :::image inline images, and Prompty attempted to extract them and treat them as actual images. This PR escapes them, by turning the colons into HTML entities, which seems to avoid the Prompty auto extraction.
Also added a test based off their file.
Does this introduce a breaking change?
When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.
Does this require changes to learn.microsoft.com docs?
This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.
Type of change
Code quality checklist
See CONTRIBUTING.md for more details.
python -m pytest
).python -m pytest --cov
to verify 100% coverage of added linespython -m mypy
to check for type errorsruff
andblack
manually on my code.