-
Notifications
You must be signed in to change notification settings - Fork 515
[FIX] escape special characters ( < > & ) to HTML context #1703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] escape special characters ( < > & ) to HTML context #1703
Conversation
CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit a34ba0f...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit a34ba0f...:
Your PR breaks these cases:
NOTE: The following tests have been failing on the master branch as well as the PR:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
cfsmp3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution and for thinking about edge cases in subtitle output!
After researching the SRT format specification and testing player behavior, I have concerns about this change:
The SRT Format Does Not Require HTML Entity Escaping
SRT supports a limited set of HTML-like formatting tags (<b>, <i>, <u>, <font>), but this does not mean SRT content should be treated as HTML requiring entity escaping.
From Wikipedia's SubRip article:
"Unofficially the format has very basic text formatting... Formatting is derived from HTML tags"
The key word is "unofficially" - SRT has no formal specification, and the HTML-like tags are just for formatting, not full HTML parsing.
Most Players Display Entities Literally
The bigger problem is that most SRT players (VLC, older Kodi, etc.) do NOT decode HTML entities. If we escape characters:
| Original | Escaped | What Players Show |
|---|---|---|
Tom & Jerry |
Tom & Jerry |
"Tom & Jerry" |
x < y |
x < y |
"x < y" |
This would break subtitle display for most users.
Kodi only added entity decoding in v18.7, and that was specifically for WebVTT subtitles from Netflix, not SRT.
WebVTT vs SRT
You may be thinking of WebVTT, which DOES require escaping per its specification. But CCExtractor has separate handling for WebVTT output.
Potential Buffer Issues
The ascii_to_html() function can expand 1 character to 5 (&), but the calling code may not have allocated sufficient buffer space for the expansion.
If there's a specific use case where you need HTML-escaped output, we could consider:
- Adding escaping only for WebVTT output format
- Adding an optional
--escape-html-entitiesflag
Could you share more about the specific problem you encountered that led to this PR?
|
Hi @cfsmp3, Thanks for putting effort to review my changes! My use case was feeding the SRT files to Anyway, I will close this pull request. Thanks for your input here! |
In raising this pull request, I confirm the following (please check boxes):
My familiarity with the project is as follows (check one):
The
<,>,&is not suppose to be appeared in SRT, because SRT supports HTML tags.So, the content of SRT should be treated as HTML content with those characters being escaped.
As a result, this pull request intends to do the following:
&to&<to<>to>