Skip to content

Conversation

@nattsw
Copy link
Contributor

@nattsw nattsw commented Nov 27, 2024

Because in a text with many emojis, the detected language will unfortunately always be 'en'. Example text:

Ein sehr schöner Beitrag über diesen Uhrturm und einem schönen Bild dazu 😍👍🙏🏻

This PR strips img tags before detecting language. The img tags will be present for translation.

Additional context

In Google Translate API, they actually do provide a way to skip certain tags but only when using the /translate endpoint, not the detect endpoint.

The method would be to add the translate="no" attribute to the respective html tags. This is also the same for Microsoft Azure and Amazon. But unfortunately this only works for /translate.

Because in a text with many emojis, the detected language
will unfortunately always be 'en'
Comment on lines +66 to +67
html_doc.css("img").remove
html_doc.to_html
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be separate lines 😅

@nattsw nattsw merged commit e142757 into main Nov 27, 2024
3 checks passed
@nattsw nattsw deleted the strip-img-for-detect branch November 27, 2024 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants