Skip to content
This repository was archived by the owner on Jul 22, 2025. It is now read-only.

Conversation

@nattsw
Copy link
Contributor

@nattsw nattsw commented Jul 2, 2025

A more deterministic way of making sure the LLM detects the correct language (instead of relying on prompt to LLM to ignore it) is to take the cooked and remove unwanted elements.

In this PR

  • we remove quotes, image captions, etc. and only take the remaining text, falling back to the unadulterated cooked
  • and update prompts related to detection and translation

/152465/12

@nattsw nattsw changed the title FIX: Ignore captions and quotes when detecting locale FIX: Ignore captions and quotes when detecting locale and update prompts Jul 2, 2025
Comment on lines 14 to 30
# quotes and blockquotes
doc.css("blockquote, aside.quote").remove
# image captions
doc.css(".lightbox-wrapper").remove

necessary = doc.text.strip

# oneboxes (external content)
doc.css("aside.onebox").remove
# code blocks
doc.css("code, pre").remove
# hashtags
doc.css("a.hashtag-cooked").remove
# emoji
doc.css("img.emoji").remove
# mentions
doc.css("a.mention").remove
Copy link
Contributor

@jjaffeux jjaffeux Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would group stuff in constants, to have only two remove calls.

I don't think the comments are adding much: # emoji => img.emoji. But if you still think they are important you can format like this:

NECESSARY_SELECTORS = [
  "aside.onebox", # oneboxes (external content)
  "code, pre", # code blocks
  ...
]

and then call: doc.css(*NECESSARY_SELECTORS).remove

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add that necessary/preferred is not very clear. I don't really know why they are necessary or preferred, this is where we would need a comment IMO.

@nattsw nattsw merged commit 2b9a4f9 into main Jul 3, 2025
6 checks passed
@nattsw nattsw deleted the ignore-cap-quo branch July 3, 2025 14:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants