Skip to content

Conversation

@clokep
Copy link
Contributor

@clokep clokep commented Dec 11, 2025

Use beautifulsoup4 instead of lxml for URL previews. This offers some nicer APIs when parsing HTML and avoids using libxml, which is unmaintained.

I haven’t done a full regression against commonly previewed sites, but I expect this will give similar (or better) results.

beautiulsoup also handles decoding the charset for us, which is less custom code.

@clokep clokep requested a review from a team as a code owner December 11, 2025 13:42
@MadLittleMods MadLittleMods changed the title Use beautiulsoup4 instead of lxml for URL previews Use beautifulsoup4 instead of lxml for URL previews Dec 29, 2025
@@ -0,0 +1 @@
Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflicts to resolve

@@ -0,0 +1 @@
Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can't see to get the same setup locally. Did something change with how to install the pinned packages?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poetry install --extras all is what I use.

Things have changed behind the scenes but shouldn't affect how you install as a developer:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I did. Maybe I'll try creating a new virtualenv. 🤔

return None
tag = soup.find(
"link",
rel=("alternate", "alternative"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I find this syntax?

I've looked through https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree

),
# Check microdata for an image.
meta_image = soup.find(
"meta", itemprop=re.compile("image", re.I), content=NON_BLANK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why itemprop?

Seems like we could do the normal image=re.I

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itemprop is the key, image is the value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's more obvious, can you share some example HTML that we're parsing?

Comment on lines +204 to +206
title = soup.find(("title", "h1", "h2", "h3"), string=True) # type: ignore[call-overload]
if title and title.string:
og["og:title"] = title.string.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we have tests to ensure this does the correct thing? string=True -> title.string and we end up with the title/heading content


if tree is None:
return
from bs4.element import NavigableString, Tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special reason for organizing the imports here?

Can we do it at the top like normal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is an optional dependency. This was the same for lxml.

Comment on lines -482 to -484
if len(elements) > stack_limit:
# We've hit our limit for working memory
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we care about this in the new implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants