Use `beautifulsoup4` instead of `lxml` for URL previews #19301

clokep · 2025-12-11T13:42:43Z

Use beautifulsoup4 instead of lxml for URL previews. This offers some nicer APIs when parsing HTML and avoids using libxml, which is unmaintained.

I haven’t done a full regression against commonly previewed sites, but I expect this will give similar (or better) results.

beautiulsoup also handles decoding the charset for us, which is less custom code.

MadLittleMods · 2025-12-29T16:37:32Z

changelog.d/19301.misc

@@ -0,0 +1 @@
+Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep.


Conflicts to resolve

MadLittleMods · 2025-12-29T16:38:02Z

changelog.d/19301.misc

@@ -0,0 +1 @@
+Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep.


Linting/CI is not passing for typechecking ❌: https://github.com/element-hq/synapse/actions/runs/20170566919/job/57904924809?pr=19301

Yes, I can't see to get the same setup locally. Did something change with how to install the pinned packages?

poetry install --extras all is what I use.

Things have changed behind the scenes but shouldn't affect how you install as a developer:

Switch the build backend from poetry-core to maturin #19234

Update pyproject.toml to be compatible with other standard Python packaging tools #19137

That's what I did. Maybe I'll try creating a new virtualenv. 🤔

docs/setup/installation.md

MadLittleMods · 2025-12-29T17:24:58Z

synapse/media/oembed.py

-        return None
+        tag = soup.find(
+            "link",
+            rel=("alternate", "alternative"),


Where can I find this syntax?

I've looked through https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree

MadLittleMods · 2025-12-29T17:29:16Z

synapse/media/preview_html.py

-            ),
+        # Check microdata for an image.
+        meta_image = soup.find(
+            "meta", itemprop=re.compile("image", re.I), content=NON_BLANK


Why itemprop?

Seems like we could do the normal image=re.I

itemprop is the key, image is the value.

So it's more obvious, can you share some example HTML that we're parsing?

MadLittleMods · 2025-12-29T17:30:30Z

synapse/media/preview_html.py

+        title = soup.find(("title", "h1", "h2", "h3"), string=True)  # type: ignore[call-overload]
+        if title and title.string:
+            og["og:title"] = title.string.strip()


I assume we have tests to ensure this does the correct thing? string=True -> title.string and we end up with the title/heading content

MadLittleMods · 2025-12-29T17:31:51Z

synapse/media/preview_html.py


-    if tree is None:
-        return
+    from bs4.element import NavigableString, Tag


Special reason for organizing the imports here?

Can we do it at the top like normal?

No, it is an optional dependency. This was the same for lxml.

MadLittleMods · 2025-12-29T17:35:15Z

synapse/media/preview_html.py

-                if len(elements) > stack_limit:
-                    # We've hit our limit for working memory
-                    break


Why don't we care about this in the new implementation?

clokep added 3 commits December 10, 2025 11:48

Use BeautifulSoup instead of LXML directly.

6dec726

Dont use lxml

8e9e333

Update docs

a24d251

clokep requested a review from a team as a code owner December 11, 2025 13:42

clokep and others added 4 commits December 11, 2025 08:44

Create 19301.misc

d5332b0

Merge remote-tracking branch 'upstream/develop' into bs4

18d2746

Lint fixes

3813537

Fix-up references

5940217

MadLittleMods added the A-URL-Preview label Dec 29, 2025

MadLittleMods changed the title ~~Use beautiulsoup4 instead of lxml for URL previews~~ Use beautifulsoup4 instead of lxml for URL previews Dec 29, 2025

MadLittleMods reviewed Dec 29, 2025

View reviewed changes

clokep added 3 commits January 16, 2026 09:22

Fix-up skipping tests

33f9a91

Merge remote-tracking branch 'upstream/develop' into bs4

9cebb21

Misc review comments

46d545e

		@@ -0,0 +1 @@
		Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep. No newline at end of file

Use beautifulsoup4 instead of lxml for URL previews #19301

Are you sure you want to change the base?

Use beautifulsoup4 instead of lxml for URL previews #19301

Uh oh!

Conversation

clokep commented Dec 11, 2025 • edited by MadLittleMods Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use `beautifulsoup4` instead of `lxml` for URL previews #19301

Use `beautifulsoup4` instead of `lxml` for URL previews #19301

clokep commented Dec 11, 2025 •

edited by MadLittleMods

Loading