-
Notifications
You must be signed in to change notification settings - Fork 479
Use beautifulsoup4 instead of lxml for URL previews
#19301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from 7 commits
6dec726
8e9e333
a24d251
d5332b0
18d2746
3813537
5940217
33f9a91
9cebb21
46d545e
703c39a
facf571
fe98013
bde51a6
9756733
1a13064
f82decb
f820b75
8010eb1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| Switch to beautofulsoup4 from lxml for URL previews. Controbuted by @clokep. | ||
clokep marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
clokep marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -25,12 +25,12 @@ | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import attr | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from synapse.media.preview_html import parse_html_description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from synapse.media.preview_html import NON_BLANK, decode_body, parse_html_description | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from synapse.types import JsonDict | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from synapse.util.json import json_decoder | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if TYPE_CHECKING: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from lxml import etree | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from bs4 import BeautifulSoup | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from synapse.server import HomeServer | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -105,35 +105,25 @@ def get_oembed_url(self, url: str) -> str | None: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # No match. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return None | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| def autodiscover_from_html(self, tree: "etree._Element") -> str | None: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| def autodiscover_from_html(self, soup: "BeautifulSoup") -> str | None: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Search an HTML document for oEmbed autodiscovery information. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Args: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| tree: The parsed HTML body. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| soup: The parsed HTML body. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Returns: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| The URL to use for oEmbed information, or None if no URL was found. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Search for link elements with the proper rel and type attributes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Cast: the type returned by xpath depends on the xpath expression: mypy can't deduce this. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| for tag in cast( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| list["etree._Element"], | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| tree.xpath("//link[@rel='alternate'][@type='application/json+oembed']"), | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ): | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if "href" in tag.attrib: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return cast(str, tag.attrib["href"]) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Some providers (e.g. Flickr) use alternative instead of alternate. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
clokep marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Cast: the type returned by xpath depends on the xpath expression: mypy can't deduce this. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| for tag in cast( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| list["etree._Element"], | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| tree.xpath("//link[@rel='alternative'][@type='application/json+oembed']"), | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ): | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if "href" in tag.attrib: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return cast(str, tag.attrib["href"]) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return None | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| tag = soup.find( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "link", | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| rel=("alternate", "alternative"), | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MadLittleMods marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| type="application/json+oembed", | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| href=NON_BLANK, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return cast(str, tag["href"]) if tag else None | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return cast(str, tag["href"]) if tag else None | |
| tag = soup.find( | |
| "link", | |
| rel=("alternate", "alternative"), | |
| type="application/json+oembed", | |
| ) | |
| href = tag.get("href") | |
| return cast(str, href) if href else None |
Prior art but ideally, we'd do even better:
| return cast(str, tag["href"]) if tag else None | |
| tags = soup.find_all( | |
| "link", | |
| rel=("alternate", "alternative"), | |
| type="application/json+oembed", | |
| ) | |
| if len(tags) == 0: | |
| return None | |
| elif len(tags) == 1: | |
| tag = tags[0] | |
| href = tag.get("href") | |
| return cast(str, href) if href else None | |
| else: | |
| # If there are multiple tags, return an error. We don't want to even | |
| # try to pick the right one if there are multiple as we could run into | |
| # problems similar to request smuggling vulnerabilities which rely on the | |
| # mismatch of how different systems interpret information. | |
| raise ValueError( | |
| 'Expected one `<link type="application/json+oembed">` but found multiple.' | |
| ) |
Also applies elsewhere below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct -- we choose the first one, not give up if there are more than one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this is safer or what assumptions the original code makes, can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code uses tag["href"] which we assume works because of href=NON_BLANK, (tenuous separate constraint). Without that, href might not exist and would result in KeyError or None and then we're casing to str.
This is a Parse, don't validate type of situation. Ideally, the returned type could guarantee us what we just validated. But instead we get back a generic type with all of that information lost.
(there are other NON_BLANK usages like this as well)
I don't think this is correct -- we choose the first one, not give up if there are more than one.
Is it normal to have multiple tags on the page? If so, searching for the first one may be fine. We should comment about our reasoning.
If a page should have one, we should be more strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code uses
tag["href"]which we assume works because ofhref=NON_BLANK,(tenuous separate constraint). Without that,hrefmight not exist and would result inKeyErrororNoneand then we're casing tostr.This is a Parse, don't validate type of situation. Ideally, the returned type could guarantee us what we just validated. But instead we get back a generic type with all of that information lost.
(there are other
NON_BLANKusages like this as well)
I'm honestly not following this at all, sorry. What part do you dislike? Is it the cast to string? Or that we're assuming href is a property (which we already checked above with the filter)?
I don't think this is correct -- we choose the first one, not give up if there are more than one.
Is it normal to have multiple tags on the page? If so, searching for the first one may be fine. We should comment about our reasoning.
It is perfectly acceptable to have more than one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the best that we can do?
We can use safer lookups (tag.get("href")) and check things as my suggestion does 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a helper method that will hopefully help?
I understand what you're saying with unwrap, but it isn't helpful to discuss Rust semantics when writing Python code.
And even so, I don't think you'd want something as specific as a TagWithHref class, HTML is pretty generic so that doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the best that we can do?
We can use safer lookups (
tag.get("href")) and check things as my suggestion does 🤷
But we already know that it exists? This would be akin to writing:
my_dict: dict[str, Foo]
if "my_key" in my_dict:
my_value = my_dict.get("my_key")It makes the value ambiguous when it isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a helper method that will hopefully help?
The get_attribute(...) helper looks like it's addressing something else (single vs multi-valued attributes).
But we already know that it exists?
In this case, the functions are small so the cognitive load to keep things straight (even into the future) isn't that bad (smoke).
In Python, we see this kind of thing (making reasonable assumptions based on prior validation) because it easily passes by (regardless of how well those assumptions are formulated). My preference is on being defensive and asserting our assumptions especially to better protect things as they evolve into the future.
⏩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a helper method that will hopefully help?
The
get_attribute(...)helper looks like it's addressing something else (single vs multi-valued attributes).
It is addressing why there were so many cast(...) calls which I thought is what you found confusing, but I guess I'm still not understanding.
Uh oh!
There was an error while loading. Please reload this page.