-
Notifications
You must be signed in to change notification settings - Fork 435
Description
Problem
When retrieving emails, links in HTML are lost because _format_body_content() prefers plain text over HTML. Many emails (newsletters, notifications, automated summaries) have readable plain text but critical links only in HTML.
Example: Email contains <a href="https://docs.google.com/...">Open document</a> in HTML, but the plain text just says "Open document" - the URL is lost.
Proposed Solution
Extract links from HTML and append them to the output, regardless of which body format is returned:
def _extract_links_from_html(html: str) -> list[str]:
# Extract href URLs from anchor tags
...
def _format_body_content(text_body: str, html_body: str) -> str:
links = _extract_links_from_html(html_body) if html_body else []
# ... existing logic ...
if links:
result += "\n\n[Links]\n" + "\n".join(links)
return resultRelated
- PR fix: handle HTML-only emails with useless text/plain fallback #247 addresses "useless" plain text fallback but uses
get_text()which still loses links - PR feat(gmail): add HTML body extraction with text fallback #168 added HTML fallback but only when plain text is empty
- Issue Gmail tool fails on HTML parsing and lacks option for raw HTML #118 was closed but the core "links lost" problem persists
On BeautifulSoup
PR #247 proposes adding BeautifulSoup. Given the project already has substantial dependencies (fastapi, google-api-python-client, etc.), BS4 seems reasonable for robust HTML handling. It would benefit this feature and potentially others. Alternatively, link extraction can be done with regex if keeping deps minimal is a priority.
Happy to submit a PR either way.