Skip to content

feat(gmail): Preserve links when returning plain text body #312

@durandom

Description

@durandom

Problem

When retrieving emails, links in HTML are lost because _format_body_content() prefers plain text over HTML. Many emails (newsletters, notifications, automated summaries) have readable plain text but critical links only in HTML.

Example: Email contains <a href="https://docs.google.com/...">Open document</a> in HTML, but the plain text just says "Open document" - the URL is lost.

Proposed Solution

Extract links from HTML and append them to the output, regardless of which body format is returned:

def _extract_links_from_html(html: str) -> list[str]:
    # Extract href URLs from anchor tags
    ...

def _format_body_content(text_body: str, html_body: str) -> str:
    links = _extract_links_from_html(html_body) if html_body else []
    # ... existing logic ...
    if links:
        result += "\n\n[Links]\n" + "\n".join(links)
    return result

Related

On BeautifulSoup

PR #247 proposes adding BeautifulSoup. Given the project already has substantial dependencies (fastapi, google-api-python-client, etc.), BS4 seems reasonable for robust HTML handling. It would benefit this feature and potentially others. Alternatively, link extraction can be done with regex if keeping deps minimal is a priority.

Happy to submit a PR either way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions