Skip to content

Strip TOC elements from article summaries#3512

Open
russellballestrini wants to merge 1 commit intogetpelican:mainfrom
russellballestrini:strip-toc-anchors-from-summaries
Open

Strip TOC elements from article summaries#3512
russellballestrini wants to merge 1 commit intogetpelican:mainfrom
russellballestrini:strip-toc-anchors-from-summaries

Conversation

@russellballestrini
Copy link
Copy Markdown

@russellballestrini russellballestrini commented Oct 13, 2025

https://russell.ballestrini.net/pelican-theme-upgrade-right-sidebar-toc/

Automatically remove table of contents divs and toc-backref anchor links from article summaries when displayed outside full article context (e.g., on homepage, in RSS feeds).

ReStructuredText automatically generates anchor links in section headings when a table of contents directive is present. These anchors work perfectly on full article pages, but become broken links when article summaries appear on homepage or in feeds - the anchor targets don't exist in that context.

This change adds a strip_toc_elements_from_html() function in pelican/utils.py that uses regex to remove:

  • TOC div blocks (
    ...
    )
  • toc-backref anchor links while preserving heading text

The function is called automatically in Content.get_summary() so all summaries are cleaned without requiring configuration or template changes.

Includes comprehensive unit tests covering various TOC formats, edge cases, and case-insensitive matching.

Pull Request Checklist

  • Ensured tests pass and (if applicable) updated functional test output
  • Conformed to code style guidelines by running appropriate linting tools
  • Added tests for changed code
  • Updated documentation for changed code

Automatically remove table of contents divs and toc-backref anchor links
from article summaries when displayed outside full article context
(e.g., on homepage, in RSS feeds).

ReStructuredText automatically generates anchor links in section headings
when a table of contents directive is present. These anchors work perfectly
on full article pages, but become broken links when article summaries appear
on homepage or in feeds - the anchor targets don't exist in that context.

This change adds a strip_toc_elements_from_html() function in pelican/utils.py
that uses regex to remove:
- TOC div blocks (<div class="contents">...</div>) containing broken navigation
- toc-backref anchor links from headings while preserving heading text

Both removals are necessary since TOC anchor targets don't exist in summary context.

The function is called automatically in Content.get_summary() so all
summaries are cleaned without requiring configuration or template changes.

Includes comprehensive unit tests covering various TOC formats, edge cases,
and case-insensitive matching.
@russellballestrini russellballestrini force-pushed the strip-toc-anchors-from-summaries branch from a609cec to 3ccc6b2 Compare October 13, 2025 14:16
@justinmayer
Copy link
Copy Markdown
Member

Any @getpelican/reviewers have a moment to review this PR? Would be greatly appreciated 😊

(Apologies for the delay in reviewing, Russell.)

@justinmayer justinmayer requested a review from a team March 27, 2026 07:32
@boxydog
Copy link
Copy Markdown
Contributor

boxydog commented Mar 27, 2026

Seems like it's fixing a real problem (broken links), that's good.

It smells a little funny to parse HTML with regex-es instead of a parser. Still, I don't see a bug, because the output of docutils is fairly restrictive (no nested elements, classes in a deterministic order, etc.).

I think there's no unit test for the get_summary change. Claude proposes the one below.

It fits the pattern of the existing summary tests — uses _copy_page_kwargs(), deletes the metadata["summary"] so the auto-generation path is exercised, and sets SUMMARY_MAX_LENGTH = None to test the full-content path (which was the restructured branch in the PR).

    def test_summary_strips_toc_elements(self):
        # TOC divs and toc-backref anchors should be removed from generated
        # summaries since the anchor targets don't exist outside full article context.
        page_kwargs = self._copy_page_kwargs()
        settings = get_settings()
        page_kwargs["settings"] = settings
        del page_kwargs["metadata"]["summary"]
        settings["SUMMARY_MAX_LENGTH"] = None

        toc_content = (
            '<div class="contents topic" id="contents">'
            "<p>Contents</p>"
            '<ul><li><a href="#intro">Intro</a></li></ul>'
            "</div>"
            '<h2><a class="toc-backref" href="#id1">Intro</a></h2>'
            "<p>Article body.</p>"
        )
        page_kwargs["content"] = toc_content
        page = Page(**page_kwargs)

        self.assertNotIn('<div class="contents', page.summary)
        self.assertNotIn("toc-backref", page.summary)
        self.assertIn("<h2>Intro</h2>", page.summary)
        self.assertIn("<p>Article body.</p>", page.summary)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants