Strip TOC elements from article summaries by russellballestrini · Pull Request #3512 · getpelican/pelican

russellballestrini · 2025-10-13T14:10:12Z

https://russell.ballestrini.net/pelican-theme-upgrade-right-sidebar-toc/

Automatically remove table of contents divs and toc-backref anchor links from article summaries when displayed outside full article context (e.g., on homepage, in RSS feeds).

ReStructuredText automatically generates anchor links in section headings when a table of contents directive is present. These anchors work perfectly on full article pages, but become broken links when article summaries appear on homepage or in feeds - the anchor targets don't exist in that context.

This change adds a strip_toc_elements_from_html() function in pelican/utils.py that uses regex to remove:

TOC div blocks (
...
)
toc-backref anchor links while preserving heading text

The function is called automatically in Content.get_summary() so all summaries are cleaned without requiring configuration or template changes.

Includes comprehensive unit tests covering various TOC formats, edge cases, and case-insensitive matching.

Pull Request Checklist

Ensured tests pass and (if applicable) updated functional test output
Conformed to code style guidelines by running appropriate linting tools
Added tests for changed code
Updated documentation for changed code

Automatically remove table of contents divs and toc-backref anchor links from article summaries when displayed outside full article context (e.g., on homepage, in RSS feeds). ReStructuredText automatically generates anchor links in section headings when a table of contents directive is present. These anchors work perfectly on full article pages, but become broken links when article summaries appear on homepage or in feeds - the anchor targets don't exist in that context. This change adds a strip_toc_elements_from_html() function in pelican/utils.py that uses regex to remove: - TOC div blocks (<div class="contents">...</div>) containing broken navigation - toc-backref anchor links from headings while preserving heading text Both removals are necessary since TOC anchor targets don't exist in summary context. The function is called automatically in Content.get_summary() so all summaries are cleaned without requiring configuration or template changes. Includes comprehensive unit tests covering various TOC formats, edge cases, and case-insensitive matching.

justinmayer · 2026-03-27T07:32:31Z

Any @getpelican/reviewers have a moment to review this PR? Would be greatly appreciated 😊

(Apologies for the delay in reviewing, Russell.)

boxydog · 2026-03-27T12:50:21Z

Seems like it's fixing a real problem (broken links), that's good.

It smells a little funny to parse HTML with regex-es instead of a parser. Still, I don't see a bug, because the output of docutils is fairly restrictive (no nested elements, classes in a deterministic order, etc.).

I think there's no unit test for the get_summary change. Claude proposes the one below.

It fits the pattern of the existing summary tests — uses _copy_page_kwargs(), deletes the metadata["summary"] so the auto-generation path is exercised, and sets SUMMARY_MAX_LENGTH = None to test the full-content path (which was the restructured branch in the PR).

    def test_summary_strips_toc_elements(self):
        # TOC divs and toc-backref anchors should be removed from generated
        # summaries since the anchor targets don't exist outside full article context.
        page_kwargs = self._copy_page_kwargs()
        settings = get_settings()
        page_kwargs["settings"] = settings
        del page_kwargs["metadata"]["summary"]
        settings["SUMMARY_MAX_LENGTH"] = None

        toc_content = (
            '<div class="contents topic" id="contents">'
            "<p>Contents</p>"
            '<ul><li><a href="#intro">Intro</a></li></ul>'
            "</div>"
            '<h2><a class="toc-backref" href="#id1">Intro</a></h2>'
            "<p>Article body.</p>"
        )
        page_kwargs["content"] = toc_content
        page = Page(**page_kwargs)

        self.assertNotIn('<div class="contents', page.summary)
        self.assertNotIn("toc-backref", page.summary)
        self.assertIn("<h2>Intro</h2>", page.summary)
        self.assertIn("<p>Article body.</p>", page.summary)

russellballestrini force-pushed the strip-toc-anchors-from-summaries branch from a609cec to 3ccc6b2 Compare October 13, 2025 14:16

justinmayer requested a review from a team March 27, 2026 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strip TOC elements from article summaries#3512

Strip TOC elements from article summaries#3512
russellballestrini wants to merge 1 commit intogetpelican:mainfrom
russellballestrini:strip-toc-anchors-from-summaries

russellballestrini commented Oct 13, 2025 •

edited

Loading

Uh oh!

justinmayer commented Mar 27, 2026

Uh oh!

boxydog commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

russellballestrini commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Checklist

Uh oh!

justinmayer commented Mar 27, 2026

Uh oh!

boxydog commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

russellballestrini commented Oct 13, 2025 •

edited

Loading