Skip to content

Conversation

gastmaier
Copy link

Purpose

Follow up to #13717, inverting the logic, instead of patching the toctree to yield "#id1" instead of "#document-path/to#id1", have the section id to be docname preffixed, solving non-unique ids in singlehtml.
Allows to remove post Sphinx transforms like in here

Top level overview of current behavior

  • ID collision is resolved per doc (#already-used -> #id1, #already-used -> #id2).
  • There is no ID collision resolution on singlehtml step.

Approach taken

Based on the LaTeX builder solution.
sphinx/writers/latex.py#hypertarget[withdoc=True] method suffixes docutils id with the docname.
In my implementation I edit ids['0'] directly to not have to overwrite the whole visit_section method, but I understand if requested to not modify the tree and instead overwrite.

On the format #document-test/extra#id1

It is compatible with HTML anchoring, CSS and JavaScript selectors, but require escaping:

#document-test\/extra\#test {color: #f00;}
document.querySelector('#document-test\\/extra\\#test')

Tests

The following tests are relevant:

  • tests/test_builders/test_build_html_tocdepth.py
  • test_build_html_numfig.py

References

@gastmaier gastmaier force-pushed the toctree-singlehtml2 branch 3 times, most recently from e6b65fb to 5117057 Compare July 20, 2025 14:23
@gastmaier gastmaier marked this pull request as ready for review July 21, 2025 07:39
@AA-Turner AA-Turner added the sprint For work completed at a conference or similar event. label Jul 21, 2025
@jayaddison
Copy link
Contributor

Hi @gastmaier - I'm a former semi-regular volunteer contributor here, although I have been less active recently. Thanks for the pull request; and sorry that I did not notice the toctree constructor problem, as you mention in #13717.

I am reading both #13717 and this PR #13739 to try to understand the different approaches and reasons for them.

Also: do you have a test case that we could add under tests/roots that demonstrates the problem? I suppose it would need to include a table of contents of some kind and have a corresponding singlehtml test case.

@gastmaier
Copy link
Author

Hi @jayaddison maybe extending tests/test_builders/test_build_html_tocdepth.py
to check for duplicated ids?
as is, it already checks if ids are as expected (e.g., the pr changes things in 3 location to keep passing the test)., but not for duplicated ids, so I guess I can add that assertion

@jayaddison
Copy link
Contributor

@gastmaier that sounds perfect, yep! (I'd forgotten about those tests)

Copy link

@akhilsmokie7-cloud akhilsmokie7-cloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align with purpose

@gastmaier gastmaier force-pushed the toctree-singlehtml2 branch 2 times, most recently from c50ba56 to 910de47 Compare August 30, 2025 17:32
@gastmaier
Copy link
Author

gastmaier commented Aug 30, 2025

Hi, @jayaddison and @akhilsmokie7-cloud I rebased and added the test to check for duplicated ids.
This pr relies on changing the ids during the write step.
Initially I didn't really like the approach, but I recently stumble on the fact that the html build also changes the images src path during the write step (original/path/to/image.png -> _images/image_<counter>.png), so I am now more comfortable with this approach.

I added the test to the bottom, checking out fe728f4 will fail at

FAILED tests/test_builders/test_build_html_tocdepth.py::test_unique_ids_singlehtml - AssertionError: assert 16 == 15

as expected, since at f5457f1
I purposely added a section called FooBar to both foo bar, forcing the same id in both pages, which is a problem only for single output.

On, "the html build also changes the images src path during the write step", this is what I am talking about
https://github.com/sphinx-doc/sphinx/blob/master/sphinx/writers/html5.py#L754-L755

CI note:
Failing test

FAILED tests/test_directives/test_directive_only.py::test_sectioning - AssertionError: Section out of place: '1.6.2. Subsection'
assert '1.6.1.1.' == '1.6.2.'

is due to 2e51b787680cefdfe56b3438d809e6476600a47e

Thanks,

@@ -110,7 +110,7 @@ def assemble_toc_secnumbers(self) -> dict[str, dict[str, tuple[int, ...]]]:
new_secnumbers: dict[str, tuple[int, ...]] = {}
for docname, secnums in self.env.toc_secnumbers.items():
for id, secnum in secnums.items():
alias = f'{docname}/{id}'
alias = f'{docname}{id}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of values are possible for the docname and id?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also: I guess people shouldn't have written hyperlinks or saved bookmarks with the assumption that these aliases are stable? but, even so - if we change the format, I guess we would break those?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gastmaier in fact: I'm not sure where these / separator characters appear. What does this code relate to?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For singlehtml and at the assemble toctree step, the href is a tuple of docname and refid.
#document-path/to/#id1 to try to avoid the refid confliction in singlehtml mode problem, which didn't work because it would patch toctree, but the content body still had the non-unique ids.

My pr changes the toctree href format from
#document-path/to/#id1 to #document-path/to#id1 (removes end slash)
and for content ids from
#d1 to #document-path/to#id1 (adds doc prefix to make unique)
the new template is therefore:
#document-{doc}#{id}
direct tuple of docname and refid, without the slash.

These are valid HTML anchors, but do require escaping when manipulating with:
css

#document-test\/extra\#test {color: #f00;}

and javascript

document.querySelector('#document-test\\/extra#test')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

singlehtml.zip
here is a singlehtml build with the patch

@@ -497,6 +498,15 @@ def depart_term(self, node: Element) -> None:

self.body.append('</dt>')

def visit_section(self, node: section) -> None:
if self.builder.name == 'singlehtml' and node['ids']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to use many @property methods in the Sphinx writers, but maybe this singlehtml condition is getting to the point where it makes sense (this is the third potential callsite, I think?).

@jayaddison
Copy link
Contributor

Maybe pedantic of me to mention, but: running the test code without the fix in place does confirm that the test case fails (duplication of foobar-b1 alias).

@jayaddison
Copy link
Contributor

Maybe pedantic of me to mention, but: running the test code without the fix in place does confirm that the test case fails (duplication of foobar-b1 alias).

(I attempted that to reassure myself and to learn slightly more about how the fix works)

@gastmaier gastmaier marked this pull request as draft September 2, 2025 10:32
@gastmaier
Copy link
Author

gastmaier commented Sep 2, 2025

Drafting again, I spotted more links using the non-doc-prefixed anchor in the body.
I spotted: explicit refs

.. _explicit-ref:

are not being prefixed. but their links to it are correct (document-path/to#explicit-ref)

I will give yet another try, but this time transversing the pickled to patch all ids early on, instead of patching at the nodes visit.

Sample of new new approach:
doc.tar.gz

To assert unique ids in singlehtml builder.

Signed-off-by: Jorge Marques <[email protected]>
Since the singlehtml aggregates all doc files into a single html page
during the write step, and the ids must be unique for proper link
anchoring, add test that collects all ids in the page and checks if all
ids are unique, by asserting the length of the list against it as a set.
@gastmaier
Copy link
Author

Applied the ruthless traverse to patch all (ref)?ids early on, instead of patching at the nodes visit.

This approach avoids mass overwrite of every docutils method under the sun, e.g. the starttag method for the sneaky explicit ref <span id="<id>">.

The procedure is to patch doctree (prefix_ids_with_docname) after the assemble_toctree , and before the other singlehtml patches (assemble_toc_secnumbers and assemble_toc_fignumbers), that also have been adjusted to match the existing document-<doc>#<id> format instead the previous loose <doc>/<id> format.

Since the call stack is a little hidden, here is a summary

@builders/singlehtml
write_documents
  - assemble_doctree:
    - inline_all_toctrees
    - resolve_references
      -  apply_post_transforms
    - prefix_ids_with_docname (new)
  - assemble_toc_secnumbers
  - assemble_toc_fignumbers

Use doc path to make ids unique. Compensates for the loss of the
pathname in the href.
Format as document-<docname>#<id> to match other parts.
@gastmaier gastmaier marked this pull request as ready for review September 3, 2025 09:16
Comment on lines +100 to +105
if 'refid' in node or 'ids' in node:
docname = env.path2doc(doc['source'])
if 'refid' in node:
node['refid'] = 'document-' + docname + '#' + node['refid']
if 'ids' in node:
node['ids'] = ['document-' + docname + '#' + id for id in node['ids']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll plan to do this within the next 24h or so, but I'll ask in case it is something you could do quickly: could you print out two columns of text with the before and after values for these node attributes when building a non-trivial project (easiest/safest choice: Sphinx itself)?

e.g.

refids
before               after
[sample/#foo]        [document-sample#foo]

node_id
sample/#foo          document-sample#foo

The reason I ask: I'd like to inspect the places where the results differ, and in particular how the code changes achieve uniqueness of the results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm also wondering whether docutils -- which produces the node objects, if I understand correctly - could help us and allow us to fix this in a more central location; and I hope that viewing the comparison columns may also help to understand whether that is realistic or whether this is some Sphinx-specific quirk)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm also wondering whether docutils -- which produces the node objects, if I understand correctly - could help us and allow us to fix this in a more central location; and I hope that viewing the comparison columns may also help to understand whether that is realistic or whether this is some Sphinx-specific quirk)

Nope, scratch that - I think that docutils is unaware of the notion of docnames, so whatever is going on here must, I think, be part of Sphinx itself.

Copy link
Author

@gastmaier gastmaier Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So docutils provides the solved tree to the builder, with each doc being a document.
Sphinx guarantees the ids are unique per doc, the filesystem guarantees the docname is unique (you cannot have two identical paths)
But the builder singlehtml flattens all into the root doc index, loosing the information of the docname, causing non-unique ids after flatting it.
This fix recovers the docname and patches into the id itself.

The sphinx documentation itself, attached below, has conflicts, there are many duplicated id1.

singlehtml.zip

The table requested (attached because it is too long):

ids.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much @gastmaier - that makes the problem and fix nice and clear.

I'm reading the comparison file at the moment - in particular I'm interested to find whether any of the before elements included a / delimiter -- I haven't found any so far. If there are none, then that would completely resolve my concern about breaking any existing hyperlinks containing that character.

Do you have any thoughts about whether we should always include the complete document path prefix? Or whether, for example, it could be omitted for unambiguous/unique IDs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would patch all at the moment, it makes sense to me to store the lost docname information in the id itself, and it is clearer to debug.

For the toctree, before the pr, it would already generate in the format document-<doc>#<id>, so this would need to be assessed as well. That's what #13717 tried to fix, only to uncover the collision issue.

And there are so many visit_* elements that needs to be patched to handle every corner case, that uniforming into a single format early on (after SphinxPostTransform, before other singlehtml patches) seems to be the only reliable approach.

The latex builder does patch at the visit_* elements with the sphinx/writers/latex.py#hypertarget[withdoc=True] method, but I don't see that working with html since it is straight up more convoluted since each visit would require some if builder.name is 'singlehtml'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sprint For work completed at a conference or similar event.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants