Skip to content

Fix angle-bracket tags hidden by GitHub notebook renderer#204

Open
danielhanchen wants to merge 2 commits intomainfrom
fix/html-tags-github-rendering
Open

Fix angle-bracket tags hidden by GitHub notebook renderer#204
danielhanchen wants to merge 2 commits intomainfrom
fix/html-tags-github-rendering

Conversation

@danielhanchen
Copy link
Copy Markdown
Contributor

@danielhanchen danielhanchen commented Mar 8, 2026

Summary

HTML-like tags (<think>, <SOLUTION>, <start_working_out>, etc.) in notebook cell outputs are silently hidden by GitHub's notebook renderer. This makes GRPO/reasoning notebook outputs unreadable on GitHub.

Output fixes (680 fixes across 203 files):

  • For execute_result/display_data outputs: adds text/html with HTML-escaped content so GitHub renders it properly
  • For stream outputs containing angle-bracket tags: clears the stream text (these are model inference traces that cannot be preserved without escaping)

Comment fixes (46 fixes):

  • Replaces angle-bracket tags in code comments (e.g. # Acts as <think>) with safe text

Also includes update_all_notebooks.py improvements:

  • QAT notebooks: dynamic torchao version detection instead of hard-pinned torchao==0.14.0
  • No cell ID injection into .ipynb files (eliminates spurious diffs)
  • Conditional widget state rewrite (avoids unnecessary notebook rewrites)
  • Script file permission normalization

Bug fix in fix_html_tags.py:

  • Only assign cell["outputs"] for cells that already have the key, preventing "outputs": [] from being added to markdown cells

The large deletions (~160k lines) are legitimate: stream outputs containing <think> tags in GRPO/reasoning notebooks had to be cleared.

Test plan

  • Ran fix_html_tags.py with the fix -- 680 output fixes, 46 comment fixes across 203 files
  • Ran update_all_notebooks.py -- regenerated all notebooks and python scripts
  • Ran update_all_notebooks.py a second time -- byte-for-byte identical output (idempotent)
  • Verified no cell ID noise, no "outputs": [] on markdown cells
  • Verified non-affected notebooks (e.g. TinyLlama) only show the legitimate text/html addition

Tags like <start_working_out>, <SOLUTION>, <think> are interpreted as
HTML by GitHub and silently hidden, making notebook outputs appear broken.

- Add text/html with HTML-escaped content to execute_result/display_data
  outputs containing raw angle-bracket tags (GitHub prefers text/html)
- Clear stream outputs that contain raw angle-bracket tags
- Replace angle-bracket tags in code comments with safe text
- String literals left unchanged (functional code not affected)
- Add scripts/fix_html_tags.py for reproducibility
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

The number of changes in this pull request is too large for Gemini Code Assist to generate a summary.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f97b0948b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/fix_html_tags.py Outdated
else:
new_outputs.append(output)

cell["outputs"] = new_outputs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict outputs field updates to code cells

fix_outputs writes cell["outputs"] = new_outputs for every cell, including markdown/raw cells that do not have an outputs field in notebook format. In any notebook where at least one real output is fixed, this injects "outputs": [] into all non-code cells (visible throughout this commit), which produces invalid notebook structure for tools that validate/round-trip .ipynb files and can cause downstream rewrites or failures in notebook tooling.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9711f5c58f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/fix_html_tags.py
source = cell.get("source", [])
new_source = []
cell_changed = False
for line in source:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle string-form cell sources when fixing comments

fix_comments iterates source as if it were a list of lines, but notebook source is also valid as a single string; in that case for line in source iterates characters, so none of the multi-character replacements in COMMENT_REPLACEMENTS can ever match. This means comment tags are silently left unfixed for string-serialized cells, which is a real case in this repo’s notebooks and makes reruns of the script incomplete.

Useful? React with 👍 / 👎.

HTML-like tags (<think>, <SOLUTION>, <start_working_out>, etc.) in
notebook outputs are silently hidden by GitHub's renderer. This fix:

1. Adds text/html with HTML-escaped content for execute_result and
   display_data outputs containing angle-bracket tags.
2. Clears stream outputs that contain angle-bracket tags (these are
   model inference traces that cannot be fixed without escaping).
3. Replaces angle-bracket tags in code comments with safe text.

Also includes update_all_notebooks.py fixes from fix/qat-deterministic-cell-ids:
- QAT notebooks: dynamic torchao version detection instead of hard-pinned 0.14.0
- No cell ID injection into .ipynb files (eliminates spurious diffs)
- Conditional widget state rewrite
- Script file permission normalization

fix_html_tags.py: only assign cell["outputs"] for cells that already
have the key, preventing "outputs": [] from being added to markdown cells.

Regenerated all notebooks and python scripts.
@danielhanchen danielhanchen force-pushed the fix/html-tags-github-rendering branch from 9711f5c to 4bcc554 Compare March 8, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant