scholar-html

A two-step CLI tool that converts a Google Scholar profile into static HTML for embedding in a personal website.

Fetch: scrape or load a Google Scholar profile page and extract structured publication data into JSON
Fetch PDFs (optional): enrich the JSON with PDF/paper links by visiting each publication's citation page
Render: turn that JSON into HTML using a Jinja2 template

The JSON intermediate format decouples fetching from rendering. You can re-render with different templates without re-fetching, hand-edit the data to fix author names or add missing venues, or check the JSON into version control to track how your citation counts change over time.

I had Claude Code build this as an easy way to sync my Google Scholar profile's publications to my website from time to time. I use the leifme.html.j2 template and then generate the HTML as a fragment to paste into the Ghost editor as HTML.

Installation

Requires Python 3.10+.

pip install -r requirements.txt

For development (running tests):

pip install -r requirements-dev.txt

Dependencies: requests, beautifulsoup4, lxml, jinja2.

Quick Start

# Parse a saved Scholar HTML file into JSON
python -m scholar_html fetch --html saved_page.html -o data.json

# (Optional) Enrich with PDF links from citation pages
python -m scholar_html fetch-pdfs data.json

# Render the JSON to a full HTML document
python -m scholar_html render data.json -o publications.html

Usage

Step 1: `fetch` — Extract publication data

The fetch command reads a Google Scholar profile and outputs structured JSON. There are three mutually exclusive ways to specify the input:

# From a saved HTML file (recommended — avoids rate-limiting)
python -m scholar_html fetch --html saved_page.html -o data.json

# From a Scholar user ID (fetches over the network)
python -m scholar_html fetch --user 22Scgp0AAAAJ -o data.json

# From a full URL (fetches over the network)
python -m scholar_html fetch --url "https://scholar.google.com/citations?user=22Scgp0AAAAJ" -o data.json

Omit -o to write to stdout instead of a file.

Recommended workflow: Google Scholar aggressively blocks automated requests. The most reliable approach is to open your Scholar profile in a browser, save the page as HTML ("Save As" → "Webpage, Complete" or "Webpage, HTML Only"), and then use --html to parse the saved file.

Step 1.5 (optional): `fetch-pdfs` — Discover PDF links

The fetch-pdfs command enriches an existing JSON file by visiting each publication's Google Scholar citation detail page to find PDF/paper links. It's designed to be interruptible and resumable — Google Scholar aggressively rate-limits, so you may need to run it several times.

# Enrich data.json in place (default 5s delay between requests)
python -m scholar_html fetch-pdfs data.json

# Custom delay and separate output file
python -m scholar_html fetch-pdfs data.json -o enriched.json --delay 10

Each publication's pdf field tracks its state:

"" — not yet attempted
"UNAVAILABLE" — attempted, but no PDF link was found on the citation page
A URL — the paper/PDF link

Progress is saved after each publication, so if the process is interrupted (rate-limited, Ctrl-C, etc.), re-running picks up where it left off. On rate-limiting, the command prints a message to stderr and exits with code 1.

Step 2: `render` — Generate HTML

The render command reads JSON (produced by fetch) and generates HTML:

# Full HTML document with the default template
python -m scholar_html render data.json -o publications.html

# HTML fragment (just the <section>, no <!DOCTYPE>/html/body wrapper)
python -m scholar_html render data.json --fragment -o snippet.html

# Limit to the first 10 publications
python -m scholar_html render data.json --limit 10 -o publications.html

# Use a custom Jinja2 template
python -m scholar_html render data.json --template my_template.html.j2 -o publications.html

Omit -o to write to stdout.

Fragment mode

--fragment outputs only the <section class="scholar-profile">...</section> block, without the surrounding <!DOCTYPE html>, <html>, <head>, or <body> tags. This is useful when you want to paste or include the output into an existing page.

JSON Format

The intermediate JSON looks like this:

{
  "meta": {
    "scholar_id": "22Scgp0AAAAJ",
    "fetched_at": "2025-02-01T21:33:00+00:00",
    "source": "file:saved_page.html"
  },
  "profile": {
    "name": "Leif Singer",
    "affiliation": "University of Victoria",
    "interests": ["Software Engineering", "Developer Tools"],
    "stats": {
      "citations_all": 1234,
      "citations_recent": 456,
      "h_index_all": 12,
      "h_index_recent": 8,
      "i10_index_all": 15,
      "i10_index_recent": 10
    }
  },
  "publications": [
    {
      "title": "How software developers use GitHub",
      "authors": "L Singer, F Figueira Filho, N Bettenburg, M Storey",
      "venue": "IEEE Software 31 (2), 58-65",
      "year": "2014",
      "citations": 312,
      "url": "/citations?view_op=view_citation&citation_for_view=...",
      "pdf": "https://ieeexplore.ieee.org/abstract/document/6773718"
    }
  ]
}

You can edit this file freely — fix author names, remove publications, reorder entries — and then re-render.

Templates

Three templates are included in the templates/ directory:

`default.html.j2`

Semantic HTML5 output with:

All CSS classes prefixed with scholar- (e.g. scholar-profile, scholar-pub, scholar-pub-title) so they won't collide with your site's styles
data-year and data-citations attributes on each <li> for optional client-side filtering or sorting with JavaScript
<ol reversed> for the publication list
Citation stats displayed as a <dl>
No CSS framework dependency — bring your own styles

`minimal.html.j2`

A plain <ul> with one <li> per publication in "Authors. Title. Venue, Year (N citations)." format. No classes, no data attributes.

`leifme.html.j2`

Styled template designed to match leif.me. Includes self-contained CSS within a <style> block so it works as a --fragment without external stylesheets. Uses the Inter font stack, the site's #fcb615 accent color for PDF links, and a clean publication layout with bold titles, grey authors, italic venues, and a light year/citation meta line.

python -m scholar_html render data.json --template templates/leifme.html.j2 --fragment -o publications.html

Custom templates

Pass any Jinja2 template with --template. The template receives these variables:

Variable	Type	Description
`profile`	object	`.name`, `.affiliation`, `.interests` (list of strings), `.stats`
`profile.stats`	object	`.citations_all`, `.citations_recent`, `.h_index_all`, `.h_index_recent`, `.i10_index_all`, `.i10_index_recent`
`publications`	list	Each has `.title`, `.authors`, `.venue`, `.year`, `.citations` (int), `.url`, `.pdf`
`meta`	object	`.scholar_id`, `.fetched_at`, `.source`
`fragment`	bool	Whether `--fragment` was passed

Your template should check {% if not fragment %} to conditionally wrap output in a full HTML document.

CSS Classes (default template)

Class	Element	Content
`scholar-profile`	`<section>`	Wrapper for everything
`scholar-name`	`<h2>`	Author name
`scholar-affiliation`	`<p>`	Affiliation
`scholar-interests`	`<ul>`	Research interest tags
`scholar-interest`	`<li>`	Individual interest
`scholar-stats`	`<dl>`	Citation statistics
`scholar-stat`	`<div>`	Individual stat (wraps `<dt>` + `<dd>`)
`scholar-publications`	`<ol>`	Publication list
`scholar-pub`	`<li>`	Single publication
`scholar-pub-title`	`<span>`	Title (contains `<a>` if URL present)
`scholar-pub-authors`	`<span>`	Author list
`scholar-pub-venue`	`<span>`	Journal/conference name
`scholar-pub-pdf`	`<a>`	`[PDF]` link (only rendered when a URL is available)
`scholar-pub-meta`	`<span>`	Year and citation count
`scholar-pub-year`	`<span>`	Publication year
`scholar-pub-citations`	`<span>`	Citation count

Project Structure

scholar_html/
  __init__.py
  __main__.py          # python -m scholar_html entry point
  cli.py               # argparse CLI with fetch/render/fetch-pdfs subcommands
  fetch.py             # HTML parsing and network fetching
  fetch_pdfs.py        # PDF link discovery from citation pages
  render.py            # Jinja2 template rendering
  schema.py            # Dataclasses and JSON serialization
  selectors.py         # CSS selectors for Scholar's DOM (isolated for maintainability)

templates/
  default.html.j2      # Semantic HTML5 with scholar- prefixed classes
  minimal.html.j2      # Bare <ul> list
  leifme.html.j2       # Styled for leif.me, self-contained CSS

tests/
  conftest.py          # Shared fixtures
  fixtures/
    sample_profile.html
    sample_citation.html
    sample_citation_no_link.html
  test_cli.py          # End-to-end CLI tests
  test_fetch.py        # Parser tests against saved HTML
  test_fetch_pdfs.py   # PDF discovery orchestration tests
  test_render.py       # Template rendering tests
  test_schema.py       # JSON round-trip tests
  test_selectors.py    # Validates selectors find elements in fixture

Testing

pytest tests/ -v

All tests run against a saved HTML fixture (tests/fixtures/sample_profile.html). No tests hit the network.

When Google Changes Their HTML

The CSS selectors used to parse Scholar profiles are isolated in scholar_html/selectors.py. If Google changes their page structure, the selector tests (test_selectors.py) will fail first, pointing you to exactly what broke. Update the selectors and fixture, and the rest of the code stays the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scholar-html

Installation

Quick Start

Usage

Step 1: `fetch` — Extract publication data

Step 1.5 (optional): `fetch-pdfs` — Discover PDF links

Step 2: `render` — Generate HTML

Fragment mode

JSON Format

Templates

`default.html.j2`

`minimal.html.j2`

`leifme.html.j2`

Custom templates

CSS Classes (default template)

Project Structure

Testing

When Google Changes Their HTML

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

scholar-html

Installation

Quick Start

Usage

Step 1: fetch — Extract publication data

Step 1.5 (optional): fetch-pdfs — Discover PDF links

Step 2: render — Generate HTML

Fragment mode

JSON Format

Templates

default.html.j2

minimal.html.j2

leifme.html.j2

Custom templates

CSS Classes (default template)

Project Structure

Testing

When Google Changes Their HTML

Step 1: `fetch` — Extract publication data

Step 1.5 (optional): `fetch-pdfs` — Discover PDF links

Step 2: `render` — Generate HTML

`default.html.j2`

`minimal.html.j2`

`leifme.html.j2`