Skip to content

Latest commit

 

History

History
248 lines (183 loc) · 9.88 KB

File metadata and controls

248 lines (183 loc) · 9.88 KB

scholar-html

A two-step CLI tool that converts a Google Scholar profile into static HTML for embedding in a personal website.

  1. Fetch: scrape or load a Google Scholar profile page and extract structured publication data into JSON
  2. Fetch PDFs (optional): enrich the JSON with PDF/paper links by visiting each publication's citation page
  3. Render: turn that JSON into HTML using a Jinja2 template

The JSON intermediate format decouples fetching from rendering. You can re-render with different templates without re-fetching, hand-edit the data to fix author names or add missing venues, or check the JSON into version control to track how your citation counts change over time.

I had Claude Code build this as an easy way to sync my Google Scholar profile's publications to my website from time to time. I use the leifme.html.j2 template and then generate the HTML as a fragment to paste into the Ghost editor as HTML.

Installation

Requires Python 3.10+.

pip install -r requirements.txt

For development (running tests):

pip install -r requirements-dev.txt

Dependencies: requests, beautifulsoup4, lxml, jinja2.

Quick Start

# Parse a saved Scholar HTML file into JSON
python -m scholar_html fetch --html saved_page.html -o data.json

# (Optional) Enrich with PDF links from citation pages
python -m scholar_html fetch-pdfs data.json

# Render the JSON to a full HTML document
python -m scholar_html render data.json -o publications.html

Usage

Step 1: fetch — Extract publication data

The fetch command reads a Google Scholar profile and outputs structured JSON. There are three mutually exclusive ways to specify the input:

# From a saved HTML file (recommended — avoids rate-limiting)
python -m scholar_html fetch --html saved_page.html -o data.json

# From a Scholar user ID (fetches over the network)
python -m scholar_html fetch --user 22Scgp0AAAAJ -o data.json

# From a full URL (fetches over the network)
python -m scholar_html fetch --url "https://scholar.google.com/citations?user=22Scgp0AAAAJ" -o data.json

Omit -o to write to stdout instead of a file.

Recommended workflow: Google Scholar aggressively blocks automated requests. The most reliable approach is to open your Scholar profile in a browser, save the page as HTML ("Save As" → "Webpage, Complete" or "Webpage, HTML Only"), and then use --html to parse the saved file.

Step 1.5 (optional): fetch-pdfs — Discover PDF links

The fetch-pdfs command enriches an existing JSON file by visiting each publication's Google Scholar citation detail page to find PDF/paper links. It's designed to be interruptible and resumable — Google Scholar aggressively rate-limits, so you may need to run it several times.

# Enrich data.json in place (default 5s delay between requests)
python -m scholar_html fetch-pdfs data.json

# Custom delay and separate output file
python -m scholar_html fetch-pdfs data.json -o enriched.json --delay 10

Each publication's pdf field tracks its state:

  • "" — not yet attempted
  • "UNAVAILABLE" — attempted, but no PDF link was found on the citation page
  • A URL — the paper/PDF link

Progress is saved after each publication, so if the process is interrupted (rate-limited, Ctrl-C, etc.), re-running picks up where it left off. On rate-limiting, the command prints a message to stderr and exits with code 1.

Step 2: render — Generate HTML

The render command reads JSON (produced by fetch) and generates HTML:

# Full HTML document with the default template
python -m scholar_html render data.json -o publications.html

# HTML fragment (just the <section>, no <!DOCTYPE>/html/body wrapper)
python -m scholar_html render data.json --fragment -o snippet.html

# Limit to the first 10 publications
python -m scholar_html render data.json --limit 10 -o publications.html

# Use a custom Jinja2 template
python -m scholar_html render data.json --template my_template.html.j2 -o publications.html

Omit -o to write to stdout.

Fragment mode

--fragment outputs only the <section class="scholar-profile">...</section> block, without the surrounding <!DOCTYPE html>, <html>, <head>, or <body> tags. This is useful when you want to paste or include the output into an existing page.

JSON Format

The intermediate JSON looks like this:

{
  "meta": {
    "scholar_id": "22Scgp0AAAAJ",
    "fetched_at": "2025-02-01T21:33:00+00:00",
    "source": "file:saved_page.html"
  },
  "profile": {
    "name": "Leif Singer",
    "affiliation": "University of Victoria",
    "interests": ["Software Engineering", "Developer Tools"],
    "stats": {
      "citations_all": 1234,
      "citations_recent": 456,
      "h_index_all": 12,
      "h_index_recent": 8,
      "i10_index_all": 15,
      "i10_index_recent": 10
    }
  },
  "publications": [
    {
      "title": "How software developers use GitHub",
      "authors": "L Singer, F Figueira Filho, N Bettenburg, M Storey",
      "venue": "IEEE Software 31 (2), 58-65",
      "year": "2014",
      "citations": 312,
      "url": "/citations?view_op=view_citation&citation_for_view=...",
      "pdf": "https://ieeexplore.ieee.org/abstract/document/6773718"
    }
  ]
}

You can edit this file freely — fix author names, remove publications, reorder entries — and then re-render.

Templates

Three templates are included in the templates/ directory:

default.html.j2

Semantic HTML5 output with:

  • All CSS classes prefixed with scholar- (e.g. scholar-profile, scholar-pub, scholar-pub-title) so they won't collide with your site's styles
  • data-year and data-citations attributes on each <li> for optional client-side filtering or sorting with JavaScript
  • <ol reversed> for the publication list
  • Citation stats displayed as a <dl>
  • No CSS framework dependency — bring your own styles

minimal.html.j2

A plain <ul> with one <li> per publication in "Authors. Title. Venue, Year (N citations)." format. No classes, no data attributes.

leifme.html.j2

Styled template designed to match leif.me. Includes self-contained CSS within a <style> block so it works as a --fragment without external stylesheets. Uses the Inter font stack, the site's #fcb615 accent color for PDF links, and a clean publication layout with bold titles, grey authors, italic venues, and a light year/citation meta line.

python -m scholar_html render data.json --template templates/leifme.html.j2 --fragment -o publications.html

Custom templates

Pass any Jinja2 template with --template. The template receives these variables:

Variable Type Description
profile object .name, .affiliation, .interests (list of strings), .stats
profile.stats object .citations_all, .citations_recent, .h_index_all, .h_index_recent, .i10_index_all, .i10_index_recent
publications list Each has .title, .authors, .venue, .year, .citations (int), .url, .pdf
meta object .scholar_id, .fetched_at, .source
fragment bool Whether --fragment was passed

Your template should check {% if not fragment %} to conditionally wrap output in a full HTML document.

CSS Classes (default template)

Class Element Content
scholar-profile <section> Wrapper for everything
scholar-name <h2> Author name
scholar-affiliation <p> Affiliation
scholar-interests <ul> Research interest tags
scholar-interest <li> Individual interest
scholar-stats <dl> Citation statistics
scholar-stat <div> Individual stat (wraps <dt> + <dd>)
scholar-publications <ol> Publication list
scholar-pub <li> Single publication
scholar-pub-title <span> Title (contains <a> if URL present)
scholar-pub-authors <span> Author list
scholar-pub-venue <span> Journal/conference name
scholar-pub-pdf <a> [PDF] link (only rendered when a URL is available)
scholar-pub-meta <span> Year and citation count
scholar-pub-year <span> Publication year
scholar-pub-citations <span> Citation count

Project Structure

scholar_html/
  __init__.py
  __main__.py          # python -m scholar_html entry point
  cli.py               # argparse CLI with fetch/render/fetch-pdfs subcommands
  fetch.py             # HTML parsing and network fetching
  fetch_pdfs.py        # PDF link discovery from citation pages
  render.py            # Jinja2 template rendering
  schema.py            # Dataclasses and JSON serialization
  selectors.py         # CSS selectors for Scholar's DOM (isolated for maintainability)

templates/
  default.html.j2      # Semantic HTML5 with scholar- prefixed classes
  minimal.html.j2      # Bare <ul> list
  leifme.html.j2       # Styled for leif.me, self-contained CSS

tests/
  conftest.py          # Shared fixtures
  fixtures/
    sample_profile.html
    sample_citation.html
    sample_citation_no_link.html
  test_cli.py          # End-to-end CLI tests
  test_fetch.py        # Parser tests against saved HTML
  test_fetch_pdfs.py   # PDF discovery orchestration tests
  test_render.py       # Template rendering tests
  test_schema.py       # JSON round-trip tests
  test_selectors.py    # Validates selectors find elements in fixture

Testing

pytest tests/ -v

All tests run against a saved HTML fixture (tests/fixtures/sample_profile.html). No tests hit the network.

When Google Changes Their HTML

The CSS selectors used to parse Scholar profiles are isolated in scholar_html/selectors.py. If Google changes their page structure, the selector tests (test_selectors.py) will fail first, pointing you to exactly what broke. Update the selectors and fixture, and the rest of the code stays the same.