Skip to content

Commit 1cce78f

Browse files
committed
w
1 parent 01de2ff commit 1cce78f

File tree

4 files changed

+738
-1
lines changed

4 files changed

+738
-1
lines changed
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
---
2+
description: OCR a general exam PDF to accessible HTML with MathML using Mathpix API
3+
---
4+
5+
You are going to OCR a general exam PDF file to accessible HTML with MathML using the Mathpix API. Follow these steps precisely:
6+
7+
## Requirements
8+
- The PDF path will be provided as an argument to this command
9+
- Use Mathpix API credentials from environment variables (MATHPIX_APP_ID and MATHPIX_API_KEY)
10+
- Generate accessible HTML with:
11+
- Actual MathML markup (NOT SVG, NOT Unicode characters)
12+
- Proper ARIA attributes (role="math", aria-label with LaTeX)
13+
- Proper semantic heading hierarchy (H1 for title, H2 for sections/problems)
14+
15+
## Workflow
16+
17+
### Step 1: Upload PDF to Mathpix
18+
Use the Mathpix v3/pdf API endpoint to upload the PDF:
19+
```bash
20+
curl -X POST "https://api.mathpix.com/v3/pdf" \
21+
-H "app_id: $MATHPIX_APP_ID" \
22+
-H "app_key: $MATHPIX_API_KEY" \
23+
-F "file=@<PDF_PATH>" \
24+
-F 'options_json={"conversion_formats": {"html.zip": true, "mmd": true}}'
25+
```
26+
27+
Extract the `pdf_id` from the response.
28+
29+
### Step 2: Check conversion status
30+
Poll the status endpoint until conversion is complete:
31+
```bash
32+
curl -X GET "https://api.mathpix.com/v3/pdf/<PDF_ID>" \
33+
-H "app_id: $MATHPIX_APP_ID" \
34+
-H "app_key: $MATHPIX_API_KEY"
35+
```
36+
37+
Wait until `"status":"completed"`.
38+
39+
### Step 3: Download Mathpix Markdown
40+
Download the .mmd format (NOT .html, as it uses SVG):
41+
```bash
42+
curl -X GET "https://api.mathpix.com/v3/pdf/<PDF_ID>.mmd" \
43+
-H "app_id: $MATHPIX_APP_ID" \
44+
-H "app_key: $MATHPIX_API_KEY" \
45+
-o /tmp/output.mmd
46+
```
47+
48+
### Step 4: Convert to HTML with MathML using Pandoc
49+
```bash
50+
pandoc /tmp/output.mmd -f latex -t html --mathml --standalone -o /tmp/output_mathml.html
51+
```
52+
53+
### Step 5: Post-process to fix Unicode and improve accessibility
54+
Use the Python script `scripts/fix_mathml.py` to:
55+
1. Replace all Unicode mathematical characters with HTML/MathML entities (&rarr;, &infin;, &Copf;, &epsilon;, etc.)
56+
2. Fix heading hierarchy (first H1 stays as title, subsequent H1s become H2)
57+
3. Add ARIA attributes to all math elements (role="math", aria-label with LaTeX source)
58+
4. Add lang="en" attribute to the <html> element
59+
5. Extract title from H1 and set descriptive page title (e.g., "REAL ANALYSIS GENERAL EXAM FALL 2022 - UVA Mathematics")
60+
6. Add breadcrumb navigation at the top of the page
61+
7. Add a back button navigation below the breadcrumb
62+
8. Wrap main content in <main> landmark element
63+
64+
**Breadcrumb navigation** should:
65+
- Have aria-label="Breadcrumb"
66+
- Use ordered list with no bullets
67+
- Include: Home / Graduate / General Exams / [Current Page Title]
68+
- Mark current page with aria-current="page"
69+
- Style separators with aria-hidden="true"
70+
71+
**Back button** should:
72+
- Link to `/graduate/generals/`
73+
- Have aria-label="Page navigation"
74+
- Be styled as a button
75+
- Include a left arrow (&larr;) and text "Back to General Exams"
76+
77+
**Main landmark** should:
78+
- Wrap all content from H1 to the end
79+
- Start with `<main>` before the H1
80+
- Close with `</main>` before `</body>`
81+
82+
Use this entity mapping:
83+
- Arrows: → (&rarr;), ← (&larr;), ↔ (&harr;), ⇒ (&rArr;), ⇐ (&lArr;), ⇔ (&hArr;), ↦ (&map;)
84+
- Set theory: ∈ (&isin;), ∉ (&notin;), ∋ (&ni;), ⊂ (&sub;), ⊃ (&sup;), ⊆ (&sube;), ⊇ (&supe;), ∪ (&cup;), ∩ (&cap;), ∖ (&setminus;), ∅ (&empty;)
85+
- Relations: ≤ (&le;), ≥ (&ge;), ≠ (&ne;), ≈ (&approx;), ≡ (&equiv;), ⟂ (&perp;), ∥ (&parallel;)
86+
- Calculus: ∂ (&part;), ∫ (&int;), ∮ (&conint;), ∑ (&sum;), ∏ (&prod;), ∇ (&nabla;), √ (&radic;)
87+
- Logic: ∀ (&forall;), ∃ (&exist;), ¬ (&not;), ∧ (&and;), ∨ (&or;)
88+
- Number sets: ℝ (&Ropf;), ℂ (&Copf;), ℕ (&Nopf;), ℤ (&Zopf;), ℚ (&Qopf;)
89+
- Greek letters: α (&alpha;), β (&beta;), γ (&gamma;), δ (&delta;), ε (&epsilon;), η (&eta;), θ (&theta;), λ (&lambda;), μ (&mu;), ν (&nu;), π (&pi;), σ (&sigma;), τ (&tau;), φ (&phi;), ω (&omega;), Γ (&Gamma;), Δ (&Delta;), Θ (&Theta;), Λ (&Lambda;), Σ (&Sigma;), Φ (&Phi;), Ω (&Omega;)
90+
- Other: ∞ (&infin;), × (&times;), ⋅ (&sdot;), ± (&plusmn;), ∠ (&ang;), ⊕ (&oplus;), ⊗ (&otimes;)
91+
92+
### Step 6: Save to final location
93+
Save the processed HTML file next to the original PDF with the same name but .html extension.
94+
95+
### Step 7: Add HTML link to the generals page
96+
After saving the HTML file, you MUST add a link to it in `graduate/general_exams.md`:
97+
98+
1. Read the file `graduate/general_exams.md`
99+
2. Find the line that references the PDF you just processed
100+
3. Edit that line to add `&bull; [HTML]({{site.url}}/path/to/file.html)` after the PDF link
101+
102+
**Example:**
103+
```markdown
104+
# Before:
105+
- [08/2022, real]({{site.url}}/graduate/exams/analysis/2022Aug_real.pdf)
106+
107+
# After:
108+
- [08/2022, real]({{site.url}}/graduate/exams/analysis/2022Aug_real.pdf) &bull; [HTML]({{site.url}}/graduate/exams/analysis/2022Aug_real.html)
109+
```
110+
111+
**Important**: The `&bull;` (bullet point) must be used as the separator between PDF and HTML links.
112+
113+
## Important Notes
114+
- Do NOT use the direct .html download from Mathpix - it uses MathJax SVG rendering
115+
- Always use the .mmd → pandoc → post-processing pipeline
116+
- Verify that NO Unicode mathematical characters remain in the final output
117+
- Verify proper heading hierarchy (use grep or check a sample)
118+
- The final HTML should be fully accessible with screen readers
119+
120+
## Success Criteria
121+
The output HTML must have:
122+
✓ Actual MathML elements (<math>, <mrow>, <mi>, <mo>, etc.)
123+
✓ NO Unicode characters - all replaced with HTML/MathML entities
124+
✓ Proper heading hierarchy (H1 for title, H2 for problems/sections)
125+
✓ ARIA attributes on all math elements (role="math", aria-label)
126+
✓ lang="en" attribute on <html> element
127+
✓ Descriptive page title derived from H1 content (e.g., "TITLE - UVA Mathematics")
128+
✓ Breadcrumb navigation with aria-label="Breadcrumb" at top
129+
✓ Back button navigation with aria-label="Page navigation" below breadcrumb
130+
✓ Main content wrapped in <main> landmark element
131+
✓ Valid, well-formed HTML5 document with proper semantic structure
132+
✓ HTML file saved next to the PDF with .html extension
133+
✓ Link added to graduate/general_exams.md with &bull; separator

0 commit comments

Comments
 (0)