@@ -15,35 +15,29 @@ You are going to OCR a general exam PDF file to accessible HTML with MathML usin
1515## Workflow
1616
1717### Step 1: Upload PDF to Mathpix
18- Use the Mathpix v3/pdf API endpoint to upload the PDF:
18+ Use the Mathpix v3/pdf API endpoint to upload the PDF (SINGLE LINE - no backslashes) :
1919``` bash
20- curl -X POST " https://api.mathpix.com/v3/pdf" \
21- -H " app_id: $MATHPIX_APP_ID " \
22- -H " app_key: $MATHPIX_API_KEY " \
23- -F " file=@<PDF_PATH>" \
24- -F ' options_json={"conversion_formats": {"html.zip": true, "tex.zip": true}}'
20+ curl -X POST " https://api.mathpix.com/v3/pdf" -H " app_id: $MATHPIX_APP_ID " -H " app_key: $MATHPIX_API_KEY " -F " file=@<PDF_PATH>" -F ' options_json={"conversion_formats": {"html.zip": true, "tex.zip": true}}'
2521```
2622
2723Extract the ` pdf_id ` from the response.
2824
2925### Step 2: Check conversion status
30- Poll the status endpoint until conversion is complete:
26+ Poll the status endpoint until conversion is complete (SINGLE LINE - no backslashes) :
3127``` bash
32- curl -X GET " https://api.mathpix.com/v3/pdf/<PDF_ID>" \
33- -H " app_id: $MATHPIX_APP_ID " \
34- -H " app_key: $MATHPIX_API_KEY "
28+ curl -X GET " https://api.mathpix.com/v3/pdf/<PDF_ID>" -H " app_id: $MATHPIX_APP_ID " -H " app_key: $MATHPIX_API_KEY "
3529```
3630
3731Wait until ` "status":"completed" ` .
3832
3933### Step 3: Download and Extract TeX Format
40- Download the tex.zip format (NOT .html, as it uses SVG):
34+ Download the tex.zip format (NOT .html, as it uses SVG) - SINGLE LINE for curl, then unzip :
4135``` bash
42- curl -X GET " https://api.mathpix.com/v3/pdf/<PDF_ID>.tex.zip" \
43- -H " app_id: $MATHPIX_APP_ID " \
44- -H " app_key: $MATHPIX_API_KEY " \
45- -o /tmp/output.tex.zip
36+ curl -X GET " https://api.mathpix.com/v3/pdf/<PDF_ID>.tex.zip" -H " app_id: $MATHPIX_APP_ID " -H " app_key: $MATHPIX_API_KEY " -o /tmp/output.tex.zip
37+ ```
4638
39+ Then extract:
40+ ``` bash
4741cd /tmp && unzip -o output.tex.zip
4842```
4943
@@ -102,8 +96,60 @@ Use this entity mapping:
10296- Greek letters: α (&alpha ; ), β (&beta ; ), γ (&gamma ; ), δ (&delta ; ), ε (&epsilon ; ), η (&eta ; ), θ (&theta ; ), λ (&lambda ; ), μ (&mu ; ), ν (&nu ; ), π (&pi ; ), σ (&sigma ; ), τ (&tau ; ), φ (&phi ; ), ω (&omega ; ), Γ (&Gamma ; ), Δ (&Delta ; ), Θ (&Theta ; ), Λ (&Lambda ; ), Σ (&Sigma ; ), Φ (&Phi ; ), Ω (&Omega ; )
10397- Other: ∞ (&infin ; ), × (× ; ), ⋅ (&sdot ; ), ± (± ; ), ∠ (&ang ; ), ⊕ (&oplus ; ), ⊗ (&otimes ; )
10498
105- ### Step 6: Add H2 Problem Headings
106- ** CRITICAL** : After post-processing, you MUST manually add H2 headings for each problem.
99+ ### Step 6: Handle Images (Diagrams, Figures)
100+ ** IMPORTANT** : Many exams contain diagrams (commutative diagrams, geometric figures, knot diagrams, etc.) that are extracted by Mathpix.
101+
102+ 1 . ** Check for extracted images** :
103+ ``` bash
104+ ls -la /tmp/< PDF_ID> /images/
105+ ```
106+
107+ 2 . ** If images exist** :
108+ - Create the images directory if it doesn't exist:
109+ ``` bash
110+ mkdir -p < EXAM_DIR> /images
111+ ```
112+
113+ - Copy ALL image files to the exam images directory:
114+ ` ` ` bash
115+ cp /tmp/< PDF_ID> /images/* .jpg < EXAM_DIR> /images/
116+ ` ` `
117+
118+ - ** Update ALL image paths in the HTML** :
119+ - Find all ` < img src=" ..." > ` tags in the HTML
120+ - Change from ` < img src=" FILENAME" ` to ` < img src=" images/FILENAME.jpg" `
121+ - Add proper alt text describing what the diagram shows
122+ - Add styling for responsive images:
123+ ` ` ` html
124+ < img src=" images/FILENAME.jpg" alt=" Descriptive alt text here" style=" max-width: 100%; height: auto; display: block; margin: 1em auto;" />
125+ ` ` `
126+
127+ 3. ** Common exam diagrams to look for** :
128+ - Commutative diagrams (arrows between mathematical objects)
129+ - Pushout/pullback squares
130+ - Geometric figures (M\u 00f6bius bands, knots, surfaces)
131+ - Graphs and plots
132+ - Function diagrams
133+
134+ 4. ** Alt text guidelines** :
135+ - Be descriptive but concise
136+ - Examples:
137+ - " Commutative diagram showing maps between groups A1, A2, B1, B2, and C"
138+ - " Trefoil knot diagram"
139+ - " Möbius band diagram showing the curve γ as its boundary"
140+ - " Pushout diagram showing the construction of Xf"
141+
142+ ** Example transformation:**
143+ ` ` ` html
144+ < ! -- Before: Broken image path -->
145+ < img src=" 2025_11_13_abc123-1" alt=" image" />
146+
147+ < ! -- After: Fixed path with descriptive alt text -->
148+ < img src=" images/2025_11_13_abc123-1.jpg" alt=" Commutative diagram showing the exact sequence" style=" max-width: 100%; height: auto; display: block; margin: 1em auto;" />
149+ ` ` `
150+
151+ # ## Step 7: Add H2 Problem Headings
152+ ** CRITICAL** : After handling images, you MUST manually add H2 headings for each problem.
107153
1081541. Read the processed HTML file
1091552. Identify each problem in the exam (usually numbered 1, 2, 3, etc.)
@@ -133,10 +179,10 @@ Use this entity mapping:
133179- Use the pattern: ` < h2 class=" unnumbered" id=" problem-N" > Problem N< /h2> `
134180- The ID should match the problem number for anchor linking
135181
136- ### Step 7 : Save to final location
182+ # ## Step 8 : Save to final location
137183Save the processed HTML file next to the original PDF with the same name but .html extension.
138184
139- ### Step 8 : Add accessible HTML link to the generals page
185+ # ## Step 9 : Add accessible HTML link to the generals page
140186After saving the HTML file, you MUST update the link in ` graduate/general_exams.md` to follow accessibility best practices:
141187
1421881. Read the file ` graduate/general_exams.md`
@@ -165,7 +211,7 @@ After saving the HTML file, you MUST update the link in `graduate/general_exams.
165211- Clearly labeling the PDF as " for printing" to indicate its purpose
166212- Using ARIA labels to communicate that PDFs may have accessibility limitations
167213
168- ### Step 9 : Final Review - Read Both Files
214+ # ## Step 10 : Final Review - Read Both Files
169215After completing all processing steps, you MUST read both the original PDF and the generated HTML file to provide a final quality assessment:
170216
171217` ` ` bash
0 commit comments