Skip to content

Commit a2e2211

Browse files
Merge pull request #3 from ironprogrammer/update-and-validate-data
2 parents 2d9f05b + aabc2ce commit a2e2211

15 files changed

+2303
-15
lines changed

.github/SETUP.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# PDF Extraction Workflow Setup
2+
3+
This document describes how to set up the automated PDF extraction workflow that uses pdfplumber to extract swim time standards from PDF files.
4+
5+
## Overview
6+
7+
The workflow automatically:
8+
1. Checks daily for new OSI Time Standards PDFs
9+
2. When detected, extracts data using pdfplumber
10+
3. Validates the extracted data
11+
4. Creates a pull request for review
12+
13+
## Prerequisites
14+
15+
- GitHub repository with Actions enabled
16+
- GitHub CLI (`gh`) configured in the repository
17+
- Python 3.x with pdfplumber library (automatically installed in workflow)
18+
19+
## Required Repository Settings
20+
21+
For the workflow to create pull requests, you must enable the following setting:
22+
23+
1. Go to: **Settings → Actions → General → Workflow permissions**
24+
2. Enable: **"Allow GitHub Actions to create and approve pull requests"**
25+
26+
This allows the workflow's `GH_TOKEN` to create PRs when new PDFs are detected.
27+
28+
## Workflow Configuration
29+
30+
The workflow is configured in `.github/workflows/check-pdf.yml` and runs:
31+
- **Daily at 9 AM Pacific Time (5 PM UTC)**
32+
- **Manually via workflow_dispatch**
33+
34+
### Manual Trigger
35+
36+
To manually trigger the workflow:
37+
38+
1. Go to **Actions** tab in your GitHub repository
39+
2. Select "Check for New OSI Time Standards PDF" workflow
40+
3. Click **Run workflow**
41+
4. Select the branch (usually `main`)
42+
5. Click **Run workflow**
43+
44+
## How It Works
45+
46+
### 1. PDF Detection (`.github/scripts/check-for-new-pdf.js`)
47+
48+
Checks the OSI website for:
49+
- Newer year ranges (e.g., 2025-2026)
50+
- URL changes for the same year (data corrections)
51+
52+
If changes detected, outputs `PDF_URL` and `LINK_TEXT` for next steps.
53+
54+
### 2. PDF Extraction (`.github/scripts/extract-pdf-with-pdfplumber.py`)
55+
56+
Uses pdfplumber library to:
57+
- Download the PDF
58+
- Extract table data from each page
59+
- Parse time standards data
60+
- Convert to JSON matching `swim_time_standards.json` structure
61+
62+
### 3. Validation (`.github/scripts/validate-all.sh`)
63+
64+
Runs three validators:
65+
- **Structure**: Verifies required fields and hierarchy
66+
- **Time Format**: Checks MM:SS.MS or SS.MS format, flags invalid seconds (e.g., 96)
67+
- **Time Progression**: Ensures A < B+ < B (faster to slower)
68+
69+
Any issues are flagged with ⚠️ emoji in the JSON.
70+
71+
### 4. README Update (`.github/scripts/update-readme-inconsistencies.js`)
72+
73+
Updates the "Data Inconsistencies" section in README.md with any flagged issues.
74+
75+
### 5. PR Creation (`.github/scripts/process-new-pdf.sh`)
76+
77+
Creates a feature branch and pull request with:
78+
- Updated `swim_time_standards.json`
79+
- Updated `README.md` (if inconsistencies found)
80+
- Detailed description of changes
81+
82+
## Testing
83+
84+
### Test Validation Scripts
85+
86+
Run the test suite to verify validation logic:
87+
88+
```bash
89+
./tests/test-validation.sh
90+
```
91+
92+
### Test PDF Extraction Locally
93+
94+
To test extraction with a PDF URL:
95+
96+
```bash
97+
# Install pdfplumber if not already installed
98+
pip install pdfplumber
99+
100+
# Run extraction
101+
python3 .github/scripts/extract-pdf-with-pdfplumber.py \
102+
"https://example.com/path/to/standards.pdf" \
103+
"2024-2025 OSI Time Standards" \
104+
"output.json"
105+
106+
# Then validate
107+
bash .github/scripts/validate-all.sh output.json README.md
108+
```
109+
110+
## Cost Considerations
111+
112+
**pdfplumber Extraction:**
113+
- No API costs - completely free
114+
- Runs locally in GitHub Actions
115+
- Fast execution (typically under 10 seconds)
116+
117+
**Recommendations:**
118+
- Keep daily checks enabled (only processes when changes detected)
119+
- Review PRs promptly to avoid duplicate processing
120+
121+
## Troubleshooting
122+
123+
### Workflow fails with "pdfplumber not found" or import error
124+
125+
**Solution:** This should not happen as pdfplumber is installed in the workflow. Check that the `pip install pdfplumber` step completed successfully in the workflow logs.
126+
127+
### Extraction produces invalid JSON
128+
129+
**Solution:**
130+
1. Check the extracted JSON structure
131+
2. Review pdfplumber's output in workflow logs
132+
3. The PDF structure may have changed - may need to adjust parsing logic in `extract-pdf-with-pdfplumber.py`
133+
134+
### Validation flags too many issues
135+
136+
**Solution:**
137+
1. Review the PDF source data - it may have legitimate errors
138+
2. Check validation logic in `.github/scripts/validate-*.js`
139+
3. Issues are flagged but data is still committed for your review
140+
141+
### PR creation fails
142+
143+
**Solution:**
144+
1. Ensure `gh` CLI is working in GitHub Actions
145+
2. Verify repository permissions for GitHub Actions
146+
3. Check if a PR already exists for that branch
147+
148+
## File Structure
149+
150+
```
151+
.github/
152+
├── scripts/
153+
│ ├── check-for-new-pdf.js # Detects new PDFs
154+
│ ├── extract-pdf-with-pdfplumber.py # pdfplumber extraction
155+
│ ├── validate-json-structure.js # Structure validator
156+
│ ├── validate-time-format.js # Time format validator
157+
│ ├── validate-time-progression.js # Progression validator
158+
│ ├── validate-all.sh # Runs all validators
159+
│ ├── update-readme-inconsistencies.js # Updates README
160+
│ └── process-new-pdf.sh # Main orchestration
161+
└── workflows/
162+
└── check-pdf.yml # GitHub Actions workflow
163+
164+
tests/
165+
├── test-validation.sh # Validation test suite
166+
└── test-pdf-checker.sh # PDF checker tests
167+
```
168+
169+
## Support
170+
171+
For issues with the workflow:
172+
1. Check workflow logs in GitHub Actions
173+
2. Review error messages from validation scripts
174+
3. Open an issue in the repository
175+
176+
For pdfplumber library issues:
177+
- Check pdfplumber documentation: https://github.com/jsvine/pdfplumber
178+
- Review Python error messages in workflow logs

.github/scripts/check-for-new-pdf.js

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,13 @@ function checkForNewPDF(html, currentData) {
177177
console.log('Status: SUCCESS - Change detected');
178178
console.log('Action: Continuing to next steps...');
179179
console.log('='.repeat(60));
180+
181+
// Output for GitHub Actions
182+
console.log();
183+
console.log('GITHUB_OUTPUT:');
184+
console.log(`PDF_URL=${changeDetails.href}`);
185+
console.log(`LINK_TEXT=${changeDetails.text}`);
186+
180187
return {
181188
changed: true,
182189
type: changeType,

0 commit comments

Comments
 (0)