-
-
Couldn't load subscription status.
- Fork 54
Description
Repository Size Optimization: gh-pages Cleanup and Backup Strategy
Executive Summary
This document outlines a three-phase approach to reduce the lecture-python.myst repository size from 4.2 GB to ~100 MB (96% reduction) while implementing a comprehensive backup strategy using GitHub Release assets. The project addresses slow clone times (5-10 minutes → 30 seconds) and high bandwidth usage caused by 194 deployment commits in the gh-pages branch history.
Project Phases
Phase 1: Enable HTML Archive Backup on Releases
Objective: Establish backup infrastructure before cleaning up gh-pages history.
Why First: Creates safety net and historical restore points before removing any data.
Implementation Steps:
- Modify Publish Workflow - Update
.github/workflows/publish.ymlto create archives:
# Add after your existing build steps, before gh-pages deployment
- name: Create HTML archive
run: |
cd _build/html
# Create compressed archives
tar -czf ../../lecture-python-html-${{ github.ref_name }}.tar.gz .
zip -r ../../lecture-python-html-${{ github.ref_name }}.zip .
cd ../..
# Generate checksums for verification
sha256sum lecture-python-html-${{ github.ref_name }}.tar.gz > checksums.txt
sha256sum lecture-python-html-${{ github.ref_name }}.zip >> checksums.txt
# Create metadata manifest
cat > manifest.json << EOF
{
"tag": "${{ github.ref_name }}",
"commit": "${{ github.sha }}",
"timestamp": "$(date -Iseconds)",
"size_mb": $(du -sm _build/html | cut -f1),
"file_count": $(find _build/html -type f | wc -l)
}
EOF
- name: Upload archives to release
uses: softprops/action-gh-release@v1
with:
files: |
lecture-python-html-${{ github.ref_name }}.tar.gz
lecture-python-html-${{ github.ref_name }}.zip
checksums.txt
manifest.json
body: |
## Deployment: ${{ github.ref_name }}
**Commit:** ${{ github.sha }}
**Live URL:** https://python.quantecon.org
### Archives
- `lecture-python-html-${{ github.ref_name }}.tar.gz` - Full site (Linux/Mac)
- `lecture-python-html-${{ github.ref_name }}.zip` - Full site (Windows)
- `checksums.txt` - SHA256 verification
- `manifest.json` - Build metadata
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# Continue with existing gh-pages deployment
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./_build/html
# Note: force_orphan will be added in Phase 3- Test the Archive Creation - Create a test
publish-tag to verify archives are created correctly:
git tag publish-test-$(date +%Y%m%d)
git push origin publish-test-$(date +%Y%m%d)- Verify Release Assets - Check that the release includes all four files:
lecture-python-html-*.tar.gzlecture-python-html-*.zipchecksums.txtmanifest.json
Success Criteria:
- Workflow successfully creates archives on
publish-*tags - All four assets are attached to releases
- Archives can be downloaded and extracted successfully
- Website content is complete in archives
Timeline: 1-2 hours for implementation and testing
Phase 2: One-Time gh-pages History Cleanup
Objective: Remove 194 historical deployment commits from gh-pages branch.
Why Second: Now that we have backup infrastructure, we can safely remove history.
Prerequisites:
- Phase 1 complete (archives working)
- Notify team members about upcoming gh-pages force push
- Verify current gh-pages deployment is working correctly
- Have repository maintainer permissions
Cleanup Steps:
1. Create Backup
cd /path/to/lecture-python.myst
# Create local backup branch (CRITICAL)
git branch gh-pages-backup origin/gh-pages
# Optional: Full repository backup
cd ..
cp -r lecture-python.myst lecture-python.myst-backup-$(date +%Y%m%d)
cd lecture-python.myst2. Fetch and Checkout gh-pages
git fetch origin gh-pages
git checkout gh-pages3. Create Fresh Orphan Branch
# Create new orphan branch (no parent commits)
git checkout --orphan gh-pages-new
# Stage all current files
git add -A
# Create single commit with current state
git commit -m "Fresh gh-pages deployment (history removed to reduce repo size from 4.2GB to ~100MB)"4. Replace Old gh-pages
# Delete old gh-pages branch
git branch -D gh-pages
# Rename new branch to gh-pages
git branch -m gh-pages5. Force Push to Remote
# Push the orphaned branch (rewrites history)
git push origin gh-pages --force6. Return to Main and Clean Up
# Switch back to main
git checkout main
# Remove all references to old commits
git reflog expire --expire=now --all
# Aggressive garbage collection
git gc --aggressive --prune=now7. Verify Results
# Check gh-pages commit count (should be 1)
git rev-list --count origin/gh-pages
# Check repository size (should be ~100 MB)
du -sh .git
# Verify website still works
# Visit: https://python.quantecon.orgExpected Results:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Repository size | 4.2 GB | ~100 MB | 96% reduction |
| Fresh clone time | 5-10 min | ~30 sec | 90% faster |
| gh-pages commits | 194 | 1 | 99% fewer |
| Bandwidth per clone | 4.2 GB | 100 MB | 97% less |
Rollback Procedure (if needed):
# Restore from backup branch
git push origin gh-pages-backup:gh-pages --forceSuccess Criteria:
- gh-pages branch has only 1 commit
- Repository size reduced to ~100 MB
- Website still accessible and functioning
- All links and content working correctly
Timeline: 15-30 minutes for execution and verification
Phase 3: Enable Orphan Commits for Future Deployments
Objective: Prevent history accumulation on gh-pages going forward.
Why Third: After cleanup, ensure the problem doesn't recur.
Implementation Steps:
- Update Publish Workflow - Add
force_orphan: trueto gh-pages deployment:
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./_build/html
force_orphan: true # Creates orphan commits (no parent history)
commit_message: "Deploy ${{ github.ref_name }}"- Deploy Test - Create a new
publish-tag to verify orphan behavior:
git tag publish-test-orphan-$(date +%Y%m%d)
git push origin publish-test-orphan-$(date +%Y%m%d)- Verify Orphan Commits - Check that new commits have no parent:
# Should still show only 1 commit after deployment
git rev-list --count origin/gh-pages
# Check that new commit is orphan (no parent)
git log origin/gh-pages --oneline -n 5Success Criteria:
- New deployments create orphan commits
- gh-pages remains at 1 commit after deployments
- Repository size stays ~100 MB
- Deployments still work correctly
Timeline: 30 minutes for implementation and testing
Post-Implementation: Contributor Instructions
After all phases are complete, contributors need to update their local repositories to benefit from size reduction.
Option A: Fresh Clone (Recommended)
cd ~/projects
rm -rf lecture-python.myst
git clone https://github.com/QuantEcon/lecture-python.myst.gitOption B: Update Existing Clone
cd lecture-python.myst
# Save any uncommitted work
git stash
# Update remote references
git fetch origin
# Reset gh-pages to match remote
git checkout gh-pages
git reset --hard origin/gh-pages
# Return to main
git checkout main
# Clean up local objects
git reflog expire --expire=now --all
git gc --aggressive --prune=now
# Verify size reduction
du -sh .gitOptional Enhancements
Create Restore Workflow (Optional)
Create .github/workflows/restore-from-release.yml to easily restore from any archived release:
name: Restore HTML from Release
on:
workflow_dispatch:
inputs:
release_tag:
description: 'Release tag to restore (e.g., publish-2025oct22)'
required: true
type: string
confirm:
description: 'Type "RESTORE" to confirm'
required: true
type: string
jobs:
restore:
runs-on: ubuntu-latest
if: github.event.inputs.confirm == 'RESTORE'
steps:
- name: Download archive from release
run: |
gh release download "${{ github.event.inputs.release_tag }}" \
--repo ${{ github.repository }} \
--pattern "lecture-python-html-*.tar.gz"
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Extract archive
run: |
mkdir -p restored-html
tar -xzf lecture-python-html-*.tar.gz -C restored-html
- name: Deploy to gh-pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./restored-html
force_orphan: true
commit_message: "Restore from ${{ github.event.inputs.release_tag }}"Cleanup Old Archives (Optional)
Create a workflow to remove archives from releases older than a threshold:
name: Cleanup Old Release Archives
on:
workflow_dispatch:
schedule:
- cron: '0 3 1 * *' # Monthly
jobs:
cleanup:
runs-on: ubuntu-latest
steps:
- uses: actions/github-script@v7
with:
script: |
const releases = await github.rest.repos.listReleases({
owner: context.repo.owner,
repo: context.repo.repo,
per_page: 100
});
const publishReleases = releases.data
.filter(r => r.tag_name.startsWith('publish-'))
.sort((a, b) => new Date(b.published_at) - new Date(a.published_at));
// Keep archives for last 20 releases
const toCleanup = publishReleases.slice(20);
for (const release of toCleanup) {
for (const asset of release.assets) {
if (asset.name.includes('lecture-python-html-')) {
await github.rest.repos.deleteReleaseAsset({
owner: context.repo.owner,
repo: context.repo.repo,
asset_id: asset.id
});
}
}
}Benefits Summary
- Zero clone impact: Release assets don't affect
git cloneoperations - Comprehensive backups: Historical restore points for all
publish-*releases - Easy disaster recovery: Simple workflow or manual restoration from any release
- Future-proof:
force_orphanprevents history accumulation automatically - Cost: $0 (GitHub Release assets are free and unlimited)
Related Issues
- Critical: gh-pages History Cleanup - 96% Repository Size Reduction (4.2GB → 100MB) #647 - Critical: gh-pages History Cleanup - 96% Repository Size Reduction
- Automated gh-pages History Cleanup - Rolling 1-Week Window (or 1 Backup) #658 - Automated gh-pages History Cleanup - Rolling 1-Week Window
Total Implementation Timeline
- Phase 1 (Archive Setup): 1-2 hours
- Phase 2 (History Cleanup): 15-30 minutes
- Phase 3 (Enable Orphan): 30 minutes
- Total: 2-3 hours
Final Notes
- Backup branch
gh-pages-backupcan be deleted after 1-2 weeks of verification - GitHub Pages may take 1-2 minutes to rebuild after force pushes
- All current URLs and links will continue to work
- Only gh-pages branch is affected; main branch history remains untouched
- This is a one-time cleanup;
force_orphanmaintains cleanliness going forward