Skip to content

MAINT: HTML archive as Publish asset + set gh-pages to orphan (no history) #661

@mmcky

Description

@mmcky

Repository Size Optimization: gh-pages Cleanup and Backup Strategy

Executive Summary

This document outlines a three-phase approach to reduce the lecture-python.myst repository size from 4.2 GB to ~100 MB (96% reduction) while implementing a comprehensive backup strategy using GitHub Release assets. The project addresses slow clone times (5-10 minutes → 30 seconds) and high bandwidth usage caused by 194 deployment commits in the gh-pages branch history.

Project Phases

Phase 1: Enable HTML Archive Backup on Releases

Objective: Establish backup infrastructure before cleaning up gh-pages history.

Why First: Creates safety net and historical restore points before removing any data.

Implementation Steps:

  1. Modify Publish Workflow - Update .github/workflows/publish.yml to create archives:
# Add after your existing build steps, before gh-pages deployment

- name: Create HTML archive
  run: |
    cd _build/html
    
    # Create compressed archives
    tar -czf ../../lecture-python-html-${{ github.ref_name }}.tar.gz .
    zip -r ../../lecture-python-html-${{ github.ref_name }}.zip .
    
    cd ../..
    
    # Generate checksums for verification
    sha256sum lecture-python-html-${{ github.ref_name }}.tar.gz > checksums.txt
    sha256sum lecture-python-html-${{ github.ref_name }}.zip >> checksums.txt
    
    # Create metadata manifest
    cat > manifest.json << EOF
    {
      "tag": "${{ github.ref_name }}",
      "commit": "${{ github.sha }}",
      "timestamp": "$(date -Iseconds)",
      "size_mb": $(du -sm _build/html | cut -f1),
      "file_count": $(find _build/html -type f | wc -l)
    }
    EOF

- name: Upload archives to release
  uses: softprops/action-gh-release@v1
  with:
    files: |
      lecture-python-html-${{ github.ref_name }}.tar.gz
      lecture-python-html-${{ github.ref_name }}.zip
      checksums.txt
      manifest.json
    body: |
      ## Deployment: ${{ github.ref_name }}
      
      **Commit:** ${{ github.sha }}
      **Live URL:** https://python.quantecon.org
      
      ### Archives
      - `lecture-python-html-${{ github.ref_name }}.tar.gz` - Full site (Linux/Mac)
      - `lecture-python-html-${{ github.ref_name }}.zip` - Full site (Windows)
      - `checksums.txt` - SHA256 verification
      - `manifest.json` - Build metadata
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

# Continue with existing gh-pages deployment
- name: Deploy to GitHub Pages
  uses: peaceiris/actions-gh-pages@v3
  with:
    github_token: ${{ secrets.GITHUB_TOKEN }}
    publish_dir: ./_build/html
    # Note: force_orphan will be added in Phase 3
  1. Test the Archive Creation - Create a test publish- tag to verify archives are created correctly:
git tag publish-test-$(date +%Y%m%d)
git push origin publish-test-$(date +%Y%m%d)
  1. Verify Release Assets - Check that the release includes all four files:
    • lecture-python-html-*.tar.gz
    • lecture-python-html-*.zip
    • checksums.txt
    • manifest.json

Success Criteria:

  • Workflow successfully creates archives on publish-* tags
  • All four assets are attached to releases
  • Archives can be downloaded and extracted successfully
  • Website content is complete in archives

Timeline: 1-2 hours for implementation and testing


Phase 2: One-Time gh-pages History Cleanup

Objective: Remove 194 historical deployment commits from gh-pages branch.

Why Second: Now that we have backup infrastructure, we can safely remove history.

Prerequisites:

  • Phase 1 complete (archives working)
  • Notify team members about upcoming gh-pages force push
  • Verify current gh-pages deployment is working correctly
  • Have repository maintainer permissions

Cleanup Steps:

1. Create Backup

cd /path/to/lecture-python.myst

# Create local backup branch (CRITICAL)
git branch gh-pages-backup origin/gh-pages

# Optional: Full repository backup
cd ..
cp -r lecture-python.myst lecture-python.myst-backup-$(date +%Y%m%d)
cd lecture-python.myst

2. Fetch and Checkout gh-pages

git fetch origin gh-pages
git checkout gh-pages

3. Create Fresh Orphan Branch

# Create new orphan branch (no parent commits)
git checkout --orphan gh-pages-new

# Stage all current files
git add -A

# Create single commit with current state
git commit -m "Fresh gh-pages deployment (history removed to reduce repo size from 4.2GB to ~100MB)"

4. Replace Old gh-pages

# Delete old gh-pages branch
git branch -D gh-pages

# Rename new branch to gh-pages
git branch -m gh-pages

5. Force Push to Remote

# Push the orphaned branch (rewrites history)
git push origin gh-pages --force

6. Return to Main and Clean Up

# Switch back to main
git checkout main

# Remove all references to old commits
git reflog expire --expire=now --all

# Aggressive garbage collection
git gc --aggressive --prune=now

7. Verify Results

# Check gh-pages commit count (should be 1)
git rev-list --count origin/gh-pages

# Check repository size (should be ~100 MB)
du -sh .git

# Verify website still works
# Visit: https://python.quantecon.org

Expected Results:

Metric Before After Improvement
Repository size 4.2 GB ~100 MB 96% reduction
Fresh clone time 5-10 min ~30 sec 90% faster
gh-pages commits 194 1 99% fewer
Bandwidth per clone 4.2 GB 100 MB 97% less

Rollback Procedure (if needed):

# Restore from backup branch
git push origin gh-pages-backup:gh-pages --force

Success Criteria:

  • gh-pages branch has only 1 commit
  • Repository size reduced to ~100 MB
  • Website still accessible and functioning
  • All links and content working correctly

Timeline: 15-30 minutes for execution and verification


Phase 3: Enable Orphan Commits for Future Deployments

Objective: Prevent history accumulation on gh-pages going forward.

Why Third: After cleanup, ensure the problem doesn't recur.

Implementation Steps:

  1. Update Publish Workflow - Add force_orphan: true to gh-pages deployment:
- name: Deploy to GitHub Pages
  uses: peaceiris/actions-gh-pages@v3
  with:
    github_token: ${{ secrets.GITHUB_TOKEN }}
    publish_dir: ./_build/html
    force_orphan: true  # Creates orphan commits (no parent history)
    commit_message: "Deploy ${{ github.ref_name }}"
  1. Deploy Test - Create a new publish- tag to verify orphan behavior:
git tag publish-test-orphan-$(date +%Y%m%d)
git push origin publish-test-orphan-$(date +%Y%m%d)
  1. Verify Orphan Commits - Check that new commits have no parent:
# Should still show only 1 commit after deployment
git rev-list --count origin/gh-pages

# Check that new commit is orphan (no parent)
git log origin/gh-pages --oneline -n 5

Success Criteria:

  • New deployments create orphan commits
  • gh-pages remains at 1 commit after deployments
  • Repository size stays ~100 MB
  • Deployments still work correctly

Timeline: 30 minutes for implementation and testing


Post-Implementation: Contributor Instructions

After all phases are complete, contributors need to update their local repositories to benefit from size reduction.

Option A: Fresh Clone (Recommended)

cd ~/projects
rm -rf lecture-python.myst
git clone https://github.com/QuantEcon/lecture-python.myst.git

Option B: Update Existing Clone

cd lecture-python.myst

# Save any uncommitted work
git stash

# Update remote references
git fetch origin

# Reset gh-pages to match remote
git checkout gh-pages
git reset --hard origin/gh-pages

# Return to main
git checkout main

# Clean up local objects
git reflog expire --expire=now --all
git gc --aggressive --prune=now

# Verify size reduction
du -sh .git

Optional Enhancements

Create Restore Workflow (Optional)

Create .github/workflows/restore-from-release.yml to easily restore from any archived release:

name: Restore HTML from Release

on:
  workflow_dispatch:
    inputs:
      release_tag:
        description: 'Release tag to restore (e.g., publish-2025oct22)'
        required: true
        type: string
      confirm:
        description: 'Type "RESTORE" to confirm'
        required: true
        type: string

jobs:
  restore:
    runs-on: ubuntu-latest
    if: github.event.inputs.confirm == 'RESTORE'
    steps:
      - name: Download archive from release
        run: |
          gh release download "${{ github.event.inputs.release_tag }}" \
            --repo ${{ github.repository }} \
            --pattern "lecture-python-html-*.tar.gz"
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      
      - name: Extract archive
        run: |
          mkdir -p restored-html
          tar -xzf lecture-python-html-*.tar.gz -C restored-html
      
      - name: Deploy to gh-pages
        uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./restored-html
          force_orphan: true
          commit_message: "Restore from ${{ github.event.inputs.release_tag }}"

Cleanup Old Archives (Optional)

Create a workflow to remove archives from releases older than a threshold:

name: Cleanup Old Release Archives

on:
  workflow_dispatch:
  schedule:
    - cron: '0 3 1 * *'  # Monthly

jobs:
  cleanup:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/github-script@v7
        with:
          script: |
            const releases = await github.rest.repos.listReleases({
              owner: context.repo.owner,
              repo: context.repo.repo,
              per_page: 100
            });
            
            const publishReleases = releases.data
              .filter(r => r.tag_name.startsWith('publish-'))
              .sort((a, b) => new Date(b.published_at) - new Date(a.published_at));
            
            // Keep archives for last 20 releases
            const toCleanup = publishReleases.slice(20);
            
            for (const release of toCleanup) {
              for (const asset of release.assets) {
                if (asset.name.includes('lecture-python-html-')) {
                  await github.rest.repos.deleteReleaseAsset({
                    owner: context.repo.owner,
                    repo: context.repo.repo,
                    asset_id: asset.id
                  });
                }
              }
            }

Benefits Summary

  • Zero clone impact: Release assets don't affect git clone operations
  • Comprehensive backups: Historical restore points for all publish-* releases
  • Easy disaster recovery: Simple workflow or manual restoration from any release
  • Future-proof: force_orphan prevents history accumulation automatically
  • Cost: $0 (GitHub Release assets are free and unlimited)

Related Issues

Total Implementation Timeline

  • Phase 1 (Archive Setup): 1-2 hours
  • Phase 2 (History Cleanup): 15-30 minutes
  • Phase 3 (Enable Orphan): 30 minutes
  • Total: 2-3 hours

Final Notes

  • Backup branch gh-pages-backup can be deleted after 1-2 weeks of verification
  • GitHub Pages may take 1-2 minutes to rebuild after force pushes
  • All current URLs and links will continue to work
  • Only gh-pages branch is affected; main branch history remains untouched
  • This is a one-time cleanup; force_orphan maintains cleanliness going forward

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions