|
| 1 | +# PySpark Wheel Builder |
| 2 | + |
| 3 | +Automates the PySpark wheel build process (clone → apply Affirm commit → compile → package → optional upload) that was previously done manually on Thor. The Docker image is built and run **on a Thor** to avoid Mac M1/ARM issues and to use the Affirm network for Artifactory uploads. Thor also injects GitHub credentials into the environment, so the container has access to private GitHub repositories (e.g. Affirm/spark) without any additional auth setup. |
| 4 | + |
| 5 | +## Flow |
| 6 | + |
| 7 | +All Affirm modifications to Spark and PySpark must be captured in a single, unmerged commit in the [Affirm/spark](https://github.com/Affirm/spark) repo. The build script fetches that commit and applies its diff on top of the upstream Spark release tag. |
| 8 | + |
| 9 | +The end-to-end workflow is: |
| 10 | + |
| 11 | +1. **Author changes** — Create a commit in [Affirm/spark](https://github.com/Affirm/spark) containing all Affirm-specific changes to Spark/PySpark. This commit does **not** need to be merged to master. |
| 12 | +2. **Register the release** — Add a new section to `build-tools/pyspark/VERSIONS.md` in this repo (ATT), including the `Commit SHA` of the Affirm/spark commit. Open a PR and merge to master. |
| 13 | +3. **Deploy** — In Buildkite (or on a Thor), trigger the build pipeline and provide: |
| 14 | + - `spark-version` (e.g. `4.0.0`) |
| 15 | + - `affirm-version` (e.g. `1`) |
| 16 | + - `commit-sha` (the full SHA of the Affirm/spark commit, same one recorded in VERSIONS.md) |
| 17 | +4. **Publish** — The script builds the three PySpark wheels and pushes them to Artifactory. |
| 18 | + |
| 19 | +> The build script validates both fields for productive releases: the `affirm-version` section must exist in `VERSIONS.md` **and** the `Commit SHA` recorded there must match the `--apply-commit` argument. Mismatches cause an early exit before compilation. |
| 20 | +
|
| 21 | +## Files |
| 22 | + |
| 23 | +| File | Purpose | |
| 24 | +|------|--------| |
| 25 | +| `Dockerfile` | Build environment (affirm-base, Java 17 by default via `JAVA_VERSION` build arg, git, wheel, twine, setuptools); copies `VERSIONS.md` into the image | |
| 26 | +| `build-pyspark-wheels.sh` | Orchestrates clone, commit apply, Maven build, wheel packaging, and optional Artifactory upload | |
| 27 | +| `VERSIONS.md` | Changelog for all productive Affirm releases; each entry must include the Affirm/spark commit SHA | |
| 28 | + |
| 29 | +## VERSIONS.md — documenting productive releases |
| 30 | + |
| 31 | +`VERSIONS.md` tracks every **productive** Affirm release and is validated at build time. |
| 32 | + |
| 33 | +| AFFIRM_VERSION type | Example | Must be in VERSIONS.md? | |
| 34 | +|---------------------|---------|------------------------| |
| 35 | +| Productive | `1`, `2`, `3` (integer only) | Yes — required | |
| 36 | +| Development | `dev1`, `dev2` | No — exempt | |
| 37 | + |
| 38 | +For each productive release, add a `## <N>` section containing: |
| 39 | + |
| 40 | +- `**Commit SHA:** <full-sha>` — the Affirm/spark commit applied during the build (validated by the script). |
| 41 | +- A short description of what changed. |
| 42 | + |
| 43 | +> The build script exits with an error if a productive `AFFIRM_VERSION` is used and either the section is missing or the recorded SHA does not match `--apply-commit`. |
| 44 | +
|
| 45 | +## ✅ Prerequisites |
| 46 | + |
| 47 | +- Thor instance (see [Thor Quickstart Guide](https://www.notion.so/Thor-Quickstart-Guide-16d40e54ae3881ea9abaecfcc11177ab?pvs=21)) |
| 48 | +- Docker on the Thor |
| 49 | +- Environment: `GIT_USER_EMAIL` and `GIT_USER_NAME` (used for git `user.email` / `user.name` inside the container) |
| 50 | +- Optional: `GITHUB_TOKEN` — a GitHub PAT for authenticated repo access (see [Environment variables](#environment-variables) below) |
| 51 | +- For upload: Artifactory credentials (`ARTIFACTORY_USER`, `ARTIFACTORY_TOKEN`) |
| 52 | + |
| 53 | +The script supports **Python 3.9.x or 3.12.x** (as provided by the affirm-base image). It checks the version at startup and exits with an error if neither is detected. |
| 54 | + |
| 55 | +## Build the image (on Thor) |
| 56 | + |
| 57 | +```bash |
| 58 | +affirm.dev connect # or SSH into your Thor |
| 59 | + |
| 60 | +cd ~/all-the-things/build-tools/pyspark |
| 61 | + |
| 62 | +# Default build (Python 3.9, Java 17 — matches batch-infra base image) |
| 63 | +docker build -t pyspark-builder:latest . |
| 64 | + |
| 65 | +# Python 3.12 build — swap the base image via --build-arg |
| 66 | +docker build \ |
| 67 | + --build-arg BASE_IMAGE=998571911837.dkr.ecr.us-east-1.amazonaws.com/affirm-base:0.92.7-py3.12.8 \ |
| 68 | + -t pyspark-builder:latest-py3.12 . |
| 69 | + |
| 70 | +# Custom Java version (e.g. Java 21) |
| 71 | +docker build \ |
| 72 | + --build-arg JAVA_VERSION=21 \ |
| 73 | + -t pyspark-builder:latest-java21 . |
| 74 | +``` |
| 75 | + |
| 76 | +## Workflow: upstream tag + commit |
| 77 | + |
| 78 | +Clone Affirm/spark, checkout the upstream tag (e.g. `v4.0.0`), apply the Affirm commit, then build and package wheels. |
| 79 | + |
| 80 | +```bash |
| 81 | +docker run --rm \ |
| 82 | + -v /tmp/pyspark-dist:/build/dist \ |
| 83 | + -e GIT_USER_EMAIL="your.email@affirm.com" \ |
| 84 | + -e GIT_USER_NAME="yourname" \ |
| 85 | + -e GITHUB_TOKEN="$GITHUB_TOKEN" \ |
| 86 | + pyspark-builder:latest \ |
| 87 | + --spark-version 4.0.0 \ |
| 88 | + --affirm-version dev1 \ |
| 89 | + --apply-commit a1b2c3d4e5f6... |
| 90 | +``` |
| 91 | + |
| 92 | +What happens: |
| 93 | + |
| 94 | +- Clones **Affirm/spark** and fetches upstream tag `v${SPARK_VERSION}` (e.g. `v4.0.0`). |
| 95 | +- Creates branch `affirm-${SPARK_VERSION}-${AFFIRM_VERSION}` from that tag. |
| 96 | +- Fetches the commit SHA from Affirm/spark and applies its diff with `git apply`. |
| 97 | +- For productive `AFFIRM_VERSION`s: validates that a section for the version exists in `VERSIONS.md` **and** that the recorded `Commit SHA` matches `--apply-commit` before proceeding. |
| 98 | +- Sets the PySpark version string in `version.py`, then runs Maven (`-DskipTests -Pkubernetes clean package -T 4`) and builds the three wheels. Does **not** upload unless `--upload` is passed. |
| 99 | + |
| 100 | +## Dry-run build (no upload) |
| 101 | + |
| 102 | +```bash |
| 103 | +docker run --rm \ |
| 104 | + -v /tmp/pyspark-dist:/build/dist \ |
| 105 | + -e GIT_USER_EMAIL="your.email@affirm.com" \ |
| 106 | + -e GIT_USER_NAME="yourname" \ |
| 107 | + -e GITHUB_TOKEN="$GITHUB_TOKEN" \ |
| 108 | + pyspark-builder:latest \ |
| 109 | + --spark-version 4.0.0 \ |
| 110 | + --affirm-version dev1 \ |
| 111 | + --apply-commit a1b2c3d4e5f6... |
| 112 | +``` |
| 113 | + |
| 114 | +This will clone Affirm/spark, apply the commit, compile with Maven (~20–40 min), and write three wheels to `/tmp/pyspark-dist/` on the host. |
| 115 | + |
| 116 | +## Upload to Artifactory (opt-in) |
| 117 | + |
| 118 | +Only when you are ready to publish: |
| 119 | + |
| 120 | +```bash |
| 121 | +docker run --rm \ |
| 122 | + -v /tmp/pyspark-dist:/build/dist \ |
| 123 | + -e GIT_USER_EMAIL="your.email@affirm.com" \ |
| 124 | + -e GIT_USER_NAME="yourname" \ |
| 125 | + -e GITHUB_TOKEN="$GITHUB_TOKEN" \ |
| 126 | + -e ARTIFACTORY_USER="your.email@affirm.com" \ |
| 127 | + -e ARTIFACTORY_TOKEN="your_jfrog_token" \ |
| 128 | + pyspark-builder:latest \ |
| 129 | + --spark-version 4.0.0 \ |
| 130 | + --affirm-version 1 \ |
| 131 | + --apply-commit a1b2c3d4e5f6... \ |
| 132 | + --upload |
| 133 | +``` |
| 134 | + |
| 135 | +Without `--upload`, the script never uploads, so repeated test runs do not push artifacts. |
| 136 | + |
| 137 | +## Buildkite pipeline |
| 138 | + |
| 139 | +The production deploy pipeline should collect three inputs from the operator and pass them into the script: |
| 140 | + |
| 141 | +| Pipeline input | Script argument | |
| 142 | +|----------------|----------------| |
| 143 | +| Spark version | `--spark-version` | |
| 144 | +| Affirm version | `--affirm-version` | |
| 145 | +| Commit SHA | `--apply-commit` | |
| 146 | + |
| 147 | +When publishing, the pipeline should also set `--upload` and inject `ARTIFACTORY_USER` / `ARTIFACTORY_TOKEN` from secrets. |
| 148 | + |
| 149 | +## Docker build args |
| 150 | + |
| 151 | +| Build arg | Default | Description | |
| 152 | +|-----------|---------|-------------| |
| 153 | +| `BASE_IMAGE` | `…/affirm-base:0.92.7-py3.9.17` | Base Docker image; swap to a `py3.12` tag for Python 3.12 builds | |
| 154 | +| `JAVA_VERSION` | `17` | JDK major version installed in the image (e.g. `17`, `21`); must be `>=17` for Spark 4.x | |
| 155 | + |
| 156 | +## Script options |
| 157 | + |
| 158 | +| Argument | Required | Default | Description | |
| 159 | +|----------|----------|---------|-------------| |
| 160 | +| `--spark-version` | Yes | — | Upstream Spark version (e.g. `4.0.0`); used for version string and upstream tag | |
| 161 | +| `--affirm-version` | Yes | — | Affirm suffix (e.g. `dev1`, `1`) → `2815!{spark-version}+affirm.{affirm-version}` | |
| 162 | +| `--apply-commit` | Yes | — | Full SHA of the Affirm/spark commit to apply; must match `Commit SHA` in VERSIONS.md for productive releases | |
| 163 | +| `--output-dir` | No | `/build/dist` | Directory for built `.whl` files | |
| 164 | +| `--upload` | No | `false` | Upload wheels to Artifactory (requires `ARTIFACTORY_USER` and `ARTIFACTORY_TOKEN`) | |
| 165 | + |
| 166 | +### Environment variables |
| 167 | + |
| 168 | +| Variable | Required | Default | Description | |
| 169 | +|----------|----------|---------|-------------| |
| 170 | +| `GIT_USER_EMAIL` | Yes | — | Used for `git config user.email` inside the container | |
| 171 | +| `GIT_USER_NAME` | Yes | — | Used for `git config user.name` inside the container | |
| 172 | +| `GITHUB_TOKEN` | No | — | GitHub PAT for authenticated repo access. When set, the script configures `GIT_ASKPASS` so the token never appears in URLs, git config, or process listings | |
| 173 | +| `REPO_URL` | No | `https://affirmprod.jfrog.io/…/pypi-local` | Artifactory PyPI repository URL used by `twine upload`. Override to target a different repository | |
| 174 | +| `ARTIFACTORY_USER` | Only with `--upload` | — | Affirm email for Artifactory authentication | |
| 175 | +| `ARTIFACTORY_TOKEN` | Only with `--upload` | — | JFrog API token for Artifactory authentication | |
| 176 | + |
| 177 | +## 🚦 Exit codes |
| 178 | + |
| 179 | +| Code | Meaning | |
| 180 | +|------|--------| |
| 181 | +| ✅ 0 | Success | |
| 182 | +| ❌ 1 | General error (clone, fetch, checkout, apply-commit/overlay, missing VERSIONS.md section, or unsupported Python version); scroll up to the last `ERROR:` line | |
| 183 | +| ❌ 2 | Maven compilation failed | |
| 184 | +| ❌ 3 | Wheel packaging failed | |
| 185 | +| ❌ 4 | Upload failed (missing credentials or Artifactory error) | |
| 186 | + |
| 187 | +## 🎡 Expected wheels |
| 188 | + |
| 189 | +After a successful run you get three wheels under `--output-dir`: |
| 190 | + |
| 191 | +- `pyspark-2815!4.0.0+affirm.1-py2.py3-none-any.whl` |
| 192 | +- `pyspark_client-2815!4.0.0+affirm.1-py2.py3-none-any.whl` |
| 193 | +- `pyspark_connect-2815!4.0.0+affirm.1-py2.py3-none-any.whl` |
| 194 | + |
| 195 | +## 🔗 References |
| 196 | + |
| 197 | +- [Spark 4.0.0 guide](https://www.notion.so/Spark-4-0-0-guide-30440e54ae38806790f3ee84f2a9f853?pvs=21) — manual process this automates |
| 198 | +- [Runbook: how to upgrade PySpark at Affirm](https://www.notion.so/Runbook-how-to-upgrade-PySpark-at-Affirm-19340e54ae388015bb68e94a8389d526?pvs=21) — version naming |
| 199 | +- [Thor Quickstart Guide](https://www.notion.so/Thor-Quickstart-Guide-16d40e54ae3881ea9abaecfcc11177ab?pvs=21) |
| 200 | +- [jvm-build-tools/bin/upload-prebuilt-spark.sh](https://github.com/Affirm/all-the-things/blob/master/jvm-build-tools/bin/upload-prebuilt-spark.sh) — JAR upload sibling |
0 commit comments