Skip to content

Commit 3c46d7e

Browse files
write pyspark wheels builder
1 parent 6bb1615 commit 3c46d7e

File tree

4 files changed

+540
-0
lines changed

4 files changed

+540
-0
lines changed

tools/pyspark_build/Dockerfile

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# PySpark wheel builder — compilation environment for building PySpark wheels on Thor.
2+
# Base image matches batch-infra DTs (Ubuntu 22.04, Python 3.9).
3+
ARG BASE_IMAGE=998571911837.dkr.ecr.us-east-1.amazonaws.com/affirm-base:0.92.7-py3.9.17
4+
5+
FROM ${BASE_IMAGE}
6+
7+
ARG JAVA_VERSION=17
8+
9+
# Java >=17 required for Spark 4.x compilation
10+
RUN apt-get update && apt-get install -y --no-install-recommends \
11+
openjdk-${JAVA_VERSION}-jdk-headless \
12+
git=1:2.34.1-1ubuntu1.17 \
13+
curl=7.81.0-1ubuntu1.23 \
14+
wget=1.21.2-2ubuntu1.1 \
15+
&& rm -rf /var/lib/apt/lists/*
16+
17+
# Python tooling for building and uploading wheels
18+
RUN python3 -m pip install --no-cache-dir wheel==0.46.3 twine==6.2.0 setuptools==81.0.0
19+
20+
ENV JAVA_HOME="/usr/lib/jvm/java-${JAVA_VERSION}-openjdk-amd64"
21+
ENV PATH="$JAVA_HOME/bin:$PATH"
22+
ENV MAVEN_OPTS="-Xss128m -Xmx4g -XX:ReservedCodeCacheSize=2g"
23+
24+
WORKDIR /build
25+
26+
COPY build-pyspark-wheels.sh /build/build-pyspark-wheels.sh
27+
RUN chmod +x /build/build-pyspark-wheels.sh
28+
29+
# VERSIONS.md is validated at runtime: productive releases must have a section with
30+
# a matching Commit SHA. No version-specific file folders are needed; all Affirm changes
31+
# must live in the commit SHA recorded in VERSIONS.md.
32+
COPY VERSIONS.md /build/VERSIONS.md
33+
34+
ENTRYPOINT ["/build/build-pyspark-wheels.sh"]

tools/pyspark_build/README.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# PySpark Wheel Builder
2+
3+
Automates the PySpark wheel build process (clone → apply Affirm commit → compile → package → optional upload) that was previously done manually on Thor. The Docker image is built and run **on a Thor** to avoid Mac M1/ARM issues and to use the Affirm network for Artifactory uploads. Thor also injects GitHub credentials into the environment, so the container has access to private GitHub repositories (e.g. Affirm/spark) without any additional auth setup.
4+
5+
## Flow
6+
7+
All Affirm modifications to Spark and PySpark must be captured in a single, unmerged commit in the [Affirm/spark](https://github.com/Affirm/spark) repo. The build script fetches that commit and applies its diff on top of the upstream Spark release tag.
8+
9+
The end-to-end workflow is:
10+
11+
1. **Author changes** — Create a commit in [Affirm/spark](https://github.com/Affirm/spark) containing all Affirm-specific changes to Spark/PySpark. This commit does **not** need to be merged to master.
12+
2. **Register the release** — Add a new section to `build-tools/pyspark/VERSIONS.md` in this repo (ATT), including the `Commit SHA` of the Affirm/spark commit. Open a PR and merge to master.
13+
3. **Deploy** — In Buildkite (or on a Thor), trigger the build pipeline and provide:
14+
- `spark-version` (e.g. `4.0.0`)
15+
- `affirm-version` (e.g. `1`)
16+
- `commit-sha` (the full SHA of the Affirm/spark commit, same one recorded in VERSIONS.md)
17+
4. **Publish** — The script builds the three PySpark wheels and pushes them to Artifactory.
18+
19+
> The build script validates both fields for productive releases: the `affirm-version` section must exist in `VERSIONS.md` **and** the `Commit SHA` recorded there must match the `--apply-commit` argument. Mismatches cause an early exit before compilation.
20+
21+
## Files
22+
23+
| File | Purpose |
24+
|------|--------|
25+
| `Dockerfile` | Build environment (affirm-base, Java 17 by default via `JAVA_VERSION` build arg, git, wheel, twine, setuptools); copies `VERSIONS.md` into the image |
26+
| `build-pyspark-wheels.sh` | Orchestrates clone, commit apply, Maven build, wheel packaging, and optional Artifactory upload |
27+
| `VERSIONS.md` | Changelog for all productive Affirm releases; each entry must include the Affirm/spark commit SHA |
28+
29+
## VERSIONS.md — documenting productive releases
30+
31+
`VERSIONS.md` tracks every **productive** Affirm release and is validated at build time.
32+
33+
| AFFIRM_VERSION type | Example | Must be in VERSIONS.md? |
34+
|---------------------|---------|------------------------|
35+
| Productive | `1`, `2`, `3` (integer only) | Yes — required |
36+
| Development | `dev1`, `dev2` | No — exempt |
37+
38+
For each productive release, add a `## <N>` section containing:
39+
40+
- `**Commit SHA:** <full-sha>` — the Affirm/spark commit applied during the build (validated by the script).
41+
- A short description of what changed.
42+
43+
> The build script exits with an error if a productive `AFFIRM_VERSION` is used and either the section is missing or the recorded SHA does not match `--apply-commit`.
44+
45+
## ✅ Prerequisites
46+
47+
- Thor instance (see [Thor Quickstart Guide](https://www.notion.so/Thor-Quickstart-Guide-16d40e54ae3881ea9abaecfcc11177ab?pvs=21))
48+
- Docker on the Thor
49+
- Environment: `GIT_USER_EMAIL` and `GIT_USER_NAME` (used for git `user.email` / `user.name` inside the container)
50+
- Optional: `GITHUB_TOKEN` — a GitHub PAT for authenticated repo access (see [Environment variables](#environment-variables) below)
51+
- For upload: Artifactory credentials (`ARTIFACTORY_USER`, `ARTIFACTORY_TOKEN`)
52+
53+
The script supports **Python 3.9.x or 3.12.x** (as provided by the affirm-base image). It checks the version at startup and exits with an error if neither is detected.
54+
55+
## Build the image (on Thor)
56+
57+
```bash
58+
affirm.dev connect # or SSH into your Thor
59+
60+
cd ~/all-the-things/build-tools/pyspark
61+
62+
# Default build (Python 3.9, Java 17 — matches batch-infra base image)
63+
docker build -t pyspark-builder:latest .
64+
65+
# Python 3.12 build — swap the base image via --build-arg
66+
docker build \
67+
--build-arg BASE_IMAGE=998571911837.dkr.ecr.us-east-1.amazonaws.com/affirm-base:0.92.7-py3.12.8 \
68+
-t pyspark-builder:latest-py3.12 .
69+
70+
# Custom Java version (e.g. Java 21)
71+
docker build \
72+
--build-arg JAVA_VERSION=21 \
73+
-t pyspark-builder:latest-java21 .
74+
```
75+
76+
## Workflow: upstream tag + commit
77+
78+
Clone Affirm/spark, checkout the upstream tag (e.g. `v4.0.0`), apply the Affirm commit, then build and package wheels.
79+
80+
```bash
81+
docker run --rm \
82+
-v /tmp/pyspark-dist:/build/dist \
83+
-e GIT_USER_EMAIL="your.email@affirm.com" \
84+
-e GIT_USER_NAME="yourname" \
85+
-e GITHUB_TOKEN="$GITHUB_TOKEN" \
86+
pyspark-builder:latest \
87+
--spark-version 4.0.0 \
88+
--affirm-version dev1 \
89+
--apply-commit a1b2c3d4e5f6...
90+
```
91+
92+
What happens:
93+
94+
- Clones **Affirm/spark** and fetches upstream tag `v${SPARK_VERSION}` (e.g. `v4.0.0`).
95+
- Creates branch `affirm-${SPARK_VERSION}-${AFFIRM_VERSION}` from that tag.
96+
- Fetches the commit SHA from Affirm/spark and applies its diff with `git apply`.
97+
- For productive `AFFIRM_VERSION`s: validates that a section for the version exists in `VERSIONS.md` **and** that the recorded `Commit SHA` matches `--apply-commit` before proceeding.
98+
- Sets the PySpark version string in `version.py`, then runs Maven (`-DskipTests -Pkubernetes clean package -T 4`) and builds the three wheels. Does **not** upload unless `--upload` is passed.
99+
100+
## Dry-run build (no upload)
101+
102+
```bash
103+
docker run --rm \
104+
-v /tmp/pyspark-dist:/build/dist \
105+
-e GIT_USER_EMAIL="your.email@affirm.com" \
106+
-e GIT_USER_NAME="yourname" \
107+
-e GITHUB_TOKEN="$GITHUB_TOKEN" \
108+
pyspark-builder:latest \
109+
--spark-version 4.0.0 \
110+
--affirm-version dev1 \
111+
--apply-commit a1b2c3d4e5f6...
112+
```
113+
114+
This will clone Affirm/spark, apply the commit, compile with Maven (~20–40 min), and write three wheels to `/tmp/pyspark-dist/` on the host.
115+
116+
## Upload to Artifactory (opt-in)
117+
118+
Only when you are ready to publish:
119+
120+
```bash
121+
docker run --rm \
122+
-v /tmp/pyspark-dist:/build/dist \
123+
-e GIT_USER_EMAIL="your.email@affirm.com" \
124+
-e GIT_USER_NAME="yourname" \
125+
-e GITHUB_TOKEN="$GITHUB_TOKEN" \
126+
-e ARTIFACTORY_USER="your.email@affirm.com" \
127+
-e ARTIFACTORY_TOKEN="your_jfrog_token" \
128+
pyspark-builder:latest \
129+
--spark-version 4.0.0 \
130+
--affirm-version 1 \
131+
--apply-commit a1b2c3d4e5f6... \
132+
--upload
133+
```
134+
135+
Without `--upload`, the script never uploads, so repeated test runs do not push artifacts.
136+
137+
## Buildkite pipeline
138+
139+
The production deploy pipeline should collect three inputs from the operator and pass them into the script:
140+
141+
| Pipeline input | Script argument |
142+
|----------------|----------------|
143+
| Spark version | `--spark-version` |
144+
| Affirm version | `--affirm-version` |
145+
| Commit SHA | `--apply-commit` |
146+
147+
When publishing, the pipeline should also set `--upload` and inject `ARTIFACTORY_USER` / `ARTIFACTORY_TOKEN` from secrets.
148+
149+
## Docker build args
150+
151+
| Build arg | Default | Description |
152+
|-----------|---------|-------------|
153+
| `BASE_IMAGE` | `…/affirm-base:0.92.7-py3.9.17` | Base Docker image; swap to a `py3.12` tag for Python 3.12 builds |
154+
| `JAVA_VERSION` | `17` | JDK major version installed in the image (e.g. `17`, `21`); must be `>=17` for Spark 4.x |
155+
156+
## Script options
157+
158+
| Argument | Required | Default | Description |
159+
|----------|----------|---------|-------------|
160+
| `--spark-version` | Yes || Upstream Spark version (e.g. `4.0.0`); used for version string and upstream tag |
161+
| `--affirm-version` | Yes || Affirm suffix (e.g. `dev1`, `1`) → `2815!{spark-version}+affirm.{affirm-version}` |
162+
| `--apply-commit` | Yes || Full SHA of the Affirm/spark commit to apply; must match `Commit SHA` in VERSIONS.md for productive releases |
163+
| `--output-dir` | No | `/build/dist` | Directory for built `.whl` files |
164+
| `--upload` | No | `false` | Upload wheels to Artifactory (requires `ARTIFACTORY_USER` and `ARTIFACTORY_TOKEN`) |
165+
166+
### Environment variables
167+
168+
| Variable | Required | Default | Description |
169+
|----------|----------|---------|-------------|
170+
| `GIT_USER_EMAIL` | Yes || Used for `git config user.email` inside the container |
171+
| `GIT_USER_NAME` | Yes || Used for `git config user.name` inside the container |
172+
| `GITHUB_TOKEN` | No || GitHub PAT for authenticated repo access. When set, the script configures `GIT_ASKPASS` so the token never appears in URLs, git config, or process listings |
173+
| `REPO_URL` | No | `https://affirmprod.jfrog.io/…/pypi-local` | Artifactory PyPI repository URL used by `twine upload`. Override to target a different repository |
174+
| `ARTIFACTORY_USER` | Only with `--upload` || Affirm email for Artifactory authentication |
175+
| `ARTIFACTORY_TOKEN` | Only with `--upload` || JFrog API token for Artifactory authentication |
176+
177+
## 🚦 Exit codes
178+
179+
| Code | Meaning |
180+
|------|--------|
181+
| ✅ 0 | Success |
182+
| ❌ 1 | General error (clone, fetch, checkout, apply-commit/overlay, missing VERSIONS.md section, or unsupported Python version); scroll up to the last `ERROR:` line |
183+
| ❌ 2 | Maven compilation failed |
184+
| ❌ 3 | Wheel packaging failed |
185+
| ❌ 4 | Upload failed (missing credentials or Artifactory error) |
186+
187+
## 🎡 Expected wheels
188+
189+
After a successful run you get three wheels under `--output-dir`:
190+
191+
- `pyspark-2815!4.0.0+affirm.1-py2.py3-none-any.whl`
192+
- `pyspark_client-2815!4.0.0+affirm.1-py2.py3-none-any.whl`
193+
- `pyspark_connect-2815!4.0.0+affirm.1-py2.py3-none-any.whl`
194+
195+
## 🔗 References
196+
197+
- [Spark 4.0.0 guide](https://www.notion.so/Spark-4-0-0-guide-30440e54ae38806790f3ee84f2a9f853?pvs=21) — manual process this automates
198+
- [Runbook: how to upgrade PySpark at Affirm](https://www.notion.so/Runbook-how-to-upgrade-PySpark-at-Affirm-19340e54ae388015bb68e94a8389d526?pvs=21) — version naming
199+
- [Thor Quickstart Guide](https://www.notion.so/Thor-Quickstart-Guide-16d40e54ae3881ea9abaecfcc11177ab?pvs=21)
200+
- [jvm-build-tools/bin/upload-prebuilt-spark.sh](https://github.com/Affirm/all-the-things/blob/master/jvm-build-tools/bin/upload-prebuilt-spark.sh) — JAR upload sibling

tools/pyspark_build/VERSIONS.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# PySpark — Affirm Version History
2+
3+
This file documents every **productive** Affirm release for PySpark wheels built at Affirm.
4+
5+
## What is an AFFIRM_VERSION?
6+
7+
`AFFIRM_VERSION` is the Affirm-specific suffix appended to the wheel version string
8+
(`2815!<SPARK_VERSION>+affirm.<AFFIRM_VERSION>`). Each productive release corresponds to
9+
a single commit in the [Affirm/spark](https://github.com/Affirm/spark) repository that
10+
contains all the Affirm-specific changes to apply on top of the upstream Spark tag.
11+
12+
## Which versions must be documented here?
13+
14+
| Type | Format | Document here? |
15+
|------|--------|---------------|
16+
| Productive | `1`, `2`, `3`, … (integer only) | **Yes — required** |
17+
| Development | `dev1`, `dev2`, … | No — exempt |
18+
19+
The build script (`build-pyspark-wheels.sh`) enforces this for productive versions:
20+
21+
1. A section `## <AFFIRM_VERSION>` must exist in this file.
22+
2. That section must include a `Commit SHA: <sha>` line. The SHA must exactly match
23+
the `--apply-commit` argument passed to the build script.
24+
25+
Both checks run before any compilation begins; the script exits with an error if either fails.
26+
27+
## Version sections
28+
29+
Add a new `## <N>` section for each productive release. Each section **must** include:
30+
31+
- `**Commit SHA:** <full-git-sha>` — the SHA from Affirm/spark containing all Affirm changes
32+
for this release. This is validated by the build script against `--apply-commit`.
33+
- A short description of what changed.
34+
- Optionally: the Spark version(s) this release targets.
35+
36+
Example section:
37+
38+
```
39+
## 0
40+
41+
**Spark version:** 4.0.0
42+
**Commit SHA:** a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2
43+
44+
**Changes:**
45+
- Updated setup.py packaging metadata.
46+
- Adjusted entrypoint.sh for Kubernetes executor compatibility.
47+
```
48+
49+
---
50+
51+
<!-- Add new productive releases below, newest first. -->

0 commit comments

Comments
 (0)