feature: add mps kernel for `compute_puff_advantage` #422

hysmio · 2025-11-22T10:57:32Z

features

adds mps kernel for compute_puff_advantage
adds objective c for launching kernel with setup caching
adds build extension source files into setup if mps is available
fixes bug with nn.init.orthogonal_ not being supported on mps
- given this is on init, the approach was just to copy to cpu & then back afterwards, some alternatives could be to change init algo or use a different one on mps, but it felt even hackier
adds some tests to validate output matches the cpu implementation, as well as some small benchmarks to ensure it's correct & faster than cpu
- the preconfigured values, 8192x64 was from breakout, 16384x64 was from g2048
- also has profiling set up for xcode

benchmarks

Macbook Pro M4 Pro Max (14 core)

Benchmarks:
Benchmark (8192 steps, 64 horizon): CPU=0.5696ms MPS=0.5387ms Speedup=1.06x
Benchmark (16384 steps, 64 horizon): CPU=1.0141ms MPS=0.4194ms Speedup=2.42x
Benchmark (100000 steps, 128 horizon): CPU=14.5072ms MPS=2.3930ms Speedup=6.06x
Benchmark (1000000 steps, 128 horizon): CPU=149.9548ms MPS=18.4245ms Speedup=8.14x

Worth noting that the Apple CPU seems to be quite fast on small sizes, which means the overhead of launching kernels doesn't really justify it for small sizes.

Could potentially expand this to check size & choose what's most optimal. I haven't benchmarked if the copy to mps or back to cpu outweighs the kernel launch overhead, I imagine it might be close given the unified memory on Apple silicon.

build and run

Building needs to be done with ARCHFLAGS="-arch arm64" env variable
Running needs --train.device=mps

layterz · 2025-11-24T00:43:44Z

I tested this on a macbook air m1 (8gb vram) on the squared env. It does work and speeds things up by ~2x for me on 20m steps, but the accuracy is quite a bit worse. The final score for mps is ~0.4 vs 0.9 on the cpu - eyeballing some evaluations the policy does seem worse and struggles when the target is further away compared to the cpu trained policy.

Less important, but would be nice to include the mps utilization under GPU percentage on the experiment panel.

hysmio · 2025-11-24T03:02:15Z

pufferlib/config/default.ini

 torch_deterministic = True
 cpu_offload = False
-device = cuda
+device = default


Not sure if this is controversial or not, but just means people don't need to manually pass --train.device mps every time, automatically select the best one based off of the get_accelerator function

eshau · 2025-11-24T14:54:50Z

To get working (at least for me) on my Macbook Pro with an M2 chip:

git clone https://github.com/pufferai/pufferlib
cd pufferlib
git fetch origin pull/422/head:pr-422
git checkout pr-422
uv venv --python 3.11
source .venv/bin/activate
ARCH_FLAGS="-arch arm64" uv pip install -e .

I ran into an issue with xcode fatal error: 'cstddef' file not found so I had to reinstall xcode.
To do so:

Find where xcode exists: xcode-select -p (This should be /Library/Developer/CommandLineTools but make sure!)
Run sudo rm -rf {xcode path} (substitute {xcode path} with /Library/Developer/CommandLineTools or what you found in Step 1)
Run xcode-select --install. A pop-up should appear asking if you want to install.
Run sudo xcode-select -s /Library/Developer/CommandLineTools to make sure it installed in the right place.
In a new shell, run echo '#include <cstddef>' | cc -x c++ -E -; to verify xcode installed correctly.
Start running the above commands again from source .venv/bin/activate.

I tested:

puffer_squared
- MPS: puffer train puffer_squared --train.device mps --vec.backend Serial
- CPU: puffer train puffer_squared --train.device cpu --vec.backend Serial
puffer_breakout
- MPS: puffer train puffer_breakout --train.device mps --vec.backend Serial
- CPU: puffer train puffer_breakout --train.device cpu --vec.backend Serial

hysmio added 5 commits November 22, 2025 21:42

feature: add mps kernel

1f02571

fix: initialisation weirdness

12ccbd1

feature: add test & benchmarks at varying steps/horizon

31a9543

fix: test copy & paste

cb781d3

fix: use in pufferrl, missed in merge conflict

b3445cd

hysmio force-pushed the feature/mps branch from 46d598d to b3445cd Compare November 22, 2025 14:12

fix: use accelerator api instead of cuda directly

a51c22c

hysmio added 2 commits November 24, 2025 13:47

fix: test_mps_advantage check diff in benchmarks

e112b3f

feature: use accelerator api more generally

c6f4a98

hysmio commented Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: add mps kernel for `compute_puff_advantage` #422

feature: add mps kernel for `compute_puff_advantage` #422

Uh oh!

hysmio commented Nov 22, 2025 •

edited

Loading

Uh oh!

layterz commented Nov 24, 2025

Uh oh!

hysmio Nov 24, 2025

Uh oh!

eshau commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feature: add mps kernel for compute_puff_advantage #422

Are you sure you want to change the base?

feature: add mps kernel for compute_puff_advantage #422

Uh oh!

Conversation

hysmio commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

features

benchmarks

build and run

Uh oh!

layterz commented Nov 24, 2025

Uh oh!

hysmio Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

eshau commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feature: add mps kernel for `compute_puff_advantage` #422

feature: add mps kernel for `compute_puff_advantage` #422

hysmio commented Nov 22, 2025 •

edited

Loading