Skip to content

Conversation

@hysmio
Copy link

@hysmio hysmio commented Nov 22, 2025

features

  • adds mps kernel for compute_puff_advantage
  • adds objective c for launching kernel with setup caching
  • adds build extension source files into setup if mps is available
  • fixes bug with nn.init.orthogonal_ not being supported on mps
    • given this is on init, the approach was just to copy to cpu & then back afterwards, some alternatives could be to change init algo or use a different one on mps, but it felt even hackier
  • adds some tests to validate output matches the cpu implementation, as well as some small benchmarks to ensure it's correct & faster than cpu
    • the preconfigured values, 8192x64 was from breakout, 16384x64 was from g2048
    • also has profiling set up for xcode

benchmarks

Macbook Pro M4 Pro Max (14 core)

Benchmarks:
Benchmark (8192 steps, 64 horizon): CPU=0.5696ms MPS=0.5387ms Speedup=1.06x
Benchmark (16384 steps, 64 horizon): CPU=1.0141ms MPS=0.4194ms Speedup=2.42x
Benchmark (100000 steps, 128 horizon): CPU=14.5072ms MPS=2.3930ms Speedup=6.06x
Benchmark (1000000 steps, 128 horizon): CPU=149.9548ms MPS=18.4245ms Speedup=8.14x

Worth noting that the Apple CPU seems to be quite fast on small sizes, which means the overhead of launching kernels doesn't really justify it for small sizes.

Could potentially expand this to check size & choose what's most optimal. I haven't benchmarked if the copy to mps or back to cpu outweighs the kernel launch overhead, I imagine it might be close given the unified memory on Apple silicon.

build and run

  • Building needs to be done with ARCHFLAGS="-arch arm64" env variable
  • Running needs --train.device=mps

@layterz
Copy link

layterz commented Nov 24, 2025

I tested this on a macbook air m1 (8gb vram) on the squared env. It does work and speeds things up by ~2x for me on 20m steps, but the accuracy is quite a bit worse. The final score for mps is ~0.4 vs 0.9 on the cpu - eyeballing some evaluations the policy does seem worse and struggles when the target is further away compared to the cpu trained policy.

image

Less important, but would be nice to include the mps utilization under GPU percentage on the experiment panel.

torch_deterministic = True
cpu_offload = False
device = cuda
device = default
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is controversial or not, but just means people don't need to manually pass --train.device mps every time, automatically select the best one based off of the get_accelerator function

@eshau
Copy link

eshau commented Nov 24, 2025

To get working (at least for me) on my Macbook Pro with an M2 chip:

git clone https://github.com/pufferai/pufferlib
cd pufferlib
git fetch origin pull/422/head:pr-422
git checkout pr-422
uv venv --python 3.11
source .venv/bin/activate
ARCH_FLAGS="-arch arm64" uv pip install -e .

I ran into an issue with xcode fatal error: 'cstddef' file not found so I had to reinstall xcode.
To do so:

  1. Find where xcode exists: xcode-select -p (This should be /Library/Developer/CommandLineTools but make sure!)
  2. Run sudo rm -rf {xcode path} (substitute {xcode path} with /Library/Developer/CommandLineTools or what you found in Step 1)
  3. Run xcode-select --install. A pop-up should appear asking if you want to install.
  4. Run sudo xcode-select -s /Library/Developer/CommandLineTools to make sure it installed in the right place.
  5. In a new shell, run echo '#include <cstddef>' | cc -x c++ -E -; to verify xcode installed correctly.
  6. Start running the above commands again from source .venv/bin/activate.

I tested:

  • puffer_squared

    • MPS: puffer train puffer_squared --train.device mps --vec.backend Serial
    • image
    • CPU: puffer train puffer_squared --train.device cpu --vec.backend Serial
    • image
  • puffer_breakout

    • MPS: puffer train puffer_breakout --train.device mps --vec.backend Serial
    • image
    • CPU: puffer train puffer_breakout --train.device cpu --vec.backend Serial
    • image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants