Skip to content

Conversation

@lgritz
Copy link
Collaborator

@lgritz lgritz commented Oct 10, 2025

Building the python wheels takes a long time! I really hate waiting for that when testing PRs, there must be a way to speed it up.

  • For the wheel workflow, add cache actions to save and restore the CCACHE_DIR. Hey, it's no easy feat to figure out what the path should be to that directory, especially on Linux where the wheels are built in a container, so the paths inside the container (where the wheel is built and the C++ compilation happens) don't match the paths outside the container (where the cache restore and save actions execute).

  • I had a heck of a time on Linux trying to get a pre-built ccache installed and had to resort to writing a bash script to build ccache itself from scratch, which is much more expensive than I'd like, but we'll have to come back to fix that separately.

  • Changed our "auto-build" utility build_dependency_with_cmake to print the amount of time it takes to build each dependency.

  • When auto-building, pass along CMAKE_CXX_COMPILER_LAUNCHER so that the dependenencies also for sure use ccache.

  • Use CMAKE_BUILD_PARALLEL_LEVEL on the wheel run to use all the cores and compile in parallel. (We did that on the regular CI but I think not for the wheel building.)

  • Fixes to the logic in our compiler.cmake where it tries to use ccache even if the magic CMAKE_CXX_COMPILER_LAUNCHER isn't set -- I have come to believe we were doing it wrong before, it was having no effect, and all along we only got ccache working on CI because we also set the env variable.

  • For CI, set CCACHE_COMPRESSION=1 to make the caches take less space against the precious limit of how much cache we can use total on GHA.

So, the result of all this:

Previous times (typical), and first wheel run for any git branch

platform total compile OIIO + deps
Linux Intel 10:35 500s
Linux ARM 7:20 294s
Mac Intel 20:18 1146s
Mac ARM 7:19 388s
Windows 14:00 759s

With ccache active, 2nd or more wheel run for a git branch

platform total compile OIIO + deps
Linux Intel 3:33 98s
Linux ARM 3:34 83s
Mac Intel 5:01 212s
Mac ARM 2:30 95s
Windows N/A not using ccache

The "compile OIIO + deps" column is the isolated time to build OIIO and also any auto-building of dependencies from source that we do. But it does not include any other setup, such as 40-60s of container setup on Linux, 20-40s setting up Python on Mac, or -- ick! -- 45-60 seconds to build cmake itself from scratch.

So this is considerably better, speeding up the full wheel workflow by 2-4x on all platforms but Windows. Remaining room for improvement (possibly in subsequent PRs, hopefully not necessarily by me):

  • Is there a way to use ccache on Windows? I'm not sure if there is, when using MSVS.
  • Find a way to install pre-built binaries on Linux rather than building ccache from scratch, which would save almost a whole minute per job.
  • Organize each platform to build all of its wheels (i.e. all of the python versions we are building for on that platform) in a single job rather than as completely independent jobs, allowing (a) the fixed per-job overhead such as container initialization, and installing python, and building of certain dependencies to happen ONCE per platform instead of separately for each wheel, and (b) the magic of ccache to speed up the builds across those wheels, since only a tiny amount of OIIO source codes depend on python at all.

@lgritz lgritz marked this pull request as ready for review October 10, 2025 19:47
@zachlewis
Copy link
Collaborator

This is remarkable -- nice work, I know this has been kind of an uphill battle! What you're doing makes a lot of sense. LGTM!

@lgritz
Copy link
Collaborator Author

lgritz commented Oct 10, 2025

I would also like to switch the approach to coalesce into one job per platform (doing all the wheels in sequence and amortizing the fixed costs), instead of a separate job from scratch for every individual wheel. I think that's possible, but I haven't looked into it and it's definitely beyond my current knowledge frontier about the wheel builder.

Building the python wheels takes a long time! I really hate waiting
for that to get added to the CI, there must be a way to speed it up.

* To wheel workflow, add cache actions to save and restore the
  CCACHE_DIR. Hey, it's no easy feat to figure out what the path
  should be to that directory, especially on Linux where the wheels
  are build in a container, so the paths in the container (where the
  wheel is built and the C++ compilation happens) don't match the
  paths outside the container (where the cache restore and save
  actions execute).

* I had a heck of a time on Linux trying to get a pre-built ccache
  installed and had to resort to writing a bash script to build ccache
  itself from scratch.

* Changed our "auto-build" utility build_dependency_with_cmake to
  print the amount of time it takes to build each dependency.

* When auto-building, pass along CMAKE_CXX_COMPILER_LAUNCHER so that
  the dependenencies also for sure use ccache.

* Use CMAKE_BUILD_PARALLEL_LEVEL on the wheel run to use all the cores
  and compile in parallel. (We did that on the regular CI but I think
  not for the wheel building.)

* Fixes to the logic in our compiler.cmake where it tries to use
  ccache even if the magic CMAKE_CXX_COMPILER_LAUNCHER isn't set -- I
  now believe we were doing it wrong, it was having no effect, and all
  along we only got ccache working on CI because we *also* set the env
  variable.

* For CI, set CCACHE_COMPRESSION=1 to make the caches take less space
  against the precious limit of how much cache we can use total on
  GHA.

So, the result of all this:

**Previous times (typical)**

| platform    | total | compile OIIO + deps |
| ----------- | ----- | ------------------- |
| Linux Intel | 10:35 |  500s               |
| Linux ARM   |  7:20 |  294s               |
| Mac Intel   | 20:18 | 1146s               |
| Mac ARM     |  7:19 |  388s               |
| Windows     | 14:00 |  759s               |

**With ccache active**

| platform    | total | compile OIIO + deps |
| ----------- | ----- | ------------------- |
| Linux Intel |  3:33 |  98s                |
| Linux ARM   |  3:34 |  83s                |
| Mac Intel   |  5:01 | 212s                |
| Mac ARM     |  2:30 |  95s                |
| Windows     |   N/A | not using ccache    |

The "compile OIIO + deps" column is the isolated time to build OIIO
and also any auto-building of dependencies from source that we do. But
it does not include any other setup, such as 40-60s of container setup
on Linux, 20-40s setting up Python on Mac, or -- ick! -- 45-60 seconds
to build cmake itself from scratch.

So this is considerably better, speeding up the full wheel workflow by
2-4x on all platforms but Windows. Remaining room for improvement:

* Is there a way to use ccache on Windows? I'm not sure if there is,
  when using MSVS.
* Find a way to install pre-built binaries on Linux rather than building
  ccache from scratch, which would save almost a whole minute per job.
* Organize each platform to build all of its wheels (i.e. all of the
  python versions we are building for on that platform) in a single
  job rather than as completely independent jobs, allowing (a) the
  fixed per-job overhead such as container initialization, and
  installing python, and building of certain dependencies to happen
  ONCE per platform instead of separately for each wheel, and (b) the
  magic of ccache to speed up the builds *across* those wheels, since
  only a tiny amount of OIIO source codes depend on python at all.

Signed-off-by: Larry Gritz <[email protected]>
@zachlewis
Copy link
Collaborator

Yep, noted! In theory, it should be super easy to do, now that ccache is working its magic...

@lgritz
Copy link
Collaborator Author

lgritz commented Oct 10, 2025

This is remarkable -- nice work, I know this has been kind of an uphill battle! What you're doing makes a lot of sense. LGTM!

Thanks!

If that "LGTM" is an actual review, can you please click the approval button?

@lgritz lgritz merged commit 6bc8627 into AcademySoftwareFoundation:main Oct 11, 2025
90 of 93 checks passed
@lgritz lgritz deleted the lg-wheelcache3 branch October 11, 2025 01:52
lgritz added a commit to lgritz/OpenImageIO that referenced this pull request Oct 11, 2025
…n#4924)

Building the python wheels takes a long time! I really hate waiting for
that when testing PRs, there must be a way to speed it up.

* For the wheel workflow, add cache actions to save and restore the
CCACHE_DIR. Hey, it's no easy feat to figure out what the path should be
to that directory, especially on Linux where the wheels are built in a
container, so the paths inside the container (where the wheel is built
and the C++ compilation happens) don't match the paths outside the
container (where the cache restore and save actions execute).

* I had a heck of a time on Linux trying to get a pre-built ccache
installed and had to resort to writing a bash script to build ccache
itself from scratch, which is much more expensive than I'd like, but
we'll have to come back to fix that separately.

* Changed our "auto-build" utility build_dependency_with_cmake to print
the amount of time it takes to build each dependency.

* When auto-building, pass along CMAKE_CXX_COMPILER_LAUNCHER so that the
dependenencies also for sure use ccache.

* Use CMAKE_BUILD_PARALLEL_LEVEL on the wheel run to use all the cores
and compile in parallel. (We did that on the regular CI but I think not
for the wheel building.)

* Fixes to the logic in our compiler.cmake where it tries to use ccache
even if the magic CMAKE_CXX_COMPILER_LAUNCHER isn't set -- I have come
to believe we were doing it wrong before, it was having no effect, and
all along we only got ccache working on CI because we *also* set the env
variable.

* For CI, set CCACHE_COMPRESSION=1 to make the caches take less space
against the precious limit of how much cache we can use total on GHA.

So, the result of all this:

**Previous times (typical), and first wheel run for any git branch**

| platform    | total | compile OIIO + deps |
| ----------- | ----- | ------------------- |
| Linux Intel | 10:35 |  500s               |
| Linux ARM   |  7:20 |  294s               |
| Mac Intel   | 20:18 | 1146s               |
| Mac ARM     |  7:19 |  388s               |
| Windows     | 14:00 |  759s               |

**With ccache active, 2nd or more wheel run for a git branch**

| platform    | total | compile OIIO + deps |
| ----------- | ----- | ------------------- |
| Linux Intel |  3:33 |  98s                |
| Linux ARM   |  3:34 |  83s                |
| Mac Intel   |  5:01 | 212s                |
| Mac ARM     |  2:30 |  95s                |
| Windows     |   N/A | not using ccache    |

The "compile OIIO + deps" column is the isolated time to build OIIO and
also any auto-building of dependencies from source that we do. But it
does not include any other setup, such as 40-60s of container setup on
Linux, 20-40s setting up Python on Mac, or -- ick! -- 45-60 seconds to
build cmake itself from scratch.

So this is considerably better, speeding up the full wheel workflow by
2-4x on all platforms but Windows. Remaining room for improvement
(possibly in subsequent PRs, hopefully not necessarily by me):

* Is there a way to use ccache on Windows? I'm not sure if there is,
when using MSVS.
* Find a way to install pre-built binaries on Linux rather than building
ccache from scratch, which would save almost a whole minute per job.
* Organize each platform to build all of its wheels (i.e. all of the
python versions we are building for on that platform) in a single job
rather than as completely independent jobs, allowing (a) the fixed
per-job overhead such as container initialization, and installing
python, and building of certain dependencies to happen ONCE per platform
instead of separately for each wheel, and (b) the magic of ccache to
speed up the builds *across* those wheels, since only a tiny amount of
OIIO source codes depend on python at all.

Signed-off-by: Larry Gritz <[email protected]>
lgritz added a commit to lgritz/OpenImageIO that referenced this pull request Oct 11, 2025
…n#4924)

Building the python wheels takes a long time! I really hate waiting for
that when testing PRs, there must be a way to speed it up.

* For the wheel workflow, add cache actions to save and restore the
CCACHE_DIR. Hey, it's no easy feat to figure out what the path should be
to that directory, especially on Linux where the wheels are built in a
container, so the paths inside the container (where the wheel is built
and the C++ compilation happens) don't match the paths outside the
container (where the cache restore and save actions execute).

* I had a heck of a time on Linux trying to get a pre-built ccache
installed and had to resort to writing a bash script to build ccache
itself from scratch, which is much more expensive than I'd like, but
we'll have to come back to fix that separately.

* Changed our "auto-build" utility build_dependency_with_cmake to print
the amount of time it takes to build each dependency.

* When auto-building, pass along CMAKE_CXX_COMPILER_LAUNCHER so that the
dependenencies also for sure use ccache.

* Use CMAKE_BUILD_PARALLEL_LEVEL on the wheel run to use all the cores
and compile in parallel. (We did that on the regular CI but I think not
for the wheel building.)

* Fixes to the logic in our compiler.cmake where it tries to use ccache
even if the magic CMAKE_CXX_COMPILER_LAUNCHER isn't set -- I have come
to believe we were doing it wrong before, it was having no effect, and
all along we only got ccache working on CI because we *also* set the env
variable.

* For CI, set CCACHE_COMPRESSION=1 to make the caches take less space
against the precious limit of how much cache we can use total on GHA.

So, the result of all this:

**Previous times (typical), and first wheel run for any git branch**

| platform    | total | compile OIIO + deps |
| ----------- | ----- | ------------------- |
| Linux Intel | 10:35 |  500s               |
| Linux ARM   |  7:20 |  294s               |
| Mac Intel   | 20:18 | 1146s               |
| Mac ARM     |  7:19 |  388s               |
| Windows     | 14:00 |  759s               |

**With ccache active, 2nd or more wheel run for a git branch**

| platform    | total | compile OIIO + deps |
| ----------- | ----- | ------------------- |
| Linux Intel |  3:33 |  98s                |
| Linux ARM   |  3:34 |  83s                |
| Mac Intel   |  5:01 | 212s                |
| Mac ARM     |  2:30 |  95s                |
| Windows     |   N/A | not using ccache    |

The "compile OIIO + deps" column is the isolated time to build OIIO and
also any auto-building of dependencies from source that we do. But it
does not include any other setup, such as 40-60s of container setup on
Linux, 20-40s setting up Python on Mac, or -- ick! -- 45-60 seconds to
build cmake itself from scratch.

So this is considerably better, speeding up the full wheel workflow by
2-4x on all platforms but Windows. Remaining room for improvement
(possibly in subsequent PRs, hopefully not necessarily by me):

* Is there a way to use ccache on Windows? I'm not sure if there is,
when using MSVS.
* Find a way to install pre-built binaries on Linux rather than building
ccache from scratch, which would save almost a whole minute per job.
* Organize each platform to build all of its wheels (i.e. all of the
python versions we are building for on that platform) in a single job
rather than as completely independent jobs, allowing (a) the fixed
per-job overhead such as container initialization, and installing
python, and building of certain dependencies to happen ONCE per platform
instead of separately for each wheel, and (b) the magic of ccache to
speed up the builds *across* those wheels, since only a tiny amount of
OIIO source codes depend on python at all.

Signed-off-by: Larry Gritz <[email protected]>
@lgritz lgritz added the build / testing / port / CI Affecting the build system, tests, platform support, porting, or continuous integration. label Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build / testing / port / CI Affecting the build system, tests, platform support, porting, or continuous integration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants