-
Notifications
You must be signed in to change notification settings - Fork 652
ci: For python wheel generation, use ccache #4924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is remarkable -- nice work, I know this has been kind of an uphill battle! What you're doing makes a lot of sense. LGTM! |
|
I would also like to switch the approach to coalesce into one job per platform (doing all the wheels in sequence and amortizing the fixed costs), instead of a separate job from scratch for every individual wheel. I think that's possible, but I haven't looked into it and it's definitely beyond my current knowledge frontier about the wheel builder. |
Building the python wheels takes a long time! I really hate waiting for that to get added to the CI, there must be a way to speed it up. * To wheel workflow, add cache actions to save and restore the CCACHE_DIR. Hey, it's no easy feat to figure out what the path should be to that directory, especially on Linux where the wheels are build in a container, so the paths in the container (where the wheel is built and the C++ compilation happens) don't match the paths outside the container (where the cache restore and save actions execute). * I had a heck of a time on Linux trying to get a pre-built ccache installed and had to resort to writing a bash script to build ccache itself from scratch. * Changed our "auto-build" utility build_dependency_with_cmake to print the amount of time it takes to build each dependency. * When auto-building, pass along CMAKE_CXX_COMPILER_LAUNCHER so that the dependenencies also for sure use ccache. * Use CMAKE_BUILD_PARALLEL_LEVEL on the wheel run to use all the cores and compile in parallel. (We did that on the regular CI but I think not for the wheel building.) * Fixes to the logic in our compiler.cmake where it tries to use ccache even if the magic CMAKE_CXX_COMPILER_LAUNCHER isn't set -- I now believe we were doing it wrong, it was having no effect, and all along we only got ccache working on CI because we *also* set the env variable. * For CI, set CCACHE_COMPRESSION=1 to make the caches take less space against the precious limit of how much cache we can use total on GHA. So, the result of all this: **Previous times (typical)** | platform | total | compile OIIO + deps | | ----------- | ----- | ------------------- | | Linux Intel | 10:35 | 500s | | Linux ARM | 7:20 | 294s | | Mac Intel | 20:18 | 1146s | | Mac ARM | 7:19 | 388s | | Windows | 14:00 | 759s | **With ccache active** | platform | total | compile OIIO + deps | | ----------- | ----- | ------------------- | | Linux Intel | 3:33 | 98s | | Linux ARM | 3:34 | 83s | | Mac Intel | 5:01 | 212s | | Mac ARM | 2:30 | 95s | | Windows | N/A | not using ccache | The "compile OIIO + deps" column is the isolated time to build OIIO and also any auto-building of dependencies from source that we do. But it does not include any other setup, such as 40-60s of container setup on Linux, 20-40s setting up Python on Mac, or -- ick! -- 45-60 seconds to build cmake itself from scratch. So this is considerably better, speeding up the full wheel workflow by 2-4x on all platforms but Windows. Remaining room for improvement: * Is there a way to use ccache on Windows? I'm not sure if there is, when using MSVS. * Find a way to install pre-built binaries on Linux rather than building ccache from scratch, which would save almost a whole minute per job. * Organize each platform to build all of its wheels (i.e. all of the python versions we are building for on that platform) in a single job rather than as completely independent jobs, allowing (a) the fixed per-job overhead such as container initialization, and installing python, and building of certain dependencies to happen ONCE per platform instead of separately for each wheel, and (b) the magic of ccache to speed up the builds *across* those wheels, since only a tiny amount of OIIO source codes depend on python at all. Signed-off-by: Larry Gritz <[email protected]>
b469aac to
e3821ea
Compare
|
Yep, noted! In theory, it should be super easy to do, now that ccache is working its magic... |
Thanks! If that "LGTM" is an actual review, can you please click the approval button? |
6bc8627
into
AcademySoftwareFoundation:main
…n#4924) Building the python wheels takes a long time! I really hate waiting for that when testing PRs, there must be a way to speed it up. * For the wheel workflow, add cache actions to save and restore the CCACHE_DIR. Hey, it's no easy feat to figure out what the path should be to that directory, especially on Linux where the wheels are built in a container, so the paths inside the container (where the wheel is built and the C++ compilation happens) don't match the paths outside the container (where the cache restore and save actions execute). * I had a heck of a time on Linux trying to get a pre-built ccache installed and had to resort to writing a bash script to build ccache itself from scratch, which is much more expensive than I'd like, but we'll have to come back to fix that separately. * Changed our "auto-build" utility build_dependency_with_cmake to print the amount of time it takes to build each dependency. * When auto-building, pass along CMAKE_CXX_COMPILER_LAUNCHER so that the dependenencies also for sure use ccache. * Use CMAKE_BUILD_PARALLEL_LEVEL on the wheel run to use all the cores and compile in parallel. (We did that on the regular CI but I think not for the wheel building.) * Fixes to the logic in our compiler.cmake where it tries to use ccache even if the magic CMAKE_CXX_COMPILER_LAUNCHER isn't set -- I have come to believe we were doing it wrong before, it was having no effect, and all along we only got ccache working on CI because we *also* set the env variable. * For CI, set CCACHE_COMPRESSION=1 to make the caches take less space against the precious limit of how much cache we can use total on GHA. So, the result of all this: **Previous times (typical), and first wheel run for any git branch** | platform | total | compile OIIO + deps | | ----------- | ----- | ------------------- | | Linux Intel | 10:35 | 500s | | Linux ARM | 7:20 | 294s | | Mac Intel | 20:18 | 1146s | | Mac ARM | 7:19 | 388s | | Windows | 14:00 | 759s | **With ccache active, 2nd or more wheel run for a git branch** | platform | total | compile OIIO + deps | | ----------- | ----- | ------------------- | | Linux Intel | 3:33 | 98s | | Linux ARM | 3:34 | 83s | | Mac Intel | 5:01 | 212s | | Mac ARM | 2:30 | 95s | | Windows | N/A | not using ccache | The "compile OIIO + deps" column is the isolated time to build OIIO and also any auto-building of dependencies from source that we do. But it does not include any other setup, such as 40-60s of container setup on Linux, 20-40s setting up Python on Mac, or -- ick! -- 45-60 seconds to build cmake itself from scratch. So this is considerably better, speeding up the full wheel workflow by 2-4x on all platforms but Windows. Remaining room for improvement (possibly in subsequent PRs, hopefully not necessarily by me): * Is there a way to use ccache on Windows? I'm not sure if there is, when using MSVS. * Find a way to install pre-built binaries on Linux rather than building ccache from scratch, which would save almost a whole minute per job. * Organize each platform to build all of its wheels (i.e. all of the python versions we are building for on that platform) in a single job rather than as completely independent jobs, allowing (a) the fixed per-job overhead such as container initialization, and installing python, and building of certain dependencies to happen ONCE per platform instead of separately for each wheel, and (b) the magic of ccache to speed up the builds *across* those wheels, since only a tiny amount of OIIO source codes depend on python at all. Signed-off-by: Larry Gritz <[email protected]>
…n#4924) Building the python wheels takes a long time! I really hate waiting for that when testing PRs, there must be a way to speed it up. * For the wheel workflow, add cache actions to save and restore the CCACHE_DIR. Hey, it's no easy feat to figure out what the path should be to that directory, especially on Linux where the wheels are built in a container, so the paths inside the container (where the wheel is built and the C++ compilation happens) don't match the paths outside the container (where the cache restore and save actions execute). * I had a heck of a time on Linux trying to get a pre-built ccache installed and had to resort to writing a bash script to build ccache itself from scratch, which is much more expensive than I'd like, but we'll have to come back to fix that separately. * Changed our "auto-build" utility build_dependency_with_cmake to print the amount of time it takes to build each dependency. * When auto-building, pass along CMAKE_CXX_COMPILER_LAUNCHER so that the dependenencies also for sure use ccache. * Use CMAKE_BUILD_PARALLEL_LEVEL on the wheel run to use all the cores and compile in parallel. (We did that on the regular CI but I think not for the wheel building.) * Fixes to the logic in our compiler.cmake where it tries to use ccache even if the magic CMAKE_CXX_COMPILER_LAUNCHER isn't set -- I have come to believe we were doing it wrong before, it was having no effect, and all along we only got ccache working on CI because we *also* set the env variable. * For CI, set CCACHE_COMPRESSION=1 to make the caches take less space against the precious limit of how much cache we can use total on GHA. So, the result of all this: **Previous times (typical), and first wheel run for any git branch** | platform | total | compile OIIO + deps | | ----------- | ----- | ------------------- | | Linux Intel | 10:35 | 500s | | Linux ARM | 7:20 | 294s | | Mac Intel | 20:18 | 1146s | | Mac ARM | 7:19 | 388s | | Windows | 14:00 | 759s | **With ccache active, 2nd or more wheel run for a git branch** | platform | total | compile OIIO + deps | | ----------- | ----- | ------------------- | | Linux Intel | 3:33 | 98s | | Linux ARM | 3:34 | 83s | | Mac Intel | 5:01 | 212s | | Mac ARM | 2:30 | 95s | | Windows | N/A | not using ccache | The "compile OIIO + deps" column is the isolated time to build OIIO and also any auto-building of dependencies from source that we do. But it does not include any other setup, such as 40-60s of container setup on Linux, 20-40s setting up Python on Mac, or -- ick! -- 45-60 seconds to build cmake itself from scratch. So this is considerably better, speeding up the full wheel workflow by 2-4x on all platforms but Windows. Remaining room for improvement (possibly in subsequent PRs, hopefully not necessarily by me): * Is there a way to use ccache on Windows? I'm not sure if there is, when using MSVS. * Find a way to install pre-built binaries on Linux rather than building ccache from scratch, which would save almost a whole minute per job. * Organize each platform to build all of its wheels (i.e. all of the python versions we are building for on that platform) in a single job rather than as completely independent jobs, allowing (a) the fixed per-job overhead such as container initialization, and installing python, and building of certain dependencies to happen ONCE per platform instead of separately for each wheel, and (b) the magic of ccache to speed up the builds *across* those wheels, since only a tiny amount of OIIO source codes depend on python at all. Signed-off-by: Larry Gritz <[email protected]>
Building the python wheels takes a long time! I really hate waiting for that when testing PRs, there must be a way to speed it up.
For the wheel workflow, add cache actions to save and restore the CCACHE_DIR. Hey, it's no easy feat to figure out what the path should be to that directory, especially on Linux where the wheels are built in a container, so the paths inside the container (where the wheel is built and the C++ compilation happens) don't match the paths outside the container (where the cache restore and save actions execute).
I had a heck of a time on Linux trying to get a pre-built ccache installed and had to resort to writing a bash script to build ccache itself from scratch, which is much more expensive than I'd like, but we'll have to come back to fix that separately.
Changed our "auto-build" utility build_dependency_with_cmake to print the amount of time it takes to build each dependency.
When auto-building, pass along CMAKE_CXX_COMPILER_LAUNCHER so that the dependenencies also for sure use ccache.
Use CMAKE_BUILD_PARALLEL_LEVEL on the wheel run to use all the cores and compile in parallel. (We did that on the regular CI but I think not for the wheel building.)
Fixes to the logic in our compiler.cmake where it tries to use ccache even if the magic CMAKE_CXX_COMPILER_LAUNCHER isn't set -- I have come to believe we were doing it wrong before, it was having no effect, and all along we only got ccache working on CI because we also set the env variable.
For CI, set CCACHE_COMPRESSION=1 to make the caches take less space against the precious limit of how much cache we can use total on GHA.
So, the result of all this:
Previous times (typical), and first wheel run for any git branch
With ccache active, 2nd or more wheel run for a git branch
The "compile OIIO + deps" column is the isolated time to build OIIO and also any auto-building of dependencies from source that we do. But it does not include any other setup, such as 40-60s of container setup on Linux, 20-40s setting up Python on Mac, or -- ick! -- 45-60 seconds to build cmake itself from scratch.
So this is considerably better, speeding up the full wheel workflow by 2-4x on all platforms but Windows. Remaining room for improvement (possibly in subsequent PRs, hopefully not necessarily by me):