Skip to content

Conversation

workinright
Copy link

Fixes #30706

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing to openpilot! In order for us to review your PR as quickly as possible, check the following:

  • Convert your PR to a draft unless it's ready to review
  • Read the contributing docs
  • Before marking as "ready for review", ensure:
    • the goal is clearly stated in the description
    • all the tests are passing
    • the change is something we merge
    • include a route or your device' dongle ID if relevant

@maxime-desroches
Copy link
Collaborator

CI is failing + this seems extremely jank, fragile, hard to maintain and understand. Converting to draft.

@maxime-desroches maxime-desroches marked this pull request as draft August 1, 2025 20:02
@workinright workinright force-pushed the master branch 3 times, most recently from 5841902 to a691288 Compare August 1, 2025 21:16
@workinright
Copy link
Author

I will get this fixed. Additionally, the new docker pull-or-build routine needs some added handling when it comes to the script running a retry in case the build fails, which I'm working on now.

Can I get a lock on the bounty?

@workinright workinright force-pushed the master branch 18 times, most recently from 8956e9a to 378c7d4 Compare August 3, 2025 17:58
@workinright
Copy link
Author

It's fixed, CI now runs successfully.

However, there are two non-critical changes made to the image (the Dockerfile is being put inside for even faster caching, entirely without running docker buildx build, and Zstandard used instead of gzip), which makes the docker_build.sh want to rebuild and upload the image. However, the registry path is forcefully specified in docker_common.sh to be commaai/openpilot (even when running in a fork), and currently it has no permissions to upload it there (the error reported, UNSUPPORTED, is very misleading, and it works if I change the registry path to somewhere I have permissions to upload to, like workinright/openpilot), which makes the rebuild process pointless.

That's the last remaining reason why the build time is <40s for now, but not yet <20s. I can show it on a separate branch with the path changed in the docker_common.sh file, would this be of any use for the review?

@workinright
Copy link
Author

Furthermore, referencing to the "jank" argument. basher is just a drop-in replacement for some of the Docker daemon routines, which as we've all seen, range from sub-optimal (pulling an image) to outright broken (uploading a zstd-compressed image to a registry). The docker_build.sh script is already handling the basher's exit code, thus if that's the will, it can easily be modified to fall-back to the old way of doing things if say, with an update something breaks (completely disregarding all the state modified by the initial attempt of using basher).

At the same time, basher is benchmarked to setup the container (running on a standard, free GitHub CI runtime at ~13s with a zstd-compresed container and around 29s with a gzip-compressed one).

Some comments may also be added to the script, despite I've felt like the code is rather self-explainatory, you may want to correct me on this.

@maxime-desroches
Copy link
Collaborator

setup-with-retry is almost twice as slow now.

@workinright
Copy link
Author

@maxime-desroches That's because of the abovementioned permission/hardcoded path problem - it's trying to build an updated image because of the updated Dockerfile (and changed compression algorithm), which would cause next runs to be fully cached. Take a look at the last commit, it's exactly at the benchmarked 31s now (and drops down to 13s if the image is Zstd-compressed and successfully uploaded).

@workinright
Copy link
Author

workinright commented Aug 4, 2025

From now (31s run time), there are two ways to the 13s time, either:

  1. merging the PR to the master branch (or perhaps, any other branch within the commaai/... GitHub account), could always be reverted later, so the runner will have the right permissions
    or
  2. creating a separate branch with the registry path modified in docker_common.sh

Before a merge, the last commit 34ba79b has to be reversed, 5e6f233 is what you want to merge in such case.

@workinright workinright marked this pull request as ready for review August 7, 2025 01:15
@workinright
Copy link
Author

As I said above, here is the branch https://github.com/workinright/openpilot/tree/docker-image-path-change-showcase showing the speed of setup-with-retry when the script has the permissions to upload a Zstd-compressed image. You can fork this branch and also check it in your own GitHub account, or see the runs I've ran:

https://github.com/workinright/openpilot/actions/runs/16833488479/job/47686918506
https://github.com/workinright/openpilot/actions/runs/16833569465/job/47687192982

The fallback to docker buildx build, completely bypassing basher and its state in case should it fail, will be added by me tomorrow at the latest.

@maxime-desroches @adeebshihadeh Do you need anything more from me to be able to review this PR?

@maxime-desroches
Copy link
Collaborator

Please add comments explaining the major sections and logic. There's one line that is literally 450 characters wide. See this script for an example.

You code might seem "self-explanatory" to you because you wrote it, but when merging code (especially infrastructure), you have to think about the current maintainers and the future ones that will have to fix and maintain this.

@workinright
Copy link
Author

@maxime-desroches Done. The fall-back is also implemented.

@workinright
Copy link
Author

workinright commented Aug 11, 2025

Furthermore, the last commit contains a suggested way of parallelizing the CI tasks (in order to strip few more seconds), to actually reach <20s run time (benchmarked at 15s-19s on a standard runner, depending on its load). Additional ~2-3s can be stripped with parallelizing the git lfs pull, although this is not included in this commit, as I've seen it to be problematic with the rest of the current config.

@adeebshihadeh
Copy link
Contributor

Converting to draft. Discussed a plan in Discord.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speedup CI setup to <20s
3 participants