Skip to content

Conversation

@lumag
Copy link
Contributor

@lumag lumag commented Sep 29, 2025

CI is having stability issues, most likely caused by the huge number of jobs trying to build and upload artifacts in parallel. Reduce the number of jobs being by merging two qcom-distro jobs into a single one and using special KAS file for qcom-armv7a / qcom-distro builds.

Copy link
Contributor

@ndechesne ndechesne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change.. let's see what the other have to say. It does break users though, we need to be cautious, and we need to update the README as well.

@quaresmajose
Copy link
Contributor

Removing the target from the yml is not good in my opinion because it will force the user to choose which image they want and this may not be easy for everyone.
We have the envirment variable KAS_TARGET that overwrites the respective setting in the configuration file. Using this we wouldn't even need new arguments in kas and we can keep this only on CI

@vkraleti
Copy link
Contributor

We could consider splitting the machines more evenly between the compile_warm_up & compile steps to reduce parallel load. Currently, the split is 2:7. Eventually, we can group 4–5 machines into a batch and run builds in multiple batches.

@lumag
Copy link
Contributor Author

lumag commented Sep 29, 2025

Removing the target from the yml is not good in my opinion because it will force the user to choose which image they want and this may not be easy for everyone. We have the envirment variable KAS_TARGET that overwrites the respective setting in the configuration file. Using this we wouldn't even need new arguments in kas and we can keep this only on CI

I'm not sure, why choosing an image is not easy. I think that specifying the image in the distro is counter-intuitive as it steers the user towards a particular image.

@lumag
Copy link
Contributor Author

lumag commented Sep 29, 2025

We could consider splitting the machines more evenly between the compile_warm_up & compile steps to reduce parallel load. Currently, the split is 2:7. Eventually, we can group 4–5 machines into a batch and run builds in multiple batches.

That would increase the amount of cache misses in the warm_up stage, so it's not logical

@github-actions
Copy link

github-actions bot commented Sep 29, 2025

Test Results

  7 files  ±0   14 suites  ±0   14m 16s ⏱️ ±0s
 50 tests ±0   50 ✅ ±0  0 💤 ±0  0 ❌ ±0 
152 runs  ±0  152 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit e9cc6f3. ± Comparison against base commit c384815.

♻️ This comment has been updated with latest results.

@quaresmajose
Copy link
Contributor

Removing the target from the yml is not good in my opinion because it will force the user to choose which image they want and this may not be easy for everyone. We have the envirment variable KAS_TARGET that overwrites the respective setting in the configuration file. Using this we wouldn't even need new arguments in kas and we can keep this only on CI

I'm not sure, why choosing an image is not easy. I think that specifying the image in the distro is counter-intuitive as it steers the user towards a particular image.

Because this forces the user to know the images from a predefined list and this makes it difficult for beginners to use. I think we should be the ones to make this choice and thus make the use of the layer simpler for new users. For someone experienced, changing the image is trivial, but I don't believe it is like that for everyone.

@vkraleti
Copy link
Contributor

We could consider splitting the machines more evenly between the compile_warm_up & compile steps to reduce parallel load. Currently, the split is 2:7. Eventually, we can group 4–5 machines into a batch and run builds in multiple batches.

That would increase the amount of cache misses in the warm_up stage, so it's not logical

In that case warm_up stage can remain as. However, I believe the compile step should be split into multiple batches. Since the number of machines is expected to grow over time, reducing the job count from 35 to 28 (as proposed in this PR) may only be a temporary relief. If the additional 7 jobs are already becoming a bottleneck, we might face the same issue again soon.

@ndechesne
Copy link
Contributor

We could consider splitting the machines more evenly between the compile_warm_up & compile steps to reduce parallel load. Currently, the split is 2:7. Eventually, we can group 4–5 machines into a batch and run builds in multiple batches.

That would increase the amount of cache misses in the warm_up stage, so it's not logical

If we have a problem with the CI runners and the number of jobs we create we should fix our infrastructure instead of limiting ourselves. I have yet to see a problem related to the number of jobs we create.

@ricardosalveti
Copy link
Contributor

@doanac is working on identifying the correct fix, anything we do to remove the amount of jobs will only be temporary.

@lumag
Copy link
Contributor Author

lumag commented Sep 30, 2025

Removing the target from the yml is not good in my opinion because it will force the user to choose which image they want and this may not be easy for everyone. We have the envirment variable KAS_TARGET that overwrites the respective setting in the configuration file. Using this we wouldn't even need new arguments in kas and we can keep this only on CI

I'm not sure, why choosing an image is not easy. I think that specifying the image in the distro is counter-intuitive as it steers the user towards a particular image.

Because this forces the user to know the images from a predefined list and this makes it difficult for beginners to use. I think we should be the ones to make this choice and thus make the use of the layer simpler for new users. For someone experienced, changing the image is trivial, but I don't believe it is like that for everyone.

Beginners still have to pass a correct set fo YAML files, which is not that trivial. I think we can document both, the set of YAMLs and the default image.

@lumag lumag requested a review from ndechesne October 13, 2025 20:10
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. To prevent automatic closure in 5 days, remove the stale label or add a comment. You can reopen a closed pull request at any time.

@github-actions
Copy link

Test run workflow

Test jobs for commit 52d6f1d

@test-reporting-app
Copy link

test-reporting-app bot commented Nov 19, 2025

Test Results

 14 files  ±0   28 suites  ±0   38m 2s ⏱️ - 2m 14s
 50 tests ±0   50 ✅ ±0  0 💤 ±0  0 ❌ ±0 
304 runs  ±0  304 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 5e3a4ce. ± Comparison against base commit 3ac7db4.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

Test run workflow

Test jobs for commit 52d6f1d

@github-actions
Copy link

Test run workflow

Test jobs for commit 808f949

CI is having stability issues, most likely caused by the huge number of
jobs trying to build and upload artifacts in parallel. Reduce the number
of jobs being by merging two qcom-distro jobs into a single one and
using special jobs for qcom-armv7a builds (since we don't support
proprietary builds on qcom-armv7a).

Signed-off-by: Dmitry Baryshkov <[email protected]>
@github-actions
Copy link

Test run workflow

Test jobs for commit 5e3a4ce

@github-actions
Copy link

Test run workflow

Test jobs for commit 5e3a4ce

@github-actions
Copy link

Test run workflow

Test jobs for commit 5e3a4ce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants