You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Overview
Somewhere along the way a regression on actor performance occurred (probably during my large refactor). Cold-start times are extremely large as a result of reporting 0 capacity for the initial heartbeat. This means we need to wait for the next heartbeat for an updated capacity report to assign any tasks to the worker.
The problem is that the executor startup was in a race condition with the heartbeat mechanism, so if the rust worker heartbeat before executors were initialized then the capacity was 0. This PR adds a future over a boolean to determine whether the runtime has initialized before starting the heartbeat runtime.
## Test Plan
Ran this locally and decreased cold-start for "hello world" from 16s to 2s. Going to run some tests on cloud to make sure.
## Rollout Plan (if applicable)
This may be rolled out immediately. We will need to cut a new `union` package.
## Upstream Changes
Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F).
- [ ] To be upstreamed to OSS
## Issue
fixes https://linear.app/unionai/issue/COR-2673/fix-startup-way-slower-than-regular-tasks
## Checklist
* [ ] Added tests
* [ ] Ran a deploy dry run and shared the terraform plan
* [ ] Added logging and metrics
* [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list)
* [ ] Updated documentation
0 commit comments