|
| 1 | +--- |
| 2 | +title: "Beam External Worker Pool" |
| 3 | +date: 2025-01-13T22:18:46-08:00 |
| 4 | +tags: |
| 5 | +- beam |
| 6 | +- golang |
| 7 | +- dev |
| 8 | +- projects |
| 9 | +categories: |
| 10 | +- Dev |
| 11 | +--- |
| 12 | + |
| 13 | +Today I merged a PR to add [the ability to run a Worker Pool](https://github.com/apache/beam/pull/33572) |
| 14 | +for the Beam Go SDK. I'm writing this quite late, so it's more of a sketch of a draft of |
| 15 | +documentation than real documentation. So this will be light on links and examples or diagrams for now. |
| 16 | + |
| 17 | +It's not something one always needs to use, but it adds flexibility to the SDK. |
| 18 | + |
| 19 | +<!--more--> |
| 20 | + |
| 21 | +Apache Beam's Portability framework as encoded by the Protocol Buffers has something |
| 22 | +called the ExternalWorkerPoolService. |
| 23 | + |
| 24 | +Pre dominantly used to support "loopback mode" when executing pipelines, it's |
| 25 | +a general approach to allow a Beam runner to spin up workers on demand. |
| 26 | + |
| 27 | +But I think I'm getting ahead of myself. |
| 28 | + |
| 29 | +When authoring any code, it's very useful to be able to iterate on the program |
| 30 | +locally. The faster you can try things out, and find and eliminate bugs, the |
| 31 | +better the code will be, and the more fun the experience tends to be. |
| 32 | +The same is true for your data processing pipelines. |
| 33 | + |
| 34 | +### Local Execution with Loopback Mode |
| 35 | + |
| 36 | +Beam enables this by what it calls "loopback mode". By setting your job's |
| 37 | +environment type to `LOOPBACK`, the framework allows your job to be executed |
| 38 | +locally. This lets you use the debugger, set breakpoints, and avoid most of the |
| 39 | +networking and distributed bits that make things hard. |
| 40 | + |
| 41 | +The Loopback environment type tells the framework to host an in process instance |
| 42 | + of the ExternaWorkerPoolService, and uses it's network endpoint to replace the normal |
| 43 | +execution environment. |
| 44 | + |
| 45 | +The Runner then connects to that address, and asks for it to "start" a worker, |
| 46 | +giving it all the usual Beam Portability information to it. That worker then |
| 47 | +goes on to get work and execute it. |
| 48 | + |
| 49 | +You gain all the speed of avoiding additional layers like docker, at the cost |
| 50 | +of not being able to distribute the work. As a result, Loopback mode isn't really |
| 51 | +the fastest way to execute all jobs. You're limited to what your local machine |
| 52 | +can have in memory and process in parallel afterall. |
| 53 | + |
| 54 | +### External Worker Pool, as a Service |
| 55 | + |
| 56 | +But what if you didn't want to run your workers on *your* machine? There's no |
| 57 | +reason the ExternalWorkerPoolService has to be in the same process as the |
| 58 | +launching program. |
| 59 | + |
| 60 | +The service can be hosted anywhere, as long as the runner can connect to it. |
| 61 | +The implementation just needs to produce a worker, that will connect to the |
| 62 | +provided Beam FnAPI endpoints, and execute the given pipeline. This includes |
| 63 | +the various artifacts and such to run it. |
| 64 | + |
| 65 | +The feature I added, is to be able to spin up a stand alone use of the service |
| 66 | +and do exactly that. All built into the standard Beam Go SDK Container. |
| 67 | + |
| 68 | +The root of it is that you can start the container with a `--worker_pool` flag |
| 69 | +and tunnel out port 50000 from the container. This makes the SDK container |
| 70 | +boot loader program start the workerpool version of the service, instead of |
| 71 | +starting as an SDK worker itself. |
| 72 | + |
| 73 | +The trick in the implementation is that it starts a separate process when |
| 74 | +StartWorker is called, executing the bootloader itself when it does so. |
| 75 | + |
| 76 | +That new bootloader process then performs the normal Beam Container Contract, |
| 77 | +where it connects to the provisioning service to get the different endpoints, |
| 78 | +gets the artifacts from from the artifact service, starts it's own worker binary |
| 79 | +instance, which ultimately, connects back to the runner's control service to |
| 80 | +get work. |
| 81 | + |
| 82 | +### OK, but what is it for? |
| 83 | + |
| 84 | +The best use is when a runner is unable to start up it's own workers for a given |
| 85 | +SDK. This lets you preconfigure a side car service for the work that needs to |
| 86 | +expand out and do the work. |
| 87 | + |
| 88 | +For example, this is needed when executing a pipeline on a runner on Kubernetes. |
| 89 | +You'd sometimes like to be able to bound the number of machines or pods that can |
| 90 | +serve as workers within your cluster. This lets you arrange a dedicated set of |
| 91 | +pods that can behave as workers for arbitrary pipelines. |
| 92 | + |
| 93 | +The main downside is that it needs to be manually pre-configured. Sorry about that. |
| 94 | + |
| 95 | +But with this as a building block, and a few more changes coming down the line |
| 96 | +we may be able to build something more self directing in the future. |
0 commit comments