Skip to content

Commit 27bfcc1

Browse files
committed
workerpool
1 parent 325185b commit 27bfcc1

File tree

1 file changed

+96
-0
lines changed

1 file changed

+96
-0
lines changed
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
---
2+
title: "Beam External Worker Pool"
3+
date: 2025-01-13T22:18:46-08:00
4+
tags:
5+
- beam
6+
- golang
7+
- dev
8+
- projects
9+
categories:
10+
- Dev
11+
---
12+
13+
Today I merged a PR to add [the ability to run a Worker Pool](https://github.com/apache/beam/pull/33572)
14+
for the Beam Go SDK. I'm writing this quite late, so it's more of a sketch of a draft of
15+
documentation than real documentation. So this will be light on links and examples or diagrams for now.
16+
17+
It's not something one always needs to use, but it adds flexibility to the SDK.
18+
19+
<!--more-->
20+
21+
Apache Beam's Portability framework as encoded by the Protocol Buffers has something
22+
called the ExternalWorkerPoolService.
23+
24+
Pre dominantly used to support "loopback mode" when executing pipelines, it's
25+
a general approach to allow a Beam runner to spin up workers on demand.
26+
27+
But I think I'm getting ahead of myself.
28+
29+
When authoring any code, it's very useful to be able to iterate on the program
30+
locally. The faster you can try things out, and find and eliminate bugs, the
31+
better the code will be, and the more fun the experience tends to be.
32+
The same is true for your data processing pipelines.
33+
34+
### Local Execution with Loopback Mode
35+
36+
Beam enables this by what it calls "loopback mode". By setting your job's
37+
environment type to `LOOPBACK`, the framework allows your job to be executed
38+
locally. This lets you use the debugger, set breakpoints, and avoid most of the
39+
networking and distributed bits that make things hard.
40+
41+
The Loopback environment type tells the framework to host an in process instance
42+
of the ExternaWorkerPoolService, and uses it's network endpoint to replace the normal
43+
execution environment.
44+
45+
The Runner then connects to that address, and asks for it to "start" a worker,
46+
giving it all the usual Beam Portability information to it. That worker then
47+
goes on to get work and execute it.
48+
49+
You gain all the speed of avoiding additional layers like docker, at the cost
50+
of not being able to distribute the work. As a result, Loopback mode isn't really
51+
the fastest way to execute all jobs. You're limited to what your local machine
52+
can have in memory and process in parallel afterall.
53+
54+
### External Worker Pool, as a Service
55+
56+
But what if you didn't want to run your workers on *your* machine? There's no
57+
reason the ExternalWorkerPoolService has to be in the same process as the
58+
launching program.
59+
60+
The service can be hosted anywhere, as long as the runner can connect to it.
61+
The implementation just needs to produce a worker, that will connect to the
62+
provided Beam FnAPI endpoints, and execute the given pipeline. This includes
63+
the various artifacts and such to run it.
64+
65+
The feature I added, is to be able to spin up a stand alone use of the service
66+
and do exactly that. All built into the standard Beam Go SDK Container.
67+
68+
The root of it is that you can start the container with a `--worker_pool` flag
69+
and tunnel out port 50000 from the container. This makes the SDK container
70+
boot loader program start the workerpool version of the service, instead of
71+
starting as an SDK worker itself.
72+
73+
The trick in the implementation is that it starts a separate process when
74+
StartWorker is called, executing the bootloader itself when it does so.
75+
76+
That new bootloader process then performs the normal Beam Container Contract,
77+
where it connects to the provisioning service to get the different endpoints,
78+
gets the artifacts from from the artifact service, starts it's own worker binary
79+
instance, which ultimately, connects back to the runner's control service to
80+
get work.
81+
82+
### OK, but what is it for?
83+
84+
The best use is when a runner is unable to start up it's own workers for a given
85+
SDK. This lets you preconfigure a side car service for the work that needs to
86+
expand out and do the work.
87+
88+
For example, this is needed when executing a pipeline on a runner on Kubernetes.
89+
You'd sometimes like to be able to bound the number of machines or pods that can
90+
serve as workers within your cluster. This lets you arrange a dedicated set of
91+
pods that can behave as workers for arbitrary pipelines.
92+
93+
The main downside is that it needs to be manually pre-configured. Sorry about that.
94+
95+
But with this as a building block, and a few more changes coming down the line
96+
we may be able to build something more self directing in the future.

0 commit comments

Comments
 (0)