|
| 1 | +# GKE Load Test Tutorial |
| 2 | + |
| 3 | +In this tutorial, we are going to set up a Doorman deployment similar to what you may expect to run in a production environment. The resource that Doorman will be protecting isn't all that useful (it's one of the Go examples for gRPC, the [Greeter](https://github.com/grpc/grpc-go/blob/master/examples/helloworld/helloworld/helloworld.proto) service), but that doesn't change the fact it's a real RPC service that may have limited capacity. Finally, we'll add some monitoring for good measure. |
| 4 | + |
| 5 | +To make this slightly more manageable, we'll do all of this in a Kubernetes cluster running on GKE (Google Container Engine). It should be simple to replicate the experiment with a Kubernetes cluster running on your own machines, and relatively easy to replicate it using some other cloud setup. |
| 6 | + |
| 7 | +## Dramatis Personae |
| 8 | + |
| 9 | +Our deployment will consist of the following elements: |
| 10 | + |
| 11 | + - Doorman Server - the standard Doorman server. |
| 12 | + - `target` - an RPC server. |
| 13 | + - `client` - a client for `target` which uses Doorman to avoid overloading it. |
| 14 | + - [Prometheus](http://prometheus.io/) - a monitoring system. We'll use it to get insight into the running system. |
| 15 | + |
| 16 | +`target` and `client` are custom written for this tutorial. Let's take a closer look at them. |
| 17 | + |
| 18 | + |
| 19 | + |
| 20 | +### `target` |
| 21 | + |
| 22 | +[Target](docker/target/target.go) is an extremely simple gRPC server. Here is its `main` function: |
| 23 | +```go |
| 24 | +func main() { |
| 25 | + flag.Parse() |
| 26 | + lis, err := net.Listen("tcp", fmt.Sprintf(":%v", *port)) |
| 27 | + if err != nil { |
| 28 | + log.Exitf("failed to listen: %v", err) |
| 29 | + } |
| 30 | + |
| 31 | + http.Handle("/metrics", prometheus.Handler()) |
| 32 | + go http.ListenAndServe(fmt.Sprintf(":%v", *debugPort), nil) |
| 33 | + s := grpc.NewServer() |
| 34 | + pb.RegisterGreeterServer(s, &server{}) |
| 35 | + s.Serve(lis) |
| 36 | +} |
| 37 | +``` |
| 38 | + |
| 39 | +We listen on two ports: one for gRPC, the other for HTTP, which we will use for monitoring. |
| 40 | + |
| 41 | +`server` is similarly unexciting: |
| 42 | + |
| 43 | +```go |
| 44 | +// server is used to implement helloworld.GreeterServer. |
| 45 | +type server struct{} |
| 46 | + |
| 47 | +// SayHello implements helloworld.GreeterServer |
| 48 | +func (s *server) SayHello(ctx context.Context, in *pb.HelloRequest) (*pb.HelloReply, error) { |
| 49 | + requests.WithLabelValues(in.Name).Inc() |
| 50 | + return &pb.HelloReply{Message: "Hello " + in.Name}, nil |
| 51 | +} |
| 52 | +``` |
| 53 | + |
| 54 | +A last thing worth noting is requests, which is a Prometheus counter. We will use it to monitor the number of requests that `target` is actually getting. |
| 55 | + |
| 56 | +### `client` |
| 57 | + |
| 58 | +`client` has one task: Send RPCs to `target`. Each process simulates some number of Doorman clients. This is necessary to make scheduling on a small Kubernetes cluster easier. In a real world setting, you would usually have one Doorman client per process. |
| 59 | + |
| 60 | +```go |
| 61 | +func main() { |
| 62 | + flag.Parse() |
| 63 | + log.Infof("Simulating %v clients.", *count) |
| 64 | + for i := 0; i < *count; i++ { |
| 65 | + id := uuid.New() |
| 66 | + log.Infof("client %v with id %v", i, id) |
| 67 | + |
| 68 | + client, err := doorman.NewWithID(*addr, id, doorman.DialOpts(grpc.WithInsecure())) |
| 69 | + if err != nil { |
| 70 | + log.Exit(err) |
| 71 | + } |
| 72 | + defer client.Close() |
| 73 | + |
| 74 | + res, err := client.Resource(*resource, *initialCapacity) |
| 75 | + if err != nil { |
| 76 | + log.Exit(err) |
| 77 | + } |
| 78 | + |
| 79 | + go manipulateCapacity(res, *initialCapacity, id) |
| 80 | + |
| 81 | + conn, err := grpc.Dial(*target, grpc.WithInsecure()) |
| 82 | + if err != nil { |
| 83 | + log.Exitf("did not connect: %v", err) |
| 84 | + } |
| 85 | + defer conn.Close() |
| 86 | + |
| 87 | + c := pb.NewGreeterClient(conn) |
| 88 | + rl := ratelimiter.NewQPS(res) |
| 89 | + |
| 90 | + for i := 0; i < *workers; i++ { |
| 91 | + go func() { |
| 92 | + ctx := context.Background() |
| 93 | + for { |
| 94 | + if err := rl.Wait(ctx); err != nil { |
| 95 | + log.Exitf("rl.Wait: %v", err) |
| 96 | + } |
| 97 | + |
| 98 | + ctx, cancel := context.WithTimeout(ctx, 30*time.Second) |
| 99 | + if _, err := c.SayHello(ctx, &pb.HelloRequest{Name: *resource}); err != nil { |
| 100 | + log.Error(err) |
| 101 | + } |
| 102 | + cancel() |
| 103 | + } |
| 104 | + }() |
| 105 | + } |
| 106 | + } |
| 107 | + http.Handle("/metrics", prometheus.Handler()) |
| 108 | + http.ListenAndServe(fmt.Sprintf(":%v", *port), nil) |
| 109 | +} |
| 110 | + |
| 111 | +``` |
| 112 | + |
| 113 | +The client uses a Doorman rate limiter and when the `Wait` method returns, performs the RPC. |
| 114 | + |
| 115 | +The function `manipulateCapacity` changes the capacity requested by a client in a random way: |
| 116 | + |
| 117 | +```go |
| 118 | +func manipulateCapacity(res doorman.Resource, current float64, id string) { |
| 119 | + clientRequested := new(expvar.Float) |
| 120 | + for range time.Tick(*interval) { |
| 121 | + r := rand.Float64() |
| 122 | + log.V(2).Infof("r=%v decreaseChance=%v increaseChance=%v", r, *decreaseChance, *increaseChance) |
| 123 | + switch { |
| 124 | + case r < *decreaseChance: |
| 125 | + current -= *step |
| 126 | + log.Infof("client %v will request less: %v.", id, current) |
| 127 | + case r < *decreaseChance+*increaseChance: |
| 128 | + log.Infof("client %v will request more: %v.", id, current) |
| 129 | + current += *step |
| 130 | + default: |
| 131 | + log.V(2).Infof("client %v not changing requested capacity", id) |
| 132 | + continue |
| 133 | + } |
| 134 | + if current > *maxCapacity { |
| 135 | + current = *maxCapacity |
| 136 | + } |
| 137 | + if current < *minCapacity { |
| 138 | + current = *minCapacity |
| 139 | + } |
| 140 | + log.Infof("client %v will request %v", id, current) |
| 141 | + if err := res.Ask(current); err != nil { |
| 142 | + log.Errorf("res.Ask(%v): %v", current, err) |
| 143 | + continue |
| 144 | + } |
| 145 | + clientRequested.Set(current) |
| 146 | + requested.Set(id, clientRequested) |
| 147 | + } |
| 148 | +} |
| 149 | +``` |
| 150 | + |
| 151 | +Again, we are exposing an HTTP port and exporting metrics (we will use them to find the client latencies). |
| 152 | + |
| 153 | +### Doorman Server |
| 154 | + |
| 155 | +This is the the regular [doorman server](https://github.com/youtube/doorman/tree/master/go/cmd/doorman), whose address we will give to the client. The way we'll run the server differs significantly from how we would run it in a real world setting. We are running just one process. If it dies, the client won't be able to get new resource leases. For production, we would run 3 processes, and they would use [etcd](https://github.com/coreos/etcd/) to elect a leader among themselves. We are skipping this step for the sake of simplicity. |
| 156 | + |
| 157 | +## Kubernetes |
| 158 | + |
| 159 | +[Kubernetes](http://kubernetes.io/) allows you to run Docker instances in a cluster. The great part about it is that it allows you to view all your containers as a single system. If you want to learn more about Kubernetes, please take a look at its [documentation](http://kubernetes.io/v1.1/examples/guestbook/README.html). |
| 160 | +### Kubernetes in less than a minute |
| 161 | + |
| 162 | +Assuming that you have some idea what Kubernetes is about, here's a quick refresher. |
| 163 | + |
| 164 | +- All processes run in Linux containers (Docker being the most popular container solution). |
| 165 | +- A *pod* is a group of containers that get scheduled together. |
| 166 | +- A *replication controller* makes sure that a specified number of replicas of some pod are running at any given point. It is important to remember that pods running under a replication controller are [cattle, not pets](http://www.theregister.co.uk/2013/03/18/servers_pets_or_cattle_cern/). You are not supposed to concern yourself with a single pod. It may get killed, rescheduled, etc. They have names, but they are randomly generated, and you refer to them mostly when debugging an issue. |
| 167 | +- A *service* abstracts away the problem of referring to pods. You specify some constrains that the pods have to meet, and the service gives you a port which you can use to connect to a pod of the specified type. It also acts as a load balancer. |
| 168 | + |
| 169 | +## Creating a cluster |
| 170 | + |
| 171 | +This part of the tutorial is specific for GKE. You should be able to adapt it reasonably well for [AWS](http://kubernetes.io/v1.1/docs/getting-started-guides/aws.html) or [Azure](http://kubernetes.io/v1.1/docs/getting-started-guides/coreos/azure/README.html) |
| 172 | + |
| 173 | +I am assuming that you've installed [gcloud CLI](https://cloud.google.com/container-engine/docs/before-you-begin#install_the_gcloud_command_line_interface), and you have a Cloud project set up. You should also do |
| 174 | + |
| 175 | +```console |
| 176 | +$ gcloud config set project PROJECT_ID |
| 177 | +$ gcloud config set compute/zone us-central1-b |
| 178 | +``` |
| 179 | + |
| 180 | +to save yourself from some typing. |
| 181 | + |
| 182 | +Let's create a cluster. Run something like |
| 183 | +```console |
| 184 | +$ gcloud container clusters create doorman-loadtest --machine-type=n1-standard-1 --num-nodes=6 |
| 185 | +``` |
| 186 | + |
| 187 | +depending on how big you want your toy cluster to be. |
| 188 | + |
| 189 | +## Docker images |
| 190 | + |
| 191 | +Now, let's create the Docker images that we will use to run our services. I am assuming that you are in Doorman's main directory. |
| 192 | + |
| 193 | +```console |
| 194 | +$ docker build -t gcr.io/google.com/doorman/doorman-server:v0.1.7 doc/loadtest/docker/server/ |
| 195 | +$ docker build -t gcr.io/google.com/doorman/doorman-client:v0.1.7 doc/loadtest/docker/server/ |
| 196 | +$ docker build -t gcr.io/google.com/doorman/target:v0.1 doc/loadtest/docker/server/ |
| 197 | +$ docker build -t gcr.io/google.com/doorman/prometheus:v0.2 doc/loadtest/docker/prometheus |
| 198 | +``` |
| 199 | + |
| 200 | +Now, we can push them to the Docker registry: |
| 201 | + |
| 202 | +```console |
| 203 | +$ gcloud docker push gcr.io/google.com/doorman/doorman-server:v0.1.7 |
| 204 | +$ gcloud docker push gcr.io/google.com/doorman/doorman-client:v0.1.7 |
| 205 | +$ gcloud docker push gcr.io/google.com/doorman/target:v0.1 |
| 206 | +$ gcloud docker push gcr.io/google.com/doorman/prometheus:v0.2 |
| 207 | +``` |
| 208 | + |
| 209 | +You will have to replace `google.com/doorman` with your project name, of course, and the image tags if you wish to use a different container registry. |
| 210 | + |
| 211 | +## Populating the cluster |
| 212 | + |
| 213 | +### Doorman |
| 214 | +Now, time for the fun part: putting our containers into the cloud! First we'll create a replication controller for the doorman server. We want only one replica, but we need it to be restarted in case something happens. Please take a look at its code in [doorman-server.yaml](doorman-server.yaml). |
| 215 | + |
| 216 | +```console |
| 217 | +$ kubectl create -f doc/loadtest/doorman-server.yaml |
| 218 | +replicationcontroller "doorman-server" created |
| 219 | +``` |
| 220 | + |
| 221 | +After a moment, you will see it's been created and it's running |
| 222 | +``` |
| 223 | +$ kubectl get pods |
| 224 | +NAME READY STATUS RESTARTS AGE |
| 225 | +doorman-server-le54r 1/1 Running 0 15s |
| 226 | +``` |
| 227 | + |
| 228 | +This is running the Doorman server with a command line like this: |
| 229 | + |
| 230 | +```console |
| 231 | +doorman -logtostderr -port=3667 -debug_port=3668 -config=./config.prototext |
| 232 | +``` |
| 233 | + |
| 234 | +Let's take a look at its logs to verify everything is fine: |
| 235 | + |
| 236 | +```console |
| 237 | +$ kubectl logs doorman-server-le54r |
| 238 | +I0226 15:48:33.352541 1 doorman_server.go:234] Waiting for the server to be configured... |
| 239 | +I0226 15:48:33.352618 1 doorman_server.go:238] Server is configured, ready to go! |
| 240 | +I0226 15:48:33.352801 1 server.go:437] this Doorman server is now the master |
| 241 | +I0226 15:48:33.352818 1 server.go:457] setting current master to 'doorman-server-le54r:3667' |
| 242 | +``` |
| 243 | + |
| 244 | +(Your pod identifier will of course be different.) |
| 245 | + |
| 246 | +We can also take a look at the status page of the server. First, we need to run |
| 247 | + |
| 248 | +```console |
| 249 | +kubectl port-forward doorman-server-le54r 3668 & |
| 250 | +``` |
| 251 | + |
| 252 | +to forward our 3668 port to the same container's port. Now we can go to http://localhost:3668/debug/status, and see something like |
| 253 | + |
| 254 | + |
| 255 | + |
| 256 | +Before we forget about it, let's also create a service, that will make our server discoverable for the clients: |
| 257 | + |
| 258 | +```console |
| 259 | +$ kubectl create -f doc/loadtest/doorman-server-service.yaml |
| 260 | +``` |
| 261 | + |
| 262 | +### Prometheus |
| 263 | +Let's not forget about |
| 264 | + |
| 265 | +```console |
| 266 | +$ kubectl create -f doc/loadtest/prometheus.yaml |
| 267 | +``` |
| 268 | +and quickly verify that it's running. Forward its port: |
| 269 | + |
| 270 | +```console |
| 271 | +kubectl port-forward prometheus-mtka5 9090 & |
| 272 | +``` |
| 273 | + |
| 274 | +And go to http://localhost:9090/graph to verify it's running. |
| 275 | + |
| 276 | +### Target |
| 277 | + |
| 278 | +Now, it's time for the target. |
| 279 | +```console |
| 280 | +$ kubectl create -f doc/loadtest/target.yaml |
| 281 | +replicationcontroller "target" created |
| 282 | +$ kubectl create -f doc/loadtest/target-service.yaml |
| 283 | +service "target" created |
| 284 | +``` |
| 285 | + |
| 286 | +Let's verify it's running: |
| 287 | + |
| 288 | +```console |
| 289 | +$ kubectl get pods -l app=target |
| 290 | +NAME READY STATUS RESTARTS AGE |
| 291 | +target-4ivl7 1/1 Running 0 1m |
| 292 | +``` |
| 293 | + |
| 294 | +### Clients |
| 295 | + |
| 296 | +Now, the key, final element of our puzzle: the client. Let's bring it up: |
| 297 | + |
| 298 | +```console |
| 299 | +$ kubectl create -f doc/loadtest/doorman-client.yaml |
| 300 | +$ kubectl create -f doc/loadtest/doorman-client-service.yaml |
| 301 | +``` |
| 302 | + |
| 303 | +This creates 10 replicas of `doorman-client`. Each replica is running the following command line: |
| 304 | + |
| 305 | +```console |
| 306 | +client -port=80 --logtostderr \ |
| 307 | + -count=100 |
| 308 | + -resource=proportional -initial_capacity=15 -min_capacity=5 \ |
| 309 | + -max_capacity=2000 -increase_chance=0.1 -decrease_chance=0.05 -step=5 \ |
| 310 | + -addr=$(DOORMAN_SERVICE_HOST):$(DOORMAN_SERVICE_PORT_GRPC) \ |
| 311 | + -target=$(TARGET_SERVICE_HOST):$(TARGET_SERVICE_PORT_GRPC) |
| 312 | + |
| 313 | +``` |
| 314 | + |
| 315 | +This means that every process creates a 100 Doorman clients, which access the resource `proportional`. The initial capacity will be `15`, and it will fluctuate between `5` and `2000`, with a 10% chance of increasing, and 5% chance of decreasing. Note we get both Doorman's and target's address (`-addr` and `-target`) from the environment. This is one of the ways that Kubernetes enables [discovery](https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/services.md#discovering-services). |
| 316 | + |
| 317 | +## Looking around |
| 318 | + |
| 319 | +So, now that everything is running let's take a small tour of the neighborhood. First, let's look at the Doorman server. |
| 320 | + |
| 321 | +Make sure that fowarding port `3668` is still working, and go to http://localhost:3668/debug/status. |
| 322 | + |
| 323 | + |
| 324 | + |
| 325 | +Now it's a lot more interesting. You can see that there's 1000 clients, and all the capacity has been assigned. |
| 326 | + |
| 327 | +Another place where you may want to take a look is http://localhost:3668/debug/requests. This allows you to get a sample of the requests handled by this gRPC server, with information about the source, timing, and received arguments. This is an invaluable tool for debugging. |
| 328 | + |
| 329 | + |
| 330 | + |
| 331 | +After that let's take a look at `target`. To do that, we'll use Prometheus' expression browser. In production, you'd probably want to be slightly more fancy and have consoles and dashboards, but for our purposes the browser should be enough. |
| 332 | + |
| 333 | +Input the following expression: |
| 334 | + |
| 335 | +``` |
| 336 | +rate(requests[5m]) |
| 337 | +``` |
| 338 | + |
| 339 | +If you remember from `target.go`, `requests` is a metric counting any requests handled by `target`. The above expression calculates the rate over 5 minutes. We can see in the graph that we are doing 5000 requests per second: |
| 340 | + |
| 341 | +![rate(requests[5m])](requests-rate.png) |
| 342 | + |
| 343 | +Other interesting queries to run: |
| 344 | + |
| 345 | +How many requests is the doorman server receiving? |
| 346 | +``` |
| 347 | +rate(doorman_server_requests[5m]) |
| 348 | +``` |
| 349 | + |
| 350 | +What's the avereage latency for a doorman client request? |
| 351 | +``` |
| 352 | +sum(rate(doorman_client_request_durations_sum[5m])) by (job) / sum(rate(doorman_client_request_durations_count[5m])) by (job) |
| 353 | +``` |
| 354 | + |
| 355 | + |
| 356 | +## What to do next |
| 357 | + |
| 358 | +### Scale! |
| 359 | + |
| 360 | +Stir things up a bit. Add more clients! A lot more clients, say, instruct the client replication controller to maintain 100 replicas. |
| 361 | + |
| 362 | +```console |
| 363 | +$ kubectl scale --replicas=100 replicationcontrollers doorman-client-proportional |
| 364 | +``` |
| 365 | + |
| 366 | +What happens with the number of requests the server is doing? How about the QPS that `target` is receiving? Is it behaving the way you expected? (Hint: if your cluster is small, and there's many clients, they will eventuall become starved for resources, and not be able to use all the capacity they got. Take a look at the [adaptive rate limiter](https://godoc.org/github.com/youtube/doorman/go/ratelimiter#AdaptiveQPS) for a workaround.) How about the client latencies? Can you make them better by giving Doorman more CPU? |
| 367 | + |
| 368 | +### Different Algorithms |
| 369 | + |
| 370 | +Experiment with different capacity distribution algorithms. Edit [`config.protext`](docker/server/config.prototext) to use the [FAIR_SHARE](../algorithms.md#fair_share) algorithm. Does it have any effect on the metrics? |
| 371 | + |
| 372 | +### High Availability |
| 373 | + |
| 374 | +Make the Doorman server highly available. Add an etcd instance (or cluster) to the Kubernetes cluster, increase the number of replicas in [doorman-server.yaml](doorman-server.yaml), and configure them to do a leader election. (Hint: Use the `-etcd_endpoints` and `-master_election_lock` flags.) |
0 commit comments