|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: A short introduction to Channelz |
| 4 | +published: true |
| 5 | +permalink: blog/a_short_introduction_to_channelz |
| 6 | +author: Yuxuan Li |
| 7 | +author-link: https://github.com/lyuxuan |
| 8 | +company: Google |
| 9 | +company-link: https://www.google.com |
| 10 | +--- |
| 11 | + |
| 12 | +Channelz is a tool that provides comprehensive runtime info about connections at |
| 13 | +different levels in gRPC. It is designed to help debug live programs, which may |
| 14 | +be suffering from network, performance, configuration issues, etc. The |
| 15 | +[gRFC](https://github.com/grpc/proposal/blob/master/A14-channelz.md) provides a |
| 16 | +detailed explanation of channelz design and is the canonical reference for all |
| 17 | +channelz implementations across languages. The purpose of this blog is to |
| 18 | +familiarize readers with channelz service and how to use it for debugging |
| 19 | +issues. The context of this post is set in |
| 20 | +[gRPC-Go](https://github.com/grpc/grpc-go), but the overall idea should be |
| 21 | +applicable across languages. At the time of writing, channelz is available for |
| 22 | +[gRPC-Go](https://github.com/grpc/grpc-go) and |
| 23 | +[gRPC-Java](https://github.com/grpc/grpc-java). Support for |
| 24 | +[C++](https://github.com/grpc/grpc) and wrapped languages is coming soon. |
| 25 | + |
| 26 | +<!--more--> |
| 27 | + |
| 28 | +Let's learn channelz through a simple example which uses channelz to help debug |
| 29 | +an issue. The |
| 30 | +[helloworld](https://github.com/grpc/grpc-go/tree/master/examples/helloworld) |
| 31 | +example from our repo is slightly modified to set up a buggy scenario. You can |
| 32 | +find the full source code here: |
| 33 | +[client](https://gist.github.com/lyuxuan/515fa6da7e0924b030e29b8be56fd90a), |
| 34 | +[server](https://gist.github.com/lyuxuan/81dd08ca649a6c78a61acc7ab05e0fef). |
| 35 | + |
| 36 | +******************************************************************************** |
| 37 | + |
| 38 | +> **Client setup:** |
| 39 | +
|
| 40 | +> The client will make 100 SayHello RPCs to a specified target and load balance |
| 41 | +> the workload with the round robin policy. Each call has a 150ms timeout. RPC |
| 42 | +> responses and errors are logged for debugging purposes. |
| 43 | +
|
| 44 | +******************************************************************************** |
| 45 | + |
| 46 | +Running the program, we notice in the log that there are intermittent errors |
| 47 | +with error code **DeadlineExceeded** (as shown in Figure 1). |
| 48 | + |
| 49 | +However, there's no clue about what is causing the deadline exceeded error and |
| 50 | +there are many possibilities: |
| 51 | + |
| 52 | +* network issue, e.g. connection lost |
| 53 | +* proxy issue, e.g. dropped requests/responses in the middle |
| 54 | +* server issue, e.g. lost requests or just slow to respond |
| 55 | + |
| 56 | +<img src="/img/log.png" style="max-width: 947px"> |
| 57 | + |
| 58 | +<p style="text-align: center"> Figure 1. Program log |
| 59 | +screenshort</p> |
| 60 | + |
| 61 | +Let's turn on grpc INFO logging for more debug info and see if we can find |
| 62 | +something helpful. |
| 63 | + |
| 64 | +<img src="/img/logWithInfo.png" style="max-width: 997px"> |
| 65 | + |
| 66 | +<p style="text-align: center"> Figure 2. |
| 67 | +gRPC INFO log</p> |
| 68 | + |
| 69 | +As shown in Figure 2, the info log indicates that all three connections to the |
| 70 | +server are connected and ready for transmitting RPCs. No suspicious event shows |
| 71 | +up in the log. One thing that can be inferred from the info log is that all |
| 72 | +connections are up all the time, therefore the lost connection hypothesis can be |
| 73 | +ruled out. |
| 74 | + |
| 75 | +To further narrow down the root cause of the issue, we will ask channelz for |
| 76 | +help. |
| 77 | + |
| 78 | +Channelz provides gRPC internal networking machinery stats through a gRPC |
| 79 | +service. To enable channelz, users just need to register the channelz service to |
| 80 | +a gRPC server in their program and start the server. The code snippet below |
| 81 | +shows the API for registering channelz service to a |
| 82 | +[grpc.Server](https://godoc.org/google.golang.org/grpc#Server). Note that this |
| 83 | +has already been done for our example client. |
| 84 | + |
| 85 | +```go |
| 86 | +import "google.golang.org/grpc/channelz/service" |
| 87 | + |
| 88 | +// s is a *grpc.Server |
| 89 | +service.RegisterChannelzServiceToServer(s) |
| 90 | + |
| 91 | +// call s.Serve() to serve channelz service |
| 92 | +``` |
| 93 | + |
| 94 | +A web tool called |
| 95 | +[grpc-zpages](https://github.com/grpc/grpc-experiments/tree/master/grpc-zpages) |
| 96 | +has been developed to conveniently serve channelz data through a web page. |
| 97 | +First, configure the web app to connect to the gRPC port that's serving the |
| 98 | +channelz service (see instructions from the previous link). Then, open the |
| 99 | +channelz web page in the browser. You should see a web page like Figure 3. Now |
| 100 | +we can start querying channelz! |
| 101 | + |
| 102 | +<p align="center"> |
| 103 | + <img src="/img/mainpage.png" style="max-width: 935px"> |
| 104 | +</p> |
| 105 | + |
| 106 | +<p style="text-align: center"> Figure 3. |
| 107 | +Channelz main page</p> |
| 108 | + |
| 109 | +As the error is on the client side, let's first click on |
| 110 | +[TopChannels](https://github.com/grpc/proposal/blob/master/A14-channelz.md#gettopchannels). |
| 111 | +TopChannels is a collection of root channels which don't have parents. In |
| 112 | +gRPC-Go, a top channel is a |
| 113 | +[ClientConn](https://godoc.org/google.golang.org/grpc#ClientConn) created by the |
| 114 | +user through [Dial](https://godoc.org/google.golang.org/grpc#Dial) or |
| 115 | +[DialContext](https://godoc.org/google.golang.org/grpc#DialContext), and used |
| 116 | +for making RPC calls. Top channels are of |
| 117 | +[Channel](https://github.com/grpc/grpc-proto/blob/9b13d199cc0d4703c7ea26c9c330ba695866eb23/grpc/channelz/v1/channelz.proto#L37) |
| 118 | +type in channelz, which is an abstraction of a connection that an RPC can be |
| 119 | +issued to. |
| 120 | + |
| 121 | +<p align="center"> |
| 122 | + <img src="/img/topChan1.png" style="max-width: 815px"> |
| 123 | +</p> |
| 124 | + |
| 125 | +<p style="text-align: center"> Figure 4. |
| 126 | +TopChannels result</p> |
| 127 | + |
| 128 | +So we click on the TopChannels, and a page like Figure 4 appears, which lists |
| 129 | +all the live top channel(s) with related info. |
| 130 | + |
| 131 | +As shown in Figure 5, there is only one top channel with id = 2 (Note that text |
| 132 | +in square brackets is the reference name of the in memory channel object, which |
| 133 | +may vary across languages). |
| 134 | + |
| 135 | +Looking at the **Data** section, we can see there are 15 calls failed out of 100 |
| 136 | +on this channel. |
| 137 | + |
| 138 | +<p align="center"> |
| 139 | + <img src="/img/topChan2.png" style="max-width: 815px"> |
| 140 | +</p> |
| 141 | + |
| 142 | +<p style="text-align: center"> Figure 5. |
| 143 | +Top Channel (id = 2)</p> |
| 144 | + |
| 145 | +On the right hand side, it shows the channel has no child **Channels**, 3 |
| 146 | +**Subchannels** (as highlighted in Figure 6), and 0 **Sockets**. |
| 147 | + |
| 148 | +<p align="center"> |
| 149 | + <img src="/img/topChan3.png" style="max-width: 815px"> |
| 150 | +</p> |
| 151 | + |
| 152 | +<p style="text-align: center"> Figure 6. |
| 153 | +Subchannels owned by the Channel (id = 2)</p> |
| 154 | + |
| 155 | +A |
| 156 | +[Subchannel](https://github.com/grpc/grpc-proto/blob/9b13d199cc0d4703c7ea26c9c330ba695866eb23/grpc/channelz/v1/channelz.proto#L61) |
| 157 | +is an abstraction over a connection and used for load balancing. For example, |
| 158 | +you want to send requests to "google.com". The resolver resolves "google.com" to |
| 159 | +multiple backend addresses that serve "google.com". In this example, the client |
| 160 | +is set up with the round robin load balancer, so all live backends are sent |
| 161 | +equal traffic. Then the (logical) connection to each backend is represented as a |
| 162 | +Subchannel. In gRPC-Go, a |
| 163 | +[SubConn](https://godoc.org/google.golang.org/grpc/balancer#SubConn) can be seen |
| 164 | +as a Subchannel. |
| 165 | + |
| 166 | +The three Subchannels owned by the parent Channel means there are three |
| 167 | +connections to three different backends for sending the RPCs to. Let's look |
| 168 | +inside each of them for more info. |
| 169 | + |
| 170 | +So we click on the first Subchannel ID (i.e. "4\[\]") listed, and a page like |
| 171 | +Figure 7 renders. We can see that all calls on this Subchannel have succeeded. |
| 172 | +Thus it's unlikely this Subchannel is related to the issue we are having. |
| 173 | + |
| 174 | +<p align="center"> |
| 175 | + <img src="/img/subChan4.png" style="max-width: 815px"> |
| 176 | +</p> |
| 177 | + |
| 178 | +<p style="text-align: center"> Figure 7. |
| 179 | +Subchannel (id = 4)</p> |
| 180 | + |
| 181 | +So we go back, and click on Subchannel 5 (i.e. "5\[\]"). Again, the web page |
| 182 | +indicates that Subchannel 5 also never had any failed calls. |
| 183 | + |
| 184 | +<p align="center"> |
| 185 | + <img src="/img/subChan6_1.png" style="max-width: 815px"> |
| 186 | +</p> |
| 187 | + |
| 188 | +<p style="text-align: center"> Figure 8. |
| 189 | +Subchannel (id = 6)</p> |
| 190 | + |
| 191 | +And finally, we click on Subchannel 6. This time, there's something different. |
| 192 | +As we can see in Figure 8, there are 15 out of 34 RPC calls failed on this |
| 193 | +Subchannel. And remember that the parent Channel also has exactly 15 failed |
| 194 | +calls. Therefore, Subchannel 6 is where the issue comes from. The state of the |
| 195 | +Subchannel is **READY**, which means it is connected and is ready to transmit |
| 196 | +RPCs. That rules out network connection problems. To dig up more info, let's |
| 197 | +look at the Socket owned by this Subchannel. |
| 198 | + |
| 199 | +A |
| 200 | +[Socket](https://github.com/grpc/grpc-proto/blob/9b13d199cc0d4703c7ea26c9c330ba695866eb23/grpc/channelz/v1/channelz.proto#L227) |
| 201 | +is roughly equivalent to a file descriptor, and can be generally regarded as the |
| 202 | +TCP connection between two endpoints. In grpc-go, |
| 203 | +[http2Client](https://github.com/grpc/grpc-go/blob/ce4f3c8a89229d9db3e0c30d28a9f905435ad365/internal/transport/http2_client.go#L46) |
| 204 | +and |
| 205 | +[http2Server](https://github.com/grpc/grpc-go/blob/ce4f3c8a89229d9db3e0c30d28a9f905435ad365/internal/transport/http2_server.go#L61) |
| 206 | +correspond to Socket. Note that a network listener is also considered a Socket, |
| 207 | +and will show up in the channelz Server info. |
| 208 | + |
| 209 | +<p align="center"> |
| 210 | + <img src="/img/subChan6_2.png" style="max-width: 815px"> |
| 211 | +</p> |
| 212 | + |
| 213 | +<p style="text-align: center"> Figure 9. |
| 214 | +Subchannel (id = 6) owns Socket (id = 8)</p> |
| 215 | + |
| 216 | +We click on Socket 8, which is at the bottom of the page (see Figure 9). And we |
| 217 | +now see a page like Figure 10. |
| 218 | + |
| 219 | +The page provides comprehensive info about the socket like the security |
| 220 | +mechanism in use, stream count, message count, keepalives, flow control numbers, |
| 221 | +etc. The socket options info is not shown in the screenshot, as there are a lot |
| 222 | +of them and not related to the issue we are investigating. |
| 223 | + |
| 224 | +The **Remote Address** field suggests that the backend we are having a problem |
| 225 | +with is **"127.0.0.1:10003"**. The stream counts here correspond perfectly to |
| 226 | +the call counts of the parent Subchannel. From this, we can know that the server |
| 227 | +is not actively sending DeadlineExceeded errors. This is because if the |
| 228 | +DeadlineExceeded error is returned by the server, then the streams would all be |
| 229 | +successful. A client side stream's success is independent of whether the call |
| 230 | +succeeds. It is determined by whether a HTTP2 frame with EOS bit set has been |
| 231 | +received (refer to the |
| 232 | +[gRFC](https://github.com/grpc/proposal/blob/master/A14-channelz.md#socket-data) |
| 233 | +for more info). Also, we can see that the number of messages sent is 34, which |
| 234 | +equals the number of calls, and it rules out the possibility of the client being |
| 235 | +stuck somehow and results in deadline exceeded. In summary, we can narrow down |
| 236 | +the issue to the server which serves on 127.0.0.1:10003. It may be that the |
| 237 | +server is slow to respond, or some proxy in front of it is dropping requests. |
| 238 | + |
| 239 | +<p align="center"> |
| 240 | + <img src="/img/socket8.png" style="max-width: 815px"> |
| 241 | +</p> |
| 242 | + |
| 243 | +<p style="text-align: center"> Figure 10. Socket |
| 244 | +(id = 8)</p> |
| 245 | + |
| 246 | +As you see, channelz has helped us pinpoint the potential root cause of the |
| 247 | +issue with just a few clicks. You can now concentrate on what's happening with |
| 248 | +the pinpointed server. And again, channelz may help expedite the debugging at |
| 249 | +the server side too. |
| 250 | + |
| 251 | +We will stop here and let readers explore the server side channelz, which is |
| 252 | +simpler than the client side. In channelz, a |
| 253 | +[Server](https://github.com/grpc/grpc-proto/blob/9b13d199cc0d4703c7ea26c9c330ba695866eb23/grpc/channelz/v1/channelz.proto#L199) |
| 254 | +is also an RPC entry point like a Channel, where incoming RPCs arrive and get |
| 255 | +processed. In grpc-go, a |
| 256 | +[grpc.Server](https://godoc.org/google.golang.org/grpc#Server) corresponds to a |
| 257 | +channelz Server. Unlike Channel, Server only has Sockets (both listen socket(s) |
| 258 | +and normal connected socket(s)) as its children. |
| 259 | + |
| 260 | +Here are some hints for the readers: |
| 261 | + |
| 262 | +* Look for the server with the address (127.0.0.1:10003). |
| 263 | +* Look at the call counts. |
| 264 | +* Go to the Socket(s) owned by the server. |
| 265 | +* Look at the Socket stream counts and message counts. |
| 266 | + |
| 267 | +You should notice that the number of messages received by the server socket is |
| 268 | +the same as sent by the client socket (Socket 8), which rules out the case of |
| 269 | +having a misbehaving proxy (dropping request) in the middle. And the number of |
| 270 | +messages sent by the server socket is equal to the messages received at client |
| 271 | +side, which means the server was not able to send back the response before |
| 272 | +deadline. You may now look at the |
| 273 | +[server](https://gist.github.com/lyuxuan/81dd08ca649a6c78a61acc7ab05e0fef) code |
| 274 | +to verify whether it is indeed the cause. |
| 275 | + |
| 276 | +******************************************************************************** |
| 277 | + |
| 278 | +> **Server setup:** |
| 279 | +
|
| 280 | +> The server side program starts up three GreeterServers, with two of them using |
| 281 | +> an implementation |
| 282 | +> ([server{}](https://gist.github.com/lyuxuan/81dd08ca649a6c78a61acc7ab05e0fef#file-main-go-L42)) |
| 283 | +> that imposes no delay when responding to the client, and one using an |
| 284 | +> implementation |
| 285 | +> ([slowServer{}](https://gist.github.com/lyuxuan/81dd08ca649a6c78a61acc7ab05e0fef#file-main-go-L50)) |
| 286 | +> which injects a variable delay of 100ms - 200ms before sending the response. |
| 287 | +
|
| 288 | +******************************************************************************** |
| 289 | + |
| 290 | +As you can see through this demo, channelz helped us quickly narrow down the |
| 291 | +possible causes of an issue and is easy to use. For more resources, see the |
| 292 | +detailed channelz |
| 293 | +[gRFC](https://github.com/grpc/proposal/blob/master/A14-channelz.md). Find us on |
| 294 | +github at [https://github.com/grpc/grpc-go](https://github.com/grpc/grpc-go). |
| 295 | + |
0 commit comments