Skip to content
This repository was archived by the owner on Apr 9, 2021. It is now read-only.

Commit 8a5ee71

Browse files
Merge pull request #739 from lyuxuan/channelz
A short introduction to channelz
2 parents cabc6ed + ec834b6 commit 8a5ee71

11 files changed

+295
-0
lines changed
Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
---
2+
layout: post
3+
title: A short introduction to Channelz
4+
published: true
5+
permalink: blog/a_short_introduction_to_channelz
6+
author: Yuxuan Li
7+
author-link: https://github.com/lyuxuan
8+
company: Google
9+
company-link: https://www.google.com
10+
---
11+
12+
Channelz is a tool that provides comprehensive runtime info about connections at
13+
different levels in gRPC. It is designed to help debug live programs, which may
14+
be suffering from network, performance, configuration issues, etc. The
15+
[gRFC](https://github.com/grpc/proposal/blob/master/A14-channelz.md) provides a
16+
detailed explanation of channelz design and is the canonical reference for all
17+
channelz implementations across languages. The purpose of this blog is to
18+
familiarize readers with channelz service and how to use it for debugging
19+
issues. The context of this post is set in
20+
[gRPC-Go](https://github.com/grpc/grpc-go), but the overall idea should be
21+
applicable across languages. At the time of writing, channelz is available for
22+
[gRPC-Go](https://github.com/grpc/grpc-go) and
23+
[gRPC-Java](https://github.com/grpc/grpc-java). Support for
24+
[C++](https://github.com/grpc/grpc) and wrapped languages is coming soon.
25+
26+
<!--more-->
27+
28+
Let's learn channelz through a simple example which uses channelz to help debug
29+
an issue. The
30+
[helloworld](https://github.com/grpc/grpc-go/tree/master/examples/helloworld)
31+
example from our repo is slightly modified to set up a buggy scenario. You can
32+
find the full source code here:
33+
[client](https://gist.github.com/lyuxuan/515fa6da7e0924b030e29b8be56fd90a),
34+
[server](https://gist.github.com/lyuxuan/81dd08ca649a6c78a61acc7ab05e0fef).
35+
36+
********************************************************************************
37+
38+
> **Client setup:**
39+
40+
> The client will make 100 SayHello RPCs to a specified target and load balance
41+
> the workload with the round robin policy. Each call has a 150ms timeout. RPC
42+
> responses and errors are logged for debugging purposes.
43+
44+
********************************************************************************
45+
46+
Running the program, we notice in the log that there are intermittent errors
47+
with error code **DeadlineExceeded** (as shown in Figure 1).
48+
49+
However, there's no clue about what is causing the deadline exceeded error and
50+
there are many possibilities:
51+
52+
* network issue, e.g. connection lost
53+
* proxy issue, e.g. dropped requests/responses in the middle
54+
* server issue, e.g. lost requests or just slow to respond
55+
56+
<img src="/img/log.png" style="max-width: 947px">
57+
58+
<p style="text-align: center"> Figure 1. Program log
59+
screenshort</p>
60+
61+
Let's turn on grpc INFO logging for more debug info and see if we can find
62+
something helpful.
63+
64+
<img src="/img/logWithInfo.png" style="max-width: 997px">
65+
66+
<p style="text-align: center"> Figure 2.
67+
gRPC INFO log</p>
68+
69+
As shown in Figure 2, the info log indicates that all three connections to the
70+
server are connected and ready for transmitting RPCs. No suspicious event shows
71+
up in the log. One thing that can be inferred from the info log is that all
72+
connections are up all the time, therefore the lost connection hypothesis can be
73+
ruled out.
74+
75+
To further narrow down the root cause of the issue, we will ask channelz for
76+
help.
77+
78+
Channelz provides gRPC internal networking machinery stats through a gRPC
79+
service. To enable channelz, users just need to register the channelz service to
80+
a gRPC server in their program and start the server. The code snippet below
81+
shows the API for registering channelz service to a
82+
[grpc.Server](https://godoc.org/google.golang.org/grpc#Server). Note that this
83+
has already been done for our example client.
84+
85+
```go
86+
import "google.golang.org/grpc/channelz/service"
87+
88+
// s is a *grpc.Server
89+
service.RegisterChannelzServiceToServer(s)
90+
91+
// call s.Serve() to serve channelz service
92+
```
93+
94+
A web tool called
95+
[grpc-zpages](https://github.com/grpc/grpc-experiments/tree/master/grpc-zpages)
96+
has been developed to conveniently serve channelz data through a web page.
97+
First, configure the web app to connect to the gRPC port that's serving the
98+
channelz service (see instructions from the previous link). Then, open the
99+
channelz web page in the browser. You should see a web page like Figure 3. Now
100+
we can start querying channelz!
101+
102+
<p align="center">
103+
<img src="/img/mainpage.png" style="max-width: 935px">
104+
</p>
105+
106+
<p style="text-align: center"> Figure 3.
107+
Channelz main page</p>
108+
109+
As the error is on the client side, let's first click on
110+
[TopChannels](https://github.com/grpc/proposal/blob/master/A14-channelz.md#gettopchannels).
111+
TopChannels is a collection of root channels which don't have parents. In
112+
gRPC-Go, a top channel is a
113+
[ClientConn](https://godoc.org/google.golang.org/grpc#ClientConn) created by the
114+
user through [Dial](https://godoc.org/google.golang.org/grpc#Dial) or
115+
[DialContext](https://godoc.org/google.golang.org/grpc#DialContext), and used
116+
for making RPC calls. Top channels are of
117+
[Channel](https://github.com/grpc/grpc-proto/blob/9b13d199cc0d4703c7ea26c9c330ba695866eb23/grpc/channelz/v1/channelz.proto#L37)
118+
type in channelz, which is an abstraction of a connection that an RPC can be
119+
issued to.
120+
121+
<p align="center">
122+
<img src="/img/topChan1.png" style="max-width: 815px">
123+
</p>
124+
125+
<p style="text-align: center"> Figure 4.
126+
TopChannels result</p>
127+
128+
So we click on the TopChannels, and a page like Figure 4 appears, which lists
129+
all the live top channel(s) with related info.
130+
131+
As shown in Figure 5, there is only one top channel with id = 2 (Note that text
132+
in square brackets is the reference name of the in memory channel object, which
133+
may vary across languages).
134+
135+
Looking at the **Data** section, we can see there are 15 calls failed out of 100
136+
on this channel.
137+
138+
<p align="center">
139+
<img src="/img/topChan2.png" style="max-width: 815px">
140+
</p>
141+
142+
<p style="text-align: center"> Figure 5.
143+
Top Channel (id = 2)</p>
144+
145+
On the right hand side, it shows the channel has no child **Channels**, 3
146+
**Subchannels** (as highlighted in Figure 6), and 0 **Sockets**.
147+
148+
<p align="center">
149+
<img src="/img/topChan3.png" style="max-width: 815px">
150+
</p>
151+
152+
<p style="text-align: center"> Figure 6.
153+
Subchannels owned by the Channel (id = 2)</p>
154+
155+
A
156+
[Subchannel](https://github.com/grpc/grpc-proto/blob/9b13d199cc0d4703c7ea26c9c330ba695866eb23/grpc/channelz/v1/channelz.proto#L61)
157+
is an abstraction over a connection and used for load balancing. For example,
158+
you want to send requests to "google.com". The resolver resolves "google.com" to
159+
multiple backend addresses that serve "google.com". In this example, the client
160+
is set up with the round robin load balancer, so all live backends are sent
161+
equal traffic. Then the (logical) connection to each backend is represented as a
162+
Subchannel. In gRPC-Go, a
163+
[SubConn](https://godoc.org/google.golang.org/grpc/balancer#SubConn) can be seen
164+
as a Subchannel.
165+
166+
The three Subchannels owned by the parent Channel means there are three
167+
connections to three different backends for sending the RPCs to. Let's look
168+
inside each of them for more info.
169+
170+
So we click on the first Subchannel ID (i.e. "4\[\]") listed, and a page like
171+
Figure 7 renders. We can see that all calls on this Subchannel have succeeded.
172+
Thus it's unlikely this Subchannel is related to the issue we are having.
173+
174+
<p align="center">
175+
<img src="/img/subChan4.png" style="max-width: 815px">
176+
</p>
177+
178+
<p style="text-align: center"> Figure 7.
179+
Subchannel (id = 4)</p>
180+
181+
So we go back, and click on Subchannel 5 (i.e. "5\[\]"). Again, the web page
182+
indicates that Subchannel 5 also never had any failed calls.
183+
184+
<p align="center">
185+
<img src="/img/subChan6_1.png" style="max-width: 815px">
186+
</p>
187+
188+
<p style="text-align: center"> Figure 8.
189+
Subchannel (id = 6)</p>
190+
191+
And finally, we click on Subchannel 6. This time, there's something different.
192+
As we can see in Figure 8, there are 15 out of 34 RPC calls failed on this
193+
Subchannel. And remember that the parent Channel also has exactly 15 failed
194+
calls. Therefore, Subchannel 6 is where the issue comes from. The state of the
195+
Subchannel is **READY**, which means it is connected and is ready to transmit
196+
RPCs. That rules out network connection problems. To dig up more info, let's
197+
look at the Socket owned by this Subchannel.
198+
199+
A
200+
[Socket](https://github.com/grpc/grpc-proto/blob/9b13d199cc0d4703c7ea26c9c330ba695866eb23/grpc/channelz/v1/channelz.proto#L227)
201+
is roughly equivalent to a file descriptor, and can be generally regarded as the
202+
TCP connection between two endpoints. In grpc-go,
203+
[http2Client](https://github.com/grpc/grpc-go/blob/ce4f3c8a89229d9db3e0c30d28a9f905435ad365/internal/transport/http2_client.go#L46)
204+
and
205+
[http2Server](https://github.com/grpc/grpc-go/blob/ce4f3c8a89229d9db3e0c30d28a9f905435ad365/internal/transport/http2_server.go#L61)
206+
correspond to Socket. Note that a network listener is also considered a Socket,
207+
and will show up in the channelz Server info.
208+
209+
<p align="center">
210+
<img src="/img/subChan6_2.png" style="max-width: 815px">
211+
</p>
212+
213+
<p style="text-align: center"> Figure 9.
214+
Subchannel (id = 6) owns Socket (id = 8)</p>
215+
216+
We click on Socket 8, which is at the bottom of the page (see Figure 9). And we
217+
now see a page like Figure 10.
218+
219+
The page provides comprehensive info about the socket like the security
220+
mechanism in use, stream count, message count, keepalives, flow control numbers,
221+
etc. The socket options info is not shown in the screenshot, as there are a lot
222+
of them and not related to the issue we are investigating.
223+
224+
The **Remote Address** field suggests that the backend we are having a problem
225+
with is **"127.0.0.1:10003"**. The stream counts here correspond perfectly to
226+
the call counts of the parent Subchannel. From this, we can know that the server
227+
is not actively sending DeadlineExceeded errors. This is because if the
228+
DeadlineExceeded error is returned by the server, then the streams would all be
229+
successful. A client side stream's success is independent of whether the call
230+
succeeds. It is determined by whether a HTTP2 frame with EOS bit set has been
231+
received (refer to the
232+
[gRFC](https://github.com/grpc/proposal/blob/master/A14-channelz.md#socket-data)
233+
for more info). Also, we can see that the number of messages sent is 34, which
234+
equals the number of calls, and it rules out the possibility of the client being
235+
stuck somehow and results in deadline exceeded. In summary, we can narrow down
236+
the issue to the server which serves on 127.0.0.1:10003. It may be that the
237+
server is slow to respond, or some proxy in front of it is dropping requests.
238+
239+
<p align="center">
240+
<img src="/img/socket8.png" style="max-width: 815px">
241+
</p>
242+
243+
<p style="text-align: center"> Figure 10. Socket
244+
(id = 8)</p>
245+
246+
As you see, channelz has helped us pinpoint the potential root cause of the
247+
issue with just a few clicks. You can now concentrate on what's happening with
248+
the pinpointed server. And again, channelz may help expedite the debugging at
249+
the server side too.
250+
251+
We will stop here and let readers explore the server side channelz, which is
252+
simpler than the client side. In channelz, a
253+
[Server](https://github.com/grpc/grpc-proto/blob/9b13d199cc0d4703c7ea26c9c330ba695866eb23/grpc/channelz/v1/channelz.proto#L199)
254+
is also an RPC entry point like a Channel, where incoming RPCs arrive and get
255+
processed. In grpc-go, a
256+
[grpc.Server](https://godoc.org/google.golang.org/grpc#Server) corresponds to a
257+
channelz Server. Unlike Channel, Server only has Sockets (both listen socket(s)
258+
and normal connected socket(s)) as its children.
259+
260+
Here are some hints for the readers:
261+
262+
* Look for the server with the address (127.0.0.1:10003).
263+
* Look at the call counts.
264+
* Go to the Socket(s) owned by the server.
265+
* Look at the Socket stream counts and message counts.
266+
267+
You should notice that the number of messages received by the server socket is
268+
the same as sent by the client socket (Socket 8), which rules out the case of
269+
having a misbehaving proxy (dropping request) in the middle. And the number of
270+
messages sent by the server socket is equal to the messages received at client
271+
side, which means the server was not able to send back the response before
272+
deadline. You may now look at the
273+
[server](https://gist.github.com/lyuxuan/81dd08ca649a6c78a61acc7ab05e0fef) code
274+
to verify whether it is indeed the cause.
275+
276+
********************************************************************************
277+
278+
> **Server setup:**
279+
280+
> The server side program starts up three GreeterServers, with two of them using
281+
> an implementation
282+
> ([server{}](https://gist.github.com/lyuxuan/81dd08ca649a6c78a61acc7ab05e0fef#file-main-go-L42))
283+
> that imposes no delay when responding to the client, and one using an
284+
> implementation
285+
> ([slowServer{}](https://gist.github.com/lyuxuan/81dd08ca649a6c78a61acc7ab05e0fef#file-main-go-L50))
286+
> which injects a variable delay of 100ms - 200ms before sending the response.
287+
288+
********************************************************************************
289+
290+
As you can see through this demo, channelz helped us quickly narrow down the
291+
possible causes of an issue and is easy to use. For more resources, see the
292+
detailed channelz
293+
[gRFC](https://github.com/grpc/proposal/blob/master/A14-channelz.md). Find us on
294+
github at [https://github.com/grpc/grpc-go](https://github.com/grpc/grpc-go).
295+

img/log.png

91.7 KB
Loading

img/logWithInfo.png

136 KB
Loading

img/mainpage.png

21.6 KB
Loading

img/socket8.png

98 KB
Loading

img/subChan4.png

60.3 KB
Loading

img/subChan6_1.png

61.6 KB
Loading

img/subChan6_2.png

61 KB
Loading

img/topChan1.png

56.7 KB
Loading

img/topChan2.png

54.6 KB
Loading

0 commit comments

Comments
 (0)