Issues with Cortex and Linkerd #8759

agalue · 2022-06-28T10:53:46Z

agalue
Jun 28, 2022

I'm having some issues with Linkerd and Cortex (https://cortexmetrics.io).

For those unfamiliar with Cortex, it has multiple components that can scale independently. Each component kind uses internal routing to decide who to talk to; it doesn't use Kubernetes Services, meaning a Pod from Component A will talk to a specific Pod from Component B, reaching it by its IP address.

In the case of the Frontends and Queriers, the Queriers register themselves against the Frontend instances and then depending on the queries that the Frontend receives, it decides to which Queriers to talk (and sometimes, multiple ones when it has to split queries).

Linkerd detects traffic between Query Frontend and Queriers based on the diagram shown on the dashboards. However, if I tap into them, I cannot see them talking to each other, and there are no statistics. The component that sends requests to the Query Frontend is seeing responses from it (which is how I know the whole solution is working). This happened with the latest Linkerd stable version (also tried the latest edge).

For instance,

Note how the Inbound stats against Queriers are empty.

Similarly, note how the Outbound stats against Query Frontends are empty.

The fact that they are listed might indicate that somehow Linkerd is aware of the traffic, but I don't understand why I cannot see the packets or statistics about them. They must have been talking to each other, as I can make queries and see results.

Using the Debug Container on the Frontend Pods, I can see that the communication via gRPC against Queriers is present and encrypted (using tshark -i any tcp port 9095, or filtering by Pod IP), and apparently the same for HTTP against our gateway application (using tshark -i any tcp port 8080); which is why it is unexpected that there are no stats and the linkerd viz tap command cannot see the traffic.

On top of that, I do not see traffic against all the Memcached instances on the dashboards or via linkerd viz tap (except requests from Prometheus). In this case, the Frontend stores results on a cache, and the Queriers should have access to the Metadata cache (Store Gateways to the Index, Metadata, and Chunks cache). I would expect to see TCP traffic, but that's not happening (when I tap on any of them), but I can see stats and edges for all of them (unlike the problem described above). I know that's working because the Cortex Dashboards for the Read Path are showing cache hits and misses.

From the Frontend, with the Debug container, Memcached traffic seems encrypted based on what I see via tshark, meaning the proxy seems to be applying mTLS; checked with tshark -i any -d tcp.port=11211,ssl host 10.244.0.36 where 10.244.0.36 is the IP of one of the Memcached Pods from a Cortex instance that talks to it.

Other communication scenarios seem to work (talk via Pod IP instead of Kubernetes Service), for instance, communication between Distributors and Ingesters, between Queriers and Ingesters, and between Queriers and Store Gateways.

The problem, for some reason, appears between Queriers and Frontends.

Any help would be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues with Cortex and Linkerd #8759

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Issues with Cortex and Linkerd #8759

Uh oh!

agalue Jun 28, 2022

Replies: 0 comments

agalue
Jun 28, 2022