nats: no responders available for request #2741

begmaroman · 2024-12-03T21:46:07Z

Description of the bug

I'm building an example of the microservices project built using go-micro with two simple services: https://github.com/begmaroman/go-micro-boilerplate/blob/feature/k8s/docker-compose.yaml

The first service is rest-api-svc and built using go micro web framework. Another one is account-svc built using go-micro.

They use nats server for discovery, transport and as a broker. I tried all other options for the transport but the following bug still appears:

{"id":"go.micro.client.transport","code":500,"detail":"nats: no responders available for request","status":"Internal Server Error"}

The interesting point is that the bug is flaky and appears ~ once per 10 request.

How to reproduce the bug

Clone the repo
Run go mod download
Run make build-base-image
Run docker compose up --build
Send request to GET http://localhost:3004/user few times

Environment

Go Version: 1.23.0

The text was updated successfully, but these errors were encountered:

asim · 2024-12-03T23:18:01Z

I can't remember the code for nats specifically but it could.be that across restarts the registry is being polluted by dead nodes. Ensure you're using heartbeat and ttls for expiry with the registry and set your client retries to 3. That should mitigate some of the problem.

asim · 2024-12-03T23:21:27Z

If you look at the nats registry code in the plugins repo you'll notice that deregistration has no form of broadcasting. So effectively it can result in dead nodes. It's been a long time since I've done any development here so you'll need to investigate yourself I'm afraid.

begmaroman · 2024-12-04T11:09:56Z

@asim this problem is related to all registries available in plugins repo so I think this is something related to dead nodes or so. Would you mind giving more advise about the fix because I really stuck here. Thank you very much.

asim · 2024-12-04T13:00:58Z

If it's related to all registries then it's an issue with shutdown and services not getting the time to deregister. It can happen if they are killed without a termination signal, usually in a k8s like environment or kill -9 locally. If you have ttl and expiry set in the service options then these entries should expire from the registry but you can increase the client retries so it immediately tries a different service entry from the registry.

service.Client().Init(client.Retries(3))

begmaroman · 2024-12-04T13:46:59Z

@asim weird thing is that when I'm using a custom selector strategy, I receive an incomplete list of nodes sometimes exactly when a request fails.

asim · 2024-12-04T14:35:41Z

Then the issue is likely with your custom selector I guess. What are you using?

begmaroman · 2024-12-04T14:46:37Z

@asim I use a self-pinger client https://github.com/begmaroman/go-micro-boilerplate/blob/master/proto/health/pinger.go and the custom selector https://github.com/begmaroman/go-micro-boilerplate/blob/master/proto/health/health.go

So the list of nodes which is coming as an argument has 2 or 1 nodes. When it has 1 node, the request usually fails.

begmaroman · 2024-12-04T14:52:47Z

Some more context:

This is how the service gets initialized: https://github.com/begmaroman/go-micro-boilerplate/blob/feature/k8s/services/account-svc/microservice/microservice.go#L28-L42
This is how the client gets initialized: https://github.com/begmaroman/go-micro-boilerplate/blob/feature/k8s/services/rest-api-svc/microservice/microservice.go#L55
There is also a self pinger client which is used by healthchecker: https://github.com/begmaroman/go-micro-boilerplate/blob/feature/k8s/services/account-svc/microservice/microservice.go#L50. It also does not always work and fails with the following error:

error while serving connection: go-micro-boilerplate.account-svc-3a6bc790-1763-409f-b96d-e3e3bea7b6c4 | _INBOX.4A34WmqmnjQZ87mXVO6djQ: deadline exceeded

asim · 2024-12-04T15:01:37Z

Ok so assuming this is a self healthcheck I guess the assumption is going to be the error could potentially occur before the service actually registers if the request to ping is initiated before the entry is in the registry, meaning no nodes are returned matching the instance. That's the only time I could see that error based on my limited understanding of the code

begmaroman · 2024-12-04T15:04:17Z

@asim given that I use a regular way to define microservice and communicate with it, should we consider the issue is in the framework?

begmaroman · 2024-12-04T15:08:40Z

Ok so assuming this is a self healthcheck I guess the assumption is going to be the error could potentially occur before the service actually registers if the request to ping is initiated before the entry is in the registry

It fails every second request. I send ping and healthchecks after the service is fully started.

asim · 2024-12-04T15:15:05Z

How often is the healthcheck fired?

asim · 2024-12-04T15:18:19Z

@asim given that I use a regular way to define microservice and communicate with it, should we consider the issue is in the framework?

Given no one else has the issue I wouldn't call it an issue with the framework. You're doing something very bespoke with your own selector and nats. If it's happening with any registry then it points to some issue with your k8s setup as opposed to the service itself. I don't know the specifics of it so it's hard to comment. Theoretically it should be fine but if it's failing then either the service is dead or something else is wrong.

begmaroman · 2024-12-04T15:42:54Z

How often is the healthcheck fired?

Every second. I tried every 5 seconds as well. The point is that the healthcheck of the service itself and the selfpinger client with the custom selector works well.

Results of executing 2k requests with 50 concurrency:
Without a custom selector strategy:

Status code distribution:
  [200] 1006 responses
  [500] 994 responses

without a custom selector strategy:

Status code distribution:
  [200] 2000 responses

As fas as I can guess, it fails when trying to retrieve a live node, but if we specify which exactly node I want to use, it works well.

begmaroman · 2024-12-04T16:18:00Z

I found the problem. Due to some reason, the list of nodes in the service object has a weird item which does not work:

{
        "metadata": null,
        "id": "1cab0b44-71dc-43d6-a145-c58faa936b63",
        "address": "172.19.0.4:5678"
}

Since the default random selector is used, in 50% cases it tries to use unreachable node. The valid node ID starts with the service name plus node id while this bad node has only ID.

I created a custom selector which filters out bad nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nats: no responders available for request #2741

nats: no responders available for request #2741

begmaroman commented Dec 3, 2024

asim commented Dec 3, 2024

asim commented Dec 3, 2024

begmaroman commented Dec 4, 2024

asim commented Dec 4, 2024

begmaroman commented Dec 4, 2024

asim commented Dec 4, 2024

begmaroman commented Dec 4, 2024

begmaroman commented Dec 4, 2024 •

edited

Loading

asim commented Dec 4, 2024

begmaroman commented Dec 4, 2024

begmaroman commented Dec 4, 2024

asim commented Dec 4, 2024

asim commented Dec 4, 2024

begmaroman commented Dec 4, 2024

begmaroman commented Dec 4, 2024

nats: no responders available for request #2741

nats: no responders available for request #2741

Comments

begmaroman commented Dec 3, 2024

Description of the bug

How to reproduce the bug

Environment

asim commented Dec 3, 2024

asim commented Dec 3, 2024

begmaroman commented Dec 4, 2024

asim commented Dec 4, 2024

begmaroman commented Dec 4, 2024

asim commented Dec 4, 2024

begmaroman commented Dec 4, 2024

begmaroman commented Dec 4, 2024 • edited Loading

asim commented Dec 4, 2024

begmaroman commented Dec 4, 2024

begmaroman commented Dec 4, 2024

asim commented Dec 4, 2024

asim commented Dec 4, 2024

begmaroman commented Dec 4, 2024

begmaroman commented Dec 4, 2024

begmaroman commented Dec 4, 2024 •

edited

Loading