Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nats: no responders available for request #2741

Open
begmaroman opened this issue Dec 3, 2024 · 15 comments
Open

nats: no responders available for request #2741

begmaroman opened this issue Dec 3, 2024 · 15 comments

Comments

@begmaroman
Copy link

Description of the bug

I'm building an example of the microservices project built using go-micro with two simple services: https://github.com/begmaroman/go-micro-boilerplate/blob/feature/k8s/docker-compose.yaml

The first service is rest-api-svc and built using go micro web framework. Another one is account-svc built using go-micro.

They use nats server for discovery, transport and as a broker. I tried all other options for the transport but the following bug still appears:

{"id":"go.micro.client.transport","code":500,"detail":"nats: no responders available for request","status":"Internal Server Error"}

The interesting point is that the bug is flaky and appears ~ once per 10 request.

How to reproduce the bug

  1. Clone the repo
  2. Run go mod download
  3. Run make build-base-image
  4. Run docker compose up --build
  5. Send request to GET http://localhost:3004/user few times

Environment

Go Version: 1.23.0

@asim
Copy link
Member

asim commented Dec 3, 2024

I can't remember the code for nats specifically but it could.be that across restarts the registry is being polluted by dead nodes. Ensure you're using heartbeat and ttls for expiry with the registry and set your client retries to 3. That should mitigate some of the problem.

@asim
Copy link
Member

asim commented Dec 3, 2024

If you look at the nats registry code in the plugins repo you'll notice that deregistration has no form of broadcasting. So effectively it can result in dead nodes. It's been a long time since I've done any development here so you'll need to investigate yourself I'm afraid.

@begmaroman
Copy link
Author

@asim this problem is related to all registries available in plugins repo so I think this is something related to dead nodes or so. Would you mind giving more advise about the fix because I really stuck here. Thank you very much.

@asim
Copy link
Member

asim commented Dec 4, 2024

If it's related to all registries then it's an issue with shutdown and services not getting the time to deregister. It can happen if they are killed without a termination signal, usually in a k8s like environment or kill -9 locally. If you have ttl and expiry set in the service options then these entries should expire from the registry but you can increase the client retries so it immediately tries a different service entry from the registry.

service.Client().Init(client.Retries(3))

@begmaroman
Copy link
Author

@asim weird thing is that when I'm using a custom selector strategy, I receive an incomplete list of nodes sometimes exactly when a request fails.

@asim
Copy link
Member

asim commented Dec 4, 2024

Then the issue is likely with your custom selector I guess. What are you using?

@begmaroman
Copy link
Author

@asim I use a self-pinger client https://github.com/begmaroman/go-micro-boilerplate/blob/master/proto/health/pinger.go and the custom selector https://github.com/begmaroman/go-micro-boilerplate/blob/master/proto/health/health.go

So the list of nodes which is coming as an argument has 2 or 1 nodes. When it has 1 node, the request usually fails.

@begmaroman
Copy link
Author

begmaroman commented Dec 4, 2024

Some more context:

error while serving connection: go-micro-boilerplate.account-svc-3a6bc790-1763-409f-b96d-e3e3bea7b6c4 | _INBOX.4A34WmqmnjQZ87mXVO6djQ: deadline exceeded

@asim
Copy link
Member

asim commented Dec 4, 2024

Ok so assuming this is a self healthcheck I guess the assumption is going to be the error could potentially occur before the service actually registers if the request to ping is initiated before the entry is in the registry, meaning no nodes are returned matching the instance. That's the only time I could see that error based on my limited understanding of the code

@begmaroman
Copy link
Author

@asim given that I use a regular way to define microservice and communicate with it, should we consider the issue is in the framework?

@begmaroman
Copy link
Author

Ok so assuming this is a self healthcheck I guess the assumption is going to be the error could potentially occur before the service actually registers if the request to ping is initiated before the entry is in the registry

It fails every second request. I send ping and healthchecks after the service is fully started.

@asim
Copy link
Member

asim commented Dec 4, 2024

How often is the healthcheck fired?

@asim
Copy link
Member

asim commented Dec 4, 2024

@asim given that I use a regular way to define microservice and communicate with it, should we consider the issue is in the framework?

  1. Given no one else has the issue I wouldn't call it an issue with the framework. You're doing something very bespoke with your own selector and nats. If it's happening with any registry then it points to some issue with your k8s setup as opposed to the service itself. I don't know the specifics of it so it's hard to comment. Theoretically it should be fine but if it's failing then either the service is dead or something else is wrong.

@begmaroman
Copy link
Author

How often is the healthcheck fired?

Every second. I tried every 5 seconds as well. The point is that the healthcheck of the service itself and the selfpinger client with the custom selector works well.

Results of executing 2k requests with 50 concurrency:
Without a custom selector strategy:

Status code distribution:
  [200] 1006 responses
  [500] 994 responses

without a custom selector strategy:

Status code distribution:
  [200] 2000 responses

As fas as I can guess, it fails when trying to retrieve a live node, but if we specify which exactly node I want to use, it works well.

@begmaroman
Copy link
Author

I found the problem. Due to some reason, the list of nodes in the service object has a weird item which does not work:

{
        "metadata": null,
        "id": "1cab0b44-71dc-43d6-a145-c58faa936b63",
        "address": "172.19.0.4:5678"
}

Since the default random selector is used, in 50% cases it tries to use unreachable node. The valid node ID starts with the service name plus node id while this bad node has only ID.

I created a custom selector which filters out bad nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants