Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NewClient functions behaviour is incompatible with secure forward-proxies #7556

Open
puneet-traceable opened this issue Aug 23, 2024 · 27 comments · May be fixed by #7881
Open

NewClient functions behaviour is incompatible with secure forward-proxies #7556

puneet-traceable opened this issue Aug 23, 2024 · 27 comments · May be fixed by #7881
Assignees
Labels
Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. Type: Bug

Comments

@puneet-traceable
Copy link

puneet-traceable commented Aug 23, 2024

What version of gRPC are you using?

1.64.0 and v1.67.0-dev

What version of Go are you using (go version)?

1.22

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

If possible, provide a recipe for reproducing the error.

  1. setup a squid proxy
  2. setup grpc client(examples/helloworld/greeter_client) and grpc server(examples/helloworld/greeter_server) to use tls
  3. run grpc server
  4. on client shell set env var for using proxy(export https_proxy="http://<proxy_host:port")
  5. start tcpdump

What did you expect to see?

the target should be hostname while it's sent to proxy and dns resolution for target should happen on proxy

What did you see instead?

dns is resolved on the client and only ip is sent.
Attaching tcpdump screenshot with difference
Screenshot 2024-08-23 at 7 50 34 PM

tcpdump for curl
Screenshot 2024-08-23 at 9 48 22 PM

@dfawley
Copy link
Member

dfawley commented Aug 23, 2024

If you need a short-term workaround to keep things working as they were before, you should be able to use the passthrough resolver:

client, err := grpc.NewClient("passthrough:///<hostname>:<port>", ...)

That probably should not be required, though.

@puneet-traceable
Copy link
Author

@dfawley I get that this would work in most of the cases.
But there are more issues with this change.
For example: if dns lookups are disabled on app instances and it's only the proxy that can resolve the dns, current grpc client built with NewClient does not work. curl for http2 does not face this problem.

@dfawley
Copy link
Member

dfawley commented Aug 27, 2024

But there are more issues with this change.

Are you saying the workaround of using passthrough:///<hostname>:<port> as the target string isn't working for you in these situations? If so, can you provide some more information and debugging logs of that?

@puneet-traceable
Copy link
Author

target

No, I didn't mean passthrough won't work. But I think NewClient api should have greater flexibility to be able to configure dns versus passthrough. Since we use opentelemetry, the real grpc-go integration is way down the stack and it's hard to configure the url this way as the same url gets used at multiple places.

@dfawley
Copy link
Member

dfawley commented Aug 28, 2024

Thanks for confirming. Yes, this should be treated as a bug and the workaround was not suggested to avoid fixing it.

@dfawley
Copy link
Member

dfawley commented Aug 30, 2024

For reference, this gRFC comes into play here: https://github.com/grpc/proposal/blob/master/A1-http-connect-proxy-support.md

But note that Java did not implement this gRFC, and we may or may not want to do things this way.

@ejona86
Copy link
Member

ejona86 commented Aug 31, 2024

Java added support for "Use Case 1". But it did not use the gRFC's design. grpc/grpc-java#10022 tracks implementing Use Case 2 in Java.

puneet-traceable added a commit to hypertrace/opentelemetry-collector that referenced this issue Sep 17, 2024
With NewClient API usage, we are facing issues at few customers
who have intermediate proxies between collector and platform.
With NewClient API instead DialContext, DNS resolution happens
on the client side while it should happen on proxy.
Also, with SGProxy client does not get the correct certificate.
can be changed once grpc fixes grpc/grpc-go#7556
and otel collector picks the fix
puneet-traceable added a commit to hypertrace/opentelemetry-collector that referenced this issue Sep 17, 2024
With NewClient API usage, we are facing issues at few customers
who have intermediate proxies between collector and platform.
With NewClient API instead DialContext, DNS resolution happens
on the client side while it should happen on proxy.
Also, with SGProxy client does not get the correct certificate.
Passthrough scheme was the prior default and prevents resolution to
happen beforehand. This change can be removed once grpc fixes grpc/grpc-go#7556
and otel collector picks the fix
puneet-traceable added a commit to hypertrace/opentelemetry-collector that referenced this issue Sep 18, 2024
With NewClient API usage, we are facing issues at few customers
who have intermediate proxies between collector and platform.
With NewClient API instead DialContext, DNS resolution happens
on the client side while it should happen on proxy.
Also, with SGProxy client does not get the correct certificate.
can be changed once grpc fixes grpc/grpc-go#7556
and otel collector picks the fix
puneet-traceable added a commit to hypertrace/opentelemetry-collector that referenced this issue Sep 18, 2024
With NewClient API usage, we are facing issues at
few customers who have intermediate proxies between collector
and platform. With NewClient API instead Dial,
DNS resolution happens on the client side while it should happen on proxy.
Also, with SGProxy client does not get the correct certificate.
This can be changed once grpc fixes grpc/grpc-go#7556 and otel collector picks the fix
puneet-traceable added a commit to hypertrace/opentelemetry-collector that referenced this issue Sep 18, 2024
With NewClient API usage, we are facing issues at
few customers who have intermediate proxies between collector
and platform. With NewClient API instead Dial,
DNS resolution happens on the client side while it should happen on proxy.
Also, with SGProxy client does not get the correct certificate.
This can be changed once grpc fixes grpc/grpc-go#7556 and otel collector picks the fix
puneet-traceable added a commit to hypertrace/opentelemetry-collector that referenced this issue Sep 18, 2024
With NewClient API usage, we are facing issues at
few customers who have intermediate proxies between collector
and platform. With NewClient API instead Dial,
DNS resolution happens on the client side while it should happen on proxy.
Also, with SGProxy client does not get the correct certificate.
This can be changed once grpc fixes grpc/grpc-go#7556 and otel collector picks the fix
@arjan-bal
Copy link
Contributor

Keeping this issue open to track the fix.

@arjan-bal arjan-bal reopened this Sep 19, 2024
@purnesh42H purnesh42H added the Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. label Sep 23, 2024
tim-mwangi pushed a commit to hypertrace/opentelemetry-collector that referenced this issue Oct 3, 2024
With NewClient API usage, we are facing issues at
few customers who have intermediate proxies between collector
and platform. With NewClient API instead Dial,
DNS resolution happens on the client side while it should happen on proxy.
Also, with SGProxy client does not get the correct certificate.
This can be changed once grpc fixes grpc/grpc-go#7556 and otel collector picks the fix
@guyni
Copy link

guyni commented Oct 17, 2024

We also ran into this issue. It's very easy to reproduce. Just use https://github.com/grpc/grpc-go/tree/master/examples/helloworld/greeter_client and run nc to mock the proxy:

  1. in one terminal, run "nc -l 8888";
  2. in another terminal, run "HTTPS_PROXY=http://localhost:8888 go run main.go -addr=example.com:50051"

If using 1.58.x, you will see the nc outputs:
% nc -l 8888
CONNECT example.com:50051 HTTP/1.1
Host: example.com:50051
User-Agent: grpc-go/1.58.4-dev

If using 1.66.x, you will see:
% nc -l 8888
CONNECT 93.184.215.14:50051 HTTP/1.1
Host: 93.184.215.14:50051
User-Agent: grpc-go/1.66.4-dev

@guyni
Copy link

guyni commented Oct 17, 2024

https://github.com/grpc/grpc-go/blob/master/internal/transport/http2_client.go#L181
Shouldn't it call proxyDial(ctx, addr.ServerName, grpcUA) instead of proxyDial(ctx, address, grpcUA) here?

@eshitachandwani
Copy link
Member

master/internal/transport/http2_client.go#L181 Shouldn't it call proxyDial(ctx, addr.ServerName, grpcUA) instead of proxyDial(ctx, address, grpcUA) here?

Hey @guyni , I think the using address is correct because as metioned in gRFC A1 case 2 , we need to send the server address (specifically the resolved server address) in the HTTP_CONNECT request. The property ServerName is used for TLS certificate validation and might be empty most of the time unless specifically set while using custom CA.

@puneet-traceable
Copy link
Author

@eshitachandwani If you look at the usecase no. 1 here, it mentions that the external address should be resolved at the proxy. That means the hostname should be supplied in CONNECT call.

@eshitachandwani
Copy link
Member

eshitachandwani commented Oct 18, 2024

Hi @puneet-traceable, currently, gRPC-Go supports only use case 2, and I’m actively working on adding support for use case 1. Apologies for not mentioning this in my previous comment.

@eshitachandwani
Copy link
Member

We also ran into this issue. It's very easy to reproduce. Just use master/examples/helloworld/greeter_client and run nc to mock the proxy:

  1. in one terminal, run "nc -l 8888";
  2. in another terminal, run "HTTPS_PROXY=http://localhost:8888 go run main.go -addr=example.com:50051"

If using 1.58.x, you will see the nc outputs: % nc -l 8888 CONNECT example.com:50051 HTTP/1.1 Host: example.com:50051 User-Agent: grpc-go/1.58.4-dev

If using 1.66.x, you will see: % nc -l 8888 CONNECT 93.184.215.14:50051 HTTP/1.1 Host: 93.184.215.14:50051 User-Agent: grpc-go/1.66.4-dev

This happens because in v1.58.x we were using grpc.Dial and we used the passthrough scheme as default name resolving scheme, and so if there is no scheme, it did not get resolved on client and proxy received the unresolved name. But now, we have changed to use grpc.NewClient() with default name resolving scheme as dns and so the proxy gets the resolved name by default if no scheme is specified. As mentioned earlier, we are actively trying to resolve this.

tim-mwangi pushed a commit to hypertrace/opentelemetry-collector that referenced this issue Oct 21, 2024
With NewClient API usage, we are facing issues at
few customers who have intermediate proxies between collector
and platform. With NewClient API instead Dial,
DNS resolution happens on the client side while it should happen on proxy.
Also, with SGProxy client does not get the correct certificate.
This can be changed once grpc fixes grpc/grpc-go#7556 and otel collector picks the fix
@erezrokah
Copy link

As mentioned earlier, we are actively trying to resolve this.

Thanks @eshitachandwani for handing this issue. Didn't find if it was already mentioned but this issue impacts consumers of the GCP Go SDK (https://github.com/googleapis/google-cloud-go), can we make sure any solution applies for those consumers as well (e.g. if the solution is by adding a new option, that consumers of the GCP Go SDK can enable the option)?

@easwars
Copy link
Contributor

easwars commented Nov 5, 2024

@erezrokah : Could you please elaborate on your last comment?

How does the GCP Go SDK currently use the proxy feature?

Are they currently affected by using grpc.NewClient? Or do you think they will be affected if they make the switch?

Thanks.

@erezrokah
Copy link

erezrokah commented Nov 6, 2024

Are they currently affected by using grpc.NewClient?

They are currently affected due to this change googleapis/google-cloud-go@be2d56d#diff-215847f913454f2311866edbdbf41d7a3cd3879e0ed0e7aa26f4ad771dd6a1dcR296.

We're experiencing the same issue as this one only when using the Google Cloud Go SDK.

I think https://github.com/grpc/grpc-go/blob/master/Documentation/anti-patterns.md#the-wrong-way-grpcdial should say both Dial and DialContext use passthrough and switching to NewClient as is should be considered a breaking change, WDYT?

Edit

See googleapis/google-cloud-go#11089

@arjan-bal
Copy link
Contributor

I can think of a workaround which doesn't involve a code change in the googleapis client library:

import (
	"google.golang.org/grpc/resolver"
	"google.golang.org/grpc"
        "google.golang.org/api/option"
)

// A resolver that registers the passthrough resolver under the "dns" scheme
// to indirectly set the default resolver used by NewClient as the passthrough
// resolver.
// Note: This will shadow the dns resolver.
type customPassthroughResolver struct{}

func (r *customPassthroughResolver) Build(target resolver.Target, cc resolver.ClientConn, opts resolver.BuildOptions) (resolver.Resolver, error) {
	passthrough := resolver.Get("passthrough")
	if passthrough == nil {
		panic("Passthrough resolver not registered!")
	}
	return passthrough.Build(target, cc, opts)
}

func (r *customPassthroughResolver) Scheme() string {
	return "dns"
}

// Use a grpc client option to avoid modifying the global grpc resolver registry.
clientOption := option.WithGRPCDialOption(grpc.WithResolvers(&customPassthroughResolver{}))
// Use the client option while creating the googleapis client.

@quartzmo
Copy link
Contributor

quartzmo commented Nov 8, 2024

@arjan-bal

Regarding your workaround above, @erezrokah asked (on googleapis/google-cloud-go#11089):

Won't this workaround disable DNS resolving altogether even when explicitly set by the scheme?

@arjan-bal
Copy link
Contributor

@quartzmo, I'm not suggesting that the Go client library uses this workaround. This workaround is hacky. The Go client library should probably revert back to using grpc.Dial instead of grpc.NewClient to fix the regression. grpc-go is working on fixing the handling of forward proxy use case in NewClient, but that could take a couple of months to land.

Won't this workaround disable DNS resolving altogether even when explicitly set by the scheme?

Yes, it will replace the DNS resolver with the passthrough resolver, even if the target URI uses dns as the scheme. Since the workaround uses a call option instead of registering a balancer globally, a user should add the call option only when they don't want to use the dns scheme. A less hacky fix would be to use the passthrough scheme in the target endpoint explicitly so that the default dns scheme is not used.

DNS resolution happens when using both passthrough and dns resolvers:

  • When using the DNS resolver, the LB policy sees multiple endpoints, one for each IP addresses returned by DNS.
  • When using passthrough, the LB policy sees a single endpoint with the unresolved hostname. The Dialer resolves the hostname into an IP address using DNS just before creating the transport.

The passthrough resolver exists for backward compatibility. grpc-go wants to move all users to the dns scheme, so the GCP client library should revert back to using NewClient once NewClient handles forward proxies correctly.

@erezrokah
Copy link

@arjan-bal considering the recommendation to revert back to grpc.Dial is there anything to be done to prevent other consumers of grpc-go from running into the same issue?

I think the following pattern could be repeated for other consumers:

  1. Someone updates to a new version of grpc-go
  2. Linter checks fail due to the deprecation notice
  3. Someone updates to use NewClient per the linter warning without fully understanding the impact (some of it described here)
  4. After some time the breaking changes are observed

Related PRs from consumers to have created issues:

Other PRs I found from a quick search that might experience similar issues

Going over the PRs above I don't see any acknowledgement of the potential breakage.

Would it make sense to update the deprecation notice to make the possible issues more clear? There's some context in the DialContext comment

// One subtle difference between NewClient and Dial and DialContext is that the
but that could be easily missed I think

@arjan-bal
Copy link
Contributor

arjan-bal commented Nov 12, 2024

@erezrokah the fact that the dns resolver resolves the backend hostname on the client is a bug, it wasn't intentional. grpc-go doesn't use godoc comments to inform users about bugs presently. We expect users to find this GitHub issue if they're effected. Once the fix is merged in master, we will decide if a patch release is required for older grpc-go releases.

@arjan-bal
Copy link
Contributor

We plan to update the release notes for recent releases.

codyoss added a commit to codyoss/google-cloud-go that referenced this issue Nov 12, 2024
There have been a couple of reports lately around some subtle
changes in behaviour of using the new recommend method. There is
also a known bug, see grpc/grpc-go#7556.
Until the bug is fixed and we have a way forward we will revert to
calling the deprecated Dial function.

Fixes: googleapis#7556
@erezrokah
Copy link

@erezrokah the fact that the dns resolver resolves the backend hostname on the client is a bug, it wasn't intentional

Thanks for the additional context I thought this was intended behavior. Updating the release notes and having a fixed merged would be great as currently the deprecation of grpc.Dial pushes users to grpc.NewClient with the bug.

@eshitachandwani
Copy link
Member

Our understanding of the problem :
Before grpc.NewClient was introduced, grpc.Dial was the primary way to establish gRPC connections. With grpc.Dial, the default "passthrough" resolver simply passed the target address unchanged to the transport layer. When grpc.NewClient was introduced, it switched to the "dns" resolver by default, which resolves the target address to IP addresses before passing them to the transport. This change inadvertently altered how proxies are handled when configured via environment variables. This behavior is inconsistent with what we expect i.e resolution to happen on proxy.

The proposal to resolve the issue:

  • Remove environment variable detection from the transport layer. Instead, introduce a per-address attribute to specify the proxy CONNECT string. This change aligns with the xDS design and provides more control over proxy behavior.
  • Create a new resolver that handles proxy configuration and delegates to child resolvers for target and proxy address resolution. This resolver does the following:
    • Use the "dns" resolver for proxy address resolution if environment variables are set.
    • Use the target's "path" portion directly if the scheme is "dns" and a proxy is configured, mimicking the old "passthrough" behavior.
    • Combine the resolved target and proxy addresses, setting the proxy CONNECT string attribute.
  • Introduce a new DialOption to preserve the current behavior of grpc.Dial, where the resolved target address is used in the proxy CONNECT request. This option ensures backward compatibility for users relying on the existing behavior.

It is based on the gRFC A1

@bushnell-deshaw
Copy link

@eshitachandwani I'm concerned about this: "Remove environment variable detection from the transport layer. Instead, introduce a per-address attribute to specify the proxy CONNECT string. This change aligns with the xDS design and provides more control over proxy behavior."

Some users have extensive use of the https_proxy and expect it to work. Forcing them to do code changes will be difficult, especially if it's just a thing that happens during some library upgrade. It's a pretty substantial change to how it works. In particular, users of GRPC often are unaware of the addresses they are connecting to. (For example, it is common to use a high-level package, which may be many layers away from the actual call to NewClient or Dial; adding options is not possible; knowing which addresses are being used is not possible.)

Adding a new, more flexible, feature is surely harmless, but if the rug is pulled out under https_proxy (which is part of the Go 1.0 API promise) that's pretty severe.

@eshitachandwani
Copy link
Member

Hey @bushnell-deshaw , sorry for the confusion. We are not pulling the plug on use of https_proxy environment for specifying proxy. I meant to say that detecting the use of proxy from the environment will be moved from the transport layer (it will be moved to the new delegating resolver logic, which will add the attribute to the address if proxy is detected). The transport layer will connect to proxy if the attribute is set with address. The users can still specify proxy using https_proxy environment, its the internal logic of detection and connecting to proxy that will change. Let me know if there is still any concern.

@eshitachandwani eshitachandwani linked a pull request Nov 29, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. Type: Bug
Projects
None yet