-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS lookups fail for api.nuget.org using Alpine based dotnet Docker images in AWS us-east-1 #9396
Comments
Thanks for the heads up, @ggatus-atlassian! That part of the DNS stack that is differing between region is managed by a dependency of ours (our CDN provider) not the NuGet.org team directly. We'll work with this partner to understand how to mitigate the problem. |
@ggatus-atlassian I was able to reproduce the larger response size with a VPN in Virginia but was unable to reproduce the truncated response in an Alpine Linux VM. Have you been able to reproduce the problem outside of an AWS VM (i.e removing Route53 from the picture)? As you can see in the screenshot, the MSG SIZE is 814 which is higher than the limit you described. Do you know if there is a way to verify what the DNS size limit is on a given Linux VM. |
We've been unable to reproduce the problem outside of AWS us-east-1. We've been speaking with AWS support, they have run some tests across us-east-1, us-east-2 and us-west-1 and only found the large CNAME chain being returned in us-east-1. AWS are also performing additional DNS response manipulation and truncation, which complicates this further beyond just issues with Alpine. This patch/bug details the limitation of libmusl in Alpine, and it's limit of 512B in responses and lack of a fallback mechanism - https://git.musl-libc.org/cgit/musl/commit/src/network/res_msend.c?id=51d4669fb97782f6a66606da852b5afd49a08001 Can you provide some more details of the test environment you are using? What version of Alpine are you running with, and what cloud platform are you running on. If you try running a The 814 message size above is odd, potentially dig isn't using libmusl, or the version of Alpine you are running with has been patched. |
@ggatus-atlassian To help us in our investigation, could you provide the packet captures of your DNS resolution requests? (e.g: Using WireShark) |
We are also seeing this issue today. Our containers are running in CircleCI - I'm not sure what AWS region they are running in. We are using the latest .NET SDK Alpine image (mcr.microsoft.com/dotnet/sdk:6.0-alpine)
It looks like this is an Alpine DNS related issue. |
We're seeing this issue specifically from us-east-1 and the mcr.microsoft.com/dotnet/sdk:7.0-alpine3.17 image
This issue first started happening intermittently on Sunday, Feb. 19th. Edit: Some more detail. We're seeing this on our Github Actions instances. Re-running the failed job with the |
Hey folks, we're continuing the conversation with our CDN provider but they need some additional information. We're also checking if they can root-cause the problem without this additional info (but that's still unclear). @ggatus-atlassian, @nslusher-sf, @RichardD012 - would it be possible for you to provide a packet capture (pcap file) of the DNS queries that reproduce the problem? The CDN provider mentioned Wireshark but perhaps there are other ways in Alpine Linux. Also, is it a feasible workaround to override the DNS resolver at a system level to be CloudFlare/Google/OpenDNS's public DNS instead of the Route53 resolver while we're investigating the problem? |
We are using builder image mcr.microsoft.com/dotnet/sdk:6.0-alpine3.13 on Jenkins instances in us-east-1 region |
I don't believe this is an issue with AWS, but something odd with the alpine image. On my local machine, I can run If you change the image to mcr.microsoft.com/dotnet/sdk:6.0, you can get a 200 status code every time. I checked .NET 5.0 and got the same results (alpine fails, but non-alpine works). |
@Boojapho - using another non-Alpine image is indeed a workaround. I think the reason Alpine is different in that it has a lower DNS response size limit. Per @ggatus-atlassian in the original post:
|
@joelverhagen Excellent point. I dug into it a little more and found that my DNS responses were part of the issue. I modified my DNS source and was able to get a reliable resolution with Alpine. |
We can see those issue as well since 2 days in Alpine Linux. Sometimes nuget.org gets resolved after retry attempts, but most of the time failing. @joelverhagen: Is there an update from the CDN provider (MS)? They should be interested in solving this as they provide .NET Alpine Linux image with mcr.microsoft.com/dotnet/sdk:6.0-alpine... |
@mhoeper, could you provide the following information?
As a workaround, while we investigate, you consider using a non-Alpine Docker image (such as |
Folks: we've mitigated the impact by failing over to our secondary CDN provider in Virginia (which is the state where AWS us-east-1 resides). We'll continue to investigate the situation with our primary CDN provider. If you are still facing issues, please let us know in this thread. |
We are running docker image mcr.microsoft.com/dotnet/sdk:6.0-alpine in us-east-1 region I think I successfully mitigated this for now, creating a NuGet.Config in the solution folder clearing nuget.org as a package source and adding your CDN apiprod-mscdn.azureedge.net instead. Maybe this works for those that do not want to switch from Alpine because of this.
|
Using an undocumented DNS name can lead to other problems in the future. Additionally, this won't fully work because of how our protocol uses linked data (i.e. there are URLs followed for some scenarios that still reference api.nuget.org because it is baked into the response body). If this works for you, feel free to do it, but we can't make any guarantees that the DNS name you've used there will keep working forever. |
Just ran several CircleCI based pipelines that were failing constantly before and they all passed. |
This works for us as well. @joelverhagen: Looks like Alpine is planning a story retry using tcp if received truncated bit. However, till this is implemented, will Nuget.org try not exceeding the 512 byte limit? Otherwise, we would plan migrating out of Alpine.... |
@mhoeper, it's currently not possible for us to guarantee that the 512 byte limit will not be exceeded. After further conversations with our primary CDN provider, this case occurs when there is "shedding", which appears to be a relatively rare case when the CDN determines it needs to provide an alternate DNS chain, likely due to high load in the area. This would align with the context where the impacted customers are in a highly popular AWS region. However, given the relatively narrow scope of the impact (Alpine Linux plus AWS regions which encounter CDN shedding), we may need to revert to the previous state if no better solution is found. We've mitigated the current situation by using our secondary CDN provider, which happens to have a smaller DNS response size. But we can't use this solution forever for scalability reasons. After doing some research online, this seems to be a common problem for Alpine users (not just NuGet.org, not just .NET, not just Docker). I believe the retry over TCP is the proper solution for Alpine, but I can't speak authoritatively since I'm not an expert in musl libc (Alpine's libc implementation which yields this problem) or Alpine's desired use-cases. I also don't know the timeline for the Alpine/musl addressing this problem. It is likely a much longer timeline that we want to be using our secondary CDN provider. I'll gently suggest moving to a non-Alpine Docker image in the short term to avoid any of these DNS problems. Alpine should probably be fine for runtime cases where NuGet.org is not needed but for SDK cases, it's probably best to avoid Alpine until this issue is resolved one way or another. We're speaking internally to our partners about alternatives both in the CDN space and in the Docker image configuration. I can't guarantee any solution since these settings are outside of the NuGet.org team's control. |
Hi! @ggatus-atlassian, @nslusher-sf, @RichardD012, @KondorosiAttila, @Boojapho, @mhoeper, @ilia-cy. Our apologies for the inconvenience again! Please take a look at this issue #9736 for the root cause and next steps. Feel free to reach out to us at [email protected] or by commenting on the discussion issue: NuGet/Home#12985. Thanks! |
Alpine Linux v3.17 has reached the end of support. NuGet.org suggests Alpine Linux customers migrate to Alpine Linux v3.18 or a newer version. If you have any questions or suggestions, feel free to reach out to NuGet.org support. |
Impact
I'm unable to use api.nuget.org from inside of an Alpine based docker image running in AWS us-east-1.
Describe the bug
We've found an issue where running an Alpine based dotnet image inside of AWS us-east-1 (e.g running an image on an EC2 instance with Docker) causes DNS lookups to api.nuget.org to fail, breaking many tools that integrate with Nuget. Ive noticed this behaviour affecting builds running in Bitbucket Pipelines (our CI/CD service), and have reproduced similar issues directly on EC2. This happens when using Route53 as the DNS resolver (the default when starting up a new EC2 instance).
It appears that the problem is due to Alpine's inability to handle truncated DNS responses. If running
dig
to perform a DNS lookup for api.nuget.org , we notice thetc
flag set in the response headers, indicating a truncated DNS response. The following was executed from an EC2 instance in us-east-1. We've found truncation does not occur in us-west-2. In the below response. we don't receive any A records for api.nuget.org due to the truncated response.Running the same query from us-west-2 gives back a correct response with a A record:
This prevents the use of Alpine based docker images running in us-east-1 and using Route53 for DNS services from communicating with nuget. Swapping to an alternative DNS provider such as Cloudflare at 1.1.1.1 or hardcoding api.nuget.org in /etc/hosts resolves the problem. It's unclear if this is a problem with AWS, nuget, or a combination of the two.
Maybe something has changed causing the nuget DNS query responses to increase in size, breaking Alpine? Comparing the above responses from a DNS lookup in us-east-1 vs us-west-2, we see in us-east-1 that there are several additional CNAME entries. Alpine truncates DNS responses that exceed 512 bytes in size (see https://christoph.luppri.ch/fixing-dns-resolution-for-ruby-on-alpine-linux). In this case, we are unable to use any dotnet alpine image to talk to nuget from AWS in us-east-1.
Repro Steps
Steps to reproduce:
wget api.nuget.org
Expected Behavior
We can successfully call api.nuget.org (however it will fail with a http 4xx response without appropriate credentials and path).
Screenshots
No response
Additional Context and logs
We've detected this issue inside of Bitbucket Pipelines, and can reproduce this directly on EC2 instances across unrelated AWS accounts where Route53 is used as a DNS resolver.
The text was updated successfully, but these errors were encountered: