-
-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tcp_do_segment abort (NFS related?) #726
Comments
Hi Justin, I just saw this. I will examine it tomorow. BR Benoît |
mount_context is destroyed because the thread exit. In turn each thread mount_context is supposed to destroy it's NFS context that will close the socket toward the server. Maybe this lead to a double close situation leading to the assert. I will check for this in the code. |
My last comment does not make sense fdrop() which is in the backtrace path is suposed to protect us against double close. @nyh Do you agree with this reasoning ? |
On first glance, it seems to me that there's nothing that the NFS code could do to cause this bug - even calling close() twice in parallel should have worked (protected by a mutex). So it appears it is more likely to be a bug in our TCP/IP implementation. We should probably investigate (I don't remember...) what this "tp->get_state() > 1" assertion is about. It seems to me (but it's been ages since I looked at this code) that we are calling code which assumes the socket is not closed, on a closed socket. Would be nice if we could somehow reproduce this bug without NFS or OpenMPI - perhaps by some test program which opens many TCP connections in parallel and then closes all of them in parallel (after, optionally, writing/reading/? on each connection). I don't really know if it would reproduce this bug, but might be worth trying if reproducing Justin's original scenario is too difficult. |
tcp_state is the state of the TCP connection (LISTEN, SYN etc) . Ok I will write a test |
Hi Justin, I made a quick test launching multiple netperf instances toward the same netperf server. It crash which could mean you may have spotted a general issue in OSv TCP implementation. Thanks |
Yes, although I'm not convinced it's the same bug - it appears it might be a completely different bug. What is really surprising me is that we already did numerous workloads on OSv with many threads doing networking, and we didn't run into these bugs before. I wonder what is new in these workloads that the bugs starting crawling out of the woodwork. |
They may have in common of openning multiple socket to exactly the same If there is somewhere in the code a map using the tuple (ip,port,proto) it On Tue, Feb 16, 2016 at 5:05 PM, nyh [email protected] wrote:
|
On Tue, Feb 16, 2016 at 6:07 PM, Benoît Canet [email protected]
If you can reproduce this with an even simpler single test program which Can you reproduce the original bug with "-c1"? If not, the bug may depend I wonder if we need to write or read something from this socket to see this In any case, if you can easily reproduce one of these bugs, we can start If there is somewhere in the code a map using the tuple (ip,port,proto) it
Usually you have a 5-tuple (with also the source IP and port), but who |
About #728 and this I made the following tests which works ok. OSv Client:
Host server:
So we have no more hypothesis. |
@benoit-canet, this issue appears to have found a crash while **close()**ing a socket, presumably in some sort of parallel scenario (closing many sockets in parallel, or closing in parallel with data arriving, or whatever). The other issue, #728, saw the crash in select(). Since in your new test you neither close()ed many sockets in parallel, nor used select(), I'm not surprised it didn't reproduce these two issues. |
No luck doing that either. |
@justinc1: Ah I just saw it was the same read test so I just have to use it. |
@justinc1 I just realized that execve was also involved could you share your tst-execve with us by attaching the source to this issue ? |
I just did:
And OSv exited in a perfectly clean way. So it could be an interaction with threads namespaces, the execve patch or the execve test tool. @justinc1 A good debugging technique is to try to isolate the problem by bisecting it. Could you try on your machine that simply launching test-file-read.c in parallel with the basic osv threads does not lead to this crash. If it's confirmed on your side I start auditing all the namespace and execve thread that was combined with your test while thinking about the interraction with the NFS client. Best regards Benoît |
I left running test (using execve) for 800 runs, and got 2 occurrences of #726. So its rare. |
Repeating without execve and elf namespaces.
Next try should be trying to get tcp_do_segment failure without using NFS, I guess. And maybe even with less that 1000 trials. |
Thanks, as I said above I am guessing this is an actual bug in our TCP/IP code, and not caused by NFS, although if you try it without NFS you need to try some other type of network traffic which is somehow similar to the NFS workload (have a bunch of TCP connections in multiple threads, and then try to close all of them). This could be related to an SMP race (closing many connections in parallel, etc.) but also might not be related to SMP at all but rather some sort of TCP protocol race (e.g., a retransmitted packet arrives after we closed the socket, or something). I'll need to do some more serious debugging to try to fix this issue. |
I was trying to repeat bug without using NFS. I opened multiple connection to TCP server, and TCP server was then pushing data to client, and than client terminates. But issue didn't show up. |
@DerangedMonkeyNinja submitted a patch to the maiiling list, titled "tcp_input: net channel state fix", which may fix this bug. He also made an interesting observation: The KASSERT is compiled out in release mode. So the fact that @justinc1 used debug mode in the example above was actually important for reproducing this bug. At least in some of the runs by @benoit-canet above he used release mode, which might be why he didn't see this bug. |
I repeated the test with patch applied (e.g. current master), and the issue is still there. The backtrace is nearly identical, I'm adding it just FYI:
GDB:
|
With test below I was able to get crash 6 times in 31 tries. Appending the file as I tend to forgot what I did last time to trigger the problem. It still uses NFS, but multiple wokers are started with pthreads, not in elf namespaces.
|
This seems as duplicate of #454 |
Crash:
Running on fedora 23 host, code is master@c93ebf9d140f + :
gdb:
30 test clients are started, and abort happens when they start to exit. NFS ~mount_context is part of trace.
So far, it happened once in 80 runs.
The text was updated successfully, but these errors were encountered: