-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.8.7:test marginal_path_double_failed_time parameter generated a coredump #73
Comments
malloc corruption is always difficult to debug. The easiest way is to run multipathd under valgrind, but that may be too slow to reproduce the error (just try). If that doesn't work, you can try using the compiler flag |
But first you may want to update multipath-tools and see if the error is still present. 0.8.7 is 2 years old. |
There only cause once in 0.8.7. Now we use asan test but no progress yet |
It's possible if the memset would clear an area outside the allocated memory for |
Here is some new info. This coredump cased when mutlipathd stop and after free_io_err_stat_path. |
Sorry, I can't parse your comment. Are you saying the core dump happened after the call to I've just reviewed the locking in that code once more, and I found no obvious race condition, all accesses to Question: Does this crash occur always? Does it only occur if there are actual marginal path events? |
multipath.conf: test.sh while true This problem can be reproduced by this test.sh. |
Yes, I don't see any competition either.
I add printing p->dio_ctx_array->buf in free_io_err_stat_path before and after free(). |
AFAIR, IMO the bottom line is that you can't free aio memory unless the kernel before the kernel has completed them. Wrt |
Please check the patch from this draft PR. It's against my latest code base, but you should be able to apply it on 0.8.7, too. |
I review the code and I think it can solve the problem. But The National Day holiday is coming. I will test it after holiday. |
We have verified the patch from this draft PR in the environment and confirmed that the patch can solve the problem. |
Thanks, I'll make an official submission asap. |
It is wrong to assume that aio data structures can be reused or freed after io_cancel(). io_cancel() will almost always return -EINPROGRESS, anyway. Use the io_starttime field to indicate whether an io event has been completed by the kernel. Make sure no in-flight buffers are freed. Fixes opensvc#73. Signed-off-by: Martin Wilck <[email protected]> Cc: Li Xiao Keng <[email protected]> Cc: Miao Guanqin <[email protected]>
The directio checker has the same problem (even though we've seen no crashes there). I made a patch for it, too, but now the unit tests are failing :-/ I need to have another look. |
@Guanqinm, @lixiaokeng: I just realize that there might be a simpler solution to this. Please revert the previous patch and just try this instead:
The io_destroy(2) man page says: "The io_destroy() system call will attempt to cancel all outstanding asynchronous I/O operations against ctx_id, will block on the completion of all operations that could not be canceled, and will destroy the ctx_id." Thus after calling |
Pulling in also @bmarzins, as he has done most of the aio work for the directio checker. I have experimented a bit.
|
We have verified the patch and confirmed that the patch can solve this issue. @mwilck
|
Thanks! The trivial patch alone is not correct by itself because the code would still re-use iocbs before they have completed. But it clarifies things. |
Hello, may I ask if the repair patch is the same as the previous patch (this draft PR)? |
I haven't finished this yet. For now you can use the "draft PR", or a combination of both patches. |
It is wrong to assume that aio data structures can be reused or freed after io_cancel(). io_cancel() will almost always return -EINPROGRESS, anyway. Use the io_starttime field to indicate whether an io event has been completed by the kernel. Make sure no in-flight buffers are freed. Fixes opensvc#73. Signed-off-by: Martin Wilck <[email protected]> Cc: Li Xiao Keng <[email protected]> Cc: Miao Guanqin <[email protected]>
I can't finish my bigger project right now. I have sent a small patch set based on the patches from this issue to dm-devel. |
libmultipath: reduce log level of directio messages Are these patches intended to fix this issue? |
Yes. It's the same patches that you verified here. |
You're of course very welcome to test that patch set in your environment. |
It is wrong to assume that aio data structures can be reused or freed after io_cancel(). io_cancel() will almost always return -EINPROGRESS, anyway. Use the io_starttime field to indicate whether an io event has been completed by the kernel. Make sure no in-flight buffers are freed. Fixes opensvc#73. Signed-off-by: Martin Wilck <[email protected]> Cc: Li Xiao Keng <[email protected]> Cc: Miao Guanqin <[email protected]> Cc: Guan Junxiong <[email protected]>
@lixiaokeng @Guanqinm: as you saw I posted an updated patch series to the dm-devel ML yesterday. Test / feedback in your test bed would be highly appreciated. |
I have give a test.sh. Do you have test it?We have test "libmultipath: io_err_stat: don't free aio memory before completion" and "libmultipath: io_err_stat: call io_destroy() before free_io_err_pathvec()". There is no coredump. I think our test can't test other patches. |
It is wrong to assume that aio data structures can be reused or freed after io_cancel(). io_cancel() will almost always return -EINPROGRESS, anyway. Use the io_starttime field to indicate whether an io event has been completed by the kernel. Make sure no in-flight buffers are freed. Fixes opensvc#73. Signed-off-by: Martin Wilck <[email protected]> Reviewed-by: Benjamin Marzinski <[email protected]> Cc: Li Xiao Keng <[email protected]> Cc: Miao Guanqin <[email protected]> Cc: Guan Junxiong <[email protected]>
It is wrong to assume that aio data structures can be reused or freed after io_cancel(). io_cancel() will almost always return -EINPROGRESS, anyway. Use the io_starttime field to indicate whether an io event has been completed by the kernel. Make sure no in-flight buffers are freed. Fixes opensvc#73. Signed-off-by: Martin Wilck <[email protected]> Reviewed-by: Benjamin Marzinski <[email protected]> Cc: Li Xiao Keng <[email protected]> Cc: Miao Guanqin <[email protected]> Cc: Guan Junxiong <[email protected]>
It is wrong to assume that aio data structures can be reused or freed after io_cancel(). io_cancel() will almost always return -EINPROGRESS, anyway. Use the io_starttime field to indicate whether an io event has been completed by the kernel. Make sure no in-flight buffers are freed. Fixes opensvc#73. Signed-off-by: Martin Wilck <[email protected]> Reviewed-by: Benjamin Marzinski <[email protected]> Cc: Li Xiao Keng <[email protected]> Cc: Miao Guanqin <[email protected]> Cc: Guan Junxiong <[email protected]>
Hi
Here we met a question. When we were testing the marginal parameters, a coredump was generated, which is shown below:
(gdb) bt
#0 __pthread_kill_implementation (threadid=281473695391360, signo=signo@entry=6, no_tid=no_tid@entry=0)
at pthread_kill.c:44
#1 0x0000ffffb5073304 in __pthread_kill_internal (signo=, threadid=)
at pthread_kill.c:78
#2 0x0000ffffb502ed7c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x0000ffffb501cd2c in __GI_abort () at abort.c:79
#4 0x0000ffffb50673ec in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0xffffb5141e20 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#5 0x0000ffffb507d26c in malloc_printerr (str=str@entry=0xffffb513f0c8 "corrupted double-linked list")
at malloc.c:5671
#6 0x0000ffffb507db44 in unlink_chunk (p=p@entry=0xffffa00012d0, av=0xffffa0000030) at malloc.c:1637
#7 0x0000ffffb507ec58 in _int_free (av=0xffffa0000030, p=0xffffa00012d0, have_lock=)
at malloc.c:4609
#8 0x0000ffffb5081370 in __GI___libc_free (mem=) at malloc.c:3393
#9 0x0000ffffb5081424 in tcache_thread_shutdown () at malloc.c:3229
#10 __malloc_arena_thread_freeres () at arena.c:1010
#11 0x0000ffffb50833e8 in __libc_thread_freeres () at thread-freeres.c:44
#12 0x0000ffffb50716a8 in start_thread (arg=0x0) at pthread_create.c:457
#13 0x0000ffffb50d7d5c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:79
(gdb) f 6
#6 0x0000ffffb507db44 in unlink_chunk (p=p@entry=0xffffa00012d0, av=0xffffa0000030) at malloc.c:1637
1637 malloc_printerr ("corrupted double-linked list");
(gdb) p p->fd
$8 = (struct malloc_chunk *) 0xffffa0001ff0
(gdb) p p->fd->fd
$9 = (struct malloc_chunk *) 0x0
(gdb) x/36xg 0xffffa0001ff0
0xffffa0001ff0: 0x0000000000000000 0x0000000000002c91
0xffffa0002000: 0x0000000000000000 0x0000000000000000
0xffffa0002010: 0x0000000000000000 0x0000000000000000
0xffffa0002020: 0x0000000000000000 0x0000000000000000
0xffffa0002030: 0x0000000000000000 0x0000000000000000
0xffffa0002040: 0x0000000000000000 0x0000000000000000
0xffffa0002050: 0x0000000000000000 0x0000000000000000
0xffffa0002060: 0x0000000000000000 0x0000000000000000
0xffffa0002070: 0x0000000000000000 0x0000000000000000
0xffffa0002080: 0x0000000000000000 0x0000000000000000
0xffffa0002090: 0x0000000000000000 0x0000000000000000
0xffffa00020a0: 0x0000000000000000 0x0000000000000000
0xffffa00020b0: 0x0000000000000000 0x0000000000000000
0xffffa00020c0: 0x0000000000000000 0x0000000000000000
0xffffa00020d0: 0x0000000000000000 0x0000000000000000
0xffffa00020e0: 0x0000000000000000 0x0000000000000000
0xffffa00020f0: 0x0000000000000000 0x0000000000000000
0xffffa0002100: 0x0000000000000000 0x0000000000000000
(gdb)
The logic of the test case is to detect whether the path is recovery after the path is set to marginal and the marginal_path_double_failed_time has elapsed, and then stop the multi-path guardian process.
When we examine the thread exit, the malloc_printerr function detects a linked list of memory that has been freed during the process's execution. It is found that this memory has been written with an abnormal value of 0, resulting an error. We don't know how to analyze this problem and we don't know where the memory is being written abnormally. Can you tell us how to analyze and solve this problem?
The text was updated successfully, but these errors were encountered: