Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better handle exceptions thrown by dynamic linker #1116

Closed
wkozaczuk opened this issue Jan 3, 2021 · 10 comments
Closed

better handle exceptions thrown by dynamic linker #1116

wkozaczuk opened this issue Jan 3, 2021 · 10 comments
Assignees

Comments

@wkozaczuk
Copy link
Collaborator

Sometimes, if for whatever the dynamic linker throws an exception like in this code:

void file::load_elf_header()
{
    try {
        read(0, &_ehdr, sizeof(_ehdr));
    } catch(error &e) {
        throw osv::invalid_elf_error(
                std::string("can't read elf header: ") + strerror(e.get()));
    }
    if (!(_ehdr.e_ident[EI_MAG0] == '\x7f'
          && _ehdr.e_ident[EI_MAG1] == 'E'
          && _ehdr.e_ident[EI_MAG2] == 'L'
          && _ehdr.e_ident[EI_MAG3] == 'F')) {
        throw osv::invalid_elf_error("bad elf header");
    }
    if (!(_ehdr.e_ident[EI_CLASS] == ELFCLASS64)) {
        throw osv::invalid_elf_error("bad elf class");
    }
    if (!(_ehdr.e_ident[EI_DATA] == ELFDATA2LSB)) {
        throw osv::invalid_elf_error("bad elf endianness");
    }
    if (!(_ehdr.e_ident[EI_VERSION] == EV_CURRENT)) {
        throw osv::invalid_elf_error("bad elf version");
    }
    if (!(_ehdr.e_ident[EI_OSABI] == ELFOSABI_LINUX
          || _ehdr.e_ident[EI_OSABI] == 0)) {
        throw osv::invalid_elf_error("bad os abi");
    }
    if (!(_ehdr.e_type == ET_DYN || _ehdr.e_type == ET_EXEC)) {
        throw osv::invalid_elf_error(
                "bad executable type (only shared-object, PIE or non-PIE executable supported)");
    }
    if(_ehdr.e_machine != ELF_KERNEL_MACHINE_TYPE) {
        throw osv::invalid_elf_error("machine type for " + _pathname + " differs from the kernel");
    }
}

it gets masked with a page fault abort like this:

Booted up in 260.48 ms
Cmdline: /tests/tst-vfs.so
ELF [tid:28, mod:1, /libvdso.so]: Instantiated
ELF [tid:28, mod:1, /libvdso.so]: The base set to: 0x0000100000000000 and end: 0x0000100000004030
ELF [tid:28, mod:1, /libvdso.so]: Loading segments
ELF [tid:28, mod:1, /libvdso.so]: Loaded and mapped PT_LOAD segment at: 0x0000100000000000 of size: 0x1000
ELF [tid:28, mod:1, /libvdso.so]: Loaded and mapped PT_LOAD segment at: 0x0000100000001000 of size: 0x1000
ELF [tid:28, mod:1, /libvdso.so]: Loaded and mapped PT_LOAD segment at: 0x0000100000002000 of size: 0x1000
ELF [tid:28, mod:1, /libvdso.so]: Loaded and mapped PT_LOAD segment at: 0x0000100000003000 of size: 0x2000
ELF [tid:28, mod:1, /libvdso.so]: Processing headers
ELF [tid:28, mod:1, /libvdso.so]: Relocated 3 PLT symbols
ELF [tid:28, mod:2, /tests/tst-vfs.so]: Instantiated
ELF [tid:28, mod:2, /tests/tst-vfs.so]: The base set to: 0x0000100000005000 and end: 0x0000100000030590
ELF [tid:28, mod:2, /tests/tst-vfs.so]: Loading segments
ELF [tid:28, mod:2, /tests/tst-vfs.so]: Loaded and mapped PT_LOAD segment at: 0x0000100000005000 of size: 0x10000
ELF [tid:28, mod:2, /tests/tst-vfs.so]: Loaded and mapped PT_LOAD segment at: 0x0000100000015000 of size: 0x13000
ELF [tid:28, mod:2, /tests/tst-vfs.so]: Loaded and mapped PT_LOAD segment at: 0x0000100000028000 of size: 0x5000
ELF [tid:28, mod:2, /tests/tst-vfs.so]: Loaded and mapped PT_LOAD segment at: 0x000010000002e000 of size: 0x3000
ELF [tid:28, mod:2, /tests/tst-vfs.so]: Processing headers
ELF [tid:28, mod:2, /tests/tst-vfs.so]: Loading DT_NEEDED object: libboost_unit_test_framework.so.1.71.0 
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: Instantiated
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: The base set to: 0x0000100000031000 and end: 0x00001000000e8748
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: Loading segments
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: Loaded and mapped PT_LOAD segment at: 0x0000100000031000 of size: 0x1b000
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: Loaded and mapped PT_LOAD segment at: 0x000010000004c000 of size: 0x76000
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: Loaded and mapped PT_LOAD segment at: 0x00001000000c2000 of size: 0x1f000
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: Loaded and mapped PT_LOAD segment at: 0x00001000000e1000 of size: 0x6000
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: Processing headers
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: Loading DT_NEEDED object: libstdc++.so.6 
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: Loading DT_NEEDED object: libm.so.6 
ELF [tid:28, mod:3, /usr/lib/libboost_unit_test_framework.so.1.71.0]: Loading DT_NEEDED object: libgcc_s.so.1 
ELF [tid:28, mod:4, /usr/lib/libgcc_s.so.1]: Instantiated
ELF [tid:28, mod:4, /usr/lib/libgcc_s.so.1]: Removed
ELF [tid:28, mod:2, /tests/tst-vfs.so]: Executing 1 DT_FINI_ARRAYSZ functions
Aborted

[backtrace]
0x0000000040217f5f <???+1075937119>
0x000000004046eb12 <osv::handle_mmap_fault(unsigned long, int, exception_frame*)+34>
0x00000000403472fa <mmu::vm_fault(unsigned long, exception_frame*)+298>
0x0000000040399063 <page_fault+147>
0x0000000040397f16 <???+1077509910>
0x000000004035d54a <elf::program::remove_object(elf::object*)+122>
0x000000004035db74 <???+1077271412>
0x0000000040362051 <std::vector<std::shared_ptr<elf::object>, std::allocator<std::shared_ptr<elf::object> > >::~vector()+97>
0x000000004020714e <???+1075867982>
0x0000000040437404 <osv::application::application(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<0x00000000404377a6 <osv::application::run(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>0x00000000404379e0 <osv::application::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)+80>
0x000000004022c6ca <do_main_thread(void*)+3098>
0x000000004046a449 <???+1078371401>
0x0000000040405d5d <thread_main_c+45>
0x0000000040398e92 <???+1077513874>

It could be that in reality, it is a bug in the dynamic linker and possibly we are trying to remove objects that have not been loaded yet or something like this.

In this particular case, the x64 image had aarch64 version of libgcc_s.so and really should have aborted with the message about the wrong machine type of the library.

@wkozaczuk
Copy link
Collaborator Author

Or the bug is as simple as trying to execute FINI functions when we did not really execute INIT functions because we failed earlier than that like in the case above.

@nyh
Copy link
Contributor

nyh commented Jan 6, 2021

In core/elf.cc, in program::load_object() we have in program::load_object():

        auto ef = std::shared_ptr<object>(new file(*this, f, name),
                [=](object *obj) { remove_object(obj); });
        // ... some lines snipped
        ef->load_needed(loaded_objects);
        ef->relocate();
        ef->fix_permissions();
        _files[name] = ef;
        _files[ef->soname()] = ef;
        return ef;

The intention of the deleter calling remove_object(ef) is that it be run much later when object is eventually destroyed.
However, if the program::load_object() calls fails in the middle, it will also be called... In our case, I think load_needed() fails, because one of the libraries it tried to load failed, so at that point the ef pointer becomes unused, and remove_object(ef) is called to delete it - before we completed the initialization (there are still a few more lines in this function). I think we need to fix remove_object(ef) to accept this possibility, recognize what hasn't been initialized and not rely on it when destroying the object.

@wkozaczuk
Copy link
Collaborator Author

I this particular crash scenario, it was trying to invoke FINI functions before even any segments got mapped. I wonder if the right solution would be to add flags to the ELF object class to track if segments were loaded and INIT called and then during remove_object()/destruction only call unload_segments() and process fini function if segments were loaded and init functions invoked respectively.

@AubreyTurner
Copy link
Contributor

Hello! I can work on this issue, the desired solution to fix this is adding and using flags in the elf object as described in the above comment correct? If that is implemented the desired exception should be thrown?

@nyh
Copy link
Contributor

nyh commented Dec 2, 2021

Hi, you are very welcome to work on this issue!

I don't know if we need an additional flags or maybe we already have enough fields in class object but missing zero initialization, or perhaps we do have zero-initialization and missing a check for it in remove_object() or something called from it.
I would begin by trying to reproduce this bug - e.g., perhaps as @wkozaczuk suggested, replace one of the shared objects loaded by the application by a broken file (he used an aarch64 one, but probably just an empty file would have a similar problem) and try to reproduce a crash like he saw instead of a clear error message.

@AubreyTurner
Copy link
Contributor

Thanks, I was able to reproduce this bug and it looks like the page fault occurs when the FINI function for a module that got successfully loaded is called when remove_object() is called for it (after trying and failing to load the broken file/module). Should the FINI functions for modules still get called if the program has failed to finish loading?

@wkozaczuk
Copy link
Collaborator Author

wkozaczuk commented Dec 6, 2021

No, the FINI functions should not get called if INIT were not called because the dynamic linker failed to load the object before that point. And we somehow need to be able to track it by examining the state of some member variables of the elf::object (see https://github.com/cloudius-systems/osv/blob/master/include/osv/elf.hh#L340-L473). Please analyze, but I suspect the existing variables in that class will not be useful enough to determine for example if INIT was called or in general, what point the initialization flow an elf object was at when an exception was thrown. So maybe we will need to add a new member variable to elf::object, some kind of enum that will be set to a new value every time a new step of the initialization step is reached, one of them - calling INIT functions. Then when remove_object() or run_fini_funcs() is called you would know what to do.

@nyh
Copy link
Contributor

nyh commented Dec 6, 2021

@wkozaczuk while it is true that in general, if exceptions can happen from anywhere, we need to remember exactly what was run and what was not. But perhaps the present problem is simpler... We have the exceptions all coming from the same place, before the object is even loaded, so it should be (? I didn't actually try...) fairly easy to know if we loaded the object or not, and if not, we certainly shouldn't call run_fini_funcs().

@wkozaczuk
Copy link
Collaborator Author

@nyh you are right that for now we should not overcomplicate this. But I have looked at the object::run_init_funcs() and it does not seem to change any state of elf::object that would allow us to know if it was called. To be safe I think we should simply add new variable initialzed or init_called and set it to true at the and of run_init_funcs for given object and only if false in run_fini_funcs return.

@AubreyTurner
Copy link
Contributor

Yes in this case the fault is occurring when the FINI functions for an already loaded module (mod 2 in the abort output) are called, not the FINI functions for the module that failed to load (mod 4). I'll try implementing tracking when INIT functions are called with a new variable in the elf object like you describe, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants