Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread-local storage doesn't work in PIE #352

Closed
nyh opened this issue Jun 22, 2014 · 12 comments
Closed

Thread-local storage doesn't work in PIE #352

nyh opened this issue Jun 22, 2014 · 12 comments
Labels

Comments

@nyh
Copy link
Contributor

nyh commented Jun 22, 2014

As Pawel Dziepak pointed in issue #213, when an application uses thread-local variables, it will work correctly when linked as a shared object ("-shared") but not work correctly when linked as a PIE ("-pie"). For example, this program:

#include <iostream>
__thread int foo = 123;
int main() {
        std::cout << foo << "\n";
}

When compiling it as a PIE, i.e., the compile line is

g++ -pie -fpie test.cc

The resulting a.out runs correctly on Linux (prints 123), but on OSv, it prints 0. Things don't change if "-fpic" is used instead of "-fpie", by the way.

When linking the same objects as a shared object ("-shared") instead of pi executable ("-pie"), the resulting object runs correctly on OSv, and prints 123.

@nyh
Copy link
Contributor Author

nyh commented Jun 22, 2014

https://bugs.freedesktop.org/show_bug.cgi?id=35268 contains an interesting discussion of how the "mesa" shared library uses the initial-exec TLS model, and it's still possible to dlopen() it because glibc deliberately allocates for the main program (for us - the kernel) a surplus size of TLS, exactly for this purpose. The Musl author points out in this thread that he hates this solution because it wastes additional space per-thread. Also, I think this solution will only work for the "initial exec" model but not for the even more optimized "local exec" model? (see http://dev.gentoo.org/~dberkholz/articles/toolchain/tls.pdf for the long description of all these models).

@pdziepak
Copy link
Contributor

The problem with "local exec" model is that the executable assumes that it is the first loaded object and therefore offset to its TLS block (relative to fs:0) is known at compile time. The kernel assumes the same and as a result TLS of the executable and the kernel overlap.

"Initial exec" is easier as OSv can decide what is the offset (again, relative to fs:0) of the TLS of the object. For threads created after loading the executable (i.e. when the kernel already is aware about static TLS defined by the executable) it is quite straightforward to implement without the need for any hacks. Reclaimer thread and other kernel threads that may end up executing user code (are there any other?) may be a problem though.

avikivity pushed a commit that referenced this issue Jun 26, 2014
As explained in issue #352, a PIE (position-independent executable) using
TLS (thread-local storage) does not currently work correctly. Fixing this
case is not easy, but until we do, it is dangerous to allow this case to
be ignored, as it causes reading and writing to wrong positions in memory
whenever TLS is used in the PIE.

So this patch adds a warning printout (to stdout) whenever one attempts to
load a PIE which has any TLS variables.

This is, of course, just a temporary measure. The real solution is to fix
object load and not just warn.

Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
@slivne slivne added the bug label Jul 31, 2014
@nyh
Copy link
Contributor Author

nyh commented Feb 10, 2015

In addition to breaking PIEs, as described above, another problem we have by not supporting initial-exec is that shared-libraries that were deliberately (for performance reasons) compiled with the initial-exec TLS model cannot run on OSv.

One such library that cannot run on OSv today is gcc's OpenMP library (libgomp), which is deliberately compiled with initial-exec TLS (see https://gcc.gnu.org/ml/gcc-patches/2012-04/msg01347.html).

@nyh
Copy link
Contributor Author

nyh commented Sep 1, 2016

In commit 8e99d15, @pdziepak already added support for the initial-exec model. But for reasons explained above by @pdziepak local-exec model is still not supported even today.

I wonder how much performance it would cost us to switch the kernel to the initial-exec model, i.e., it will need to use an offset to find its part of TLS area (and can't assume the kernel's offset is zero), so we can allow the application (just one of them...) to use local-exec. Or, I wonder if there's another way for us to support local-exec.

@pdziepak
Copy link
Contributor

pdziepak commented Sep 1, 2016

Well, there might be another solution, but it is extremely messy. The idea is to switch %fs at entry and exit to the kernel (meaning: OSv and the libraries it uses).

It is not easy since OSv isn't separated form the applications as much as traditional kernels do and dynamic linker needs to be hijacked to insert some trampoline code. There are also exceptions to be dealt with...

Early, incomplete and possibly broken attempt at doing something like this is available here.

@wkozaczuk
Copy link
Collaborator

I have just realized that since version 1.6 Golang supports '--buildmode pie" but unfortunately generated pie uses local-exec (as you can see the app executes but then blows up in the end):

OSv v0.52.0-5-g65f0e38
eth0: 192.168.122.15
WARNING: /hello2 is a PIE using TLS. This is currently unsupported (see issue #352). Link with '-shared' instead of '-pie'.
syscall(): unimplemented system call 267
Hello, 世界
Go version: go1.9.2
syscall(): unimplemented system call 231
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1400000 addr=0x0 pc=0x100001488ca8]

goroutine 1 [running]:
runtime.throw(0x1000014e9654, 0x5)
	/home/wkozaczuk/tools/go1.9.2/go/src/runtime/panic.go:605 +0x97 fp=0xc42004df30 sp=0xc42004df10 pc=0x100001487537
runtime.sigpanic()
	/home/wkozaczuk/tools/go1.9.2/go/src/runtime/signal_unix.go:374 +0x22b fp=0xc42004df80 sp=0xc42004df30 pc=0x10000149a79b
runtime.main()
	/home/wkozaczuk/tools/go1.9.2/go/src/runtime/proc.go:220 +0x278 fp=0xc42004dfe0 sp=0xc42004df80 pc=0x100001488ca8
runtime.goexit()
	/home/wkozaczuk/tools/go1.9.2/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc42004dfe8 sp=0xc42004dfe0 pc=0x1000014b1901
syscall(): unimplemented system call 231
fatal error: unexpected signal during runtime execution
panic during panic

@wkozaczuk
Copy link
Collaborator

I have just added two new Golang examples built with --buildmode=pie and they work just fine (please see cloudius-systems/osv-apps@4485cc1 for details). It turns out that the crash from my older comment below was really caused by missing sys_exit_group call since added (see a52e73c).

One will see 'WARNING: /httpserver.so is a PIE using TLS. This is currently unsupported (see issue #352). Link with '-shared' instead of '-pie'.' when one running golang-pie-example or golang-pie-httpserver. However both examples function just fine with httpserver behaving normally under stress tests using load. It might be that Golang runtime does not use thread local variables. In either case it is a good news as it seems we have another simpler way to run golang apps without wrapper.

The only caveat I saw was this stacktrace when trying to Ctrl-C running httpserver example:

OSv v0.52.0-33-g84a4cbcd
eth0: 192.168.122.15
WARNING: /httpserver.so is a PIE using TLS. This is currently unsupported (see issue #352). Link with '-shared' instead of '-pie'.
syscall(): unimplemented system call 267
Go version: go1.11
page fault outside application, addr: 0x0000000000000000
[registers]
RIP: 0x00000000003a3112 <syscall_entry+18>
RFL: 0x0000000000010286  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x000000000000000e  RBX: 0x000000c00002e000  RCX: 0x0000100000193c9a  RDX: 0x000000c00002e080
RSI: 0x0000000000000000  RDI: 0x0000000000000002  RBP: 0xffff800002d67e70  R8:  0xffff80000096c501
R9:  0xffff80000096c500  R10: 0x0000000000000008  R11: 0x0000000000000286  R12: 0xffff800001c8b040
R13: 0xffff80000096c500  R14: 0x0000000000000008  R15: 0x0000000000000286  RSP: 0x0000000000000000
Aborted

[backtrace]
0x0000000000346ef2 <???+3436274>
0x0000000000347bf2 <mmu::vm_fault(unsigned long, exception_frame*)+466>
0x00000000003a32dd <page_fault+125>
0x00000000003a2156 <???+3809622>
0x000010000017ad44 <???+1551684>
0x000010000016af41 <???+1486657>
0x000010000017ab06 <???+1551110>
0x000010000017a3f5 <???+1549301>
0x0000100000193d82 <???+1654146>
0x00000000003fd546 <thread_main_c+38>
0x00000000003a30d2 <???+3813586>

@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Apr 19, 2019

As I was trying to understand how local-exec works I came across yet another article that describes all TLS scenarios with good examples - https://chao-tic.github.io/blog/2018/12/25/tls.

Just to confirm how local-exec works, there are two forms compiler will generate code for local-exec like in those examples:

mov    $0xfffffffffffffff8,%rcx
mov    %fs:(%rcx),%rcx

or

movq   $0xa,%fs:0xffffffffffffffb0

Is it correct?

So here is an idea on how we might support local-exec for PIEs given these assumptions:

  • using thread local variables is rare in general and discouraged unless absolutely there is no other mechanism to use instead
  • higher language runtimes like golang, node, etc typically use TLS to store single variable that holds address to external structure (hashmap, container) to keep real thread-local variables

My point is that TLS used by PIEs would in most cases be pretty small as far as memory goes. So maybe potential solution would be to some reserve just enough space for PIE local-exec executable TLS and have "real" thread local variables used OSv kernel in local-exec mode outside of this reserved part of the TLS area.

In other words if we declared a TLS array buffer like so (assuming 256 bytes would be enough in most cases):

__thread char __pie_tls_reservation[256];

and somehow made __pie_tls_reservation be at the very beginning (end?) of the TLS area and everything else that OSv kernel uses after (+256) (-256, before?) we would avoid the collision between user application and OSv kernel. Obviously OSv would not really use __pie_tls_reservation.

@nyh @pdziepak Is it possible to pull it off with some magical gcc settings, somehow force it to position specific TLS local-exec variables at specific offsets when we build loader.elf?

I have also come across this gcc flags:

-mtls-direct-seg-refs
-mno-tls-direct-seg-refs
Controls whether TLS variables may be accessed with offsets from the TLS
segment register (%gs for 32-bit, %fs for 64-bit), or whether the thread base
pointer must be added. Whether or not this is valid depends on the operating
system, and whether it maps the segment to cover the entire TLS area.
For systems that use the GNU C Library, the default is on.

and some discussion around it here - https://bugs.llvm.org/show_bug.cgi?id=16145.

Could this help us somehow?

Finally how is it possible golang PIEs that use TLS work on OSv? Are we lucky because it has single TLS variable (see size of 8) -

TLS            0x00000000001075f0 0x00000000001085f0 0x00000000001085f0
                 0x0000000000000000 0x0000000000000008  R      0x8

and somehow falls into this 8 bytes of TLS not used by OSv kernel.

Or is it because TLS for new app threads are located on the stack (read this paragraph - https://chao-tic.github.io/blog/2018/12/25/tls#the-initialisation-of-tcb-or-tls-in-non-main-threads) [well maybe not in OSv) and when we switch stacks during SYSCALL calls the TLS is in different areas in memory?

@wkozaczuk
Copy link
Collaborator

I think this might be a solution (minus any gaps I have missed):

diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index 88bef949..9fe15939 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -156,11 +156,17 @@ void thread::setup_tcb()

     void* user_tls_data;
     size_t user_tls_size = 0;
+    void* user_local_tls_data;
+    size_t user_local_tls_size = 0;
     if (_app_runtime) {
         auto obj = _app_runtime->app.lib();
         assert(obj);
         user_tls_size = obj->initial_tls_size();
         user_tls_data = obj->initial_tls();
+        if (obj->is_executable()) {
+           user_local_tls_size = obj->get_tls_size();
+           user_local_tls_data = obj->get_tls_segment();
+        }
     }

     // In arch/x64/loader.ld, the TLS template segment is aligned to 64
@@ -177,6 +183,9 @@ void thread::setup_tcb()
     memcpy(p + user_tls_size, sched::tls.start, sched::tls.filesize);
     memset(p + user_tls_size + sched::tls.filesize, 0,
            sched::tls.size - sched::tls.filesize);
+    if (user_local_tls_size) {
+        memcpy(p + (tls.size + user_tls_size - user_local_tls_size), user_local_tls_data, user_local_tls_size);
+    }
     _tcb = static_cast<thread_control_block*>(p + tls.size + user_tls_size);
     _tcb->self = _tcb;
     _tcb->tls_base = p + user_tls_size;
diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld
index caae1f68..72470f6c 100644
--- a/arch/x64/loader.ld
+++ b/arch/x64/loader.ld
@@ -78,7 +78,7 @@ SECTIONS
         _percpu_workers_end = .;
     }
     . = ALIGN(64);
-    .tdata : { *(.tdata .tdata.* .gnu.linkonce.td.*) } :tls :text
+    .tdata : { *(.pie_tls_reservation .tdata .tdata.* .gnu.linkonce.td.*) } :tls :text
     .tbss : {
         *(.tbss .tbss.* .gnu.linkonce.tb.*)
         . = ALIGN(64);
diff --git a/core/elf.cc b/core/elf.cc
index ca9226f6..2120afaa 100644
--- a/core/elf.cc
+++ b/core/elf.cc
@@ -437,9 +437,12 @@ void object::load_segments()
     // As explained in issue #352, we currently don't correctly support TLS
     // used in PIEs.
     if (_is_executable && _tls_segment) {
-        std::cout << "WARNING: " << pathname() << " is a PIE using TLS. This "
-                  << "is currently unsupported (see issue #352). Link with "
+        auto tls_size = _tls_init_size + _tls_uninit_size;
+        if (tls_size > 256) {
+            std::cout << "WARNING: " << pathname() << " is a PIE using TLS of size " << tls_size
+                  << " which is greater than 256 bytes limit. Either increase the limit of link with "
                   << "'-shared' instead of '-pie'.\n";
+        }
     }
 }

diff --git a/include/osv/elf.hh b/include/osv/elf.hh
index 0f1792c7..d9b6b715 100644
--- a/include/osv/elf.hh
+++ b/include/osv/elf.hh
@@ -362,7 +362,10 @@ public:
     void init_static_tls();
     size_t initial_tls_size() { return _initial_tls_size; }
     void* initial_tls() { return _initial_tls.get(); }
+    void* get_tls_segment() { return _tls_segment; }
     std::vector<ptrdiff_t>& initial_tls_offsets() { return _initial_tls_offsets; }
+    bool is_executable() { return _is_executable; }
+    ulong get_tls_size();
 protected:
     virtual void load_segment(const Elf64_Phdr& segment) = 0;
     virtual void unload_segment(const Elf64_Phdr& segment) = 0;
@@ -387,7 +390,6 @@ private:
     void relocate_rela();
     void relocate_pltgot();
     unsigned symtab_len();
-    ulong get_tls_size();
     void collect_dependencies(std::unordered_set<elf::object*>& ds);
     void prepare_initial_tls(void* buffer, size_t size, std::vector<ptrdiff_t>& offsets);
     void alloc_static_tls();
diff --git a/loader.cc b/loader.cc
index 61f68e09..96d2aff6 100644
--- a/loader.cc
+++ b/loader.cc
@@ -65,6 +65,8 @@ asm(".pushsection \".debug_gdb_scripts\", \"MS\",@progbits,1 \n"
     ".asciz \"scripts/loader.py\" \n"
     ".popsection \n");

+__attribute__((section (".pie_tls_reservation"))) __thread char __pie_local_exec_tls_reservation[256] = {0};
+
 elf::Elf64_Ehdr* elf_header __attribute__ ((aligned(8)));

 size_t elf_size;

The symbols in TLS section of loader.elf look like this:

readelf -s build/release/loader.elf | grep TLS | grep -o 000000*.* | sort | uniq
0000000000000000   256 TLS     GLOBAL DEFAULT   58 __pie_local_exec_tls_rese
0000000000000100     4 TLS     GLOBAL DEFAULT   58 _ZN5sched15preempt_counte
0000000000000110     8 TLS     GLOBAL DEFAULT   59 rrw_tsd_key
0000000000000118     8 TLS     GLOBAL DEFAULT   59 zfs_fsyncer_key
0000000000000120     8 TLS     GLOBAL DEFAULT   59 _ZN12attached_env7tls_env
0000000000000128     4 TLS     GLOBAL DEFAULT   59 _ZN5sched15exception_dept
0000000000000130     8 TLS     GLOBAL DEFAULT   59 current_interrupt_frame
0000000000000138     1 TLS     LOCAL  DEFAULT   59 _ZN3dbgL8recursedE
000000000000013c     4 TLS     GLOBAL DEFAULT   59 _ZN6memory21emergency_all
0000000000000140     1 TLS     GLOBAL DEFAULT   59 _ZN6memory13alloc_tracker
0000000000000148     8 TLS     GLOBAL DEFAULT   59 percpu_base
0000000000000150     8 TLS     GLOBAL DEFAULT   59 _ZN5sched11current_cpuE
0000000000000158     1 TLS     GLOBAL DEFAULT   59 _ZN5sched15need_reschedul
0000000000000160     8 TLS     GLOBAL DEFAULT   59 _ZN5sched9s_currentE
0000000000000168     4 TLS     LOCAL  DEFAULT   59 _ZL18func_trace_nesting
0000000000000170     8 TLS     GLOBAL DEFAULT   59 _ZN3osv20override_current
0000000000000178     8 TLS     LOCAL  DEFAULT   59 _ZZ7readdirE6result
0000000000000180   280 TLS     LOCAL  DEFAULT   59 _ZZ7readdirE5entry
0000000000000298     8 TLS     LOCAL  DEFAULT   59 __locale
00000000000002a0     4 TLS     GLOBAL DEFAULT   59 _ZN15pthread_private12can
00000000000002a8     8 TLS     GLOBAL DEFAULT   59 _ZN15pthread_private15cur
00000000000002b0  1024 TLS     GLOBAL DEFAULT   59 _ZN15pthread_private3tsdE
00000000000006b0     4 TLS     GLOBAL DEFAULT   59 errno
00000000000006b8     8 TLS     LOCAL  DEFAULT   59 _ZL11dlerror_ptr
00000000000006c0   128 TLS     LOCAL  DEFAULT   59 _ZL11dlerror_msg
0000000000000740     8 TLS     LOCAL  DEFAULT   59 _ZL17signal_stack_size
0000000000000748     8 TLS     LOCAL  DEFAULT   59 _ZL18signal_stack_begin
0000000000000750     4 TLS     GLOBAL DEFAULT   59 _ZN3osv21thread_pending_s
0000000000000760    16 TLS     GLOBAL DEFAULT   59 _ZN3osv18thread_signal_ma
0000000000000770     8 TLS     LOCAL  DEFAULT   59 _ZL24thread_local_destruc
0000000000000778    16 TLS     LOCAL  DEFAULT   59 _ZZN12_GLOBAL__N_110get_g
0000000000000788     8 TLS     GLOBAL DEFAULT   59 _ZSt11__once_call
0000000000000790     8 TLS     GLOBAL DEFAULT   59 _ZSt15__once_callable

As you can see thanks to a trick in loader.ld (will it always put this in the beginning?) __pie_local_exec_tls_reservation is the very first 256 bytes where local exec of PIE can live without any collision with "used" part of kernel TLS.

I have tested the example from the issue (see above) and seems to work:

OSv v0.53.0-6-gc8395118
eth0: 192.168.122.15
Booted up in 432.09 ms
123

@nyh
Copy link
Contributor Author

nyh commented Apr 24, 2019

Nice trick with that __pie_local_exec_tls_reservation thing. Looks to me like a reasonable approach but @pdziepak is the real expert here and might have other thoughts (?).

I will be easier to review the patch on the mailing list, but I'll start here:

+    if (user_local_tls_size) {
+        memcpy(p + (tls.size + user_tls_size - user_local_tls_size), user_local_tls_data, user_local_tls_size);
+    }

Wow, took me a while to understand why this is correct, but I think it is. Maybe this deserves a comment, e.g., saying that the first 256 bytes of OSv's TLS (which is the 256 bytes at the end of the allocation, since the TLS is addressed with negative offsets) is reserved for lor local-exec TLS in executables which don't know about OSv - so if we have such an executable, we need to copy its TLS initialization. Or something like that.

Can you please create a unit test for testing this, which also includes several local-exec variables, some initialized some not (so should be zero-initialized), and see we also get the right initialization (e.g., if we initialize a to 1 and b to 2, we don't want to see this order reversed ;-)).

The obvious downside of this technique is that it adds yet another 256 bytes to every thread (as well as the work to copy these bytes). I wonder if we shouldn't start with a much lower number and grow it later if we want? Can you please change the "256" number repeated three times in your code to a constant set only once? (sadly, it must be a constant, it can't be a boot-time configuration, because we need it during compilation).

@wkozaczuk
Copy link
Collaborator

I hope to send a patch later once I do more testing.

As far as reservation size I agree we could go with 64 bytes instead. As I noted thread-local variable I think are used sparingly so most apps should be OK with that limit.

As far as wasting space I think we would not waste any given that existing TLS kernel size is around 1.5K which means that our malloc uses whole 4K page in this case so adding 64 bytes or even more will not change anything for now. See #1000.

@wkozaczuk
Copy link
Collaborator

As you may have seen in the patch I just have sent the reservation area actually needs to be at the end of the kernel TLS. This actually also explains why Golang pies work even before the patch. Golang pies only have 8-bytes long TLS and the kernel TLS had already unintended unused area of 40 bytes because of 64-bytes round-up (see loader.ld).

wkozaczuk pushed a commit to wkozaczuk/osv that referenced this issue Sep 18, 2019
This patch enhances OSv dynamic loader to support
pies and position dependant executables that use TLS
(Thread Local Storage) in local-exec mode.

It does so by reserving an extra slot in kernel static
TLS block at its end and designating it as user static TLS
for the executable ELF. Any dependant ELF objects are still
placed in the area before the kernel TLS. For the specifics please
read comments added to arch-elf.cc and arch-switch.hh.

Please note that this solution limits the size of the application
TLS block to 64 bytes plus extra gap due to 64-bytes alignment
of the kernel TLS. This should be sufficient for most applications
which either use tiny TLS (Golang uses 8-bytes long) if at all.
Rust ELFs tend to rely on quite large TLS in which case the limit
in loader.ld needs to be increased accordingly and loader.elf
relinked.

Fixes cloudius-systems#352

Signed-off-by: Waldemar Kozaczuk <[email protected]>
Message-Id: <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants