Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added exception translator specific mutex used with try_translate_exceptions #5362

Merged
merged 5 commits into from
Sep 17, 2024

Conversation

vfdev-5
Copy link
Contributor

@vfdev-5 vfdev-5 commented Sep 9, 2024

Description

  • Added exception translator specific mutex in the internals to lock it when with try_translate_exceptions and register_exception_translator, register_local_exception_translator.

Fixes #5346

cc @colesbury

Suggested changelog entry:

Added exception translator specific mutex used with try_translate_exceptions in the free-threaded build for internal locking.

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the right approach to me.

  • I'd consider passing internals.registered_exception_translators to the callback and maybe rename it to with_exception_translators (instead of passing the internals struct).
  • I think this is an ABI change for the free-threaded build due to the new mutex in internals, so I think PYBIND11_INTERNALS_VERSION needs to be incremented. It may make sense to only change it for the free-threaded build.
  • A regression test derived from your bug report might be a good idea

@vfdev-5 vfdev-5 force-pushed the fix-5346-ft-internals-deadlock branch from f20ebd9 to 591cb73 Compare September 9, 2024 23:14
@vfdev-5 vfdev-5 requested a review from colesbury September 9, 2024 23:38
@vfdev-5 vfdev-5 marked this pull request as ready for review September 10, 2024 08:55
@colesbury
Copy link
Contributor

This LGTM, but needs review from the pybind11 maintainers

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Sep 11, 2024

Thanks for the feedback @colesbury !

@henryiii @rwgk could you please review this PR, thanks!

@rwgk
Copy link
Collaborator

rwgk commented Sep 11, 2024

(I'll look asap. I just changed companies and I'm still in the middle of getting set up. At the moment I don't even see when I'm getting tagged on GitHub, not sure why.)

@@ -39,7 +39,11 @@
# if PY_VERSION_HEX >= 0x030C0000 || defined(_MSC_VER)
// Version bump for Python 3.12+, before first 3.12 beta release.
// Version bump for MSVC piggy-backed on PR #4779. See comments there.
# define PYBIND11_INTERNALS_VERSION 5
# ifdef Py_GIL_DISABLED
# define PYBIND11_INTERNALS_VERSION 6
Copy link
Collaborator

@rwgk rwgk Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless #5296 is merged first, this will lead to incompatibility between extension modules compiled with or without Py_GIL_DISABLED defined. — EDIT next day: THIS IS NOT AN ISSUE. See comments posted after this one.

I just only now realize ... Huston we have a problem?

We already added pymutex mutex to struct internals but didn't change PYBIND11_INTERNALS_VERSION. Ouch? Sorry I missed that before.

What do we want here? I'm very uncertain, @colesbury could you please advise?

struct internals {
#if PY_VERSION_HEX >= 0x030D0000
    pymutex mutex;
    pymutex exception_translator_mutex;
#endif
...

Could that work? The idea here is that struct internals is the same with and without Py_GIL_DISABLED defined.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Free-threading is experimental so changing the ABI before was fine. Now I think packages might be beginning to ship, so probably not such a great idea to change without a bump. But not fond of these confusing bumps. The ABI is currently a mess, by the way - the Windows only bump broke conda-forge and the only solution was to pin <2.12, which now is breaking packages (see the feedstock for more info).

Will the internals version matter after #5296? Could we just basically do away with the concept, and only do compiler/stdlib, etc?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just basically do away with the concept

Not entirely, I think. The internals versions still need to match for the inheritance functionality to work correctly. (I'd probably need a day or two to figure out conclusively if that's fixable or just fundamentally impossible.)

But the more I think about it, the more I'm getting worried about having a different struct internals with and without Py_GIL_DISABLED defined:

What happens if there is one extension built with the GIL enabled, another built with free-threading, and both are imported into the same process?

Currently we only have this:

  • #define PYBIND11_CHECK_PYTHON_VERSION \
    { \
    const char *compiled_ver \
    = PYBIND11_TOSTRING(PY_MAJOR_VERSION) "." PYBIND11_TOSTRING(PY_MINOR_VERSION); \
    const char *runtime_ver = Py_GetVersion(); \
    size_t len = std::strlen(compiled_ver); \
    if (std::strncmp(runtime_ver, compiled_ver, len) != 0 \
    || (runtime_ver[len] >= '0' && runtime_ver[len] <= '9')) { \
    PyErr_Format(PyExc_ImportError, \
    "Python version mismatch: module was compiled for Python %s, " \
    "but the interpreter version is incompatible: %s.", \
    compiled_ver, \
    runtime_ver); \
    return nullptr; \
    } \
    }

Do we need to also check for Py_GIL_DISABLED compatibility there?

Or do we have to do something here?

  • #define PYBIND11_INTERNALS_ID \
    "__pybind11_internals_v" PYBIND11_TOSTRING(PYBIND11_INTERNALS_VERSION) \
    PYBIND11_INTERNALS_KIND PYBIND11_COMPILER_TYPE PYBIND11_STDLIB \
    PYBIND11_BUILD_ABI PYBIND11_BUILD_TYPE "__"
    #define PYBIND11_MODULE_LOCAL_ID \
    "__pybind11_module_local_v" PYBIND11_TOSTRING(PYBIND11_INTERNALS_VERSION) \
    PYBIND11_INTERNALS_KIND PYBIND11_COMPILER_TYPE PYBIND11_STDLIB \
    PYBIND11_BUILD_ABI PYBIND11_BUILD_TYPE "__"

@colesbury

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is one extension built with the GIL enabled, another built with free-threading, and both are imported into the same process?

You can't have extensions compiled with and without Py_GIL_DISABLED in the same process. An extension might require the GIL to be dynamically enabled, but that's different: that's a matter of calling (or not calling) PyUnstable_Module_SetGIL(). The extension will still need to be compiled with Py_GIL_DISABLED to be loaded in 3.13t.

In other words:

  • Py_GIL_DISABLED is part of the extension ABI, similar to Py_VERSION_HEX
  • It's part of the wheel tag (i.e., the "t" in cp313-cp313t), just like the version number

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or as @henryiii wrote in #5148 (comment): "A free-threaded build can't talk to a non-free-threaded build".

Copy link
Contributor Author

@vfdev-5 vfdev-5 Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I should have built using same python version, e.g. 3.13t and 3.13.
Doing so, above example is giving either a segfault:

$ gdb --args python3.13t main.py

Starting program: /usr/bin/python3.13t main.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7be1f0b in pybind11::dict::dict(pybind11::object&&) () from /pybind11_playground/ft_or_not_ft/test_gil.so

(gdb) bt
#0  0x00007ffff796bf0b in pybind11::dict::dict(pybind11::object&&) () from /pybind11_playground/ft_or_not_ft/test_gil.so
#1  0x00007ffff79683c5 in pybind11::detail::get_internals() () from /pybind11_playground/ft_or_not_ft/test_gil.so
#2  0x00007ffff7967bdd in PyInit_test_gil () from /pybind11_playground/ft_or_not_ft/test_gil.so
#3  0x00000000005e7800 in ?? ()

or

$ python3.13 main.py
Traceback (most recent call last):
  File "/pybind11_playground/ft_or_not_ft/main.py", line 4, in <module>
    import test_ft
ImportError: /pybind11_playground/ft_or_not_ft/test_ft.so: undefined symbol: _Py_DecRefShared

If this is again unrelated sorry for the noise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! That really helps me understand the situation better. So I think what we need to do is work on this code:

#define PYBIND11_CHECK_PYTHON_VERSION \
{ \
const char *compiled_ver \
= PYBIND11_TOSTRING(PY_MAJOR_VERSION) "." PYBIND11_TOSTRING(PY_MINOR_VERSION); \
const char *runtime_ver = Py_GetVersion(); \
size_t len = std::strlen(compiled_ver); \
if (std::strncmp(runtime_ver, compiled_ver, len) != 0 \
|| (runtime_ver[len] >= '0' && runtime_ver[len] <= '9')) { \
PyErr_Format(PyExc_ImportError, \
"Python version mismatch: module was compiled for Python %s, " \
"but the interpreter version is incompatible: %s.", \
compiled_ver, \
runtime_ver); \
return nullptr; \
} \
}

We want to catch if there is a mismatch between the Python interpreter kind (gil, free-threading) and the way the extension was built.

In a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think packages might be beginning to ship, so probably not such a great idea to change without a bump. But not fond of these confusing bumps.

Yeah ... "version bump chaos"

Trying to summarize how I understand the situation:

  • The internals version wasn't changed for Python 3.13t — but that's ok because there were no released packages anyway that could have been built with a non-matching struct internals.

  • Assuming there are released packages for 3.13t now, yes, it would indeed be safest to bump the internals version for 3.13t. — But pegging that on Py_GIL_DISABLED will make it very difficult for the general user community to reason about the internals version.

If this PR gets merged as is, we'll have internals version

  • 4 for Python < 3.12 if _MSC_VER is not defined.
  • 5 for Python 3.12, or 3.13, or if _MSC_VER is defined.
  • 6 for Python 3.13t only.

Who can (or wants to) understand that? :-)

And then we also still have #5339 (smart_holder) and #5280 (native enum) that I'd really like to see merged.

I believe this is a situation where the maintainers need to step up to balance "making progress" with "minimizing confusion".

@henryiii wrote:

Free-threading is experimental

That makes me think, let's not mess with the internals version in this PR:

  • Maybe a small number of users experimenting will stumble,
  • but a much larger number of production users will not have to understand the weirdness summarized above.

WDYT?

Looking beyond that question, I feel it'd be best to approach this strategically:

That's a discussion far beyond this PR, but I'm mentioning it here because it would be another reason to not mess with the internals version here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes me think, let's not mess with the internals version in this PR

I'm concerned that landing this PR without changing the version number will cause problems not just for users but also package maintainers (e.g., scipy, PyTorch) that use pybind11 and support free-threading. The failure modes for different internals can be complicated -- in this case, I think you run the risk of treating uninitialized memory like a mutex.

I think the approach in this PR is a good long term fix, but if the version bump is the issue we can first pursue a different approach that avoids changing the internals struct (and consider landing this change later when #5280 is ready).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First priority, I want to be helpful. Second minimize potential for confusion.

Unfortunately I'm technically somewhat handicapped at the moment (no Linux workstation), but let me try to get #5296 merged asap, with @henryiii's formal approval, to be sure we're on the same.

After that version bumps will have a lot less impact, and it won't matter as much that people fully understand the numbering.

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, based on @colesbury's review; I don't fully understand the mutex issue (#5346) TBH; I won't have enough free cycles for a while to really get my head around this.

PR #5296 was merged, which means we don't have to worry about the internals version bump so much.

tests/test_exceptions.cpp Outdated Show resolved Hide resolved
tests/test_exceptions.cpp Outdated Show resolved Hide resolved
@vfdev-5 vfdev-5 force-pushed the fix-5346-ft-internals-deadlock branch from 591cb73 to 8275a6a Compare September 16, 2024 08:46
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Sep 17, 2024

@rwgk I applied the suggestions to this PR. Anything else I need to do here or it is good to be merged?

@rwgk rwgk merged commit 1d9483f into pybind:master Sep 17, 2024
78 checks passed
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Sep 17, 2024
@rwgk
Copy link
Collaborator

rwgk commented Sep 17, 2024

Could you please add a Changlog entry to the PR description?

Please see other open PRs for the template.

@vfdev-5 vfdev-5 deleted the fix-5346-ft-internals-deadlock branch September 17, 2024 20:38
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Sep 17, 2024

@rwgk I updated the description of the PR with a changelog entry, feel free to update it if needed or let me know how I can update it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs changelog Possibly needs a changelog entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: retranslating local exception with custom data class hangs in free-threading cpython
4 participants