Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noticeable performance downgrade from Python 3.10 to onwards versions #4716

Open
gi0baro opened this issue Nov 18, 2024 · 4 comments
Open

Noticeable performance downgrade from Python 3.10 to onwards versions #4716

gi0baro opened this issue Nov 18, 2024 · 4 comments

Comments

@gi0baro
Copy link
Contributor

gi0baro commented Nov 18, 2024

Hi 👋 I'm not sure wether this makes sense as an issue or should be a discussion instead: to project maintainers, feel free to move this to a discussion if you conclude that's better.

Context: in the Granian project (an HTTP server) we recently introduced some e2e benchmarks using different Python versions, which show a ~30% performance degradation for some tests when comparing Python 3.10 to all other versions onwards (PyO3 0.22, cfg pyo3_disable_reference_pool).

Now, the specific tests showing this degradation involves some relatively simple code:

While I understand an e2e benchmark might suffer from a lot of additional noise when compared to a smaller unit-benchmark, and there's a lot more to consider (network stack in CPython stdlib, for example), I also believe, given other protocols involving asyncio and a bunch more stuff suffer from a very smaller degradation compared to the one I referenced, there might be something going on in PyO3 <-> CPython interop. Thus I have two main questions:

  • is there any well-known difference from Python 3.11 and onwards in how PyO3 interact with the Python interpreter that might explain this?
  • do you have any suggestions on how to investigate this in a more fine-grained way to help enlighten any other differences between Python versions that might play a role in this?

Thanks in advance 🙏

@gi0baro gi0baro changed the title Noticeable performance downgrade from Python 3.10 to onwards version Noticeable performance downgrade from Python 3.10 to onwards versions Nov 18, 2024
@davidhewitt
Copy link
Member

Thanks for the questions. I don't have an immediate answer for you; to check I understand, Python 3.11 and up got ~30% slower?

Have you tried generating flame graphs (e.g. with samply) to see if that gives a hint where the differences come from?

@gi0baro
Copy link
Contributor Author

gi0baro commented Nov 20, 2024

to check I understand, Python 3.11 and up got ~30% slower?

@davidhewitt correct, here is a more direct comparison extracted from that bench:

Python version Total requests RPS avg latency max latency
3.10 559148 55978 2.28ms 24.538ms
3.11 381549 38197 3.339ms 24.674ms
3.12 356792 35798 7.121ms 64.292ms
3.13 371324 37194 3.429ms 18.313ms

Have you tried generating flame graphs (e.g. with samply) to see if that gives a hint where the differences come from?

that makes sense. Let me plan some tests using sampling, I'll post here the findings.

@gi0baro
Copy link
Contributor Author

gi0baro commented Dec 6, 2024

@davidhewitt I tried using samply as you suggested to build flame graphs, but even with the following profile used in build

[profile.profiling]
inherits = "release"
debug = true

the stacks in the report just shows items as 0x12989b _granian.cpython-310-x86_64-linux-gnu.so, so it's quite hard to spot on differences between 3.10 and 3.11 builds. Do you have any further suggestions on how to get full stacks on a PyO3 cdylib built library?

@gi0baro
Copy link
Contributor Author

gi0baro commented Dec 6, 2024

Btw, I re-run tests with PyO3 0.23 and they show the same issue (ignore the absolute numbers vs last table as it's different hardware):

Python version RPS avg latency max latency
3.9 132489 0.483ms 1.653ms
3.10 132521 0.482ms 1.565ms
3.11 64298 0.994ms 1.816ms
3.12 62252 2.054ms 7.601ms
3.13 63075 1.014ms 1.692ms

Might this be related to threads? The involved code has the main Python thread waiting on a threading.Event object, with 1 tokio thread dealing with I/O and sending/receiving stuff from a 2nd thread which interacts with Python code through a loop of Python::with_gil(|py| { ... }) calls. I'm starting wondering there might be some difference in GIL acquisition from different threads after 3.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants