PERF: release the GIL in cylindrical pixelizer #5018

chrishavlin · 2024-10-04T16:34:23Z

release the gil!

This surprisingly seemed to work with minimal effort... took the pixelization time from 53s to 27s for one example (see this slack thread).

neutrinoceros · 2024-10-04T16:38:15Z

Releasing the GIL is likely a good idea here, but I also suspect that the abysmal performance in this routine may be fixed independently (possibly by reverting an earlier patch from mine that I'll dig up for in my future review). Any way, I'll try to review in details in the coming days. Thanks !

chrishavlin · 2024-10-04T16:41:27Z

if you figure out a separate fix I'm happy to go with it and close this one!

chrishavlin · 2024-10-04T16:41:41Z

oh, especially since image tests seem to be failing and it was not as simple as i hoped :)

chrishavlin · 2024-10-04T16:47:34Z

ok, i think i see the issue but won't have time to fix it for a while... so will hold off in case you devise a separate performance fix.

neutrinoceros · 2024-10-05T08:07:39Z

oh wow these diffs are catastrophic. I'll try to tackle the performance issue from another angle and I'll move this to draft if you don't mind.

matthewturk · 2024-10-08T13:48:48Z

These lines:

                     ptbounds[0] = ptbounds[0] % (2*np.pi)
                     ptbounds[1] = ptbounds[1] % (2*np.pi)

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

cphyc · 2024-10-08T14:04:34Z

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

With the latest version, they are actually expanded to

(__pyx_v_ptbounds[0]) = fmod((__pyx_v_ptbounds[0]), (2.0 * __pyx_v_npPI));

so there might be a small speedup to be found by precomputing 2π ahead of time, but I'm not sure this is super important.

yt/utilities/lib/pixelization_routines.pyx

matthewturk · 2024-10-08T14:57:20Z

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

With the latest version, they are actually expanded to
(__pyx_v_ptbounds[0]) = fmod((__pyx_v_ptbounds[0]), (2.0 * __pyx_v_npPI));
so there might be a small speedup to be found by precomputing 2π ahead of time, but I'm not sure this is super important.

Huh. I guess I wasn't on the latest version! For me it did a ton of unpacking and python type coercion. I guess that's one of the advantages of upgrading, eh?

Are our performance tests being run on the latest version? (i.e., do you see the slowness, or is what I am referring to showing up in our benchmarks)

Co-authored-by: Corentin Cadiou <[email protected]>

chrishavlin · 2024-10-10T15:18:36Z

still expect the image diffs to fail -- i'll look later today and if it's a quick fix (which i think it is) I'll put it in and do some performance tests.

neutrinoceros · 2024-10-10T19:24:22Z

yt/utilities/lib/pixelization_routines.pyx

@@ -566,7 +566,8 @@ def pixelize_cylinder(np.float64_t[:,:] buff,
    cdef np.float64_t r_inc, theta_inc
    cdef np.float64_t costheta, sintheta
    cdef int i, i1, pi, pj
-    cdef np.float64_t twoPI = 2 * np.pi
+    cdef np.float64_t onePI = np.pi


here's the direct way to "import" the constant from numpy as the C level

cdef extern from "numpy/npy_math.h": double NPY_PI

(taken from https://github.com/yt-project/yt/pull/4948/files#r1692841029)

(NPY_2_PI is also available)

hah, will import NPY_2_PI as well in a minute after tests run this round.... i think this one will pass...

oh wait, NPY_2_PI is 2/pi, not 2*pi ?

aaaah, missed that, sorry

chrishavlin · 2024-10-10T20:07:11Z

yt/utilities/lib/pixelization_routines.pyx

+                        if ptbounds[0] < 0:
+                            ptbounds[0] += NPY_PI
+                        ptbounds[1] = ptbounds[1] % twoPI
+                        if ptbounds[1] < 0:
+                            ptbounds[1] += NPY_PI


so I'm not sure why I needed to add the check for negative numbers here, but after switching from ptbounds[1] % (2*np.pi) to ptbounds[1] % twoPI, the result returns numbers in the range (-pi,pi).

I think we need to understand this bit, and possibly come up with a way to perform the computation that doesn't involve branching; hard-wiring a somewhat expensive calculation generally hurts performance less than ruining branch prediction.
I'm writing this comment early in my review but I'll try to come up with more insight.

I'm taking a bet that the reason why you're seeing a change in behavior is that the function is decorated with @cython.cdivision(True): on main the Python % is used, while with twoPI you're getting a C operator instead.

http://docs.cython.org/en/latest/src/tutorial/caveats.html

can you try this patch ?

diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx index 194ce7af6..6728375d0 100644 --- a/yt/utilities/lib/pixelization_routines.pyx +++ b/yt/utilities/lib/pixelization_routines.pyx @@ -664,12 +664,8 @@ def pixelize_cylinder(np.float64_t[:,:] buff, ptbounds[1] = fmax(ptbounds[1], corners[i1]) # shift to a [0, PI] interval - ptbounds[0] = ptbounds[0] % twoPI - if ptbounds[0] < 0: - ptbounds[0] += NPY_PI - ptbounds[1] = ptbounds[1] % twoPI - if ptbounds[1] < 0: - ptbounds[1] += NPY_PI + ptbounds[0] = (ptbounds[0]+twoPI) % twoPI + ptbounds[1] = (ptbounds[1]+twoPI) % twoPI if prbounds[0] >= rmin and prbounds[1] <= rmax and \ ptbounds[0] >= tmin and ptbounds[1] <= tmax:

the result should be the same in all cases, but it avoids branching, so I expect it to be faster.
If this works, we'll want to add a comment for why it's done this way.

I am wondering if we could use fmod here.

looks like in cython, % is an alias to fmod already when using C functions, which returns a negative number when the first argument is negative (which is why the tests started failing here I think). but I do think it's clearer to use fmod to avoid the confusion in the first place (so fmod(ptbounds[0]+twoPI, twoPI) to handle the negative numbers). In case it's helpful, I wrote up some scratch code to double check, here's a plot:

chrishavlin · 2024-10-11T18:05:01Z

Here's a performance comparison between this branch and main that compares the scaling with increasing number of cells in r and theta.

Full code for the test case is over in this repo. The test dataset is a 2d cylindrical dataset built with:

def get_2d_ds(n_r, n_theta):

    shp = (n_r,n_theta,1)
    bbox = np.array([[0, 1], [0, np.pi*2],[-.5, .5]])
    data = {
         'density': rs.random(shp)
            }
    ds = yt.load_uniform_grid(data,
                              shp,
                              bbox=bbox,
                              geometry='cylindrical',
                              axis_order=('r', 'theta', 'z'))
    return ds

And I'm timing a SlicePlot here:

def plot_ds(ds):
    slc = yt.SlicePlot(ds, 'z', ('stream', 'density'))
    slc.render()

The first and second column in the following graph are the elapsed time in seconds averaged over 10 iterations for the main branch and this PR's branch. Final column is the percent speedup.

Average speedup over all resolutions is about 33%.

matthewturk · 2024-10-11T18:08:22Z

I like the comparison. I immediately thought, "what about in AMR" then realized this is likely a better general purpose diagnostic.

neutrinoceros · 2024-10-17T13:20:33Z

I looked for most recent patches to this routine from ~3 years ago when I was focused on correctness and possibly neglecting performance. Here's what I found, in case that's relevant here

#3782 and #3818 may have impacted performance, so it might be worth trying to revert them and see if the performance changes measurably.
If #3818 is costly, it may be argued that it should be reverted, as it really comes down to an arbitrary choice between two equally valid approaches, with converging results in the limit of infinite resolution.
#3782 seems actually more likely to have a negative impact, but it's a lot less trivial to decide what to do if so. In any case I think it's worth investigating now, and if performance reaches an acceptable level with this PR, maybe we don't need to do anything more. I'd just like it to be an informed decision.

neutrinoceros · 2024-10-20T09:18:12Z

given that I've post poned the release to ~10 days for now there should be enough time to squeeze this in, though it's not a blocker.

chrishavlin · 2024-10-21T15:33:35Z

ya! I think so too -- I should have time today to try out your suggestion.

yt/utilities/lib/pixelization_routines.pyx

chrishavlin · 2024-10-22T16:47:06Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

neutrinoceros · 2024-10-29T18:27:05Z

Alright I finally took an hour to try out the suggestions I made in #5018 (comment)

Here's the test script I used to measure perfs

import argparse
import json
from pathlib import Path
from time import monotonic_ns

import numpy as np

import yt


def get_2d_ds(n_r, n_theta):
    shp = (n_r, n_theta, 1)
    bbox = np.array([[0, 1], [0, np.pi * 2], [-0.5, 0.5]])
    data = {"density": np.random.random(shp)}
    ds = yt.load_uniform_grid(
        data, shp, bbox=bbox, geometry="cylindrical", axis_order=("r", "theta", "z")
    )
    return ds


def plot_ds(ds):
    slc = yt.SlicePlot(ds, "z", ("stream", "density"))
    slc.render()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--tag", type=str, help="prefix of a savefile")
    parser.add_argument(
        "--nit",
        dest="n_iter",
        nargs="?",
        type=int,
        default=1,
        help="number of iterations to run",
    )
    args = parser.parse_args()

    tstart = monotonic_ns()
    for it in range(args.n_iter):
        plot_ds(get_2d_ds(256, 256))
        print(f"{it=}/{args.n_iter}", end="\r")
    tstop = monotonic_ns()
    dt_ns = tstop - tstart
    dt_s = dt_ns / 1e9
    dt_s_per_iter = dt_s / args.n_iter
    print(f"Did {args.n_iter} iterations in {dt_s:.02f} s ({dt_s_per_iter:.02f} s/it)")

    results = {
        "tag": args.tag,
        "n_iter": args.n_iter,
        "dt_s": dt_s,
        "dt_s_per_iter": dt_s_per_iter,
    }
    SAVE_DIR = Path(__file__).parent / "results"
    SAVE_DIR.mkdir(exist_ok=True)
    with open(SAVE_DIR / f"{args.tag}.{args.n_iter}.json", "w") as fh:
        json.dump(results, fh, indent=2)

and my plotting script

import json
from pathlib import Path

import matplotlib.pyplot as plt

DATA_DIR = Path(__file__).parent / "results"

fig, ax = plt.subplots()

files = sorted(DATA_DIR.glob("*.json"))
xlabels: list[str] = []
for ifile, file in enumerate(files):
    with file.open() as fh:
        data = json.load(fh)
    label = f"{data['tag']} (n_iter={data['n_iter']})"
    ax.scatter(ifile, data["dt_s_per_iter"], label=label)
ax.set(xticks=list(range(len(files))), ylabel="runtime per it (s)")
ax.legend()
sfile = DATA_DIR.parent / "viz.png"
print(f"saving to {sfile.resolve()}")
fig.savefig(sfile)

and the resulting viz

(ref is main as of 2e2032c)

So it seems that this PR almost compensates for a huge perf hit we took in #3818, but not quite.
Self quote

If #3818 is costly, it may be argued that it should be reverted, as it really comes down to an arbitrary choice between two equally valid approaches, with converging results in the limit of infinite resolution.

so I think the actual priority is to revert #3818, and then we should explore if this PR still improves performance on top of it or not.

neutrinoceros · 2024-10-29T18:31:03Z

Here's the revert PR #5034

neutrinoceros · 2024-11-07T09:20:28Z

Alright, to summarize the scattered discussion: I still think there should be a smarter way to restore optimial performances in this routine (which is what I'm trying to do in #5035) but so far I haven't been able to find a completely backward compatible solution. This patch is 100% backward compatible so I'm choosing to include it in yt 4.4.0, and I'm hoping we can revisit soon after.
Thanks @chrishavlin, and sorry it's taken me so long to settle on your solution !

release gil in pixelize_cylinder

f06761f

chrishavlin added viz: 2D performance coordinates: non-cartesian labels Oct 4, 2024

neutrinoceros self-requested a review October 4, 2024 16:35

chrishavlin added the enhancement Making something better label Oct 4, 2024

neutrinoceros marked this pull request as draft October 5, 2024 08:07

cphyc previously approved these changes Oct 8, 2024

View reviewed changes

yt/utilities/lib/pixelization_routines.pyx Show resolved Hide resolved

yt/utilities/lib/pixelization_routines.pyx Outdated Show resolved Hide resolved

yt/utilities/lib/pixelization_routines.pyx Outdated Show resolved Hide resolved

pre-compute twoPI

5269837

Co-authored-by: Corentin Cadiou <[email protected]>

chrishavlin dismissed cphyc’s stale review via 5269837 October 10, 2024 15:16

precompute twoPI

ad92f03

Co-authored-by: Corentin Cadiou <[email protected]>

neutrinoceros reviewed Oct 10, 2024

View reviewed changes

chrishavlin force-pushed the release_gil_in_pixelize_cylinder branch from 9c76d68 to ad92f03 Compare October 10, 2024 19:25

handle theta<0 in shift to 0,2pi

3e1ef2c

chrishavlin commented Oct 10, 2024

View reviewed changes

chrishavlin marked this pull request as ready for review October 11, 2024 18:05

neutrinoceros added this to the 4.4.0 milestone Oct 20, 2024

adjust the 0,2pi modulo calculation

16c8737

neutrinoceros reviewed Oct 22, 2024

View reviewed changes

yt/utilities/lib/pixelization_routines.pyx Outdated Show resolved Hide resolved

chrishavlin commented Oct 22, 2024

View reviewed changes

yt/utilities/lib/pixelization_routines.pyx Outdated Show resolved Hide resolved

update note on fmod

6c3fce5

[pre-commit.ci] auto fixes from pre-commit.com hooks

93b75d7

for more information, see https://pre-commit.ci

neutrinoceros mentioned this pull request Oct 29, 2024

PERF: restore cylindrical pixelizer performance from yt 4.0.2 (revert gh-3818) #5034

Closed

neutrinoceros mentioned this pull request Oct 30, 2024

PERF: optimize cylindrical pixelizer routine #5035

Draft

neutrinoceros approved these changes Nov 7, 2024

View reviewed changes

neutrinoceros changed the title ~~release gil in pixelize_cylinder~~ PERF: release the GIL in cylindrical pixelizer Nov 7, 2024

neutrinoceros merged commit 0ef58ff into yt-project:main Nov 7, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: release the GIL in cylindrical pixelizer #5018

PERF: release the GIL in cylindrical pixelizer #5018

chrishavlin commented Oct 4, 2024

neutrinoceros commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

neutrinoceros commented Oct 5, 2024

matthewturk commented Oct 8, 2024

cphyc commented Oct 8, 2024

matthewturk commented Oct 8, 2024

chrishavlin commented Oct 10, 2024

neutrinoceros Oct 10, 2024 •

edited

Loading

neutrinoceros Oct 10, 2024

chrishavlin Oct 10, 2024

chrishavlin Oct 10, 2024

neutrinoceros Oct 10, 2024

chrishavlin Oct 10, 2024

neutrinoceros Oct 17, 2024

neutrinoceros Oct 17, 2024

neutrinoceros Oct 17, 2024 •

edited

Loading

matthewturk Oct 21, 2024

chrishavlin Oct 21, 2024

chrishavlin commented Oct 11, 2024

matthewturk commented Oct 11, 2024

neutrinoceros commented Oct 17, 2024

neutrinoceros commented Oct 20, 2024

chrishavlin commented Oct 21, 2024

chrishavlin commented Oct 22, 2024

neutrinoceros commented Oct 29, 2024

neutrinoceros commented Oct 29, 2024

neutrinoceros commented Nov 7, 2024

PERF: release the GIL in cylindrical pixelizer #5018

PERF: release the GIL in cylindrical pixelizer #5018

Conversation

chrishavlin commented Oct 4, 2024

neutrinoceros commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

neutrinoceros commented Oct 5, 2024

matthewturk commented Oct 8, 2024

cphyc commented Oct 8, 2024

matthewturk commented Oct 8, 2024

chrishavlin commented Oct 10, 2024

neutrinoceros Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neutrinoceros Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrishavlin commented Oct 11, 2024

matthewturk commented Oct 11, 2024

neutrinoceros commented Oct 17, 2024

neutrinoceros commented Oct 20, 2024

chrishavlin commented Oct 21, 2024

chrishavlin commented Oct 22, 2024

neutrinoceros commented Oct 29, 2024

neutrinoceros commented Oct 29, 2024

neutrinoceros commented Nov 7, 2024

neutrinoceros Oct 10, 2024 •

edited

Loading

neutrinoceros Oct 17, 2024 •

edited

Loading