Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: release the GIL in cylindrical pixelizer #5018

Merged

Conversation

chrishavlin
Copy link
Contributor

release the gil!

This surprisingly seemed to work with minimal effort... took the pixelization time from 53s to 27s for one example (see this slack thread).

@neutrinoceros
Copy link
Member

Releasing the GIL is likely a good idea here, but I also suspect that the abysmal performance in this routine may be fixed independently (possibly by reverting an earlier patch from mine that I'll dig up for in my future review). Any way, I'll try to review in details in the coming days. Thanks !

@chrishavlin
Copy link
Contributor Author

if you figure out a separate fix I'm happy to go with it and close this one!

@chrishavlin
Copy link
Contributor Author

oh, especially since image tests seem to be failing and it was not as simple as i hoped :)

@chrishavlin
Copy link
Contributor Author

ok, i think i see the issue but won't have time to fix it for a while... so will hold off in case you devise a separate performance fix.

@neutrinoceros
Copy link
Member

oh wow these diffs are catastrophic. I'll try to tackle the performance issue from another angle and I'll move this to draft if you don't mind.

@neutrinoceros neutrinoceros marked this pull request as draft October 5, 2024 08:07
@matthewturk
Copy link
Member

These lines:

                     ptbounds[0] = ptbounds[0] % (2*np.pi)
                     ptbounds[1] = ptbounds[1] % (2*np.pi)

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

@cphyc
Copy link
Member

cphyc commented Oct 8, 2024

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

With the latest version, they are actually expanded to

(__pyx_v_ptbounds[0]) = fmod((__pyx_v_ptbounds[0]), (2.0 * __pyx_v_npPI));

so there might be a small speedup to be found by precomputing 2π ahead of time, but I'm not sure this is super important.

cphyc
cphyc previously approved these changes Oct 8, 2024
yt/utilities/lib/pixelization_routines.pyx Show resolved Hide resolved
yt/utilities/lib/pixelization_routines.pyx Outdated Show resolved Hide resolved
yt/utilities/lib/pixelization_routines.pyx Outdated Show resolved Hide resolved
@matthewturk
Copy link
Member

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

With the latest version, they are actually expanded to

(__pyx_v_ptbounds[0]) = fmod((__pyx_v_ptbounds[0]), (2.0 * __pyx_v_npPI));

so there might be a small speedup to be found by precomputing 2π ahead of time, but I'm not sure this is super important.

Huh. I guess I wasn't on the latest version! For me it did a ton of unpacking and python type coercion. I guess that's one of the advantages of upgrading, eh?

Are our performance tests being run on the latest version? (i.e., do you see the slowness, or is what I am referring to showing up in our benchmarks)

Co-authored-by: Corentin Cadiou <[email protected]>
Co-authored-by: Corentin Cadiou <[email protected]>
@chrishavlin
Copy link
Contributor Author

still expect the image diffs to fail -- i'll look later today and if it's a quick fix (which i think it is) I'll put it in and do some performance tests.

@@ -566,7 +566,8 @@ def pixelize_cylinder(np.float64_t[:,:] buff,
cdef np.float64_t r_inc, theta_inc
cdef np.float64_t costheta, sintheta
cdef int i, i1, pi, pj
cdef np.float64_t twoPI = 2 * np.pi
cdef np.float64_t onePI = np.pi
Copy link
Member

@neutrinoceros neutrinoceros Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the direct way to "import" the constant from numpy as the C level

cdef extern from "numpy/npy_math.h":
    double NPY_PI

(taken from https://github.com/yt-project/yt/pull/4948/files#r1692841029)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(NPY_2_PI is also available)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, will import NPY_2_PI as well in a minute after tests run this round.... i think this one will pass...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, NPY_2_PI is 2/pi, not 2*pi ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaaah, missed that, sorry

@chrishavlin chrishavlin force-pushed the release_gil_in_pixelize_cylinder branch from 9c76d68 to ad92f03 Compare October 10, 2024 19:25
Comment on lines 668 to 672
if ptbounds[0] < 0:
ptbounds[0] += NPY_PI
ptbounds[1] = ptbounds[1] % twoPI
if ptbounds[1] < 0:
ptbounds[1] += NPY_PI
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I'm not sure why I needed to add the check for negative numbers here, but after switching from ptbounds[1] % (2*np.pi) to ptbounds[1] % twoPI, the result returns numbers in the range (-pi,pi).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to understand this bit, and possibly come up with a way to perform the computation that doesn't involve branching; hard-wiring a somewhat expensive calculation generally hurts performance less than ruining branch prediction.
I'm writing this comment early in my review but I'll try to come up with more insight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking a bet that the reason why you're seeing a change in behavior is that the function is decorated with @cython.cdivision(True): on main the Python % is used, while with twoPI you're getting a C operator instead.

http://docs.cython.org/en/latest/src/tutorial/caveats.html

Copy link
Member

@neutrinoceros neutrinoceros Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try this patch ?

diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx
index 194ce7af6..6728375d0 100644
--- a/yt/utilities/lib/pixelization_routines.pyx
+++ b/yt/utilities/lib/pixelization_routines.pyx
@@ -664,12 +664,8 @@ def pixelize_cylinder(np.float64_t[:,:] buff,
                             ptbounds[1] = fmax(ptbounds[1], corners[i1])
 
                         # shift to a [0, PI] interval
-                        ptbounds[0] = ptbounds[0] % twoPI
-                        if ptbounds[0] < 0:
-                            ptbounds[0] += NPY_PI
-                        ptbounds[1] = ptbounds[1] % twoPI
-                        if ptbounds[1] < 0:
-                            ptbounds[1] += NPY_PI
+                        ptbounds[0] = (ptbounds[0]+twoPI) % twoPI
+                        ptbounds[1] = (ptbounds[1]+twoPI) % twoPI
 
                         if prbounds[0] >= rmin and prbounds[1] <= rmax and \
                            ptbounds[0] >= tmin and ptbounds[1] <= tmax:

the result should be the same in all cases, but it avoids branching, so I expect it to be faster.
If this works, we'll want to add a comment for why it's done this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we could use fmod here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like in cython, % is an alias to fmod already when using C functions, which returns a negative number when the first argument is negative (which is why the tests started failing here I think). but I do think it's clearer to use fmod to avoid the confusion in the first place (so fmod(ptbounds[0]+twoPI, twoPI) to handle the negative numbers). In case it's helpful, I wrote up some scratch code to double check, here's a plot:

comparison-cython-modulo

@chrishavlin
Copy link
Contributor Author

Here's a performance comparison between this branch and main that compares the scaling with increasing number of cells in r and theta.

Full code for the test case is over in this repo. The test dataset is a 2d cylindrical dataset built with:

def get_2d_ds(n_r, n_theta):

    shp = (n_r,n_theta,1)
    bbox = np.array([[0, 1], [0, np.pi*2],[-.5, .5]])
    data = {
         'density': rs.random(shp)
            }
    ds = yt.load_uniform_grid(data,
                              shp,
                              bbox=bbox,
                              geometry='cylindrical',
                              axis_order=('r', 'theta', 'z'))
    return ds

And I'm timing a SlicePlot here:

def plot_ds(ds):
    slc = yt.SlicePlot(ds, 'z', ('stream', 'density'))
    slc.render()

The first and second column in the following graph are the elapsed time in seconds averaged over 10 iterations for the main branch and this PR's branch. Final column is the percent speedup.

results

Average speedup over all resolutions is about 33%.

@chrishavlin chrishavlin marked this pull request as ready for review October 11, 2024 18:05
@matthewturk
Copy link
Member

I like the comparison. I immediately thought, "what about in AMR" then realized this is likely a better general purpose diagnostic.

@neutrinoceros
Copy link
Member

I looked for most recent patches to this routine from ~3 years ago when I was focused on correctness and possibly neglecting performance. Here's what I found, in case that's relevant here

#3782 and #3818 may have impacted performance, so it might be worth trying to revert them and see if the performance changes measurably.
If #3818 is costly, it may be argued that it should be reverted, as it really comes down to an arbitrary choice between two equally valid approaches, with converging results in the limit of infinite resolution.
#3782 seems actually more likely to have a negative impact, but it's a lot less trivial to decide what to do if so. In any case I think it's worth investigating now, and if performance reaches an acceptable level with this PR, maybe we don't need to do anything more. I'd just like it to be an informed decision.

@neutrinoceros neutrinoceros added this to the 4.4.0 milestone Oct 20, 2024
@neutrinoceros
Copy link
Member

given that I've post poned the release to ~10 days for now there should be enough time to squeeze this in, though it's not a blocker.

@chrishavlin
Copy link
Contributor Author

ya! I think so too -- I should have time today to try out your suggestion.

@chrishavlin
Copy link
Contributor Author

pre-commit.ci autofix

@neutrinoceros
Copy link
Member

Alright I finally took an hour to try out the suggestions I made in #5018 (comment)

Here's the test script I used to measure perfs

import argparse
import json
from pathlib import Path
from time import monotonic_ns

import numpy as np

import yt


def get_2d_ds(n_r, n_theta):
    shp = (n_r, n_theta, 1)
    bbox = np.array([[0, 1], [0, np.pi * 2], [-0.5, 0.5]])
    data = {"density": np.random.random(shp)}
    ds = yt.load_uniform_grid(
        data, shp, bbox=bbox, geometry="cylindrical", axis_order=("r", "theta", "z")
    )
    return ds


def plot_ds(ds):
    slc = yt.SlicePlot(ds, "z", ("stream", "density"))
    slc.render()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--tag", type=str, help="prefix of a savefile")
    parser.add_argument(
        "--nit",
        dest="n_iter",
        nargs="?",
        type=int,
        default=1,
        help="number of iterations to run",
    )
    args = parser.parse_args()

    tstart = monotonic_ns()
    for it in range(args.n_iter):
        plot_ds(get_2d_ds(256, 256))
        print(f"{it=}/{args.n_iter}", end="\r")
    tstop = monotonic_ns()
    dt_ns = tstop - tstart
    dt_s = dt_ns / 1e9
    dt_s_per_iter = dt_s / args.n_iter
    print(f"Did {args.n_iter} iterations in {dt_s:.02f} s ({dt_s_per_iter:.02f} s/it)")

    results = {
        "tag": args.tag,
        "n_iter": args.n_iter,
        "dt_s": dt_s,
        "dt_s_per_iter": dt_s_per_iter,
    }
    SAVE_DIR = Path(__file__).parent / "results"
    SAVE_DIR.mkdir(exist_ok=True)
    with open(SAVE_DIR / f"{args.tag}.{args.n_iter}.json", "w") as fh:
        json.dump(results, fh, indent=2)

and my plotting script

import json
from pathlib import Path

import matplotlib.pyplot as plt

DATA_DIR = Path(__file__).parent / "results"

fig, ax = plt.subplots()

files = sorted(DATA_DIR.glob("*.json"))
xlabels: list[str] = []
for ifile, file in enumerate(files):
    with file.open() as fh:
        data = json.load(fh)
    label = f"{data['tag']} (n_iter={data['n_iter']})"
    ax.scatter(ifile, data["dt_s_per_iter"], label=label)
ax.set(xticks=list(range(len(files))), ylabel="runtime per it (s)")
ax.legend()
sfile = DATA_DIR.parent / "viz.png"
print(f"saving to {sfile.resolve()}")
fig.savefig(sfile)

and the resulting viz
viz
(ref is main as of 2e2032c)

So it seems that this PR almost compensates for a huge perf hit we took in #3818, but not quite.
Self quote

If #3818 is costly, it may be argued that it should be reverted, as it really comes down to an arbitrary choice between two equally valid approaches, with converging results in the limit of infinite resolution.

so I think the actual priority is to revert #3818, and then we should explore if this PR still improves performance on top of it or not.

@neutrinoceros
Copy link
Member

Here's the revert PR #5034

@neutrinoceros
Copy link
Member

Alright, to summarize the scattered discussion: I still think there should be a smarter way to restore optimial performances in this routine (which is what I'm trying to do in #5035) but so far I haven't been able to find a completely backward compatible solution. This patch is 100% backward compatible so I'm choosing to include it in yt 4.4.0, and I'm hoping we can revisit soon after.
Thanks @chrishavlin, and sorry it's taken me so long to settle on your solution !

@neutrinoceros neutrinoceros changed the title release gil in pixelize_cylinder PERF: release the GIL in cylindrical pixelizer Nov 7, 2024
@neutrinoceros neutrinoceros merged commit 0ef58ff into yt-project:main Nov 7, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants