-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: release the GIL in cylindrical pixelizer #5018
PERF: release the GIL in cylindrical pixelizer #5018
Conversation
Releasing the GIL is likely a good idea here, but I also suspect that the abysmal performance in this routine may be fixed independently (possibly by reverting an earlier patch from mine that I'll dig up for in my future review). Any way, I'll try to review in details in the coming days. Thanks ! |
if you figure out a separate fix I'm happy to go with it and close this one! |
oh, especially since image tests seem to be failing and it was not as simple as i hoped :) |
ok, i think i see the issue but won't have time to fix it for a while... so will hold off in case you devise a separate performance fix. |
oh wow these diffs are catastrophic. I'll try to tackle the performance issue from another angle and I'll move this to draft if you don't mind. |
These lines:
are using Python math operations inside a tight loop. What if we used a constant for PI instead of |
With the latest version, they are actually expanded to (__pyx_v_ptbounds[0]) = fmod((__pyx_v_ptbounds[0]), (2.0 * __pyx_v_npPI)); so there might be a small speedup to be found by precomputing 2π ahead of time, but I'm not sure this is super important. |
Huh. I guess I wasn't on the latest version! For me it did a ton of unpacking and python type coercion. I guess that's one of the advantages of upgrading, eh? Are our performance tests being run on the latest version? (i.e., do you see the slowness, or is what I am referring to showing up in our benchmarks) |
Co-authored-by: Corentin Cadiou <[email protected]>
Co-authored-by: Corentin Cadiou <[email protected]>
still expect the image diffs to fail -- i'll look later today and if it's a quick fix (which i think it is) I'll put it in and do some performance tests. |
@@ -566,7 +566,8 @@ def pixelize_cylinder(np.float64_t[:,:] buff, | |||
cdef np.float64_t r_inc, theta_inc | |||
cdef np.float64_t costheta, sintheta | |||
cdef int i, i1, pi, pj | |||
cdef np.float64_t twoPI = 2 * np.pi | |||
cdef np.float64_t onePI = np.pi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's the direct way to "import" the constant from numpy as the C level
cdef extern from "numpy/npy_math.h":
double NPY_PI
(taken from https://github.com/yt-project/yt/pull/4948/files#r1692841029)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(NPY_2_PI is also available)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, will import NPY_2_PI
as well in a minute after tests run this round.... i think this one will pass...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, NPY_2_PI is 2/pi, not 2*pi ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aaaah, missed that, sorry
9c76d68
to
ad92f03
Compare
if ptbounds[0] < 0: | ||
ptbounds[0] += NPY_PI | ||
ptbounds[1] = ptbounds[1] % twoPI | ||
if ptbounds[1] < 0: | ||
ptbounds[1] += NPY_PI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I'm not sure why I needed to add the check for negative numbers here, but after switching from ptbounds[1] % (2*np.pi)
to ptbounds[1] % twoPI
, the result returns numbers in the range (-pi,pi).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to understand this bit, and possibly come up with a way to perform the computation that doesn't involve branching; hard-wiring a somewhat expensive calculation generally hurts performance less than ruining branch prediction.
I'm writing this comment early in my review but I'll try to come up with more insight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking a bet that the reason why you're seeing a change in behavior is that the function is decorated with @cython.cdivision(True)
: on main
the Python %
is used, while with twoPI
you're getting a C operator instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try this patch ?
diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx
index 194ce7af6..6728375d0 100644
--- a/yt/utilities/lib/pixelization_routines.pyx
+++ b/yt/utilities/lib/pixelization_routines.pyx
@@ -664,12 +664,8 @@ def pixelize_cylinder(np.float64_t[:,:] buff,
ptbounds[1] = fmax(ptbounds[1], corners[i1])
# shift to a [0, PI] interval
- ptbounds[0] = ptbounds[0] % twoPI
- if ptbounds[0] < 0:
- ptbounds[0] += NPY_PI
- ptbounds[1] = ptbounds[1] % twoPI
- if ptbounds[1] < 0:
- ptbounds[1] += NPY_PI
+ ptbounds[0] = (ptbounds[0]+twoPI) % twoPI
+ ptbounds[1] = (ptbounds[1]+twoPI) % twoPI
if prbounds[0] >= rmin and prbounds[1] <= rmax and \
ptbounds[0] >= tmin and ptbounds[1] <= tmax:
the result should be the same in all cases, but it avoids branching, so I expect it to be faster.
If this works, we'll want to add a comment for why it's done this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we could use fmod
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like in cython, %
is an alias to fmod
already when using C functions, which returns a negative number when the first argument is negative (which is why the tests started failing here I think). but I do think it's clearer to use fmod
to avoid the confusion in the first place (so fmod(ptbounds[0]+twoPI, twoPI)
to handle the negative numbers). In case it's helpful, I wrote up some scratch code to double check, here's a plot:
Here's a performance comparison between this branch and main that compares the scaling with increasing number of cells in r and theta. Full code for the test case is over in this repo. The test dataset is a 2d cylindrical dataset built with: def get_2d_ds(n_r, n_theta):
shp = (n_r,n_theta,1)
bbox = np.array([[0, 1], [0, np.pi*2],[-.5, .5]])
data = {
'density': rs.random(shp)
}
ds = yt.load_uniform_grid(data,
shp,
bbox=bbox,
geometry='cylindrical',
axis_order=('r', 'theta', 'z'))
return ds And I'm timing a def plot_ds(ds):
slc = yt.SlicePlot(ds, 'z', ('stream', 'density'))
slc.render() The first and second column in the following graph are the elapsed time in seconds averaged over 10 iterations for the Average speedup over all resolutions is about 33%. |
I like the comparison. I immediately thought, "what about in AMR" then realized this is likely a better general purpose diagnostic. |
I looked for most recent patches to this routine from ~3 years ago when I was focused on correctness and possibly neglecting performance. Here's what I found, in case that's relevant here
#3782 and #3818 may have impacted performance, so it might be worth trying to revert them and see if the performance changes measurably. |
given that I've post poned the release to ~10 days for now there should be enough time to squeeze this in, though it's not a blocker. |
ya! I think so too -- I should have time today to try out your suggestion. |
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
Alright I finally took an hour to try out the suggestions I made in #5018 (comment) Here's the test script I used to measure perfs import argparse
import json
from pathlib import Path
from time import monotonic_ns
import numpy as np
import yt
def get_2d_ds(n_r, n_theta):
shp = (n_r, n_theta, 1)
bbox = np.array([[0, 1], [0, np.pi * 2], [-0.5, 0.5]])
data = {"density": np.random.random(shp)}
ds = yt.load_uniform_grid(
data, shp, bbox=bbox, geometry="cylindrical", axis_order=("r", "theta", "z")
)
return ds
def plot_ds(ds):
slc = yt.SlicePlot(ds, "z", ("stream", "density"))
slc.render()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--tag", type=str, help="prefix of a savefile")
parser.add_argument(
"--nit",
dest="n_iter",
nargs="?",
type=int,
default=1,
help="number of iterations to run",
)
args = parser.parse_args()
tstart = monotonic_ns()
for it in range(args.n_iter):
plot_ds(get_2d_ds(256, 256))
print(f"{it=}/{args.n_iter}", end="\r")
tstop = monotonic_ns()
dt_ns = tstop - tstart
dt_s = dt_ns / 1e9
dt_s_per_iter = dt_s / args.n_iter
print(f"Did {args.n_iter} iterations in {dt_s:.02f} s ({dt_s_per_iter:.02f} s/it)")
results = {
"tag": args.tag,
"n_iter": args.n_iter,
"dt_s": dt_s,
"dt_s_per_iter": dt_s_per_iter,
}
SAVE_DIR = Path(__file__).parent / "results"
SAVE_DIR.mkdir(exist_ok=True)
with open(SAVE_DIR / f"{args.tag}.{args.n_iter}.json", "w") as fh:
json.dump(results, fh, indent=2) and my plotting script import json
from pathlib import Path
import matplotlib.pyplot as plt
DATA_DIR = Path(__file__).parent / "results"
fig, ax = plt.subplots()
files = sorted(DATA_DIR.glob("*.json"))
xlabels: list[str] = []
for ifile, file in enumerate(files):
with file.open() as fh:
data = json.load(fh)
label = f"{data['tag']} (n_iter={data['n_iter']})"
ax.scatter(ifile, data["dt_s_per_iter"], label=label)
ax.set(xticks=list(range(len(files))), ylabel="runtime per it (s)")
ax.legend()
sfile = DATA_DIR.parent / "viz.png"
print(f"saving to {sfile.resolve()}")
fig.savefig(sfile) and the resulting viz So it seems that this PR almost compensates for a huge perf hit we took in #3818, but not quite.
so I think the actual priority is to revert #3818, and then we should explore if this PR still improves performance on top of it or not. |
Here's the revert PR #5034 |
Alright, to summarize the scattered discussion: I still think there should be a smarter way to restore optimial performances in this routine (which is what I'm trying to do in #5035) but so far I haven't been able to find a completely backward compatible solution. This patch is 100% backward compatible so I'm choosing to include it in yt 4.4.0, and I'm hoping we can revisit soon after. |
release the gil!
This surprisingly seemed to work with minimal effort... took the pixelization time from 53s to 27s for one example (see this slack thread).