Performance optimisation of point-to-point distance #1066

martinfleis · 2024-04-15T08:36:10Z

I've been wondering, if there is a space to optimise the distance measurement when all geometries are simple points. When using shapely and extracting coordinates to compute distance using numpy, I can get about 50x speedup.

See the example of measuring distance between 4k of points as posted in shapely/shapely#2038.

import shapely
import numpy as np

points = shapely.points(np.random.rand(4000, 2))

%%timeit
geos = shapely.distance(points, points.reshape(-1, 1))
# 4 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# linalg.norm is convenient and commonly used for this
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
linalg = np.linalg.norm(coords[:, None] - coords, axis=2)
# 328 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# but we can also be explicit, and faster
(shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
sqrt = np.sqrt((coords[:, None][:, :, 0] - coords[:,0])**2 + (coords[:, None][:, :,1] - coords[:,1])**2)
# 71.2 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

While we could do this in shapely, maybe there's a space to do it in GEOS?

kadyb · 2024-04-15T09:01:43Z

In R-spatial, we also noted that some alternatives can be faster than GEOS.

library("geos") # 3.11.1
library("Rfast")

n = 4000
df = data.frame(x = rnorm(n), y = rnorm(n))
pts = as_geos_geometry(sf::st_as_sf(df, coords = c("x", "y")))

geos_dist = function(x) {
  mat = matrix(0.0, nrow = length(x), ncol = length(x))
  for (i in seq_along(x)) mat[i, ] = geos::geos_distance(x[i], x)
  return(mat)
}

bench::mark(
  check = FALSE,
  geos = geos_dist(pts),
  R = as.matrix(dist(df, method = "euclidean")),
  Rfast = Rfast::Dist(df, method = "euclidean")
)
#>   expression      min   median `itr/sec` mem_alloc
#> 1 geos          6.87s    6.87s     0.146    1.01GB
#> 2 R          465.05ms 477.14ms     2.10   671.42MB
#> 3 Rfast       56.47ms  60.69ms    15.2     122.2MB

dr-jts · 2024-04-15T14:05:35Z

Is this requirement for the specific case of computing a matrix of all distances between points in a set? If so, that's not a use case for which GEOS is designed, and it would require a different kind of API and algorithm.

If I understand Shapely correctly, the loops to compute the distances between all points in the set is done on the Shapely side, not in GEOS.

dr-jts · 2024-04-15T14:05:41Z

Do you know why the numpy algorithm is so much faster? Is it using vectorized arithmetic? Or is it just avoiding overhead of dealing with geometries rather than raw numbers?

martinfleis · 2024-04-15T14:33:01Z

Is this requirement for the specific case of computing a matrix of all distances between points in a set?

No, the same applies to many-to-one measurement (distance from each point in array to a single point), with slightly different ratios:

import shapely
import numpy as np

points = shapely.points(np.random.rand(4000, 2))

%%timeit
geos = shapely.distance(points, points[0])
# 992 µs ± 4.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
# linalg.norm is convenient and commonly used for this
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
linalg = np.linalg.norm(coords - coords[0], axis=1)
# 260 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
# but we can also be explicit, and faster
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
sqrt = np.sqrt((coords[:, 0] - coords[0][0])**2 + (coords[:, 1] - coords[0][1])**2)
# 221 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Do you know why the numpy algorithm is so much faster? Is it using vectorized arithmetic? Or is it just avoiding overhead of dealing with geometries rather than raw numbers?

AFAIK Numpy is using vectorized operations here.

It may be the case that this sort of improvement does not belong to GEOS and shall be implemented in upstream where numpy (or some other stuff) can be be efficiently used when dealing with arrays.

dr-jts · 2024-04-15T14:45:24Z

No, the same applies to many-to-one measurement (distance from each point in array to a single point), with slightly different ratios:

Er, actually my comment was slightly too specific. The key distinction is that these tests are computing many distances, whereas the current GEOS API targets the case of computing a single distance. There's quite a bit of overhead involved with getting from Python to GEOS (I believe), so perhaps that's the source of the performance difference.

AFAIK Numpy is using vectorized operations here.

It may be the case that this sort of improvement does not belong to GEOS and shall be implemented in upstream where numpy (or some other stuff) can be be efficiently used when dealing with arrays.

Yes, that sounds like the best approach.

martinfleis · 2024-04-15T15:11:07Z

Cool thanks! Will close it here and follow up on this in shapely.

dbaston · 2024-04-15T15:35:55Z

AFAIU Shapely already has the points stored as GEOS geometries, so I wouldn't expect much overhead in doing an operation on them. I think #1067 should make a noticeable difference here.

martinfleis mentioned this issue Apr 15, 2024

More complex benchmarks kadyb/vector-benchmark#13

Open

dr-jts added the Feedback Requested label Apr 15, 2024

martinfleis closed this as completed Apr 15, 2024

martinfleis mentioned this issue Apr 15, 2024

fast-path implementation of point-to-point distance shapely/shapely#2038

Open

dbaston mentioned this issue Apr 15, 2024

DistanceOp: Add short-circuit for point-point distance #1067

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimisation of point-to-point distance #1066

Performance optimisation of point-to-point distance #1066

martinfleis commented Apr 15, 2024

kadyb commented Apr 15, 2024

dr-jts commented Apr 15, 2024

dr-jts commented Apr 15, 2024

martinfleis commented Apr 15, 2024

dr-jts commented Apr 15, 2024

martinfleis commented Apr 15, 2024

dbaston commented Apr 15, 2024

Performance optimisation of point-to-point distance #1066

Performance optimisation of point-to-point distance #1066

Comments

martinfleis commented Apr 15, 2024

kadyb commented Apr 15, 2024

dr-jts commented Apr 15, 2024

dr-jts commented Apr 15, 2024

martinfleis commented Apr 15, 2024

dr-jts commented Apr 15, 2024

martinfleis commented Apr 15, 2024

dbaston commented Apr 15, 2024