Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimisation of point-to-point distance #1066

Closed
martinfleis opened this issue Apr 15, 2024 · 7 comments
Closed

Performance optimisation of point-to-point distance #1066

martinfleis opened this issue Apr 15, 2024 · 7 comments

Comments

@martinfleis
Copy link

I've been wondering, if there is a space to optimise the distance measurement when all geometries are simple points. When using shapely and extracting coordinates to compute distance using numpy, I can get about 50x speedup.

See the example of measuring distance between 4k of points as posted in shapely/shapely#2038.

import shapely
import numpy as np

points = shapely.points(np.random.rand(4000, 2))

%%timeit
geos = shapely.distance(points, points.reshape(-1, 1))
# 4 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# linalg.norm is convenient and commonly used for this
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
linalg = np.linalg.norm(coords[:, None] - coords, axis=2)
# 328 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# but we can also be explicit, and faster
(shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
sqrt = np.sqrt((coords[:, None][:, :, 0] - coords[:,0])**2 + (coords[:, None][:, :,1] - coords[:,1])**2)
# 71.2 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

While we could do this in shapely, maybe there's a space to do it in GEOS?

@kadyb
Copy link
Contributor

kadyb commented Apr 15, 2024

In R-spatial, we also noted that some alternatives can be faster than GEOS.

library("geos") # 3.11.1
library("Rfast")

n = 4000
df = data.frame(x = rnorm(n), y = rnorm(n))
pts = as_geos_geometry(sf::st_as_sf(df, coords = c("x", "y")))

geos_dist = function(x) {
  mat = matrix(0.0, nrow = length(x), ncol = length(x))
  for (i in seq_along(x)) mat[i, ] = geos::geos_distance(x[i], x)
  return(mat)
}

bench::mark(
  check = FALSE,
  geos = geos_dist(pts),
  R = as.matrix(dist(df, method = "euclidean")),
  Rfast = Rfast::Dist(df, method = "euclidean")
)
#>   expression      min   median `itr/sec` mem_alloc
#> 1 geos          6.87s    6.87s     0.146    1.01GB
#> 2 R          465.05ms 477.14ms     2.10   671.42MB
#> 3 Rfast       56.47ms  60.69ms    15.2     122.2MB

@dr-jts
Copy link
Contributor

dr-jts commented Apr 15, 2024

Is this requirement for the specific case of computing a matrix of all distances between points in a set? If so, that's not a use case for which GEOS is designed, and it would require a different kind of API and algorithm.

If I understand Shapely correctly, the loops to compute the distances between all points in the set is done on the Shapely side, not in GEOS.

@dr-jts
Copy link
Contributor

dr-jts commented Apr 15, 2024

Do you know why the numpy algorithm is so much faster? Is it using vectorized arithmetic? Or is it just avoiding overhead of dealing with geometries rather than raw numbers?

@martinfleis
Copy link
Author

Is this requirement for the specific case of computing a matrix of all distances between points in a set?

No, the same applies to many-to-one measurement (distance from each point in array to a single point), with slightly different ratios:

import shapely
import numpy as np

points = shapely.points(np.random.rand(4000, 2))

%%timeit
geos = shapely.distance(points, points[0])
# 992 µs ± 4.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
# linalg.norm is convenient and commonly used for this
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
linalg = np.linalg.norm(coords - coords[0], axis=1)
# 260 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
# but we can also be explicit, and faster
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
sqrt = np.sqrt((coords[:, 0] - coords[0][0])**2 + (coords[:, 1] - coords[0][1])**2)
# 221 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Do you know why the numpy algorithm is so much faster? Is it using vectorized arithmetic? Or is it just avoiding overhead of dealing with geometries rather than raw numbers?

AFAIK Numpy is using vectorized operations here.

It may be the case that this sort of improvement does not belong to GEOS and shall be implemented in upstream where numpy (or some other stuff) can be be efficiently used when dealing with arrays.

@dr-jts
Copy link
Contributor

dr-jts commented Apr 15, 2024

No, the same applies to many-to-one measurement (distance from each point in array to a single point), with slightly different ratios:

Er, actually my comment was slightly too specific. The key distinction is that these tests are computing many distances, whereas the current GEOS API targets the case of computing a single distance. There's quite a bit of overhead involved with getting from Python to GEOS (I believe), so perhaps that's the source of the performance difference.

AFAIK Numpy is using vectorized operations here.

It may be the case that this sort of improvement does not belong to GEOS and shall be implemented in upstream where numpy (or some other stuff) can be be efficiently used when dealing with arrays.

Yes, that sounds like the best approach.

@martinfleis
Copy link
Author

Cool thanks! Will close it here and follow up on this in shapely.

@dbaston
Copy link
Member

dbaston commented Apr 15, 2024

AFAIU Shapely already has the points stored as GEOS geometries, so I wouldn't expect much overhead in doing an operation on them. I think #1067 should make a noticeable difference here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants