-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimisation of point-to-point distance #1066
Comments
In R-spatial, we also noted that some alternatives can be faster than GEOS. library("geos") # 3.11.1
library("Rfast")
n = 4000
df = data.frame(x = rnorm(n), y = rnorm(n))
pts = as_geos_geometry(sf::st_as_sf(df, coords = c("x", "y")))
geos_dist = function(x) {
mat = matrix(0.0, nrow = length(x), ncol = length(x))
for (i in seq_along(x)) mat[i, ] = geos::geos_distance(x[i], x)
return(mat)
}
bench::mark(
check = FALSE,
geos = geos_dist(pts),
R = as.matrix(dist(df, method = "euclidean")),
Rfast = Rfast::Dist(df, method = "euclidean")
)
#> expression min median `itr/sec` mem_alloc
#> 1 geos 6.87s 6.87s 0.146 1.01GB
#> 2 R 465.05ms 477.14ms 2.10 671.42MB
#> 3 Rfast 56.47ms 60.69ms 15.2 122.2MB |
Is this requirement for the specific case of computing a matrix of all distances between points in a set? If so, that's not a use case for which GEOS is designed, and it would require a different kind of API and algorithm. If I understand Shapely correctly, the loops to compute the distances between all points in the set is done on the Shapely side, not in GEOS. |
Do you know why the |
No, the same applies to many-to-one measurement (distance from each point in array to a single point), with slightly different ratios: import shapely
import numpy as np
points = shapely.points(np.random.rand(4000, 2))
%%timeit
geos = shapely.distance(points, points[0])
# 992 µs ± 4.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
# linalg.norm is convenient and commonly used for this
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
linalg = np.linalg.norm(coords - coords[0], axis=1)
# 260 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
# but we can also be explicit, and faster
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
sqrt = np.sqrt((coords[:, 0] - coords[0][0])**2 + (coords[:, 1] - coords[0][1])**2)
# 221 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
AFAIK Numpy is using vectorized operations here. It may be the case that this sort of improvement does not belong to GEOS and shall be implemented in upstream where numpy (or some other stuff) can be be efficiently used when dealing with arrays. |
Er, actually my comment was slightly too specific. The key distinction is that these tests are computing many distances, whereas the current GEOS API targets the case of computing a single distance. There's quite a bit of overhead involved with getting from Python to GEOS (I believe), so perhaps that's the source of the performance difference.
Yes, that sounds like the best approach. |
Cool thanks! Will close it here and follow up on this in shapely. |
AFAIU Shapely already has the points stored as GEOS geometries, so I wouldn't expect much overhead in doing an operation on them. I think #1067 should make a noticeable difference here. |
I've been wondering, if there is a space to optimise the distance measurement when all geometries are simple points. When using shapely and extracting coordinates to compute distance using numpy, I can get about 50x speedup.
See the example of measuring distance between 4k of points as posted in shapely/shapely#2038.
While we could do this in shapely, maybe there's a space to do it in GEOS?
The text was updated successfully, but these errors were encountered: