Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use stats::dist() to calculate Euclidean distances? #1874

Closed
kadyb opened this issue Jan 4, 2022 · 3 comments
Closed

Use stats::dist() to calculate Euclidean distances? #1874

kadyb opened this issue Jan 4, 2022 · 3 comments

Comments

@kadyb
Copy link
Contributor

kadyb commented Jan 4, 2022

I have ~50,000 points and would like to calculate the Euclidean distance matrix. Currently st_distance() seems to be the slowest way to do it (below is the benchmark). Could you consider using stats::dist() to speed up this operation? Here is also Rfast::Dist, written in C++, which is super fast. fields::rdist is pretty fast too.

library(sp)
library(sf)
library(terra)
library(Rfast)
library(rgeos)
library(fields)

n = 4000
df = data.frame(x = runif(n, 171000, 861000), y = runif(n, 133000, 775000))

pts_sf = st_as_sf(df, coords = c("x", "y"), crs = "epsg:2180")
pts_terra = vect(pts_sf)
pts_sp = as(pts_sf, "Spatial")

results = bench::mark(
  iterations = 30, check = FALSE, time_unit = "s",
  sf = sf::st_distance(pts_sf, which = "Euclidean"),
  terra = as.matrix(terra::distance(pts_terra)),
  stats = as.matrix(stats::dist(df, method = "euclidean")),
  Rfast = Rfast::Dist(df, method = "euclidean"),
  rgeos = rgeos::gDistance(pts_sp, byid = TRUE),
  sp = sp::spDists(pts_sp),
  fields = fields::rdist(df)
)
results
#> expression   min  median `itr/sec` mem_alloc `gc/sec`
#> 1 sf         9.83   9.88      0.100    245MB    0.07
#> 2 terra      1.01   1.06      0.917    580MB    2.05
#> 3 stats      1.01   1.04      0.930    580MB    2.11
#> 4 Rfast      0.16   0.17      5.120    122MB    2.39
#> 5 rgeos      4.65   4.70      0.213    122MB    0.10
#> 6 sp         1.24   1.27      0.781    1.31GB   5.47
#> 7 fields     0.18   0.19      4.830    122MB    1.45
@rsbivand
Copy link
Member

rsbivand commented Jan 4, 2022

For completeness:

pts_sp <- as(pts_sf, "Spatial")
results = bench::mark(
  iterations = 30, check = FALSE, time_unit = "s",
  sf = sf::st_distance(pts_sf, which = "Euclidean"),
  terra = as.matrix(terra::distance(pts_terra)),
  stats = as.matrix(stats::dist(df, method = "euclidean")),
  Rfast = Rfast::Dist(df, method = "euclidean"),
  rgeos = rgeos::gDistance(pts_sp, byid=TRUE),
  sp = sp::spDists(pts_sp),
)
results[, 1:6]
# A tibble: 6 × 6
#  expression   min median `itr/sec` mem_alloc `gc/sec`
#  <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#1 sf         4.73   5.17      0.194        NA    0.200
#2 terra      0.706  0.834     1.20         NA    2.53 
#3 stats      0.692  0.786     1.25         NA    3.47 
#4 Rfast      0.107  0.120     7.69         NA    3.59 
#5 rgeos      2.10   2.21      0.427        NA    0.213
#6 sp         0.717  0.777     1.25         NA    7.08

edzer added a commit that referenced this issue Jan 5, 2022
@edzer
Copy link
Member

edzer commented Jan 5, 2022

OK, this branch gives us

library(sf)
# Linking to GEOS 3.10.1, GDAL 3.4.0, PROJ 8.2.0; sf_use_s2() is TRUE

n = 4000
df = data.frame(x = runif(n, 171000, 861000), y = runif(n, 133000, 775000))

pts_sf = st_as_sf(df, coords = c("x", "y"), crs = "epsg:2180")

results = bench::mark(
  iterations = 30, check = FALSE, time_unit = "s",
  sf = sf::st_distance(pts_sf, which = "Euclidean"),
  stats = as.matrix(stats::dist(st_coordinates(pts_sf), method = "euclidean")),
)
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled. 
results[, 1:6]
# # A tibble: 2 × 6
#   expression   min median `itr/sec` mem_alloc `gc/sec`
#   <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
# 1 sf         0.589  0.595      1.67     581MB     3.45
# 2 stats      0.511  0.534      1.84     580MB     3.75

@kadyb
Copy link
Contributor Author

kadyb commented Apr 16, 2024

Dan Baston has just implemented patch in GEOS that improve the performance of point-to-point calculations. In the future, it would be interesting to see what is faster:
a) dist() with the as.matrix() optimization patch in R 4.4
b) new GEOS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants