Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative approach to build a transfer table #114

Closed
FlxPo opened this issue Jan 31, 2024 · 8 comments
Closed

Alternative approach to build a transfer table #114

FlxPo opened this issue Jan 31, 2024 · 8 comments

Comments

@FlxPo
Copy link

FlxPo commented Jan 31, 2024

I tried to use gtfs_transfer_table on a big GTFS dataset, with 46 000 stops (from IDFM, for the Ile-de-France region : https://eu.ftp.opendatasoft.com/stif/GTFS/IDFM-gtfs.zip). A transfer table is already provided, but i needed to recompute them to be able to merge this feed with other feeds.

For now gtfs_transfer_table seems to compute all pairwise distances between stops with geodist, which results in a out of memory error given the high number of stops.

Here is an alternative approach using sf which is quite fast, adapted to my needs but that could return exactly the same thing as gtfs_transfer_table :

transfer_table <- function(gtfs, d_limit = 200, crs = 2154) {
  
  # Convert the GTFS stops data.table to sf to be able perform spatial operations efficiently
  stops_xy <- sfheaders::sf_point(gtfs$stops, x = "stop_lon", y = "stop_lat", keep = TRUE)
  stops_xy$stop_index <- 1:nrow(stops_xy)
  
  # Project the data according to the local CRS
  st_crs(stops_xy) <- 4326
  stops_xy <- st_transform(stops_xy, crs)
  
  # Find which stops are within d_limit meters of each stop
  stops_xy_buffer <- st_buffer(stops_xy, 400)
  
  intersects <- st_intersects(stops_xy, stops_xy_buffer)
  
  intersects <- lapply(seq_along(intersects), function(i) {
    data.table(stop_index = i, neighbor_stop_index = intersects[[i]])
  })
  
  intersects <- rbindlist(intersects)
  intersects <- intersects[stop_index != neighbor_stop_index]
  
  # Compute the crow fly distance and travel times between neighboring stops
  # (travel times formula is based on a linear model fitted on IDFM data)
  stops_xy_coords <- as.data.table(st_coordinates(stops_xy))
  stops_xy_coords[, stop_index := stops_xy$stop_index]
  
  transfers <- merge(intersects, stops_xy_coords, by = "stop_index")
  transfers <- merge(transfers, stops_xy_coords, by.x = "neighbor_stop_index", by.y = "stop_index", suffixes = c("_from", "_to"))
  
  transfers[, distance := sqrt((Y_to - Y_from)^2 + (X_to - X_from)^2)]
  transfers[, min_transfer_time := 31 + 1.125*distance]
  transfers[, transfer_type := 2]
  
  # Format the data as expected for the transfers table
  stops_index_id <- as.data.table(st_drop_geometry(stops_xy[, c("stop_id", "stop_index")]))
  
  transfers <- merge(transfers, stops_index_id, by = "stop_index")
  transfers <- merge(transfers, stops_index_id, by.x = "neighbor_stop_index", by.y = "stop_index", suffixes = c("_from", "_to"))
  
  setnames(transfers, c("stop_id_from", "stop_id_to"), c("from_stop_id", "to_stop_id"))
  
  transfers <- transfers[, list(from_stop_id, to_stop_id, min_transfer_time)]
  
  return(transfers)
  
}

It would require taking a dependency on sf.

@mpadge
Copy link
Member

mpadge commented Jan 31, 2024

Thanks @FlxPo. Strange that you're getting out-of-memory errors. That shouldn't happen, and definitely not from distance calculations, which are all internalised here in C++ code, and don't use geodist at all. (It's only used in other functions, not transfer table calculation.) I routinely calculate transfer tables between over 30,000 stops and have no issues. I'll try on the feed you've linked to and see what i get ...

@FlxPo
Copy link
Author

FlxPo commented Jan 31, 2024

I just realized I was using an old version of gtfsrouter, I will do some tests to see if I still get an error.

@mpadge
Copy link
Member

mpadge commented Jan 31, 2024

No, the error is real. It comes from data.table. I've already fixed it, and will push in a moment

@mpadge mpadge closed this as completed in d6d0abf Jan 31, 2024
@mpadge
Copy link
Member

mpadge commented Jan 31, 2024

Let me know if that's okay now. See Rdatatable/data.table issue 5676 for background - data.table has a pile of issues, of which this is clearly one, and will slowly start moving towards being better maintained soon.

@mpadge
Copy link
Member

mpadge commented Jan 31, 2024

I'm always interested in opportunities for sf comparisons, and so here it is:

library (gtfsrouter)
library(data.table)

transfer_table <- function(gtfs, d_limit = 200, crs = 2154) {
  # ... your function from above ...
}

path <- "./feeds/helsinki.zip"
gtfs <- extract_gtfs (path)
#> ▶ Unzipping GTFS archive✔ Unzipped GTFS archive  
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

system.time (
    t1 <- gtfs_transfer_table (gtfs, d_limit = 200)
)
#>    user  system elapsed 
#>   2.131   0.004   2.055

system.time (
    t2 <- transfer_table(gtfs, d_limit = 200)
)
#>    user  system elapsed 
#>  95.689   0.449  95.685

Created on 2024-01-31 with reprex v2.1.0

So basically 50 times faster than sf.

@FlxPo
Copy link
Author

FlxPo commented Jan 31, 2024

I just tested the latest development version on the [Helsinki GTFS](https://www.hsl.fi/en/hsl/open-data, is this the one you use ?). This is weird, my function seems a little bit faster (and I cannot reproduce the 50x difference).

library(gtfsrouter)
library(sf)
#> Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
library(data.table)

transfer_table <- function(gtfs, d_limit = 200, crs = 2154) {
  
  # Convert the GTFS stops data.table to sf to be able perform spatial operations efficiently
  stops_xy <- sfheaders::sf_point(gtfs$stops, x = "stop_lon", y = "stop_lat", keep = TRUE)
  stops_xy$stop_index <- 1:nrow(stops_xy)
  
  # Project the data according to the local CRS
  st_crs(stops_xy) <- 4326
  stops_xy <- st_transform(stops_xy, crs)
  
  # Find which stops are within d_limit meters of each stop
  stops_xy_buffer <- st_buffer(stops_xy, 400)
  
  intersects <- st_intersects(stops_xy, stops_xy_buffer)
  
  intersects <- lapply(seq_along(intersects), function(i) {
    data.table(stop_index = i, neighbor_stop_index = intersects[[i]])
  })
  
  intersects <- rbindlist(intersects)
  intersects <- intersects[stop_index != neighbor_stop_index]
  
  # Compute the crow fly distance and travel times between neighboring stops
  # (travel times formula is based on a linear model fitted on IDFM data)
  stops_xy_coords <- as.data.table(st_coordinates(stops_xy))
  stops_xy_coords[, stop_index := stops_xy$stop_index]
  
  transfers <- merge(intersects, stops_xy_coords, by = "stop_index")
  transfers <- merge(transfers, stops_xy_coords, by.x = "neighbor_stop_index", by.y = "stop_index", suffixes = c("_from", "_to"))
  
  transfers[, distance := sqrt((Y_to - Y_from)^2 + (X_to - X_from)^2)]
  transfers[, min_transfer_time := 31 + 1.125*distance]
  transfers[, transfer_type := 2]
  
  # Format the data as expected for the transfers table
  stops_index_id <- as.data.table(st_drop_geometry(stops_xy[, c("stop_id", "stop_index")]))
  
  transfers <- merge(transfers, stops_index_id, by = "stop_index")
  transfers <- merge(transfers, stops_index_id, by.x = "neighbor_stop_index", by.y = "stop_index", suffixes = c("_from", "_to"))
  
  setnames(transfers, c("stop_id_from", "stop_id_to"), c("from_stop_id", "to_stop_id"))
  
  transfers <- transfers[, list(from_stop_id, to_stop_id, min_transfer_time)]
  
  return(transfers)
  
}


gtfs_file_path <- "D:/data/mobility/data/gtfs/hsl.zip"

gtfs <- extract_gtfs(gtfs_file_path)
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

system.time (
  t1 <- gtfs_transfer_table (gtfs, d_limit = 200)
)
#> utilisateur     système      écoulé 
#>        4.62        0.03        4.66

system.time (
  t2 <- transfer_table(gtfs, d_limit = 200)
)
#> utilisateur     système      écoulé 
#>        1.76        0.12        1.90

Created on 2024-01-31 with reprex v2.1.0

@mpadge
Copy link
Member

mpadge commented Jan 31, 2024

Oh, the difference is that i commented out the st_transform(stops_xy, 2154) line, because distances can and should be calculated directly on the ellipse in 4326 coordinates, which is what gtfsrouter does. Re-projecting on to a plane is obviously much faster, but there is no universal way to do that that won't decrease accuracy of the routine. If you remove that line (and add a d_limit <- unit::set_units(d_limit, m)) then you should see similar times to my example.

@FlxPo
Copy link
Author

FlxPo commented Jan 31, 2024

Thanks for the precision. For the record I'm not sure I see why distances should be calculated from lat/lon coordinates (at least at a regional scale), and I wouldn't mind specifying the CRS for each of my projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants