-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use local locations DB #104
Conversation
@@ -17,6 +17,8 @@ | |||
class HtmlParserCogent1(Html): | |||
"""Notifications Parser for Cogent notifications.""" | |||
|
|||
_geolocator = Geolocator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a module-wide variable rather than specific to the Cogent parser class? We might have other parsers that need to use this as well in the future and it doesn't make sense to duplicate it unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally, good point. Even for now it's good, it's better to be prepared for reusability.
circuit_maintenance_parser/utils.py
Outdated
|
||
if self.db_location is not None: | ||
# 1. We try to match city name and country | ||
location = self.db_location.query(f"city_ascii == '{city_name}' & country == '{country}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is pandas overkill for this problem? I wonder if we could instead do something relatively simple with the built-in csv
library, something like:
self.db_location = {}
with open('worldcities.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# Index by city and country
self.db_location[(row["city_ascii"], row["country"])] = (row["lat"], row["lng"])
# Index by city (first entry wins if duplicated names)
if row["city_ascii"] not in db_location:
self.db_location[row["city_ascii"]] = (row["lat"], row["lng"])
...
lat, lng = self.db_location.get((city_name, country), self.db_location.get(city_name, None))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it may be. I will try this implementation and if the time penalty is not significant, I also prefer to get rid of this dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd second that. Ideally we could use DictReader
in the csv
module. The pandas
lib is quite heavy with more dependencies.
circuit_maintenance_parser/utils.py
Outdated
city_name = city.split(", ")[0] | ||
country = city.split(", ")[-1] | ||
|
||
if self.db_location is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check? It doesn't look like load_db_location()
has any case where it'd fail to populate self.db_location
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing check 😓
Args: | ||
city (str): Geographic location name | ||
""" | ||
if self.timezone is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check needed? Same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before, maybe it assumes that the method is available for class, no instance and fails the validation
circuit_maintenance_parser/utils.py
Outdated
# If even with the location we can't find the timezone, we try getting the location | ||
# from API. | ||
latitude, longitude = self.get_location_from_api(city) | ||
timezone = self.timezone.tzNameAt(latitude, longitude) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be needed since get_location
already falls through on its own to calling get_location_from_api
if get_location_from_local_file
fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory yes, but I found a case, with the original DB data, that was resolving to Sydney coordinates and the timezone resolution was returning None
, maybe due some bug on that library, because with the coordinates from the API was working well (some decimals of difference)
("Barcelona, Spain", "Europe/Madrid"), | ||
("Sydney, Australia", "Australia/Sydney"), | ||
("Guadalajara, Spain", "Europe/Madrid"), | ||
("Guadalajara, Mexico", "America/Mexico_City"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming - are these a mix of "city + country matches CSV" and "city alone matches CSV" so that we are testing both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, this test was to validate that case, we try to get the city + country first to avoid ambiguities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good once we swap out pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one!
Use a 41K locations CSV file to resolve location coordinates locally if possible before trying external API queries if not present.
Other enhancements:
backoff
library to handle retries to external DBpandas
(latestpython3.6
compatible version) to handle the local CSV DBGeolocator
that initializes the local DB and the timezone resolver once for all.