Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use local locations DB #104

Merged
merged 8 commits into from
Nov 4, 2021
Merged

Use local locations DB #104

merged 8 commits into from
Nov 4, 2021

Conversation

chadell
Copy link
Collaborator

@chadell chadell commented Oct 30, 2021

Use a 41K locations CSV file to resolve location coordinates locally if possible before trying external API queries if not present.

Other enhancements:

  • Using backoff library to handle retries to external DB
  • Using pandas(latest python3.6 compatible version) to handle the local CSV DB
  • Refactor the geolocation into a class Geolocator that initializes the local DB and the timezone resolver once for all.

@chadell chadell marked this pull request as ready for review November 2, 2021 06:45
@@ -17,6 +17,8 @@
class HtmlParserCogent1(Html):
"""Notifications Parser for Cogent notifications."""

_geolocator = Geolocator()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a module-wide variable rather than specific to the Cogent parser class? We might have other parsers that need to use this as well in the future and it doesn't make sense to duplicate it unnecessarily.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally, good point. Even for now it's good, it's better to be prepared for reusability.


if self.db_location is not None:
# 1. We try to match city name and country
location = self.db_location.query(f"city_ascii == '{city_name}' & country == '{country}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pandas overkill for this problem? I wonder if we could instead do something relatively simple with the built-in csv library, something like:

self.db_location = {}
with open('worldcities.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        # Index by city and country
        self.db_location[(row["city_ascii"], row["country"])] = (row["lat"], row["lng"])
        # Index by city (first entry wins if duplicated names)
        if row["city_ascii"] not in db_location:
            self.db_location[row["city_ascii"]] = (row["lat"], row["lng"])

...

lat, lng = self.db_location.get((city_name, country), self.db_location.get(city_name, None))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it may be. I will try this implementation and if the time penalty is not significant, I also prefer to get rid of this dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd second that. Ideally we could use DictReader in the csv module. The pandas lib is quite heavy with more dependencies.

city_name = city.split(", ")[0]
country = city.split(", ")[-1]

if self.db_location is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check? It doesn't look like load_db_location() has any case where it'd fail to populate self.db_location.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing check 😓

Args:
city (str): Geographic location name
"""
if self.timezone is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check needed? Same reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before, maybe it assumes that the method is available for class, no instance and fails the validation

Comment on lines 100 to 103
# If even with the location we can't find the timezone, we try getting the location
# from API.
latitude, longitude = self.get_location_from_api(city)
timezone = self.timezone.tzNameAt(latitude, longitude)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be needed since get_location already falls through on its own to calling get_location_from_api if get_location_from_local_file fails?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory yes, but I found a case, with the original DB data, that was resolving to Sydney coordinates and the timezone resolution was returning None, maybe due some bug on that library, because with the coordinates from the API was working well (some decimals of difference)

("Barcelona, Spain", "Europe/Madrid"),
("Sydney, Australia", "Australia/Sydney"),
("Guadalajara, Spain", "Europe/Madrid"),
("Guadalajara, Mexico", "America/Mexico_City"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming - are these a mix of "city + country matches CSV" and "city alone matches CSV" so that we are testing both cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, this test was to validate that case, we try to get the city + country first to avoid ambiguities

Copy link
Contributor

@pke11y pke11y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good once we swap out pandas

Copy link
Contributor

@pke11y pke11y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants