Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script for counting total seen slots #3

Merged
merged 20 commits into from
Sep 28, 2022

Conversation

Mr0grog
Copy link
Collaborator

@Mr0grog Mr0grog commented Jan 28, 2022

We’ve had a few requests over time for a total count of slots that UNIVAF has monitored, so I thought I’d clean up the hacky script I originally wrote for this and add it here for later re-use. This could still use some improvement, so I’ve posted it as a draft.

This also led to some messy discoveries:

  • Our data files are pretty big, and it turns out other storage formats can improve literally every aspect of this with no downsides (small files, faster to download, cheaper to store, and even faster to read and scan every line (!)). In this case, I went with batch-gzipping the current JSON files, which you can see read support for in the read_json_lines() function. (See a deeper analysis of all this at: Write smaller log files univaf#542

  • I’ve never used Pypy before, but it gave us a 30-35% speed boost here! It’s doesn’t play nice with Pandas yet, though, so I wound up creating lib_cli.py so I could share code without tripping an import pandas statement.

  • I switched to a more map-reduce-y pattern (summarize each log file, then combine the summaries) later, which allowed for another big speedup via parallel processing.

  • The existing code uses urllib to download the data files from S3 using an unauthenticed HTTP request. It’s really slow (10-20 MB/s from S3 to an EC2 machine in the same region). At some point I tried using the AWS CLI client, and it downloaded data a full order of magnitude faster! Not sure if this is something special about how it forms the requests (it’s written in Python, so it’s not that) or if you just get more bandwidth when authenticated. In any case, we should take advantage of this (I haven’t included any code to do so here, though).

  • Rite Aid reported bad data for slot counts from 2021-09-09 through 2021-11-17 (when their API completely broke and went offline). I added some code to “correct” for this by substituting the median slot count from the month after we started scraping them instead (because the API was offline), and which had more realistic & accurate data.

    How do I know the data was bad? Prior to the dates in question, Rite Aid had been reporting hundreds of slots/day at most for any given location. During the dates in question, it reported thousands. Further, it only ever reported one of 3 values (1,728, 1,584, or 1,440) rather than a broad spread across locations. Those three values were also the same for all locations on particular days (e.g. every location had 1,584 slots on 2021-10-06).

    Why use 13 for the “corrected” value? Starting a few days after the Rite Aid API went down, we started scraping their booking site, getting us a richer data set with what seems like more realistic numbers. Taking the median slots/day/location from the first month of that data seemed like good conservative replacement. However, this does give us 13, where the median in the week before the bad data, we were seeing a median of 288. It’s hard to know what’s more correct — the previous data could have been bad in different ways, could be more correct because it’s a complete count of all slots from inside their system (scraping, we only see available slots), could be a miscount that treats each product+slot combination as a separate slot (we almost made this mistake in our scraper), or Rite Aid could simply have changed staffing and allocation during this period and offered fewer appointment slots in late November vs. early September (after all, demand had decreased dramatically leading into this time period).

To Do

I made this PR to get all this committed, visible, and re-usable, but it does need some further work and may not be worth merging yet.

  • Use nicer CLI parsing tools in other scripts.
  • Download with authenticaed request or AWS CLI client instead of urllib when credentials are available.
  • Actually download the necessary files instead of requring you to have done that manually (or, as I did, by hacking up process_univaf.py).
  • Summarize data by more dimensions (maybe provider and source). Digging into the issues with Rite Aid above really highlighted how having tools for these other breakdowns would be helpful. We could build on it to automatically publish nightly reports to highlight these issues like the Rite Aid one proactively.

Statistics from Rite Aid, for posterity:

Values are numbers of slots seen on a location + day combination. These only count reports of > 1 total slots in order to filter out sources don’t provide slot- or capcity-level detail (e.g. CDC).

Misbehaving Locations: 2425 of 2471 total
Before (the week before the bad period):
  Min: 144 / Max: 432
  Mean:   254.54788260383873
  Median: 288.0
  Deciles: [144.0, 144.0, 288.0, 288.0, 288.0, 288.0, 288.0, 288.0, 288.0]
During:
  Min: 1440 / Max: 3312
  Mean:   1636.5812954028193
  Median: 1728.0
  Deciles: [1440.0, 1440.0, 1440.0, 1584.0, 1728.0, 1728.0, 1728.0, 1728.0, 1728.0]
After (the month after the bad period):
  Min: 2 / Max: 125
  Mean:   14.458970639250685
  Median: 13.0
  Deciles: [3.0, 6.0, 9.0, 11.0, 13.0, 16.0, 18.0, 22.0, 29.0]
After for locations that misbehaved:
  Min: 2 / Max: 125
  Mean:   14.444431297378738
  Median: 13
  Deciles: [3.0, 6.0, 9.0, 11.0, 13.0, 16.0, 18.0, 22.0, 29.0]
After for locations that were good:
  Min: 2 / Max: 111
  Mean:   42.815384615384616
  Median: 16
  Deciles: [2.0, 4.0, 7.0, 8.4, 16.0, 67.8, 93.0, 93.0, 96.6]

@Mr0grog
Copy link
Collaborator Author

Mr0grog commented Jan 28, 2022

While doing this, I also uploaded gzipped version of the availability log files to S3 for every day through 2022-01-26. We don’t have an ongoing process for saving new data as gzip yet, though (see usdigitalresponse/univaf#542).

@janovergoor
Copy link
Contributor

great work! let me know if there's anything you'd like me to review

@Mr0grog
Copy link
Collaborator Author

Mr0grog commented Jan 28, 2022

I don’t think anything here is a huge deal or priority, but if you want to look over how the counting works and if I should have leveraged some other existing code or if some of this could be leveraged elsewhere, 👍. Not a huge deal if you don’t want to spend a lot of effort on the other scripts in this repo, which aren’t in active use, though.

Copy link

@astonm astonm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌🏽

src/count_slots.py Outdated Show resolved Hide resolved
src/requirements.txt Outdated Show resolved Hide resolved
It turns out the speed I was confused about AWS CLI getting before is that it will break requests for large files up into parallel requests using the HTTP range header. Boto3 also has this functionality built in, which gives us a straightforward way to get the speed without complicated branching logic depending on whether the AWS CLI is installed.
@Mr0grog Mr0grog marked this pull request as ready for review September 28, 2022 03:47
@Mr0grog
Copy link
Collaborator Author

Mr0grog commented Sep 28, 2022

I did not do “Summarize data by more dimensions (maybe provider and source),” but have addressed everything else here, and I think it’s time to land this.

@Mr0grog Mr0grog merged commit 1ed735a into main Sep 28, 2022
@Mr0grog Mr0grog deleted the people-keep-asking-how-many-slots-we-have-seen branch September 28, 2022 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants