Add script for counting total seen slots #3

Mr0grog · 2022-01-28T18:08:14Z

We’ve had a few requests over time for a total count of slots that UNIVAF has monitored, so I thought I’d clean up the hacky script I originally wrote for this and add it here for later re-use. This could still use some improvement, so I’ve posted it as a draft.

This also led to some messy discoveries:

Our data files are pretty big, and it turns out other storage formats can improve literally every aspect of this with no downsides (small files, faster to download, cheaper to store, and even faster to read and scan every line (!)). In this case, I went with batch-gzipping the current JSON files, which you can see read support for in the read_json_lines() function. (See a deeper analysis of all this at: Write smaller log files univaf#542
I’ve never used Pypy before, but it gave us a 30-35% speed boost here! It’s doesn’t play nice with Pandas yet, though, so I wound up creating lib_cli.py so I could share code without tripping an import pandas statement.
I switched to a more map-reduce-y pattern (summarize each log file, then combine the summaries) later, which allowed for another big speedup via parallel processing.
The existing code uses urllib to download the data files from S3 using an unauthenticed HTTP request. It’s really slow (10-20 MB/s from S3 to an EC2 machine in the same region). At some point I tried using the AWS CLI client, and it downloaded data a full order of magnitude faster! Not sure if this is something special about how it forms the requests (it’s written in Python, so it’s not that) or if you just get more bandwidth when authenticated. In any case, we should take advantage of this (I haven’t included any code to do so here, though).
Rite Aid reported bad data for slot counts from 2021-09-09 through 2021-11-17 (when their API completely broke and went offline). I added some code to “correct” for this by substituting the median slot count from the month after we started scraping them instead (because the API was offline), and which had more realistic & accurate data.

How do I know the data was bad? Prior to the dates in question, Rite Aid had been reporting hundreds of slots/day at most for any given location. During the dates in question, it reported thousands. Further, it only ever reported one of 3 values (1,728, 1,584, or 1,440) rather than a broad spread across locations. Those three values were also the same for all locations on particular days (e.g. every location had 1,584 slots on 2021-10-06).

Why use 13 for the “corrected” value? Starting a few days after the Rite Aid API went down, we started scraping their booking site, getting us a richer data set with what seems like more realistic numbers. Taking the median slots/day/location from the first month of that data seemed like good conservative replacement. However, this does give us 13, where the median in the week before the bad data, we were seeing a median of 288. It’s hard to know what’s more correct — the previous data could have been bad in different ways, could be more correct because it’s a complete count of all slots from inside their system (scraping, we only see available slots), could be a miscount that treats each product+slot combination as a separate slot (we almost made this mistake in our scraper), or Rite Aid could simply have changed staffing and allocation during this period and offered fewer appointment slots in late November vs. early September (after all, demand had decreased dramatically leading into this time period).

To Do

I made this PR to get all this committed, visible, and re-usable, but it does need some further work and may not be worth merging yet.

Use nicer CLI parsing tools in other scripts.
Download with authenticaed request or AWS CLI client instead of urllib when credentials are available.
Actually download the necessary files instead of requring you to have done that manually (or, as I did, by hacking up process_univaf.py).
Summarize data by more dimensions (maybe provider and source). Digging into the issues with Rite Aid above really highlighted how having tools for these other breakdowns would be helpful. We could build on it to automatically publish nightly reports to highlight these issues like the Rite Aid one proactively.

Statistics from Rite Aid, for posterity:

Values are numbers of slots seen on a location + day combination. These only count reports of > 1 total slots in order to filter out sources don’t provide slot- or capcity-level detail (e.g. CDC).

Misbehaving Locations: 2425 of 2471 total
Before (the week before the bad period):
  Min: 144 / Max: 432
  Mean:   254.54788260383873
  Median: 288.0
  Deciles: [144.0, 144.0, 288.0, 288.0, 288.0, 288.0, 288.0, 288.0, 288.0]
During:
  Min: 1440 / Max: 3312
  Mean:   1636.5812954028193
  Median: 1728.0
  Deciles: [1440.0, 1440.0, 1440.0, 1584.0, 1728.0, 1728.0, 1728.0, 1728.0, 1728.0]
After (the month after the bad period):
  Min: 2 / Max: 125
  Mean:   14.458970639250685
  Median: 13.0
  Deciles: [3.0, 6.0, 9.0, 11.0, 13.0, 16.0, 18.0, 22.0, 29.0]
After for locations that misbehaved:
  Min: 2 / Max: 125
  Mean:   14.444431297378738
  Median: 13
  Deciles: [3.0, 6.0, 9.0, 11.0, 13.0, 16.0, 18.0, 22.0, 29.0]
After for locations that were good:
  Min: 2 / Max: 111
  Mean:   42.815384615384616
  Median: 16
  Deciles: [2.0, 4.0, 7.0, 8.4, 16.0, 67.8, 93.0, 93.0, 96.6]

Determines which database dump to use for deduplication, loading provider info, etc.

Mr0grog · 2022-01-28T18:29:59Z

While doing this, I also uploaded gzipped version of the availability log files to S3 for every day through 2022-01-26. We don’t have an ongoing process for saving new data as gzip yet, though (see usdigitalresponse/univaf#542).

janovergoor · 2022-01-28T20:59:08Z

great work! let me know if there's anything you'd like me to review

Mr0grog · 2022-01-28T21:32:58Z

I don’t think anything here is a huge deal or priority, but if you want to look over how the counting works and if I should have leveraged some other existing code or if some of this could be leveraged elsewhere, 👍. Not a huge deal if you don’t want to spend a lot of effort on the other scripts in this repo, which aren’t in active use, though.

astonm

🙌🏽

src/count_slots.py

src/requirements.txt

src/univaf_data.py

It turns out the speed I was confused about AWS CLI getting before is that it will break requests for large files up into parallel requests using the HTTP range header. Boto3 also has this functionality built in, which gives us a straightforward way to get the speed without complicated branching logic depending on whether the AWS CLI is installed.

Mr0grog · 2022-09-28T03:49:33Z

I did not do “Summarize data by more dimensions (maybe provider and source),” but have addressed everything else here, and I think it’s time to land this.

Mr0grog added 7 commits January 27, 2022 17:14

Start with old, hacked-together script from last time

f2233c3

Deep clean of original script

4ddcd0b

Make parallelizable and cacheable

c1a9ecd

More strictly correct deduping

3bb01a1

Replace bad Rite Aid data with reasonable values

5f388d6

Add logic for provider breakdown

6cdd299

Support providing a reference date

7fb9502

Determines which database dump to use for deduplication, loading provider info, etc.

Mr0grog mentioned this pull request Jan 28, 2022

Write smaller log files usdigitalresponse/univaf#542

Closed

astonm reviewed Jan 29, 2022

View reviewed changes

src/count_slots.py Outdated Show resolved Hide resolved

src/requirements.txt Outdated Show resolved Hide resolved

Mr0grog mentioned this pull request Feb 2, 2022

Add more validation to Rite Aid API usdigitalresponse/univaf#547

Merged

Mr0grog added 9 commits July 19, 2022 10:19

cli_date should actually return a date

c22b2bf

Always use .gz files

29cbab0

Use Counter instead of defaultdict(lambda: 0)

ed5d879

Add utility for downloading and reading data files

dfbba8b

Fix misnamed call

684d348

I can't type

711e371

Context manage correctly

dbff8e5

Use new standard cli lib everywhere

3b09764

If we're going to install tqdm, we should use it

2935019

astonm reviewed Jul 20, 2022

View reviewed changes

src/univaf_data.py Show resolved Hide resolved

Mr0grog added 3 commits September 27, 2022 19:25

Make download_log_file return a path

ff628bc

Fix string issue

b1b3088

Mr0grog marked this pull request as ready for review September 28, 2022 03:47

Support new gzip files in process_univaf.py

27f9c21

Mr0grog merged commit 1ed735a into main Sep 28, 2022

Mr0grog deleted the people-keep-asking-how-many-slots-we-have-seen branch September 28, 2022 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script for counting total seen slots #3

Add script for counting total seen slots #3

Mr0grog commented Jan 28, 2022 •

edited

Loading

Mr0grog commented Jan 28, 2022 •

edited

Loading

janovergoor commented Jan 28, 2022

Mr0grog commented Jan 28, 2022

astonm left a comment

Mr0grog commented Sep 28, 2022

Add script for counting total seen slots #3

Add script for counting total seen slots #3

Conversation

Mr0grog commented Jan 28, 2022 • edited Loading

To Do

Mr0grog commented Jan 28, 2022 • edited Loading

janovergoor commented Jan 28, 2022

Mr0grog commented Jan 28, 2022

astonm left a comment

Choose a reason for hiding this comment

Mr0grog commented Sep 28, 2022

Mr0grog commented Jan 28, 2022 •

edited

Loading

Mr0grog commented Jan 28, 2022 •

edited

Loading