-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script for counting total seen slots #3
Add script for counting total seen slots #3
Conversation
Determines which database dump to use for deduplication, loading provider info, etc.
While doing this, I also uploaded gzipped version of the availability log files to S3 for every day through 2022-01-26. We don’t have an ongoing process for saving new data as gzip yet, though (see usdigitalresponse/univaf#542). |
great work! let me know if there's anything you'd like me to review |
I don’t think anything here is a huge deal or priority, but if you want to look over how the counting works and if I should have leveraged some other existing code or if some of this could be leveraged elsewhere, 👍. Not a huge deal if you don’t want to spend a lot of effort on the other scripts in this repo, which aren’t in active use, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙌🏽
It turns out the speed I was confused about AWS CLI getting before is that it will break requests for large files up into parallel requests using the HTTP range header. Boto3 also has this functionality built in, which gives us a straightforward way to get the speed without complicated branching logic depending on whether the AWS CLI is installed.
I did not do “Summarize data by more dimensions (maybe provider and source),” but have addressed everything else here, and I think it’s time to land this. |
We’ve had a few requests over time for a total count of slots that UNIVAF has monitored, so I thought I’d clean up the hacky script I originally wrote for this and add it here for later re-use. This could still use some improvement, so I’ve posted it as a draft.
This also led to some messy discoveries:
Our data files are pretty big, and it turns out other storage formats can improve literally every aspect of this with no downsides (small files, faster to download, cheaper to store, and even faster to read and scan every line (!)). In this case, I went with batch-gzipping the current JSON files, which you can see read support for in the
read_json_lines()
function. (See a deeper analysis of all this at: Write smaller log files univaf#542I’ve never used Pypy before, but it gave us a 30-35% speed boost here! It’s doesn’t play nice with Pandas yet, though, so I wound up creating
lib_cli.py
so I could share code without tripping animport pandas
statement.I switched to a more map-reduce-y pattern (summarize each log file, then combine the summaries) later, which allowed for another big speedup via parallel processing.
The existing code uses
urllib
to download the data files from S3 using an unauthenticed HTTP request. It’s really slow (10-20 MB/s from S3 to an EC2 machine in the same region). At some point I tried using the AWS CLI client, and it downloaded data a full order of magnitude faster! Not sure if this is something special about how it forms the requests (it’s written in Python, so it’s not that) or if you just get more bandwidth when authenticated. In any case, we should take advantage of this (I haven’t included any code to do so here, though).Rite Aid reported bad data for slot counts from 2021-09-09 through 2021-11-17 (when their API completely broke and went offline). I added some code to “correct” for this by substituting the median slot count from the month after we started scraping them instead (because the API was offline), and which had more realistic & accurate data.
How do I know the data was bad? Prior to the dates in question, Rite Aid had been reporting hundreds of slots/day at most for any given location. During the dates in question, it reported thousands. Further, it only ever reported one of 3 values (1,728, 1,584, or 1,440) rather than a broad spread across locations. Those three values were also the same for all locations on particular days (e.g. every location had 1,584 slots on 2021-10-06).
Why use 13 for the “corrected” value? Starting a few days after the Rite Aid API went down, we started scraping their booking site, getting us a richer data set with what seems like more realistic numbers. Taking the median slots/day/location from the first month of that data seemed like good conservative replacement. However, this does give us 13, where the median in the week before the bad data, we were seeing a median of 288. It’s hard to know what’s more correct — the previous data could have been bad in different ways, could be more correct because it’s a complete count of all slots from inside their system (scraping, we only see available slots), could be a miscount that treats each product+slot combination as a separate slot (we almost made this mistake in our scraper), or Rite Aid could simply have changed staffing and allocation during this period and offered fewer appointment slots in late November vs. early September (after all, demand had decreased dramatically leading into this time period).
To Do
I made this PR to get all this committed, visible, and re-usable, but it does need some further work and may not be worth merging yet.
urllib
when credentials are available.process_univaf.py
).Statistics from Rite Aid, for posterity: