😷 GATech COVID-19 Data Scraper

Number of cases per day. As a CSV. Data as it should be. EZ to read.

🎬 Demo

⬇️ Download the Current Data (updated hourly)

📈 View the data in an interactive Chart and Table

Below is the link to a public S3 Object that gets updated hourly from a Lambda running this project's code. Feel free to use it for powering a dashboard or investigating the data yourself

https://gatech-covid-19-data.s3.amazonaws.com/gatech_covid_data.csv

🏁 Getting Started

For those who want to run the data scraper locally

git clone [email protected]:davidgamero/gatech-covid-data-scraper.git
cd gatech-covid-data-scraper
pip install -r requirements.txt

python scrape_covid_data.py

Data will be written to gatech_covid_data.csv

ℹ️ Project Info

Q: Why did I make this?

A: I searched "gatech covid" on GitHub and only got one result which was in R by cjwichman

I believe that pandemic health data should be freely and easily accessible and wanted to make my own plots, so I decided to make a Python scraper implementation to better understand the data.

My main improvements were automated extraction of case numbers aggregated by day even for rows that group cases. This was trickier than I expected for rows that differ in formatting ex: due to the GATech Health Alert Site's wildly inconsistent conventions 🤢 I used a series of Regular Expressions to parse for keywords and then extract integers using observed rules. All fuzzy extractions are printed to the command line for manual verification.

The patterns currently recognized are

Rows with a 'Position' value of 'Students (N)' or 'Various (N)' where N is the number of cases, which I extracted with a regex capture group for the numeric contents of the parentheses
Rows with a 'Position' value of 'Students' OR 'Various'. For these rows I use a regex search to find the first integer present in the 'Campus Impact' column as the number of cases. It would be nice to eventually check that there is only a single match and throw an error for manual review if there are multiple integers.

💾 AWS Lambda -> S3

To deploy as an AWS Lambda function build gatech-covid-data-lambda.zip with build_lambda_zip.sh and upload to a Python Lambda with s3:PutObject,s3:PutObjectAcl permissions to the target bucket

chmod +x build_lambda_zip.sh
./build_lambda_zip.sh

Upload gatech-covid-data-lambda.zip to AWS Lambda

I recommend increasing timeout to >5s as the data size increases over time with more rows

Acknowledgements

Shout out to cjwichman for paving the way with their gatech_covid repo

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_lambda_zip.sh		build_lambda_zip.sh
lambda_function.py		lambda_function.py
requirements.txt		requirements.txt
scrape_covid_data.py		scrape_covid_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

😷 GATech COVID-19 Data Scraper

🎬 Demo

🏁 Getting Started

ℹ️ Project Info

💾 AWS Lambda -> S3

Acknowledgements

About

Releases

Packages

Languages

License

davidgamero/gatech-covid-data-scraper

Folders and files

Latest commit

History

Repository files navigation

😷 GATech COVID-19 Data Scraper

🎬 Demo

🏁 Getting Started

ℹ️ Project Info

💾 AWS Lambda -> S3

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages