Hello there! My name is Enrique Perez and this is my solution to the HF Data Engineering take-home test.
The original requirements are here.
To set up the development environment please follow the following steps:
- Install Python 3.10+
- In the project directory run
make install-deps-dev
. - Done! You are ready to start developing the app.
In the project directory run:
make unit-tests
We will run the app from the virtual environment, while in the project directory run:
make run-spark-app
Done!
The basic structure is the following:
├── app.py
├── data
├── input
├── output
├── src
│ ├── __init__.py
│ ├── custom_exceptions.py
│ ├── data_processing.py
│ ├── load_events.py
│ └── utils.py
- The
src
folder contains all the functions required to read and process the data, as well as other utilities for error handling and logging. load_events.py
contains the code required to fulfill Task #1.- The
data
folder contains the persisted output from Task #1 in parquet file format. data_processing.py
contains the code required to fulfill Task #2.app.py
executes both tasks in order to produce the desired dataset.
- As source events are coming in JSON format, and JSON does not enforce a strict schema by default, no schema is enforced during the reading of these events.
- The
cookTime
andprepTime
fields are strings representing duration of time in the ISO 8601 format, so during Task #1 these fields are transform to duration in minutes of integer datatype, persisting the data in a easier way for further processing later. As there's no a built-in way to do this and preferred to not rely on additional packages or UDF, regular expressions were used in order to accomplish this transformation. - Few duplicated recipes were found on the data, nonetheless one of them had the same
datePulished
value, so no clear criteria for eliminating duplicate values was found; but as none of these recipes have no beef the final output wouldn't be affected. - Three (3) recipes without name were spotted these were filtered out during Task #1.
make code-style
will keep the code tidy, usingblack
for the format,isort
to organize import statements andflake8
for style.make check-types
will help you prevent type-related bugs, usingmypy
and the type annotations.
- For deployment a cloud-based cluster could be used, e.g., AWS EMR, and a configuration management tool like Terraform can be used to define and manage the deployment infrastructure. As the latest version of AWS EMR is 6.13 for the moment of this writing,
pyspark 3.4.1
was used, which is the lastest version ofspark
compatible with EMR 6.13. - In case of performance problems I'd consider the use of
repartition()
orcoalesce()
operations to control the partitioning and minimize data shuffling, especially before aggregating data. - To run the app periodically I would use AWS Lambda, creating a Lambda function that triggers the execution of our app using AWS EventBridge.