Model to predict whether or not a bank application is fraudulent, coded in R.
I used a large, realistic dataset from Kaggle for this project. Data available here: https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022
Base Analysis uses the Base dataset, which represents the real-world, anonymized fraud data. Variant V Analysis uses the Variant V datset, which has better separability in the training data. Both datasets contain 1,000,000 rows and 32 variables.
I created this project for the course Econ 695: Econometrics for Big Data in Spring 2023. As part of the class, I gave a five minute presentation on my project and wrote up a findings report. Those files can also be found in this folder.