This is a pytorch re-implementation of Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.
I thought this would be a good paper to reproduce since this would allow me to code and train a GPT style model from scratch.
References used for the Code :-