Pandas and Dask from the Inside

Tutorial session from PyData Amsterdam, Friday 7 April 2017.
Presented by Stephen Simmons ([email protected]).
GitHub repo: https://github.com/stevesimmons/pydata-ams2017-pandas-and-dask-from-the-inside

Tutorial contents

This is meant for all levels of Python/Pandas users: accessible to beginners plus some new insights for experienced users.

Part I - Pandas from the Inside

Learn how to make your Python data analysis easier, faster and more efficent with Pandas, while avoiding common pitfalls.

We follow a typical Python data analysis task from start to finish, looking inside Pandas Series, DataFrames and other objects to discover what really happens.

We will see that many Pandas operations are essentially function calls on numpy arrays. Our Pandas code, if we do it right, can benefit from the full speed of numpy's highly optimized C routines. Equally, if we do it wrong, our Pandas code can be 1000x slower.

The examples here Australian Rules football results. If you aren't familiar with Aussie Rules, watch this 9 minute introduction.

Part II - Dask from the Inside - "Big Pandas"

This second part looks at Pandas analysis when our data sets can't fit in local memory. The examples here use the On-Time domestic flight arrival/departure data from the US Bureau of Transportation Statistics. The monthly BTS On-Time Performance dataset started in December 1987 and currently has details on 173 million individual flights. Each monthly extract is a 220MB csv file, zipped to 23MB, covering the 450,000 or so flights by major US carriers.

We try two approaches to process this csv data:

Plain Pandas, which quickly runs out of memory.
Dask, whose distributed/deferred DataFrames are a near drop-in replacement for Pandas.

Through seeing how Dask works "from the inside", we can make better architectural decisions on local versus distributed data processing.

Before the tutorial

Python packages

If you want to follow along the examples on your own laptop, please have the latest versions of Python3, Pandas, Jupyter and IPython installed. Part II will need Dask and optionally graphviz to visualize Dask dependency graphs.

If you don't have Python already, the easiest route is via the full Anaconda Python distribution (300MB). Details are at http://conda.pydata.org/docs/installation.html. Alternatively download the "miniconda" version and install just the packages you need.

Tutorial files

Clone this repo for a local copy of the presentation materials, source code, sample Jupyter notebooks and test data.

If you just want to follow along the tutorial slides, download just the PDFs.

Please update your local copies on the morning of the tutorial to make sure your version matches the final ones I am presenting.

The main files for Part I are:

slides-1-pandas-from-the-inside.pdf or slides-1-pandas-from-the-inside.pdf- Presentation slides. As some slides are quite detailed, you may want to download these to follow along on your own laptop/tablet.
pandas-from-the-inside.ipynb - Jupyter notebook with code from the slides plus some further explanation. Download this if you have Jupyter installed on your laptop. Otherwise you can view the rendered notebook here on GitHub.
pfi.py - Code in the Jupyter notebook.
bg3.txt - Sample data file from http://afltables.com/afl/stats/biglists/bg3.txt. A copy is included in this repo for your convenience.

The main files for Part II are:

slides-2-dask-big-pandas.pdf or slides-2-dask-big-pandas.pdf-
nb1-setup.ipynb
nb2-preparing-sample-data.ipynb
nb3-pandas-with-large-csvs.ipynb
nb4-parquet.ipynb
nb5-dask-graphs.ipynb

Prepared sample data is available here:

About the presenter

Stephen Simmons has been programming in Python since 2000. He works at JPMorgan in London, where he leads the Precious Metals technology team, building trading and risk applications in JPMorgan's Python environment, Athena.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pandas and Dask from the Inside

Tutorial contents

Part I - Pandas from the Inside

Part II - Dask from the Inside - "Big Pandas"

Before the tutorial

Python packages

Tutorial files

About the presenter

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
bg3.txt		bg3.txt
nb1-setup.ipynb		nb1-setup.ipynb
nb2-preparing-sample-data.ipynb		nb2-preparing-sample-data.ipynb
nb3-pandas-with-large-csvs.ipynb		nb3-pandas-with-large-csvs.ipynb
nb4-parquet.ipynb		nb4-parquet.ipynb
nb5-dask-graphs.ipynb		nb5-dask-graphs.ipynb
pandas-from-the-inside.ipynb		pandas-from-the-inside.ipynb
pfi.py		pfi.py
scores.csv		scores.csv
slides-1-pandas-from-the-inside.pdf		slides-1-pandas-from-the-inside.pdf
slides-1-pandas-from-the-inside.pptx		slides-1-pandas-from-the-inside.pptx
slides-2-dask-big-pandas.pdf		slides-2-dask-big-pandas.pdf
slides-2-dask-big-pandas.pptx		slides-2-dask-big-pandas.pptx

License

stevesimmons/pydata-ams2017-pandas-and-dask-from-the-inside

Folders and files

Latest commit

History

Repository files navigation

Pandas and Dask from the Inside

Tutorial contents

Part I - Pandas from the Inside

Part II - Dask from the Inside - "Big Pandas"

Before the tutorial

Python packages

Tutorial files

About the presenter

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages