Skip to content
This repository has been archived by the owner on Jul 3, 2024. It is now read-only.
Margriet Groenendijk edited this page Aug 15, 2018 · 4 revisions

Short Name

Analyse historical shopping data with Spark and PixieDust in a Jupyter notebook

Short Description

Use Jupyter Notebooks with IBM Watson Studio to analyse historical shopping data with the open-source Python packages Apache Spark and PixieDust. Create bar charts, line charts, scatter plots, pie charts, histograms and maps without any coding.

Offering Type

Cognitive

Introduction

This code pattern shows how to analyse historical shopping data with Jupyter Notebooks in IBM Watson Studio and the open-source Python packages Apache Spark and PixieDust. Users can quickly analyse data and produce charts and maps.

Author

By Patrick Titzler and Margriet Groenendijk

Code

Demo

  • link to demo video

Video

  • link to youtube video

Overview

In this code pattern historical shopping data is analysed in a Jupyter notebook with the open-source Python packages Apache Spark and PixieDust.

When the reader has completed this code pattern, they will understand how to:

Flow

  1. Log in to IBM Watson Studio
  2. Load the provided notebook into Watson Studio
  3. Load the customer data in the notebook
  4. Transform the data with Apacke Spark
  5. Create charts and maps with PixieDust

Included components

  • IBM Watson Studio: a suite of tools and a collaborative environment for data scientists, developers and domain experts
  • IBM Apache Spark: an open source cluster computing framework optimized for extremely fast and large scale data processing

Featured Technologies

  • Jupyter notebooks: an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text
  • PixieDust: Open source Python package, providing support for Javascript/Node.js code.

Blog

Blog Title: Speeding up data exploration with PixieDust and Jupyter notebooks

Blog Author: Margriet Groenendijk

Blog Content - see below

With PixieDust you can use the power of Python and Jupyter notebooks when you:

  • have never coded before
  • are an experienced data analyst or data scientist
  • are a developer with little Python experience wanting to quickly explore some data

Jupyter notebooks is a tool used by many data scientists to wrangle and clean data, visualise data, build and test machine learning models and even write talks. The reason for this is that both text, code and figures and tables can be combined, which makes it easy to keep the code structured by adding a lot of comments and explanations of your thought processes and decisions you made.

To visualise data with Python there are many packages available. When you just got started this might be overwhelming. When you are experienced it still takes a bit of time to create charts, because the syntax of all these packages is slightly different. Especially as it is easy to spend a lot of time tweaking your code to create the perfect chart. I have to admit I tend to do this as it is so much fun, but definitely not always necessary.

With PixieDust you can explore data in a simpler way and also spend more time exploring the data instead of going down the rabbit hole of tweaking the code to change the colours, fonts, line styles, axes and anything else you can manually change.

The main command to create charts from Spark or pandas DataFrames is display(df). When you run this command in a cell in a notebook the data will be displayed in a table. Now you have the option to scroll through the data, filter the data or create a chart from a menu. All of this is simply done by clicking a few buttons.

PixieDust uses other visualisation packages to create the charts, such as matplotlib, bokeh, seaborn and Brunel. You can see it as a clever wrapper around these libraries that will save you time while exploring data.

To explore PixieDust you can go through this code pattern where historical shopping data is analyzed with Spark and PixieDust. The data is loaded, cleaned and then analyzed by creating various charts and maps. Jupyter notebooks are run in IBM Watson Studio. The code pattern will help you through all the steps to set up your IBM Cloud account, create the notebook and run the notebook.

In case you want to jump straight to the code, the GitHub repository contains the notebook that you can run both in the cloud or locally.

To learn more about PixieDust and Jupyter notebooks these are a few resources to get you started:


Learn more

  • Watson Studio: Master the art of data science with IBM's Watson Studio
  • Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
  • With Watson: Want to take your Watson app to the next level? Looking to utilize Watson Brand assets? Join the With Watson program to leverage exclusive brand, marketing, and tech resources to amplify and accelerate your Watson embedded commercial solution.
Clone this wiki locally