Jupyter Notebook Tutorial

Jupyter Notebook Tutorial

Introduction

According to https://jupyter.org/

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Jupyter is an environment for you to write code on your language of choice, evaluating its output as you go.

Also, Jupyter is the best IDE(ish) there is, in my humble and final opinion.

Installing

Even though Jupyter supports many programming languages, it requires python to be installed. The easiest way is with the anaconda package, the most popular python distribution, which comes with Jupyter Notebook and some of the most used python modules.

You can download the anaconda package for your desired OS in the following link: https://www.continuum.io/downloads and then proceed with the installation.

To launch the Jupyter Notebook server, either open a command line and type jupyter notebook or run the Jupyter application.

Creating a notebook

Once you’ve started the notebook server, it will open the Notebook Dashboard in your system’s default navigator. The dashboard is the Jupyter user interface and it shows a list of files, notebooks and subfolders in the current directory

Dashboard

In order to create your first notebook, you need to click on the “New” button in the top right corner and then select a kernel (in our case Python 3)

New notebook creation

The notebook is organized by cells where you can write code or text

Cell

After you write some code in the input cell In [ ], you can execute it by pressing Shift + Enter or clicking the “Run cell” button in the toolbar; the output will be shown in the output cell Out [ ]

Cell code output

There are other buttons on the toolbar, here’s a brief list, given the names, their functionalities are pretty straighforward (icons are missing):

  • Save and Checkpoint
  • Insert cell below
  • Cut selected cells
  • Copy selected cells
  • Paste cells below
  • Move selected cells up
  • Move selected cells down
  • Run cell, select below
  • Interrupt kernel
  • Restart the kernel
  • Cell type selector
  • Open command palette

You can find another pretty good guide in the Jupyter docs.


One of the Jupyter Notebook’s greatest advantages is the fact that you can easily write some code and check its output right away, making it very fitting to data exploration where you need to query a dataset, often applying statistical functions and plotting charts to attain insights or verify impressions.

Also, the Notebook is capable of rendering text formatted in the [Markdown syntax][markdowns], the Markdown cell type, enabling you to construct a narrative within your notebook, conducting the reader through your flow of thought $-$ this topic is discussed in more depth in this blog post.

Markdown

The notebook and data analysis

To exemplify the advantages mentioned above, we will perform a pretty basic analysis on a dataset.

The first step is to import the required modules:

  • pandas - for managing data in the form of dataframes (tabulated)
  • numpy - the numerical module
  • matplotlib - for plotting charts

The %matplotlib inline command will be discussed in a few moments, for now let us just accept it.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

The dataset we'll be dealing with has some data on videogame sales across the globe and it was put together by the user GregorySmith in the Kaggle community, available at this link: https://www.kaggle.com/gregorut/videogamesales.

First we load the dataset into a dataframe with the pandas function read_csv, by passing the file location as its argument, and then take a look at the data. The code below assumes the dataset is in the current directory and uses the os.getcwd() method to build the path.

The dropna() method is called to rid the dataset of eventual missing data (NaN, NA, NaT, etc.) and the head() method displays the first 5 rows

In [2]:
import os

df = pd.read_csv(os.getcwd() + '/vgsales.csv')
df.dropna(inplace=True)
df.head()
Out[2]:
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 1 Wii Sports Wii 2006.0 Sports Nintendo 41.49 29.02 3.77 8.46 82.74
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo 29.08 3.58 6.81 0.77 40.24
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo 15.85 12.88 3.79 3.31 35.82
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo 15.75 11.01 3.28 2.96 33.00
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37

There are 11 columns:

  • Rank - the title overall position
  • Name - the videogame title
  • Platform - the main platform the game was launched for
  • Year - the year of launching
  • Genre - the genre of the title
  • Publisher - the company responsible for the game
  • NA_Sales - total of sales, in millions, in North America
  • EU_Sales - total of sales, in millions, in Europe
  • JP_Sales - total of sales, in millions, in Japan
  • Other_Sales - total of sales, in millions, in the rest of the world
  • Global_Sales - total of sales, in millions, worldwide

Now for the questions:

What are the 20 worldwide most sold games of all time?

We can find that out by plotting a horizontal bar chart of the first 20 entries in the Global_Sales column, in descending order

In [3]:
df.iloc[0:20][::-1].plot.barh(x='Name',y='Global_Sales',use_index=True)
plt.xlabel('Global Sales in millions')
Out[3]:

As expected, we can see classics like Mario and Pokémon, as well as the more recent success of the Grand Theft Auto franchise, dominating the first positions.

But this result is for sales worldwide, one can wonder if the cultural differences between Japan and North America would lead to different results.

Let's make two other top 20 sellers horizontal bar charts, one for Japan and other for North America, and place them side-by-side

In [4]:
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
df.sort_values('JP_Sales',ascending=False).iloc[0:20,:][::-1].plot.barh(
    x='Name',y='JP_Sales',use_index=True,ax=ax1,legend=False)
ax1.set_xlabel('Japan Sales in millions')
df.sort_values('NA_Sales',ascending=False).iloc[0:20,:][::-1].plot.barh(
    x='Name',y='NA_Sales',use_index=True,ax=ax2,legend=False)
ax2.set_xlabel('North America Sales in millions')
plt.tight_layout()

The tastes are indeed really diverse, there were only 5 titles figuring in both lists (Super Mario Bros., New Super Mario Bros., Tetris, Pokémon Red/Blue and Mario Kart for Nintendo DS). Also, it appears the Pokémon franchise was even more successful in the east (6 of the top 20 with 5 of those being top 7)!


At first sight it appears that Nintendo games fare really well in all corners (?) of the globe, but is the company as a whole among top sellers?

Let's find out by creating a new dataframe with the total of sales by publishers.

First we get the unique values in the Publisher column by creating a set and then turn it into a list.

In [5]:
unique_publisher = list(set(df['Publisher']))

Next we create a list with the total of global sales for each publisher. We do that with aid of a list comprehension and boolean indexing.

  • we first iterate over unique_publisher with the variable of iteration i
  • then we find matches for it in the Publisher column of our original dataset
  • adding the corresponding values in the Global_Sales column
  • we finish it by wrapping it with [] creating a list
In [6]:
sales_publisher = [df[df.loc[:,'Publisher'] == i]['Global_Sales'].sum() for i in unique_publisher]

Now we build the new dataframe with columns Publisher and Global_Sales and data unique_publisher and sales_publisher. Let's ake a look.

In [7]:
publisher_sales = pd.DataFrame({'Publisher':unique_publisher,'Global_Sales':sales_publisher})
publisher_sales.head()
Out[7]:
Global_Sales Publisher
0 0.74 Pioneer LDC
1 0.05 Monte Christo Multimedia
2 2.59 SNK
3 0.88 Destination Software, Inc
4 0.03 Paon Corporation

Is Nintendo at the top? Also, what does the top 20 looks like? Well, on to anoter horizontal bar plot

In [8]:
publisher_sales.sort_values('Global_Sales',ascending=False).iloc[0:20,:][::-1].plot.barh(
    x='Publisher',y='Global_Sales',legend=False)
plt.xlabel('Global Sales In millions')
Out[8]:

Nintendo is still the uncontested champion, by an even greater margin than expected. We can still see household names like Sega (9th), Nintendo's great competitor in the 90's, Capcom (12th), responsible for one of the most iconical franchises of the fighting genre in Street Fighter, and Disney Interactive Studios (16th), Disney's videogame division.


Bonus

Is my childhood favorite game, "Rock and roll racing", on this list? Was it a success?

Lemme see

In [13]:
df[df.loc[:,'Name']=='Rock and roll racing']
Out[13]:
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales

Nothing, maybe I mistyped? Let's check it manually, how hard can it be? Let's see how many games whose title begin with "R" there are

In [10]:
len([i for i in df.loc[:,'Name'].values if i[0]=='R'])
Out[10]:
701

701 is quite a lot.

Time to work smart, not hard.

In one word >> regular[\s]expressions (recursive joke)

In [11]:
import re
regex=re.compile('rock[\S\s]+ing',re.IGNORECASE)
[m.group(0) for l in [i for i in df.loc[:,'Name'].values if i[0]=='R'] for m in [regex.search(l)] if m]
Out[11]:
["Rock 'N Roll Racing", "Rock N' Roll Racing"]

Two entries? Oh, I see, titlecase and a little uncertainty regarding the apostrophe position.

All the merrier, let's combine'em and see how many copies of this amazing game were sold:

In [12]:
rnr_sales = 0
for i in ["Rock 'N Roll Racing", "Rock N' Roll Racing"]:
    rnr_sales += df[df.loc[:,'Name'] == i]['Global_Sales'].values[0]

print("It sold an astonishingly amount of {} millions!".format(rnr_sales))
It sold an astonishingly amount of 0.060000000000000005 millions!

...I swear it was awesome

Rock'n roll racing

Other functionalities

Magic commands

Remember the %matplotlib inline command? Me too.

Commands beginning by % are called magic commands, %matplotlib inline is a line magic command that allows matplotlib charts to be exibited in the output cell (doesn't open a new window with the plot), like the charts above.

There are also cell magic commands and plenty of other line ones. Some of the most useful ones are very well explained in this reference, for a complete list visit the IPython docs page (or type %lsmagic and start feeling it).

Shell Commands

You can use shell commands by putting an exclamation mark at the beginning of the command

! pip freeze | grep pandas

Github integration

Since may 7th 2015 Github is able to render jupyter notebooks.

What does that mean?

Well, it means you can share your notebooks (and visualize from others) even with those who doesn't have it installed, and you don't even have to have a github account (although I firmly suggest you create one if that is the case: https://github.com/join).

Other languages

Jupyter doesn't only work with Python. There are plenty of other kernels and you can also use cell magics to run code in different languages

%%bash
#!/bin/bash
# declare STRING variable
STRING='Hello World'
# print variable on a screen
echo $STRING

That was it

In case of any doubt feel free to contact me and I strongly suggest you visit Jupyter and Python documentations, as well as forums like Quora, StackOverflow, blogs like Dataquest and this awesome collection of notebooks.