Jupyter Notebook Tutorial
Qua 05 abril 2017Jupyter Notebook Tutorial¶
Introduction¶
According to https://jupyter.org/
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
Jupyter is an environment for you to write code on your language of choice, evaluating its output as you go.
Also, Jupyter is the best IDE(ish) there is, in my humble and final opinion.
Installing¶
Even though Jupyter supports many programming languages, it requires python to be installed. The easiest way is with the anaconda package, the most popular python distribution, which comes with Jupyter Notebook and some of the most used python modules.
You can download the anaconda package for your desired OS in the following link: https://www.continuum.io/downloads and then proceed with the installation.
To launch the Jupyter Notebook server, either open a command line and type jupyter notebook
or run the Jupyter application.
Creating a notebook¶
Once you’ve started the notebook server, it will open the Notebook Dashboard in your system’s default navigator. The dashboard is the Jupyter user interface and it shows a list of files, notebooks and subfolders in the current directory
In order to create your first notebook, you need to click on the “New” button in the top right corner and then select a kernel (in our case Python 3)
The notebook is organized by cells where you can write code or text
After you write some code in the input cell In [ ]
, you can execute it by pressing Shift + Enter
or clicking the “Run cell” button in the toolbar; the output will be shown in the output cell Out [ ]
There are other buttons on the toolbar, here’s a brief list, given the names, their functionalities are pretty straighforward (icons are missing):
- Save and Checkpoint
- Insert cell below
- Cut selected cells
- Copy selected cells
- Paste cells below
- Move selected cells up
- Move selected cells down
- Run cell, select below
- Interrupt kernel
- Restart the kernel
- Cell type selector
- Open command palette
You can find another pretty good guide in the Jupyter docs.
One of the Jupyter Notebook’s greatest advantages is the fact that you can easily write some code and check its output right away, making it very fitting to data exploration where you need to query a dataset, often applying statistical functions and plotting charts to attain insights or verify impressions.
Also, the Notebook is capable of rendering text formatted in the [Markdown syntax][markdowns], the Markdown cell type, enabling you to construct a narrative within your notebook, conducting the reader through your flow of thought $-$ this topic is discussed in more depth in this blog post.
The notebook and data analysis¶
To exemplify the advantages mentioned above, we will perform a pretty basic analysis on a dataset.
The first step is to import the required modules:
pandas
- for managing data in the form ofdataframes
(tabulated)numpy
- the numerical modulematplotlib
- for plotting charts
The %matplotlib inline
command will be discussed in a few moments, for now let us just accept it.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
The dataset we'll be dealing with has some data on videogame sales across the globe and it was put together by the user GregorySmith in the Kaggle community, available at this link: https://www.kaggle.com/gregorut/videogamesales.
First we load the dataset into a dataframe with the pandas function read_csv
, by passing the file location as its argument, and then take a look at the data. The code below assumes the dataset is in the current directory and uses the os.getcwd()
method to build the path.
The dropna()
method is called to rid the dataset of eventual missing data (NaN
, NA
, NaT
, etc.) and the head()
method displays the first 5 rows
import os
df = pd.read_csv(os.getcwd() + '/vgsales.csv')
df.dropna(inplace=True)
df.head()
There are 11 columns:
- Rank - the title overall position
- Name - the videogame title
- Platform - the main platform the game was launched for
- Year - the year of launching
- Genre - the genre of the title
- Publisher - the company responsible for the game
- NA_Sales - total of sales, in millions, in North America
- EU_Sales - total of sales, in millions, in Europe
- JP_Sales - total of sales, in millions, in Japan
- Other_Sales - total of sales, in millions, in the rest of the world
- Global_Sales - total of sales, in millions, worldwide
Now for the questions:¶
What are the 20 worldwide most sold games of all time?
We can find that out by plotting a horizontal bar chart of the first 20 entries in the Global_Sales column, in descending order
df.iloc[0:20][::-1].plot.barh(x='Name',y='Global_Sales',use_index=True)
plt.xlabel('Global Sales in millions')
As expected, we can see classics like Mario and Pokémon, as well as the more recent success of the Grand Theft Auto franchise, dominating the first positions.
But this result is for sales worldwide, one can wonder if the cultural differences between Japan and North America would lead to different results.
Let's make two other top 20 sellers horizontal bar charts, one for Japan and other for North America, and place them side-by-side
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
df.sort_values('JP_Sales',ascending=False).iloc[0:20,:][::-1].plot.barh(
x='Name',y='JP_Sales',use_index=True,ax=ax1,legend=False)
ax1.set_xlabel('Japan Sales in millions')
df.sort_values('NA_Sales',ascending=False).iloc[0:20,:][::-1].plot.barh(
x='Name',y='NA_Sales',use_index=True,ax=ax2,legend=False)
ax2.set_xlabel('North America Sales in millions')
plt.tight_layout()
The tastes are indeed really diverse, there were only 5 titles figuring in both lists (Super Mario Bros., New Super Mario Bros., Tetris, Pokémon Red/Blue and Mario Kart for Nintendo DS). Also, it appears the Pokémon franchise was even more successful in the east (6 of the top 20 with 5 of those being top 7)!
At first sight it appears that Nintendo games fare really well in all corners (?) of the globe, but is the company as a whole among top sellers?
Let's find out by creating a new dataframe with the total of sales by publishers.
First we get the unique values in the Publisher
column by creating a set
and then turn it into a list
.
unique_publisher = list(set(df['Publisher']))
Next we create a list with the total of global sales for each publisher. We do that with aid of a list comprehension and boolean indexing.
- we first iterate over
unique_publisher
with the variable of iterationi
- then we find matches for it in the
Publisher
column of our original dataset - adding the corresponding values in the
Global_Sales
column - we finish it by wrapping it with
[]
creating a list
sales_publisher = [df[df.loc[:,'Publisher'] == i]['Global_Sales'].sum() for i in unique_publisher]
Now we build the new dataframe with columns Publisher
and Global_Sales
and data unique_publisher
and sales_publisher
. Let's ake a look.
publisher_sales = pd.DataFrame({'Publisher':unique_publisher,'Global_Sales':sales_publisher})
publisher_sales.head()
Is Nintendo at the top? Also, what does the top 20 looks like? Well, on to anoter horizontal bar plot
publisher_sales.sort_values('Global_Sales',ascending=False).iloc[0:20,:][::-1].plot.barh(
x='Publisher',y='Global_Sales',legend=False)
plt.xlabel('Global Sales In millions')
Nintendo is still the uncontested champion, by an even greater margin than expected. We can still see household names like Sega (9th), Nintendo's great competitor in the 90's, Capcom (12th), responsible for one of the most iconical franchises of the fighting genre in Street Fighter, and Disney Interactive Studios (16th), Disney's videogame division.
Bonus¶
Is my childhood favorite game, "Rock and roll racing", on this list? Was it a success?
Lemme see
df[df.loc[:,'Name']=='Rock and roll racing']
Nothing, maybe I mistyped? Let's check it manually, how hard can it be? Let's see how many games whose title begin with "R" there are
len([i for i in df.loc[:,'Name'].values if i[0]=='R'])
701 is quite a lot.
Time to work smart, not hard.
In one word >> regular[\s]expressions
(recursive joke)
import re
regex=re.compile('rock[\S\s]+ing',re.IGNORECASE)
[m.group(0) for l in [i for i in df.loc[:,'Name'].values if i[0]=='R'] for m in [regex.search(l)] if m]
Two entries? Oh, I see, titlecase and a little uncertainty regarding the apostrophe position.
All the merrier, let's combine'em and see how many copies of this amazing game were sold:
rnr_sales = 0
for i in ["Rock 'N Roll Racing", "Rock N' Roll Racing"]:
rnr_sales += df[df.loc[:,'Name'] == i]['Global_Sales'].values[0]
print("It sold an astonishingly amount of {} millions!".format(rnr_sales))
...I swear it was awesome
Other functionalities¶
Magic commands¶
Remember the %matplotlib inline
command? Me too.
Commands beginning by %
are called magic commands, %matplotlib inline
is a line
magic command that allows matplotlib
charts to be exibited in the output cell
(doesn't open a new window with the plot), like the charts above.
There are also cell
magic commands and plenty of other line
ones. Some of the most useful ones are very well explained in this reference, for a complete list visit the IPython docs page (or type %lsmagic
and start feeling it).
Shell Commands¶
You can use shell commands by putting an exclamation mark at the beginning of the command
! pip freeze | grep pandas
Github integration¶
Since may 7th 2015 Github is able to render jupyter notebooks.
What does that mean?
Well, it means you can share your notebooks (and visualize from others) even with those who doesn't have it installed, and you don't even have to have a github account (although I firmly suggest you create one if that is the case: https://github.com/join).
Other languages¶
Jupyter doesn't only work with Python. There are plenty of other kernels and you can also use cell
magics to run code in different languages
%%bash
#!/bin/bash
# declare STRING variable
STRING='Hello World'
# print variable on a screen
echo $STRING
That was it¶
In case of any doubt feel free to contact me and I strongly suggest you visit Jupyter and Python documentations, as well as forums like Quora, StackOverflow, blogs like Dataquest and this awesome collection of notebooks.