1. Getting Started with Python

The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.

1.1. Installing Python

Python was designed and implemented by Dutch programmer Guido van Rossum in the late 1980s. It is an immensely popular object-oriented programming language particularly suitable for rapid prototyping. Its name is a tribute to the funniest British comedy troupe ever. Therefore we will surely be having a jolly good laugh along our journey.

We will be using the reference implementation of the Python language (called CPython), version 3.9 (or any later one).

Users of Unix-like operating systems (GNU/Linux1, FreeBSD, etc.) may download Python via their native package manager (e.g., sudo apt install python3 in Debian and Ubuntu). Then, additional Python packages (see Section 1.4) can be installed by the said manager or directly from the Python Package Index (PyPI) via the pip tool.

Users of other operating systems can download Python from the project’s website or some other distribution available on the market.

1.2. Working with Jupyter Notebooks

JupyterLab is a web-based development environment supporting numerous programming languages, see Figure 1.1. It is definitely not the most convenient environment for exercising data science in Python (writing standalone scripts in some more advanced editors is the preferred option). Still, we have chosen it here because of its educative advantages (interactive, easy to start with, etc.).

More advanced students can consider, for example, jupytext as a means to create .ipynb files directly from Markdown files.

../_images/jupyter_overview.png

Figure 1.1 JupyterLab at a glance

In JupyterLab, we can work with:

  • Jupyter notebooks.ipynb documents combining code, text, plots, tables, and other rich outputs; importantly, code chunks can be created, modified, and run interactively, which makes it a good reporting tool for our basic data science needs;

  • code consoles — terminals for running code chunks interactively (read-eval-print loop);

  • source files in many different languages — with syntax highlighting and the ability to send code to the associated consoles;

and many more.

Exercise 1.1

Head to the official documentation of the JupyterLab project and watch the introductory video linked in the Overview section.

1.2.1. Launching JupyterLab

How we launch JupyterLab will vary from system to system. Everyone needs to determine the best way to do it by themselves.

Some users will be able to start JupyterLab via their start menu/application launcher. Alternatively, we can open the system terminal (bash, zsh, etc.) and type:

cd our/favourite/directory  # change directory
jupyter lab  # or jupyter-lab, depending on the system

This should launch the JupyterLab server and open the corresponding web app in the default web browser.

Note

Some commercial cloud-hosted instances or forks of the open-source JupyterLab project are available on the market, but we endorse none of them (even though they might be provided gratis, there are always strings attached). It is best to run our applications locally, where we are free to be in control over the software environment.

1.2.2. First Notebook

Here is how we can create our first notebook.

  1. From JupyterLab, create a new notebook running a Python 3 kernel (for example, by selecting File \(\to\) New \(\to\) Notebook from the menu).

  2. Select File \(\to\) Rename Notebook and change the filename to HelloWorld.ipynb.

    Important

    The file is stored relative to the current working directory of the running JupyterLab server instance. Make sure you can locate HelloWorld.ipynb on your disk using your favourite file explorer (by the way, .ipynb is just a JSON file that can also be edited using an ordinary text editor).

  3. Input the following in the code cell:

    print("G'day!")
    
  4. Press Ctrl+Enter (or Cmd+Return on macOS) to execute the code cell and display the result; see Figure 1.2.

../_images/jupyter_hello.png

Figure 1.2 “Hello World” in a Jupyter Notebook

1.2.3. More Cells

Time for some more cells.

  1. By pressing Enter, we can enter the Edit mode. Modify the cell’s contents so that it now reads:

    # My first code cell (this is a comment)
    print("G'day!")  # prints a message (this is a comment too)
    print(2+5)  # prints a number
    
  2. Press Ctrl+Enter to execute the code and replace the previous outputs with the new ones.

  3. Enter a command to print some other message that is to your liking. Note that character strings in Python must be enclosed in either double quotes or apostrophes.

  4. Press Shift+Enter to execute the code cell, create a new one below, and then enter the edit mode.

  5. In the new cell, enter and then execute the following:

    import matplotlib.pyplot as plt  # basic plotting library
    plt.bar(
        ["Python", "JavaScript", "HTML", "CSS"],  # a list of strings
        [80, 30, 10, 15]  # a list of integers (the corresponding bar heights)
    )
    plt.title("What makes you happy?")
    plt.show()
    
  6. Add three more code cells, displaying some text or creating other bar plots.

Exercise 1.2

Change print(2+5) to PRINT(2+5), execute the code chunk and see what happens.

Note

In the Edit mode, JupyterLab behaves like an ordinary text editor. Most keyboard shortcuts known from elsewhere are available, for example:

  • Shift+LeftArrow, DownArrow, UpArrow, or RightArrow – select text,

  • Ctrl+c – copy,

  • Ctrl+x – cut,

  • Ctrl+v – paste,

  • Ctrl+z – undo,

  • Ctrl+] – indent,

  • Ctrl+[ – dedent,

  • Ctrl+/ – toggle comment.

1.2.4. Edit vs Command Mode

Moreover:

  1. By pressing ESC, we can enter the Command mode.

  2. In the Command mode, we can use the arrow DownArrow and UpArrow keys to move between the code cells.

  3. In the Command mode, pressing d,d (d followed by another d) deletes the currently selected cell.

  4. Press z to undo the last operation.

  5. Press a and b to insert a new blank cell, respectively, above and below the current one.

  6. Note a simple drag and drop can relocate cells.

Important

ESC and Enter switch between the Command and Edit modes, respectively.

1.2.5. Markdown Cells

So far, we have been playing with Code cells. We can convert the current cell to a Markdown block by pressing m in the Command mode (note that by pressing y we can turn it back to a Code cell).

Markdown is a lightweight, human-readable markup language widely used for formatting text documents.

  1. Enter the following into a new Markdown cell:

    # Section
    
    ## Subsection
    
    This ~~was~~ *is* **really** nice.
    
    * one
    * two
        1. aaa
        2. bbbb
    * three
    
    
    ```python
    # some code to display (but not execute)
    2+2
    ```
    
    ![Python](https://www.python.org/static/img/python-logo.png)
    
  2. Press Ctrl+Enter to display the formatted text.

  3. Notice that Markdown cells can be modified by entering the Edit mode as usual (Enter key).

Exercise 1.3

Read the official introduction to the Markdown syntax.

Exercise 1.4

Follow this interactive Markdown tutorial.

Exercise 1.5

Apply what you have learned by making the current Jupyter notebook more readable. Add a header at the beginning of the report featuring your name and email address. Before and after each code cell, explain (in your own words) its purpose and how to interpret the obtained results.

1.3. The Best Note-Taking App

Learning, and this is what we are here for, will not be effective without making notes of the concepts that we come across during this course – many of them will be new to us. We will need to write down some definitions and noteworthy properties of the methods we discuss, draw simple diagrams and mind maps to build connections between different topics, check intermediate results, or derive simple mathematical formulae ourselves.

Let us not waste our time finding the best app for our computers, phones, or tablets. The best and most versatile note-taking solution is an ordinary piece of A4 paper and a pen or a pencil. Loose sheets of paper, 5 mm grid-ruled for graphs and diagrams, work nicely. They can be held together using a cheap landscape clip folder (the one with a clip on the long side). An advantage of this solution is that it can be browsed through like an ordinary notebook. Also, new pages can be added anywhere, and their ordering altered arbitrarily.

1.4. Initialising Each Session and Getting Example Data (!)

From now on, we assume that the following commands have been issued at the beginning of each session:

# import key packages – required:
import numpy as np
import scipy.stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# further settings – optional:
pd.set_option("display.notebook_repr_html", False)  # disable "rich" output

import os
os.environ["COLUMNS"] = "74"  # output width, in characters
np.set_printoptions(linewidth=74)
pd.set_option("display.width", 74)


plt.style.use("seaborn")  # overall plot style

_colours = [  # the "R4" palette
    "#000000", "#DF536B", "#61D04F", "#2297E6",
    "#28E2E5", "#CD0BBC", "#F5C710", "#999999"
]

_linestyles = [
    "solid", "dashed", "dashdot", "dotted"
]

plt.rcParams["axes.prop_cycle"] = plt.cycler(
    # each plotted line will have a different plotting style
    color=_colours, linestyle=_linestyles*2
)
plt.rcParams["patch.facecolor"] = _colours[0]


plt.rcParams.update({  # further graphical parameters
    "font.size":         11,
    "font.family":       "sans-serif",
    "font.sans-serif":   ["Alegreya Sans", "Alegreya"],
    "figure.autolayout": True,
    "figure.dpi":        300,
    "figure.figsize":    (6, 3.5),  # default is [8.0, 5.5],
})

np.random.seed(123)  # initialise the pseudorandom number generator

The above imports the most frequently used packages (together with their usual aliases, we will get to that later). Then, it sets up some further options that yours truly is particularly fond of. On a side note, for the discussion on the reproducible pseudorandom number generation, please see Section 6.4.2.

The software we use regularly receives feature upgrades, API changes, and bug fixes. Therefore, it is good to know which version of the Python environment was used to evaluate all the code included in this book:

import sys
print(sys.version)
## 3.9.7 (default, Sep 10 2021, 14:59:43) 
## [GCC 11.2.0]

The versions of the packages that we use in this course are given below. They can usually be fetched by calling, for example, print(np.__version__), etc.

Package

Version

numpy

1.22.4

scipy

1.8.1

pandas

1.4.2

matplotlib

3.5.2

seaborn

0.11.2

sklearn (scikit-learn) (*)

0.24.1

icu (PyICU) (*)

2.8.1

IPython (*)

8.2.0

mplfinance (*)

0.12.8b9

We expect 99% of the code listed in this book to work in future versions of our environment. If the reader discovers that this is not the case, filing a bug report at https://github.com/gagolews/datawranglingpy will be much appreciated (for the benefit of other readers).

Important

All example datasets that we use throughout this course are available for download at https://github.com/gagolews/teaching_data.

Exercise 1.6

Ensure you are comfortable accessing raw data files from the above repository. Chose any file, e.g., nhanes_adult_female_height_2020.txt in the marek folder, and then click Raw. It is the URL that you have now been redirected to, not the previous one, that includes the link to be referred to from within your Python session.

Note that each dataset starts with several comment lines explaining its structure, the meaning of the variables, etc.

1.5. Exercises

Exercise 1.7

What is the difference between the Edit and the Command mode in Jupyter?

Exercise 1.8

What is Markdown?

Exercise 1.9

How to format a table in Markdown?


1

GNU/Linux is the operating system of choice for machine learning engineers and data scientists both on the desktop and in the cloud. Switching to a free system at some point cannot be recommended highly enough.