1. Getting started with Python#
The open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out the author’s other book, Deep R Programming [34].
1.1. Installing Python#
Python was designed and implemented by the Dutch programmer Guido van Rossum in the late 1980s. It is an immensely popular object-orientated programming language, particularly suitable for rapid prototyping. Its name is a tribute to the funniest British comedy troupe ever. We will surely be having a jolly good laugh[1] along our journey.
We will be using the reference implementation of the Python language (called CPython[2]), version 3.10 (or any later one).
Users of UNIX-like operating systems (GNU/Linux[3], FreeBSD, etc.)
may download Python via their native package
manager (e.g., sudo apt install python3
in Debian and Ubuntu).
Then, additional Python packages (see Section 1.4) can be
installed
by the said manager or directly from the Python Package Index
(PyPI) via the pip tool.
Users of other operating systems can download Python from the project’s website or some other distribution available on the market, e.g., Anaconda or Miniconda.
Install Python on your computer.
1.2. Working with Jupyter notebooks#
Jupyter brings a web browser-based development environment supporting numerous programming languages. Even though, in the long run, it is not the most convenient space for exercising data science in Python (writing standalone scripts in some more advanced editors is the preferred option), we chose it here because of its educative advantages (interactive, easy to start with, etc.).

Figure 1.1 An example Jupyter notebook.#
In Jupyter, we can work with:
Jupyter notebooks —
.ipynb
documents combining code, text, plots, tables, and other rich outputs; importantly, code chunks can be created, modified, and run interactively, which makes it a fine reporting tool for our basic data science needs; see Figure 1.1;code consoles — terminals for running code chunks interactively (read-eval-print loop);
source files in many different languages — with syntax highlighting and the ability to send code to the associated consoles;
and many more.
Head to the official documentation of the Jupyter project. Watch the introductory video linked in the Overview section.
Note
(*)
More advanced students might consider, for example,
jupytext
as a means to create .ipynb
files directly from Markdown documents.
1.2.1. Launching JupyterLab#
How we launch JupyterLab (or its lightweight version, Jupyter Notebook) will vary from system to system. We all need to determine the best way to do it by ourselves.
Some users will be able to start JupyterLab via their start menu/application launcher. Alternatively, we can open the system terminal (bash, zsh, etc.) and type:
cd our/favourite/directory # change directory
jupyter lab # or jupyter-lab, depending on the system
This should launch the JupyterLab server and open the corresponding web app in the default web browser.
Note
Some commercial cloud-hosted instances or forks of the open-source JupyterLab project are available on the market, but we endorse none of them (even though they might be provided gratis, there are always strings attached). It is best to run our applications locally, where we are free to be in full control over the software environment (and can also work with it when we have no internet access).
1.2.2. First notebook#
Here is how we can create our first notebook.
From JupyterLab, create a new notebook running a Python 3 kernel (for example, by selecting File \(\to\) New \(\to\) Notebook from the menu).
Select File \(\to\) Rename Notebook and change the filename to
HelloWorld.ipynb
.Important
The file is stored relative to the current working directory of the running JupyterLab server instance. Make sure you can locate
HelloWorld.ipynb
on your disk using your favourite file explorer (by the way,.ipynb
is just a JSON file that can also be edited using an ordinary text editor).Input the following in the code cell:
print("G'day!")
Press
Ctrl
+Enter
(orCmd
+Return
on m**OS) to execute the code cell and display the result; see Figure 1.2.

Figure 1.2 “Hello, World” in a Jupyter notebook.#
1.2.3. More cells#
Time for some more cells.
By pressing Enter, we can enter the Edit mode. Modify the cell’s contents so that it now reads:
# My first code cell (this is a comment) print("G'day!") # prints a message (this is a comment too) print(2+5) # prints a number
Press Ctrl+Enter to execute the code and replace the previous outputs with the new ones.
Enter a command to print some other message that is to your liking. Note that character strings in Python must be enclosed in either double quotes or apostrophes.
Press Shift+Enter to execute the code cell, create a new one below, and then enter the edit mode.
In the new cell, enter and then execute the following:
import matplotlib.pyplot as plt # basic plotting library plt.bar( ["Python", "JavaScript", "HTML", "CSS"], # a list of strings [80, 30, 10, 15] # a list of integers (the corresponding bar heights) ) plt.title("What makes you happy?") plt.show()
Add three more code cells, displaying some text or creating other bar plots.
Change print(2+5)
to PRINT(2+5)
.
Execute the code chunk and see what happens.
Note
In the Edit mode, JupyterLab behaves like an ordinary text editor. Most keyboard shortcuts known from elsewhere are available, for example:
Shift+LeftArrow, DownArrow, UpArrow, or RightArrow – select text,
Ctrl+c – copy,
Ctrl+x – cut,
Ctrl+v – paste,
Ctrl+z – undo,
Ctrl+] – indent,
Ctrl+[ – dedent,
Ctrl+/ – toggle comment.
1.2.4. Edit vs command mode#
Moreover:
By pressing ESC, we can enter the Command mode.
In the Command mode, we can use the arrow DownArrow and UpArrow keys to move between the code cells.
In the Command mode, pressing d,d (d followed by another d) deletes the currently selected cell.
Press z to undo the last operation.
Press a and b to insert a new blank cell, respectively, above and below the current one.
Note a simple drag and drop can relocate cells.
Important
ESC and Enter switch between the Command and Edit modes, respectively.
1.2.5. Markdown cells#
So far we have only been playing with code cells. We can convert the current cell to a Markdown block by pressing m in the Command mode (note that by pressing y we can turn it back to a code cell).
Markdown is a lightweight, human-readable markup language widely used for formatting text documents.
Enter the following into a new Markdown cell:
# Section ## Subsection This ~~was~~ *is* **really** nice. * one * two 1. aaa 2. bbbb * three ```python # some code to display (but not execute) 2+2 ``` 
Press Ctrl+Enter to display the formatted text.
Notice that Markdown cells can be modified by entering the Edit mode as usual (Enter key).
Read the official introduction to the Markdown syntax.
Follow this interactive Markdown tutorial.
Apply what you learnt by making the current Jupyter notebook more readable. Add a header at the beginning of the report featuring your name and email address. Before and after each code cell, explain (in your own words) its purpose and how we can interpret the obtained results.
1.3. The best note-taking app#
Learning, and this is what we are here for, will not be effective without making notes of the concepts that we come across during this course: many of them will be new to us. We will need to write down some definitions and noteworthy properties of the methods we discuss, draw simple diagrams and mind maps to build connections between different topics, check intermediate results, or derive simple mathematical formulae ourselves.
Let us not waste our time finding the best app for our computers, phones, or tablets. The best and most versatile note-taking solution is an ordinary piece of A4 paper and a pen or a pencil. Loose sheets of paper, 5 mm grid-ruled for graphs and diagrams, work nicely. They can be held together using a cheap landscape clip folder (the one with a clip on the long side). An advantage of this solution is that it can be browsed through like an ordinary notebook. Also, new pages can be added anywhere, and their ordering altered arbitrarily.
1.4. Initialising each session and getting example data#
From now on, we assume that the following commands are issued at the beginning of each session:
# import key packages – required:
import numpy as np
import scipy.stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# further settings – optional:
pd.set_option("display.notebook_repr_html", False) # disable "rich" output
import os
os.environ["COLUMNS"] = "74" # output width, in characters
np.set_printoptions(linewidth=74)
pd.set_option("display.width", 74)
import sklearn
sklearn.set_config(display="text")
plt.style.use("seaborn-v0_8") # overall plot style
_colours = [ # the "R4" palette
"#000000f0", "#DF536Bf0", "#61D04Ff0", "#2297E6f0",
"#28E2E5f0", "#CD0BBCf0", "#F5C710f0", "#999999f0"
]
_linestyles = [
"solid", "dashed", "dashdot", "dotted"
]
plt.rcParams["axes.prop_cycle"] = plt.cycler(
# each plotted line will have a different plotting style
color=_colours, linestyle=_linestyles*2
)
plt.rcParams["patch.facecolor"] = _colours[0]
np.random.seed(123) # initialise the pseudorandom number generator
The above imports the most frequently used packages (together with their usual aliases, we will get to that later). Then, it sets up some further options that yours truly is particularly fond of. On a side note, for the discussion on the reproducible pseudorandom number generation, please see Section 6.4.2.
The software we use regularly receives feature extensions, API changes, and bug fixes. It is worthwhile to know which version of the Python environment was used to evaluate all the code included in this book:
import sys
print(sys.version)
## 3.11.4 (main, Jun 9 2023, 07:59:55) [GCC 12.3.0]
The versions of the packages that we use in this course
are given below. They can usually be fetched by calling,
for example, print(np.__version__)
, etc.
Package |
Version |
---|---|
numpy |
1.25.2 |
scipy |
1.11.2 |
matplotlib |
3.7.2 |
pandas |
2.1.0 |
seaborn |
0.12.2 |
sklearn (scikit-learn) (*) |
1.3.0 |
icu (PyICU) (*) |
2.11 |
IPython (*) |
8.15.0 |
mplfinance (*) |
0.12.10b0 |
We expect 99% of the code listed in this book to work in future versions of our environment. If the diligent reader discovers that this is not the case, filing a bug report at https://github.com/gagolews/datawranglingpy/issues will be much appreciated (for the benefit of other students).
Important
All example datasets that we use throughout this course are available for download at https://github.com/gagolews/teaching-data.
Ensure you are comfortable accessing raw data files
from the above repository.
Chose any file, e.g., nhanes_adult_female_height_2020.txt
in the marek
folder, and then click Raw.
It is the URL that you were redirected to, not the
previous one, that includes the link to be referred to from within
your Python session.
Note that each dataset starts with several comment lines explaining its structure, the meaning of the variables, etc.
1.5. Exercises#
What is the difference between the Edit and the Command mode in Jupyter?
How can we format a table in Markdown? How can we insert an image?