1. Getting started with Python#
The open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out the author’s other book, Deep R Programming .
1.1. Installing Python#
Python was designed and implemented by the Dutch programmer Guido van Rossum in the late 1980s. It is an immensely popular object-orientated programming language, particularly suitable for rapid prototyping. Its name is a tribute to the funniest British comedy troupe ever. We will surely be having a jolly good laugh along our journey.
We will be using the reference implementation of the Python language (called CPython), version 3.10 (or any later one).
Users of UNIX-like operating systems (GNU/Linux, FreeBSD, etc.)
may download Python via their native package
sudo apt install python3 in Debian and Ubuntu).
Then, additional Python packages (see Section 1.4) can be
by the said manager or directly from the Python Package Index
(PyPI) via the pip tool.
Users of other operating systems can download Python from the project’s website or some other distribution available on the market, e.g., Anaconda or Miniconda.
Install Python on your computer.
1.2. Working with Jupyter notebooks#
Jupyter brings a web browser-based development environment supporting numerous programming languages. Even though, in the long run, it is not the most convenient space for exercising data science in Python (writing standalone scripts in some more advanced editors is the preferred option), we chose it here because of its educative advantages (interactive, easy to start with, etc.).
In Jupyter, we can work with:
Jupyter notebooks —
.ipynbdocuments combining code, text, plots, tables, and other rich outputs; importantly, code chunks can be created, modified, and run interactively, which makes it a fine reporting tool for our basic data science needs; see Figure 1.1;
code consoles — terminals for running code chunks interactively (read-eval-print loop);
source files in many different languages — with syntax highlighting and the ability to send code to the associated consoles;
and many more.
Head to the official documentation of the Jupyter project. Watch the introductory video linked in the Overview section.
More advanced students might consider, for example,
as a means to create
.ipynb files directly from Markdown documents.
1.2.1. Launching JupyterLab#
How we launch JupyterLab (or its lightweight version, Jupyter Notebook) will vary from system to system. We all need to determine the best way to do it by ourselves.
Some users will be able to start JupyterLab via their start menu/application launcher. Alternatively, we can open the system terminal (bash, zsh, etc.) and type:
cd our/favourite/directory # change directory jupyter lab # or jupyter-lab, depending on the system
This should launch the JupyterLab server and open the corresponding web app in the default web browser.
Some commercial cloud-hosted instances or forks of the open-source JupyterLab project are available on the market, but we endorse none of them (even though they might be provided gratis, there are always strings attached). It is best to run our applications locally, where we are free to be in full control over the software environment (and can also work with it when we have no internet access).
1.2.2. First notebook#
Here is how we can create our first notebook.
From JupyterLab, create a new notebook running a Python 3 kernel (for example, by selecting File \(\to\) New \(\to\) Notebook from the menu).
Select File \(\to\) Rename Notebook and change the filename to
The file is stored relative to the current working directory of the running JupyterLab server instance. Make sure you can locate
HelloWorld.ipynbon your disk using your favourite file explorer (by the way,
.ipynbis just a JSON file that can also be edited using an ordinary text editor).
Input the following in the code cell:
Returnon m**OS) to execute the code cell and display the result; see Figure 1.2.
1.2.3. More cells#
Time for some more cells.
By pressing Enter, we can enter the Edit mode. Modify the cell’s contents so that it now reads:
# My first code cell (this is a comment) print("G'day!") # prints a message (this is a comment too) print(2+5) # prints a number
Press Ctrl+Enter to execute the code and replace the previous outputs with the new ones.
Enter a command to print some other message that is to your liking. Note that character strings in Python must be enclosed in either double quotes or apostrophes.
Press Shift+Enter to execute the code cell, create a new one below, and then enter the edit mode.
In the new cell, enter and then execute the following:
Add three more code cells, displaying some text or creating other bar plots.
(2+5) to PRINT
Execute the code chunk and see what happens.
In the Edit mode, JupyterLab behaves like an ordinary text editor. Most keyboard shortcuts known from elsewhere are available, for example:
Shift+LeftArrow, DownArrow, UpArrow, or RightArrow – select text,
Ctrl+c – copy,
Ctrl+x – cut,
Ctrl+v – paste,
Ctrl+z – undo,
Ctrl+] – indent,
Ctrl+[ – dedent,
Ctrl+/ – toggle comment.
1.2.4. Edit vs command mode#
By pressing ESC, we can enter the Command mode.
In the Command mode, we can use the arrow DownArrow and UpArrow keys to move between the code cells.
In the Command mode, pressing d,d (d followed by another d) deletes the currently selected cell.
Press z to undo the last operation.
Press a and b to insert a new blank cell, respectively, above and below the current one.
Note a simple drag and drop can relocate cells.
ESC and Enter switch between the Command and Edit modes, respectively.
1.2.5. Markdown cells#
So far we have only been playing with code cells. We can convert the current cell to a Markdown block by pressing m in the Command mode (note that by pressing y we can turn it back to a code cell).
Markdown is a lightweight, human-readable markup language widely used for formatting text documents.
Enter the following into a new Markdown cell:
# Section ## Subsection This ~~was~~ *is* **really** nice. * one * two 1. aaa 2. bbbb * three ```python # some code to display (but not execute) 2+2 ``` ![Python](https://www.python.org/static/img/python-logo.png)
Press Ctrl+Enter to display the formatted text.
Notice that Markdown cells can be modified by entering the Edit mode as usual (Enter key).
Read the official introduction to the Markdown syntax.
Follow this interactive Markdown tutorial.
Apply what you learnt by making the current Jupyter notebook more readable. Add a header at the beginning of the report featuring your name and email address. Before and after each code cell, explain (in your own words) its purpose and how we can interpret the obtained results.
1.3. The best note-taking app#
Learning, and this is what we are here for, will not be effective without making notes of the concepts that we come across during this course: many of them will be new to us. We will need to write down some definitions and noteworthy properties of the methods we discuss, draw simple diagrams and mind maps to build connections between different topics, check intermediate results, or derive simple mathematical formulae ourselves.
Let us not waste our time finding the best app for our computers, phones, or tablets. The best and most versatile note-taking solution is an ordinary piece of A4 paper and a pen or a pencil. Loose sheets of paper, 5 mm grid-ruled for graphs and diagrams, work nicely. They can be held together using a cheap landscape clip folder (the one with a clip on the long side). An advantage of this solution is that it can be browsed through like an ordinary notebook. Also, new pages can be added anywhere, and their ordering altered arbitrarily.
1.4. Initialising each session and getting example data#
From now on, we assume that the following commands are issued at the beginning of each session:
# import key packages – required: import numpy as np import scipy.stats import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # further settings – optional: pd.set_option("display.notebook_repr_html", False) # disable "rich" output import os os.environ["COLUMNS"] = "74" # output width, in characters np.set_printoptions(linewidth=74) pd.set_option("display.width", 74) import sklearn sklearn.set_config(display="text") plt.style.use("seaborn-v0_8") # overall plot style _colours = [ # the "R4" palette "#000000f0", "#DF536Bf0", "#61D04Ff0", "#2297E6f0", "#28E2E5f0", "#CD0BBCf0", "#F5C710f0", "#999999f0" ] _linestyles = [ "solid", "dashed", "dashdot", "dotted" ] plt.rcParams["axes.prop_cycle"] = plt.cycler( # each plotted line will have a different plotting style color=_colours, linestyle=_linestyles*2 ) plt.rcParams["patch.facecolor"] = _colours np.random.seed(123) # initialise the pseudorandom number generator
The above imports the most frequently used packages (together with their usual aliases, we will get to that later). Then, it sets up some further options that yours truly is particularly fond of. On a side note, for the discussion on the reproducible pseudorandom number generation, please see Section 6.4.2.
The software we use regularly receives feature extensions, API changes, and bug fixes. It is worthwhile to know which version of the Python environment was used to evaluate all the code included in this book:
import sys print(sys.version) ## 3.11.4 (main, Jun 9 2023, 07:59:55) [GCC 12.3.0]
The versions of the packages that we use in this course
are given below. They can usually be fetched by calling,
for example, print
sklearn (scikit-learn) (*)
icu (PyICU) (*)
We expect 99% of the code listed in this book to work in future versions of our environment. If the diligent reader discovers that this is not the case, filing a bug report at https://github.com/gagolews/datawranglingpy/issues will be much appreciated (for the benefit of other students).
All example datasets that we use throughout this course are available for download at https://github.com/gagolews/teaching-data.
Ensure you are comfortable accessing raw data files
from the above repository.
Chose any file, e.g.,
marek folder, and then click Raw.
It is the URL that you were redirected to, not the
previous one, that includes the link to be referred to from within
your Python session.
Note that each dataset starts with several comment lines explaining its structure, the meaning of the variables, etc.
What is the difference between the Edit and the Command mode in Jupyter?
How can we format a table in Markdown? How can we insert an image?