The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.
The Art of Data Wrangling
Data science1 aims at making sense of and generating predictions from data that have2 been collected in copious quantities from various sources, such as physical sensors, surveys, online forms, access logs, and (pseudo)random number generators, to name a few. They can take diverse forms, e.g., vectors, matrices, and other tensors, graphs/networks, audio/video streams, text, etc.
Researchers in psychology, economics, sociology, agriculture, engineering, biotechnology, pharmacy, medicine, and genetics, amongst many others, need statistical methods to make discoveries as well as confirm or falsify existing theories. What is more, with the increased availability of open data, everyone can do remarkable work for the common good, e.g., by volunteering for non-for-profit NGOs or debunking false news and overzealous acts of wishful thinking on any side of the political spectra.
Furthermore, data scientists, machine learning engineers, statisticians, and business analysts are the most well-paid specialists. This is because data-driven decision-making, modelling, and prediction have already proven themselves especially useful in many domains:
financial services (banking, insurance, investment funds),
Okay, to be frank, the above list was generated by duckduckgoing the “biggest industries” query. That was an easy task. Data science (and its different flavours, including operational research, machine learning, business and artificial intelligence) can be applied wherever we have some relevant data at hand and there is a need to improve or understand the underlying processes.
Miniaturisation, increased computing power, cheaper storage, and the popularity of various internet services all caused data to become ubiquitous. Think about how much information people consume and generate when they interact with different news feeds or social media on their phones.
Data usually do not come in a tidy and tamed form. Also, at first glance, it is rarely known what we can get out of them. And thus, data wrangling is the very broad process of appropriately curating raw information chunks and then exploring the underlying data structure so that they become analysable.
Aims, Scope, and Design Philosophy
This course is envisaged as a student’s first, broad, yet well-structured exposure to data science3.
By no means do we have the ambition to be comprehensive with regard to any topic we cover. Time for that will come later in dedicated units related to calculus, matrix algebra, probability, mathematical statistics, continuous and combinatorial optimisation, information theory, stochastic processes, statistical/machine learning, algorithms and data structures, take a deep breath, databases and big data analytics, operational research, graphs and networks, differential equations and dynamical systems, time series analysis, signal processing, etc.
Instead, we lay very solid groundwork for all the above by introducing all the objects at an appropriate level of generality and building the most crucial connections between them. We provide the necessary intuitions behind the more advanced methods and concepts. This way, further courses do not need to waste our time introducing the most elementary definitions and answering the metaphysical questions like “but why do we need that (e.g., matrix multiplication) at all”.
And thus, in this course, we are going to explore methods for:
performing exploratory data analysis (e.g., aggregation and visualisation),
working with different types of data (e.g., numerical, categorical, text, time series),
cleaning data gathered from structured and unstructured sources, e.g., by identifying outliers, normalising strings, extracting numbers from text, imputing missing data,
transforming, selecting, and extracting features, dimensionality reduction,
identifying naturally occurring data clusters, comparing data between groups,
discovering patterns in data via approximation/modelling approaches using the most popular probability distributions and the easiest to understand statistical/machine learning algorithms,
testing whether two data distributions differ significantly from each other,
reporting the results of data analysis.
We primarily focus on methods and algorithms that have stood the test of time and that continue to inspire researchers and practitioners. They all meet a reality check that is comprised of the three following properties, which we believe are essential in practice:
simplicity (and thus interpretability, being equipped with no or only a few underlying tunable parameters; being based on some sensible intuitions that can be explained in our own words),
mathematical analysability (at least to some extent; so that we can understand their strengths and limitations),
implementability (not too abstract on the one hand, but also not requiring any advanced computer-y hocus-pocus on the other).
Many more complex algorithms are merely variations on or clever combinations of the more basic ones. This is why we need to study the fundamentals in great detail. We might not see it now, but this will become evident as we progress.
We Need Maths
The maths we introduce is the most elementary possible, in a good sense. Namely, we do not go beyond:
simple analytic functions (affine maps, logarithms, cosines),
the natural linear ordering of points on the real line (and the lack thereof in the case of multidimensional data),
the sum of squared differences between things (including the Euclidean distance between points),
linear vector/matrix algebra, e.g., to represent the most useful geometric transforms (rotation, scaling, translation),
the frequentist interpretation (as in: in samples of large sizes, we expect that…) of some common objects from probability theory and statistics.
This is the kind of toolkit that we believe is a sine qua non requirement for every prospective data scientist. We cannot escape falling in love with it.
We Need Some Computing Environment
We no longer practice data analysis using a piece of paper and a pencil4. Even though there are some dedicated computer programs that solve the most common problems arising in the most straightforward scenarios (e.g., spreadsheet-like click-here click-there stand-alone statistical packages), we need a tool that will enable us to respond to any challenge in a scientifically rigorous (and hence well organised and documented as well as reproducible) manner.
In this course, we will be writing code in Python, which we will introduce from scratch; thus, we do not require any prior programming experience.
Still, Python and third-party packages which are written therein are amongst many software tools which can help gain new knowledge from data. Other5 open-source choices include, e.g., R and Julia. And many new ones will emerge in the future.
We will emphasise developing transferable skills: most of what we learn here can be applied (using a different syntax but the same kind of reasoning) in other environments. In other words, this is a course on data wrangling (with Python), not a course on Python (with examples in data wrangling).
We want the reader to become an independent user of this computing environment. Somebody who is not overwhelmed when they are faced with any intermediate-level data analysis problem. A user whose habitual response to a new challenge is not to look everything up on the internet even in the simplest possible scenarios. In other words, we value creative thinking.
We believe we have found a good trade-off between the minimal set of tools that need to be mastered and the less frequently used ones that can later be found in the documentation or on the internet. In other words, the reader will discover the joy of programming and using their logical thinking to tinker with things.
We Need Data and Domain Knowledge
There is no data science or machine learning without data, and data’s purpose is to represent a given problem domain. Mathematics allows us to study different processes at a healthy level of abstractness/specificity. Still, we should always be familiar with the reality behind the numbers we have at hand, for example, by working closely with various experts in the field of our interest or pursuing our own study/research therein.
Courses such as this one, out of necessity, must use some generic datasets that are quite familiar to most readers (e.g., data on life expectancy and GDP in different countries, time to finish a marathon, yearly household incomes).
Still, many textbooks introduce statistical concepts using carefully crafted datasets where everything runs smoothly, and all models work out of the box. This gives a false sense of security. In practice, however, most datasets are not only unpolished but also (even after some careful treatment) uninteresting. Such is life. We will not be avoiding the more difficult problems in our journey.
This book is a whole course and should be read from the beginning to the end.
The material has been divided into 5 parts.
Chapter 1 discusses how to execute the first code chunks in Jupyter Notebooks, which are a flexible tool for the reproducible generation of reports from data analyses.
Chapter 2 introduces the basic scalar types in base Python, ways to call existing and to write our own functions, and control a code chunk’s execution flow.
Chapter 3 mentions sequential and other iterable types in base Python; more advanced data structures (vectors, matrices, data frames) that we introduce below will build upon these concepts.
Chapter 4 introduces vectors from numpy, which we use for storing data on the real line (think: individual columns in a tabular dataset). Then, we look at the most common types of empirical distributions of data (e.g., bell-shaped, right-skewed, heavy-tailed ones).
In Chapter 5 we list the most basic ways for processing sequences of numbers, including methods for data aggregation, transformation (e.g., standardisation), and filtering. We also mention that a computer’s floating-point arithmetic is imprecise and what we can do about it.
Chapter 6 reviews the most common probability distributions (normal, log-normal, Pareto, Uniform, and mixtures thereof), methods for assessing how well they fit empirical data, and pseudorandom number generation that is crucial for experiments based on simulations.
Chapter 7 introduces matrices from numpy. They are a convenient means of storing multidimensional quantitative data (many points described by possibly many numerical features). We also present some methods for their visualisation (and the problems arising from our being three-dimensional creatures).
Chapter 8 is devoted to basic operations on matrices. We will see that some of them simply extend on what we have learned in Chapter 5, but there is more: for instance, we discuss how to determine the set of each point’s nearest neighbours.
Chapter 9 discusses ways to explore the most basic relationships between the variables in a dataset: the Pearson and Spearman correlation coefficients (and what it means that correlation is not causation), k-nearest neighbour and linear regression (including the sad cases where a model matrix is ill-conditioned), and finding interesting combinations of variables that can help reduce the dimensionality of a problem (via the so-called principal component analysis).
Chapter 10 introduces
DataFrameobjects from pandas, which we can think of as vectors and matrices on steroids. For instance, they allow rows and columns to be labelled and columns to be of different types. We emphasise that most of what we have learned in the previous chapters still applies, but there is even more: for example, methods for joining (merging) many datasets, converting between long and wide formats, etc.
In Chapter 11 we introduce the ways to represent and handle categorical data as well as how (not) to lie with statistics.
Chapter 12 covers the case of aggregating, transforming, and visualising data in groups by one or more qualitative variables, including classification with k-nearest neighbours (when we are asked to fill the gaps in a categorical variable). We will also try to discover the naturally occurring partitions with the k-means method, which is an example of a computationally hard optimisation problem that needs to be tackled with some imperfect heuristics.
Chapter 13 is an interlude where we solve some pleasant exercises on data frames and learn the basics of SQL. This will be useful when we are faced with datasets that do not fit into a computer’s memory.
Other Data Types:
Chapter 14 discusses ways to handle text data and extract information from them, e.g., through regular expressions. We also briefly mention the challenges related to the processing of non-English text, including phrases like pozdro dla ziomali z Bródna, Viele Grüße und viel Spaß, and χαίρετε.
Chapter 15 emphasises that some data may be missing or be questionable (e.g., censored, incorrect, rare) and what we can do about them.
In Chapter 16 we cover the most basic methods for the processing of time series, because, ultimately, everything changes, and we should be able to track the evolution of things.
(*) Parts marked with a single or double asterisk can be skipped upon first reading as they are of increased difficulty and are not essential for beginner students.
Our goal here, in the long run, is for you, dear reader, to become a skilled expert who is independent, ethical, and capable of critical thinking; one who hopefully will make a small contribution towards making this world a slightly better place.
To guide you through it, we have a few tips for you.
Follow the rules.
Technical textbooks are not belletristic. Sometimes a single page will be very meaning-intense. Do not try to consume too much at the same time. Go for a walk, reflect on what you have learned, and build connections between different concepts. In case of any doubt, go back to one of the previous sections. Learning is an iterative process, not a linear one.
Solve all the suggested exercises. We might be introducing new concepts or developing crucial intuitions there as well. Also, try implementing most of the methods you learn about from scratch instead of looking for copy-paste solutions on the internet. How else will you master the material and develop the necessary programming skills?
Code is an integral part of the text. Each piece of good code is worth 1234 words (on average). Do not skip it. On the contrary, you should play and experiment with it. Run every major line of code, inspect the results generated, and read more about the functions you use in the official documentation. What is the type (class) of the object returned? If it is an array or a data frame, what is its shape? What would happen if we replaced X with Y? Do not fret; your computer will not blow up.
Harden up6. Your journey towards expertise will take years, there are no shortcuts, but it will be quite enjoyable every now and then, so don’t give up. Still, sitting all day in front of your computer is unhealthy – exercise and socialise between 28 and 31 times per month, because you’re not, nor will ever be, a robot.
Learn maths. Our field has a very long history and stands on the shoulders of many giants; many methods we use these days are merely minor variations on the classical, fundamental results that date back to Newton, Leibniz, Gauss, and Laplace. Therefore, eventually, you will need some working knowledge of mathematics to understand them (linear algebra, calculus, probability and statistics). Remember that software products/APIs seem to change frequently, but they are just a facade, a flashy wrapping around the methods we have been using for quite a while.
Use only methods that you can explain. Even though they may be easily accessible, you should refrain from working with algorithms/methods/models whose definitions (pseudocode, mathematical formulae, objective functions they are trying to optimise) and properties you do not know, understand, or cannot rephrase in your own words.
Compromises are inevitable7. There will never be a single best metric, algorithm, or way to solve all the problems. Even though some solutions might be better than others with regard to specific criteria, this will only be true under certain assumptions (if they fit a theoretical model). Beware that focusing too much on one aspect leads to undesirable consequences with respect to other factors, especially those that cannot be measured easily. Therefore, refraining from improving things might sometimes be better than pushing too hard. Always apply common sense.
Be scientific and ethical. Make your reports reproducible, your toolkit well-organised, and all the assumptions you make explicit. Develop a dose of scepticism and impartiality towards everything, from marketing slogans, through your ideological biases, to all hotly debated topics. Most data analysis exercises end up with conclusions like: “it’s too early to tell”, “data don’t show it’s either way”, “there is a difference, but it is hardly significant”, “yeah, but our sample is not representative for the entire population” – and there is nothing wrong with this. Remember that it is highly unethical to use statistics to tell lies; this includes presenting only one side of the overly complex reality and totally ignoring all the other ones (compare Rule#8). Using statistics for doing dreadful things (tracking users to find their vulnerabilities, developing products and services which are addictive) is a huge no-no!
The best things in life are free. These include the open-source software and open-access textbooks (such as this one) we use in our journey. Spread the good news about them and – if you can – contribute something (even as small as reporting typos in their documentation or helping others in different forums when they are stuck with something) so that you are not only a taker. After all, it is our shared responsibility.
Minimalist Data Wrangling with Python is based on my experience as an author of a quite successful textbook Przetwarzanie i analiza danych w języku Python (Data Processing and Analysis in Python), [GBC16] that I have written (in Polish, 2016, published by PWN) with my former (successful) PhD students Maciej Bartoszuk and Anna Cena – thanks! The current book is an entirely different work; however, its predecessor served as an excellent testbed for many ideas conveyed here.
The teaching style exercised in this book has proven successful in many similar courses that yours truly has been responsible for, including at Warsaw University of Technology, Data Science Retreat (Berlin), and Deakin University (Geelong/Melbourne). I thank all my students for the feedback given over the last 10 or so years.
A thank-you to all the authors and contributors of the Python packages that we use throughout this course: numpy [H+20], scipy [V+20], matplotlib [Hun07], seaborn [Was21], and pandas [McK17], amongst others (as well as the many C/C++/Fortran libraries they provide wrappers for). Their version numbers are given in Section 1.4.
This book has been prepared in a Markdown superset
Python code chunks have been processed with the R (sic!)
package knitr [Xie15].
A little help from Makefiles, custom shell scripts,
and Sphinx plugins
dotted the j’s and crossed the f’s.
The Ubuntu Mono font is used for
the display of
Traditionally known as statistics.
Yes, data are plural (datum is singular).
We might have entitled it Introduction to Data Science (with Python).
We acknowledge that some more theoretically inclined readers might ask the question: but why do we need programming at all? Unfortunately, some mathematicians have forgotten that probability and statistics are deeply rooted in the so-called real world. We should remember that theory beautifully supplements practice and provides us with very deep insights, but we still need to get our hands dirty from time to time.
There are also some commercial solutions available on the market, but we believe that ultimately all software should be free, therefore we are not going to talk about them here at all.
Some people would refer to this rule as There is no free lunch, but in our – overall friendly – world, many things are actually free (see Rule #9), therefore this name is strongly misleading.