Preface

The open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated.

The Art of Data Wrangling

Data science1 aims at making sense of and generating predictions from data that have2 been collected in copious quantities from various sources, such as physical sensors, surveys, online forms, access logs, and (pseudo)random number generators, to name a few. They can take diverse forms, e.g., vectors, matrices, and other tensors, graphs/networks, audio/video streams, or text.

Researchers in psychology, economics, sociology, agriculture, engineering, cybersecurity, biotechnology, pharmacy, sports science, medicine, and genetics, amongst many others, need statistical methods to make discoveries as well as confirm or falsify existing theories. What is more, with the increased availability of open data, everyone can do remarkable work for the common good, e.g., by volunteering for non-profit NGOs or debunking false news and overzealous acts of wishful thinking on any side of the political spectrum.

Furthermore, data scientists, machine learning engineers, statisticians, and business analysts are the most well-paid specialists. This is because data-driven decision-making, modelling, and prediction proved themselves especially effective in many domains, including healthcare, food production, pharmaceuticals, transportation, financial services (banking, insurance, investment funds), real estate, and retail.

Overall, data science (and its different flavours, including operational research, machine learning, business and artificial intelligence) can be applied wherever we have some relevant data at hand and there is a need to improve or understand the underlying processes.

Exercise 1

Miniaturisation, increased computing power, cheaper storage, and the popularity of various internet services all caused data to become ubiquitous. Think about how much information people consume and generate when they interact with different news feeds or social media on their phones.

Data usually do not come in a tidy and tamed form. Data wrangling is the very broad process of appropriately curating raw information chunks and then exploring the underlying data structure so that they become analysable.

Aims, Scope, and Design Philosophy

This course is envisaged as a student’s first exposure to data science3, providing a high-level overview as well as discussing key concepts at a healthy level of detail.

By no means do we have the ambition to be comprehensive with regard to any topic we cover. Time for that will come later in separate lectures on calculus, matrix algebra, probability, mathematical statistics, continuous and combinatorial optimisation, information theory, stochastic processes, statistical/machine learning, algorithms and data structures, take a deep breath, databases and big data analytics, operational research, graphs and networks, differential equations and dynamical systems, time series analysis, signal processing, etc.

Instead, we lay very solid groundwork for the above by introducing all the objects at an appropriate level of generality and building the most crucial connections between them. We provide the necessary intuitions behind the more advanced methods and concepts. This way, further courses do not need to waste our time introducing the most elementary definitions and answering the metaphysical questions like “but why do we need that (e.g., matrix multiplication) at all”.

For those reasons, in this book, we explore methods for:

  • performing exploratory data analysis (e.g., aggregation and visualisation),

  • working with different types of data (e.g., numerical, categorical, text, time series),

  • cleaning data gathered from structured and unstructured sources, e.g., by identifying outliers, normalising strings, extracting numbers from text, imputing missing data,

  • transforming, selecting, and extracting features, dimensionality reduction,

  • identifying naturally occurring data clusters,

  • discovering patterns in data via approximation/modelling approaches using the most popular probability distributions and the easiest to understand statistical/machine learning algorithms,

  • testing whether two data distributions differ significantly from each other,

  • reporting the results of data analysis.

We primarily focus on methods and algorithms that have stood the test of time and that continue to inspire researchers and practitioners. They all meet a reality check that is comprised of the three following properties, which we believe are essential in practice:

  • simplicity (and thus interpretability, being equipped with no or only a few underlying tunable parameters; being based on some sensible intuitions that can be explained in our own words),

  • mathematical analysability (at least to some extent; so that we can understand their strengths and limitations),

  • implementability (not too abstract on the one hand, but also not requiring any advanced computer-y hocus-pocus on the other).

Note

Many more complex algorithms are merely variations on or clever combinations of the more basic ones. This is why we need to study the fundamentals in great detail. We might not see it now, but this will become evident as we progress.

We Need Maths

The maths we introduce is the most elementary possible, in a good sense. Namely, we do not go beyond:

  • simple analytic functions (affine maps, logarithms, cosines),

  • the natural linear ordering of points on the real line (and the lack thereof in the case of multidimensional data),

  • the sum of squared differences between things (including the Euclidean distance between points),

  • linear vector/matrix algebra, e.g., to represent the most useful geometric transforms (rotation, scaling, translation),

  • the frequentist interpretation (as in: in samples of large sizes, we expect that…) of some common objects from probability theory and statistics.

This is the kind of toolkit that we believe is a sine qua non requirement for every prospective data scientist. We cannot escape falling in love with it.

We Need Some Computing Environment

We no longer practice data analysis solely using a piece of paper and a pencil4. Over the years, dedicated computer programs that solve the most common problems arising in the most straightforward scenarios were developed (e.g., spreadsheet-like click-here-click-there stand-alone statistical packages). Still, we need a tool that will enable us to respond to any challenge in a manner that is scientifically rigorous (and hence well organised and reproducible).

In this course, we will be writing code in Python, which we shall introduce from scratch. Consequently, we do not require any prior programming experience.

The 2021 StackOverflow Developer Survey lists it as the 2nd most popular programming language nowadays (slightly behind JavaScript, whose primary use is in Web development). Over the last few years, Python has proven to be a very robust choice for learning and applying data wrangling techniques. This is possible thanks to the famous high-quality packages written by the devoted community of open-source programmers, including but not limited to numpy, scipy, pandas, matplotlib, seaborn, and scikit-learn.

Nevertheless, Python and its third-party packages are amongst many software tools which can help gain new knowledge from data. Other5 open-source choices include, e.g., R and Julia. And many new ones will emerge in the future.

Important

We will therefore emphasise developing transferable skills: most of what we learn here can be applied (using a different syntax but the same kind of reasoning) in other environments. In other words, this is a course on data wrangling (with Python), not a course on Python (with examples in data wrangling).

We want the reader to become an independent user of this computing environment. Somebody who is not overwhelmed when they are faced with any intermediate-level data analysis problem. A user whose habitual response to a new challenge is not to look everything up on the internet even in the simplest possible scenarios. In other words, we value creative thinking.

We believe we have found a good trade-off between the minimal set of tools that need to be mastered and the less frequently used ones that can later be found in the documentation or on the internet. In other words, the reader will discover the joy of programming and using their logical thinking to tinker with things.

We Need Data and Domain Knowledge

There is no data science or machine learning without data, and data’s purpose is to represent a given problem domain. Mathematics allows us to study different processes at a healthy level of abstractness/specificity. Still, we should always be familiar with the reality behind the numbers we have at hand, for example, by working closely with various experts in the field of our interest or pursuing our own study/research therein.

Courses such as this one, out of necessity, must use some generic datasets that are quite familiar to most readers (e.g., data on life expectancy and GDP in different countries, time to finish a marathon, yearly household incomes).

Yet, many textbooks introduce statistical concepts using carefully crafted datasets where everything runs smoothly, and all models work out of the box. This gives a false sense of security. In practice, however, most datasets are not only unpolished but also (even after some careful treatment) uninteresting. Such is life. We will not be avoiding the more difficult problems during our journey.

Structure

This book is a whole course and should be read from the beginning to the end.

The material has been divided into five parts.

  1. Introducing Python:

    • Chapter 1 discusses how to execute the first code chunks in Jupyter Notebooks, which are a flexible tool for the reproducible generation of reports from data analyses.

    • Chapter 2 introduces the basic scalar types in base Python, ways to call existing and to write our own functions, and control a code chunk’s execution flow.

    • Chapter 3 mentions sequential and other iterable types in base Python; more advanced data structures (vectors, matrices, data frames) that we introduce below will build upon these concepts.

  2. Unidimensional Data:

    • Chapter 4 introduces vectors from numpy, which we use for storing data on the real line (think: individual columns in a tabular dataset). Then, we look at the most common types of empirical distributions of data (e.g., bell-shaped, right-skewed, heavy-tailed ones).

    • In Chapter 5, we list the most basic ways for processing sequences of numbers, including methods for data aggregation, transformation (e.g., standardisation), and filtering. We also mention that a computer’s floating-point arithmetic is imprecise and what we can do about it.

    • Chapter 6 reviews the most common probability distributions (normal, log-normal, Pareto, uniform, and mixtures thereof), methods for assessing how well they fit empirical data, and pseudorandom number generation that is crucial for experiments based on simulations.

  3. Multidimensional Data:

    • Chapter 7 introduces matrices from numpy. They are a convenient means of storing multidimensional quantitative data (many points described by possibly many numerical features). We also present some methods for their visualisation (and the problems arising from our being three-dimensional creatures).

    • Chapter 8 is devoted to basic operations on matrices. We will see that some of them simply extend on what we learned in Chapter 5, but there is more: for instance, we discuss how to determine the set of each point’s nearest neighbours.

    • Chapter 9 discusses ways to explore the most basic relationships between the variables in a dataset: the Pearson and Spearman correlation coefficients (and what it means that correlation is not causation), k-nearest neighbour and linear regression (including the sad cases where a model matrix is ill-conditioned), and finding interesting combinations of variables that can help reduce the dimensionality of a problem (via the so-called principal component analysis).

  4. Heterogeneous Data:

    • Chapter 10 introduces Series and DataFrame objects from pandas, which we can think of as vectors and matrices on steroids. For instance, they allow rows and columns to be labelled and columns to be of different types. We emphasise that most of what we learned in the previous chapters still applies, but there is even more: for example, methods for joining (merging) many datasets, converting between long and wide formats, etc.

    • In Chapter 11, we introduce the ways to represent and handle categorical data as well as how (not) to lie with statistics.

    • Chapter 12 covers the case of aggregating, transforming, and visualising data in groups defined by one or more qualitative variables, including classification with k-nearest neighbours (when we are asked to fill the gaps in a categorical variable). We will also try to discover the naturally occurring partitions using the k-means method, which is an example of a computationally hard optimisation problem that needs to be tackled with some imperfect heuristics.

    • Chapter 13 is an interlude where we solve some pleasant exercises on data frames and learn the basics of SQL. This will be handy when we are faced with datasets that do not fit into a computer’s memory.

  5. Other Data Types:

    • Chapter 14 discusses ways to handle text data and extract information from them, e.g., through regular expressions. We also briefly mention the challenges related to the processing of non-English text, including phrases like pozdro dla ziomali z Bródna, Viele Grüße und viel Spaß, and χαίρετε.

    • Chapter 15 emphasises that some data may be missing or be questionable (e.g., censored, incorrect, rare) and what we can do about them.

    • In Chapter 16, we cover the most basic methods for the processing of time series, because, ultimately, everything changes, and we should be able to track the evolution of things.

Note

(*) The parts marked with a single or double asterisk can be skipped the first time we read this book. They are of increased difficulty and are less essential for beginner students.

The Rules

Our goal here, in the long run, is for you, dear reader, to become a skilled expert who is independent, ethical, and capable of critical thinking; one who hopefully will make a small contribution towards making this world a slightly better place.

To guide you through it, we have a few tips for you.

  1. Follow the rules.

  2. Technical textbooks are not belletristic – purely for shallow amusement. Sometimes a single page will be very meaning-intense. Do not try to consume too much at the same time. Go for a walk, reflect on what you learned, and build connections between different concepts. In case of any doubt, go back to one of the previous sections. Learning is an iterative process, not a linear one.

  3. Solve all the suggested exercises. We might be introducing new concepts or developing crucial intuitions there as well. Also, try implementing most of the methods you learn about instead of looking for copy-paste solutions on the internet. How else will you be able to master the material and develop the necessary programming skills?

  4. Code is an integral part of the text. Each piece of good code is worth 1234 words (on average). Do not skip it. On the contrary, you should play and experiment with it. Run every major line of code, inspect the results generated, and read more about the functions you use in the official documentation. What is the type (class) of the object returned? If it is an array or a data frame, what is its shape? What would happen if we replaced X with Y? Do not fret; your computer will not blow up.

  5. Harden up6. Your journey towards expertise will take years, there are no shortcuts, but it will be quite enjoyable every now and then, so don’t give up. Still, sitting all day in front of your computer is unhealthy – exercise and socialise between 28 and 31 times per month, because you’re not, nor will ever be, a robot.

  6. Learn maths. Our field has a very long history and stands on the shoulders of many giants; many methods we use these days are merely minor variations on the classical, fundamental results that date back to Newton, Leibniz, Gauss, and Laplace. Eventually, you will need some working knowledge of mathematics to understand them (linear algebra, calculus, probability and statistics). Remember that software products/APIs seem to change frequently, but they are just a facade, a flashy wrapping around the methods we were using for quite a while.

  7. Use only methods that you can explain. You should refrain from working with algorithms/methods/models whose definitions (pseudocode, mathematical formulae, objective functions they are trying to optimise) and properties you do not know, understand, or cannot rephrase in your own words. That they might be accessible or easy to use should not make any difference to you. Also, prefer simple models over black boxes.

  8. Compromises are inevitable7. There will never be a single best metric, algorithm, or way to solve all the problems. Even though some solutions might be better than others with regard to specific criteria, this will only be true under certain assumptions (if they fit a theoretical model). Beware that focusing too much on one aspect leads to undesirable consequences with respect to other factors, especially those that cannot be measured easily. Refraining from improving things might sometimes be better than pushing too hard. Always apply common sense.

  9. Be scientific and ethical. Make your reports reproducible, your toolkit well-organised, and all the assumptions you make explicit. Develop a dose of scepticism and impartiality towards everything, from marketing slogans, through your ideological biases, to all hotly debated topics. Most data analysis exercises end up with conclusions like: “it’s too early to tell”, “data don’t show it’s either way”, “there is a difference, but it is hardly significant”, “yeah, but our sample is not representative for the entire population” – and there is nothing wrong with this. Remember that it is highly unethical to use statistics to tell lies; this includes presenting only one side of the overly complex reality and totally ignoring all the other ones (compare Rule#8). Using statistics for doing dreadful things (tracking users to find their vulnerabilities, developing products and services which are addictive) is a huge no-no!

  10. The best things in life are free. These include the open-source software and open-access textbooks (such as this one) we use in our journey. Spread the good word about them and – if you can – don’t only be a taker: contribute something valuable yourself (even as small as reporting typos in their documentation or helping others in different forums when they are stuck). After all, it is our shared responsibility.

About the Author

I, Marek Gagolewski (pronounced like Ma’rek Gong-olive-ski), am currently a Senior Lecturer in Applied AI at Deakin University in Melbourne, VIC, Australia and an Associate Professor in Data Science (on long-term leave) at the Faculty of Mathematics and Information Science, Warsaw University of Technology, Poland and Systems Research Institute of the Polish Academy of Sciences.

My research interests are related to data science, in particular: modelling complex phenomena, developing usable, general purpose algorithms, studying their analytical properties, and finding out how people use, misuse, understand, and misunderstand methods of data analysis in research, commercial, and decision making settings. I’m an author of 85+ publications, including journal papers in outlets such as Proceedings of the National Academy of Sciences (PNAS), Information Fusion, International Journal of Forecasting, Statistical Modelling, Journal of Statistical Software, Information Sciences, Knowledge-Based Systems, IEEE Transactions on Fuzzy Systems, and Journal of Informetrics.

In my “spare” time, I write books for my students and develop open-source (libre) data analysis software, such as stringi – one of the most often downloaded R packages, genieclust – a fast and robust clustering algorithm in both Python and R, and many others.

Acknowledgements

Minimalist Data Wrangling with Python is based on my experience as an author of a quite successful textbook Przetwarzanie i analiza danych w języku Python (Data Processing and Analysis in Python; see [30]) that I wrote (in Polish, 2016, published by PWN) with my former (successful) PhD students, Maciej Bartoszuk and Anna Cena – thanks! The current book is an entirely different work; however, its predecessor served as an excellent testbed for many ideas conveyed here.

The teaching style exercised in this book has proven successful in many similar courses that yours truly has been responsible for, including at Warsaw University of Technology, Data Science Retreat (Berlin), and Deakin University (Geelong/Melbourne). I thank all my students for the feedback given over the last 10 or so years.

A thank-you to all the authors and contributors of the Python packages that we use throughout this course: numpy [40], scipy [82], matplotlib [45], seaborn [83], and pandas [56], amongst others (as well as the many C/C++/Fortran libraries they provide wrappers for). Their version numbers are given in Section 1.4.

This book has been prepared in a Markdown superset called MyST, Sphinx, and TeX (XeLaTeX). Python code chunks were processed with the R (sic!) package knitr [89]. A little help from Makefiles, custom shell scripts, and Sphinx plugins (sphinxcontrib-bibtex, sphinxcontrib-proof) dotted the j’s and crossed the f’s. The Ubuntu Mono font is used for the display of code. Typesetting of the main text relies upon the Alegreya and Lato typefaces.

To my friends: Ania, Basia, Grzesiek, Fizz Grady, and Tessa – thanks for being so patient and for your comments about different things!


1

Traditionally known as statistics.

2

Yes, data are plural (datum is singular).

3

We might have entitled it Introduction to Data Science (with Python).

4

We acknowledge that some more theoretically inclined readers might ask the question: but why do we need programming at all? Unfortunately, some mathematicians forgot that probability and statistics are deeply rooted in the so-called real world. We should remember that theory beautifully supplements practice and provides us with very deep insights, but we still need to get our hands dirty from time to time.

5

There are also some commercial solutions available on the market, but we believe that ultimately all software should be free. Consequently, we are not going to talk about them here at all.

6

Cyclists know.

7

Some people would refer to this rule as There is no free lunch, but in our – overall friendly – world, many things are actually free (see Rule #9). Therefore, this name is misleading.