Minimalist Data Wrangling with Python

Marek Gagolewski

doi:10.5281/zenodo.6451068

Preface¶

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Deep R Programming [36] too.

The art of data wrangling¶

Data science[1] aims at making sense of and generating predictions from data that have[2] been collected in copious quantities from various sources, such as physical sensors, surveys, online forms, access logs, and (pseudo)random number generators, to name a few. They can take diverse forms, e.g., be given as vectors, matrices, or other tensors, graphs/networks, audio/video streams, or text.

Researchers in psychology, economics, sociology, agriculture, engineering, cybersecurity, biotechnology, pharmacy, sports science, medicine, and genetics, amongst many others, need statistical methods to make new discoveries and confirm or falsify existing hypotheses. What is more, with the increased availability of open data, everyone can do remarkable work for the common good, e.g., by volunteering for non-profit NGOs or debunking false news and overzealous acts of wishful thinking on any side of the political spectrum.

Data scientists, machine learning engineers, statisticians, and business analysts are among the most well-paid specialists. This is because data-driven decision-making, modelling, and prediction proved themselves especially effective in many domains, including healthcare, food production, pharmaceuticals, transportation, financial services (banking, insurance, investment funds), real estate, and retail.

Overall, data science (and its assorted flavours, including operational research, machine learning, pattern recognition, business and artificial intelligence) can be applied wherever we have some information repository at hand and there is a need to describe, understand, model, or improve the underlying processes.

Exercise 1

Miniaturisation, increased computing power, cheaper storage, and the popularity of various internet services all caused data to become ubiquitous. Think about how much information people consume and generate when they interact with news feeds or social media on their phones.

Data usually do not come in a tidy and tamed form. Data wrangling is the very broad process of appropriately curating raw information chunks and then exploring the underlying data structure so that they become analysable.

Aims, scope, and design philosophy¶

This course is envisaged as a student’s first exposure to data science[3], providing a high-level overview as well as discussing key concepts at a healthy level of detail.

By no means do we have the ambition to be comprehensive with regard to any topic we cover. Time for that will come later in separate lectures on calculus, matrix algebra, probability, mathematical statistics, continuous and combinatorial optimisation, information theory, stochastic processes, statistical/machine learning, algorithms and data structures, take a deep breath, databases and big data analytics, operational research, graphs and networks, differential equations and dynamical systems, time series analysis, signal processing, etc.

Instead, we lay solid groundwork for the aforementioned by introducing all the objects at an appropriate level of generality, and building the most crucial connections between them. We provide the necessary intuitions behind the more advanced methods and concepts. This way, further courses do not need to waste our time introducing the most elementary definitions and answering the metaphysical questions like “but why do we need that (e.g., matrix multiplication) in the first place”.

For those reasons, in this book, we explore methods for:

performing exploratory data analysis (e.g., aggregation and visualisation),
working with varied types of data (e.g., numerical, categorical, text, time series),
cleaning data gathered from structured and unstructured sources, e.g., by identifying outliers, normalising strings, extracting numbers from text, imputing missing data,
transforming, selecting, and extracting features, dimensionality reduction,
identifying naturally occurring data clusters,
discovering patterns in data via approximation/modelling approaches using the most popular probability distributions and the easiest to understand statistical/machine learning algorithms,
testing whether two data distributions are significantly different,
reporting the results of data analysis.

We primarily focus on methods and algorithms that have stood the test of time and that continue to inspire researchers and practitioners. They all meet the reality check comprised of the three undermentioned properties, which we believe are essential in practice:

simplicity (and thus interpretability, being equipped with no or only a few underlying tunable parameters; being based on some sensible intuitions that can be explained in our own words),
mathematical analysability (at least to some extent; so that we can understand their strengths and limitations),
implementability (not too abstract on the one hand, but also not requiring any advanced computer-y hocus-pocus on the other).

Note

Many more complex algorithms are merely variations on or clever combinations of the more basic ones. This is why we need to study the foundations in great detail. We might not see it now, but this will become evident as we progress.

We need maths¶

The maths we introduce is the most elementary possible, in a good sense. Namely, we do not go beyond:

simple analytic functions (affine maps, logarithms, cosines),
the natural linear ordering of points on the real line (and the lack thereof in the case of multidimensional data),
the sum of squared differences between things (including the Euclidean distance between points),
linear vector/matrix algebra, e.g., to represent the most useful geometric transforms (rotation, scaling, translation),
the frequentist interpretation (as in: in samples of large sizes, we expect that…) of some common objects from probability theory and statistics.

This is the kind of toolkit that we believe is a sine qua non requirement for every prospective data scientist. We cannot escape falling in love with it.

We need some computing environment¶

We no longer practice data analysis solely using a piece of paper and a pencil[4]. Over the years, dedicated computer programs that solve the most common problems arising in the most straightforward scenarios were developed, e.g., spreadsheet-like click-here-click-there standalone statistical packages. Still, we need a tool that will enable us to respond to any challenge in a manner that is scientifically rigorous, i.e., well organised and reproducible.

This course uses the Python language which we shall introduce from scratch. Consequently, we do not require any prior programming experience.

The 2024 StackOverflow Developer Survey lists Python as the second most popular programming language (slightly behind JavaScript, whose primary use is in Web development). Over the last couple of years, it has proven to be a quite robust choice for learning and applying data wrangling techniques. This is possible thanks to the devoted community of open-source programmers who wrote the famous high-quality packages such as numpy, scipy, matplotlib, pandas, seaborn, and scikit-learn.

Nevertheless, Python and its third-party packages are amongst many software tools which can help extract knowledge from data. Certainly, this ecosystem is not ideal for all the applications, nor is it the most polished. The R environment [36, 65, 96, 102] is one[5] of the recommended alternatives worth considering.

Important

We will focus on developing transferable skills: most of what we learn here can be applied (using different syntax but the same kind of reasoning) in other environments. Thus, this is a course on data wrangling (with Python), not a course on Python (with examples in data wrangling).

We want the reader to become an independent user of this computing environment. Somebody who is not overwhelmed when they are faced with any intermediate-level data analysis problem. A user whose habitual response to a new challenge is not to look everything up on the internet even in the simplest possible scenarios. Someone who will not be replaced by stupid artificial “intelligence” in the future.

We believe we have found a healthy trade-off between the minimal set of tools that need to be mastered and the less frequently used ones that can later be found in the documentation or online. In other words, the reader will discover the joy of programming and using logical reasoning to tinker with things.

We need data and domain knowledge¶

There is no data science or machine learning without data, and data’s purpose is to represent a given problem domain. Mathematics allows us to study different processes at a healthy level of abstractness/specificity. Still, we need to be familiar with the reality behind the numbers we have at hand, for example, by working closely with various experts or pursuing our own research in the relevant field

Courses such as this one, out of necessity, must use some generic datasets that are familiar to most readers (e.g., life expectancy and GDP by country, time to finish a marathon, yearly household incomes).

Regrettably, many textbooks introduce statistical concepts using carefully fabricated datasets where everything runs smoothly, and all models work out of the box. This gives a false sense of security and builds a too cocky a level of confidence. In practice, however, most datasets are not only unpolished; they are dull, even after some careful treatment. Such is life. We will not be avoiding the more difficult and less attractive problems during our journey.

Structure¶

This book is a whole course. We recommend reading it from the beginning to the end.

The material has been divided into five parts.

Introducing Python:
- Chapter 1 discusses how to set up the Python environment, including Jupyter Notebooks which are a flexible tool for the reproducible generation of reports from data analyses.
- Chapter 2 introduces the elementary scalar types in base Python, ways to call existing and to compose our own functions, and control a code chunk’s execution flow.
- Chapter 3 mentions sequential and other iterable types in base Python. The more advanced data structures (vectors, matrices, data frames) will build upon these concepts.
Unidimensional data:
- Chapter 4 introduces vectors from numpy, which we use for storing data on the real line (think: individual columns in a tabular dataset). Then, we look at the most common types of empirical distributions of data, e.g., bell-shaped, right-skewed, heavy-tailed ones.
- In Chapter 5, we list the most basic ways for processing sequences of numbers, including methods for data aggregation, transformation (e.g., standardisation), and filtering. We also mention that a computer’s floating-point arithmetic is imprecise and what we can do about it.
- Chapter 6 reviews the most common probability distributions (normal, log-normal, Pareto, uniform, and mixtures thereof), methods for assessing how well they fit empirical data. It also covers pseudorandom number generation which is crucial in experiments based on simulations.
Multidimensional data:
- Chapter 7 introduces matrices from numpy. They are a convenient means of storing multidimensional quantitative data, i.e., many points described by possibly many numerical features. We also present some methods for their visualisation (and the problems arising from our being three-dimensional creatures).
- Chapter 8 is devoted to operations on matrices. We will see that some of them simply extend upon what we have learnt in Chapter 5, but there is more: for instance, we discuss how to determine the set of each point’s nearest neighbours.
- Chapter 9 discusses ways to explore the most basic relationships between the variables in a dataset: the Pearson and Spearman correlation coefficients (and what it means that correlation is not causation), \(k\)-nearest neighbour and linear regression (including sad cases where a model matrix is ill-conditioned), and finding noteworthy combinations of variables that can help reduce the dimensionality of a problem (via principal component analysis).
Heterogeneous data:
- Chapter 10 introduces Series and DataFrame objects from pandas, which we can think of as vectors and matrices on steroids. For instance, they allow rows and columns to be labelled and columns to be of different types. We emphasise that most of what we learnt in the previous chapters still applies, but now we can do even more: run methods for joining (merging) many datasets, converting between long and wide formats, etc.
- In Chapter 11, we introduce the ways to represent and handle categorical data as well as how (not) to lie with statistics.
- Chapter 12 covers the case of aggregating, transforming, and visualising data in groups defined by one or more qualitative variables. It introduces an approach to data classification using the \(k\)-nearest neighbours scheme, which is useful when we are asked to fill the gaps in a categorical variable. We will also discover naturally occurring partitions using the \(k\)-means method, which is an example of a computationally hard optimisation problem that needs to be tackled with some imperfect heuristics.
- Chapter 13 is an interlude where we solve some pleasant exercises on data frames and learn the basics of SQL. This will come in handy when our datasets do not fit in a computer’s memory.
Other data types:
- Chapter 14 discusses ways to handle text data and extract information from them, e.g., through regular expressions. We also briefly mention the challenges related to the processing of non-English text, including phrases like pozdro dla ziomali z Bródna, Viele Grüße und viel Spaß, and χαίρετε.
- Chapter 15 emphasises that some data may be missing or be questionable (e.g., censored, incorrect, rare) and what we can do about it.
- In Chapter 16, we cover the most rudimentary methods for the processing of time series because, ultimately, everything changes, and we should be able to track the evolution of things.

Note

(*) Parts marked with a single or double asterisk (e.g., some sections or examples) can be skipped on first reading for they are of lesser importance or greater difficulty.

The Rules¶

Our goal here, in the long run, is for you, dear reader, to become a skilled expert who is independent, ethical, and capable of critical thinking; one who hopefully will make some contribution towards making this world a slightly better place. To guide you through this challenging journey, we have a few tips.

Follow the rules.
Technical textbooks are not belletristic – purely for shallow amusement. Sometimes a single page will be very meaning-intense. Do not try to consume too much all at once. Go for a walk, reflect on what you learnt, and build connections between different concepts. In case of any doubt, go back to the previous sections. Learning is an iterative process, not a linear one.
Solve all the suggested exercises. We might be introducing ideas or developing crucial intuitions there as well. Also, try implementing most of the methods you learn about instead of looking for copy-paste solutions on the internet. How else will you be able to master the material and develop the necessary programming skills?
Code is an integral part of the text. Each piece of good code is worth 1234 words (on average). Do not skip it. On the contrary, you are encouraged to play and experiment with it. Run every major line of code, inspect the results generated, and read more about the functions you use in the official documentation. What is the type (class) of the object returned? If it is an array or a data frame, what is its shape? What would happen if we replaced X with Y? Do not fret; your computer will not blow up.
Harden up[6]. Your journey towards expertise will take years, there are no shortcuts, but it will be fairly enjoyable every now and then, so don’t give up. Still, sitting all day in front of your computer is unhealthy. Exercise and socialise between 28 and 31 times per month for you’re not, nor will ever be, a robot.
Learn maths. Our field has a very long history and stands on the shoulders of many giants; many methods we use these days are merely minor variations on the classical, fundamental results that date back to Newton, Leibniz, Gauss, and Laplace. Eventually, you will need some working knowledge of mathematics to understand them (linear algebra, calculus, probability and statistics). Remember that software products/APIs seem to change frequently, but they are just a facade, a flashy wrapping around the methods we were using for quite a while.
Use only methods that you can explain. You ought to refrain from working with algorithms/methods/models whose definitions (pseudocode, mathematical formulae, objective functions they are trying to optimise) and properties you do not know, understand, or cannot rephrase in your own words. That they might be accessible or easy to use should not make any difference to you. Also, prefer simple models over black boxes.
Compromises are inevitable[7]. There will never be a single best metric, algorithm, or way to solve all the problems. Even though some solutions might be superior to others with regard to certain criteria, this will only be true under very specific assumptions (if they fit a theoretical model). Beware that focusing too much on one aspect leads to undesirable consequences with respect to other factors, especially those that cannot be measured easily. Refraining from improving things might sometimes be better than pushing too hard. Always apply common sense.
Be scientific and ethical. Make your reports reproducible, your toolkit well-organised, and all the assumptions you make explicit. Develop a dose of scepticism and impartiality towards everything, from marketing slogans, through your ideological biases, to all hotly debated topics. Most data analysis exercises end up with conclusions like: “it’s too early to tell”, “data don’t show it’s either way”, “there is a difference, but it is hardly significant”, “yeah, but our sample is not representative for the entire population” – and there is nothing wrong with this. Communicate in a precise manner [88]. Remember that it is highly unethical to use statistics to tell lies [98]; this includes presenting only one side of the overly complex reality and totally ignoring all others (compare Rule#8). Using statistics for doing dreadful things (tracking users to find their vulnerabilities, developing products and services which are addictive) is a huge no-no!
The best things in life are free. These include the open-source software and open-access textbooks (such as this one) we use in our journey. Spread the good news about them and – if you can – don’t only be a taker: contribute something valuable yourself (even as small as reporting typos in their documentation or helping others in different forums when they are stuck). After all, it is our shared responsibility.

About the author¶

I, Marek Gagolewski (pronounced like Maa’rek (Mark) Gong-o-leaf-ski), am currently an Associate Professor in Data Science at the Faculty of Mathematics and Information Science, Warsaw University of Technology.

My research interests are related to data science, in particular: modelling complex phenomena, developing usable, general-purpose algorithms, studying their analytical properties, and finding out how people use, misuse, understand, and misunderstand methods of data analysis in research, commercial, and decision-making settings. I am an author of ~100 publications, including journal papers in outlets such as Proceedings of the National Academy of Sciences (PNAS), Journal of Statistical Software, The R Journal, Journal of Classification, Information Fusion, International Journal of Forecasting, Statistical Modelling, Physica A: Statistical Mechanics and its Applications, Information Sciences, Knowledge-Based Systems, IEEE Transactions on Fuzzy Systems, and Journal of Informetrics.

In my “spare” time, I write books for my students: check out my Deep R Programming [36]. I also develop open-source software for data analysis, such as stringi (one of the most often downloaded R packages) and genieclust (a fast and robust clustering algorithm in both Python and R).

Acknowledgements¶

Minimalist Data Wrangling with Python is based on my experience as an author of a quite successful textbook Przetwarzanie i analiza danych w języku Python [37] that I wrote with my former (successful) PhD students, Maciej Bartoszuk and Anna Cena – thanks! Even though the current book is an entirely different work, its predecessor served as an excellent testbed for many ideas conveyed here.

The teaching style exercised in this book has proven successful in many similar courses that yours truly has been responsible for, including at Warsaw University of Technology, Data Science Retreat (Berlin), and Deakin University (Melbourne). I thank all my students and colleagues for the feedback given over the last 10 or so years.

A thank-you to all the authors and contributors of the Python packages that we use throughout this course: numpy [48], scipy [97], matplotlib [54], pandas [66], and seaborn [99], amongst others (as well as the many C/C++/Fortran libraries they provide wrappers for). Their version numbers are given in Section 1.4.

This book was prepared in a Markdown superset called MyST, Sphinx, and TeX (XeLaTeX). Python code chunks were processed with the R (sic!) package knitr [106]. A little help from Makefiles, custom shell scripts, and Sphinx plugins (sphinxcontrib-bibtex, sphinxcontrib-proof) dotted the j’s and crossed the f’s. The Ubuntu Mono font is used for the display of code. The typesetting of the main text relies on the Alegreya typeface.

This work received no funding, administrative, technical, or editorial support from Deakin University, Warsaw University of Technology, Polish Academy of Sciences, or any other source.

You can make this book better¶

When it comes to quality assurance, open, non-profit projects have to resort to the generosity of the readers’ community.

If you find a typo, a bug, or a passage that could be rewritten or extended for better readability/clarity, do not hesitate to report it via the Issues tracker available at https://github.com/gagolews/datawranglingpy. New feature requests are welcome as well.