Minimalist Data Wrangling with Python

Marek Gagolewski

doi:10.5281/zenodo.6451068

Changelog¶

Important

Any bug/typo reports/fixes are appreciated. The most up-to-date version of this book can be found at https://datawranglingpy.gagolewski.com/.

Below is the list of the most noteworthy changes.

2025-..-.. (v1.1.0.9xxx) (in progress):
- We now mention how own function modules can be created.
- More programming exercises.
- Minor extensions and bug fixes.
2025-02-17 (v1.1.0):
- New HTML theme (includes the light and dark mode).
- Not using seaborn where it can easily be replaced by a few calls to the lower-level matplotlib, especially in the numpy chapters. This way, we can learn how to create some popular charts from scratch. In particular, we are now using own functions to display a heat map and a pairs plot.
- Use numpy.genfromtxt more eagerly.
- A few more examples of using f-strings for results’ pretty-printing.
- Minor extensions and bug fixes.
- Updated to Python 3.11, numpy 2.2, pandas 2.2, matplotlib 3.10 (amongst others).
2023-02-06 (v1.0.3):
- Numeric reference style; updated bibliography.
- Reduce the file size of the screen-optimised PDF at the cost of a slight decrease of the quality of some figures.
- The print-optimised PDF now uses selective rasterisation of parts of figures, not whole pages containing them. This should increase the quality of the printed version of this book.
- Bug fixes.
- Minor extensions, including: pandas.Series.dt.strftime, more details how to avoid pitfalls in data frame indexing, etc.
2022-08-24 (v1.0.2):
- The first printed (paperback) version can be ordered from Amazon.
- Fixed page margin and header sizes.
- Minor typesetting and other fixes.
2022-08-12 (v1.0.1):
- Cover.
- ISBN 978-0-6455719-1-2 assigned.
2022-07-16 (v1.0.0):
- Preface complete.
- Handling tied observations.
- Plots now look better when printed in black and white.
- Exception handling.
- File connections.
- Other minor extensions and material reordering: more aggregation functions, pandas.unique, pandas.factorize, probability vectors representing binary categorical variables, etc.
- Final proofreading and copyediting.
2022-06-13 (v0.5.1):
- The Kolmogorov–Smirnov Test (one and two sample).
- The Pearson Chi-Squared Test (one and two sample and for independence).
- Dealing with round-off and measurement errors.
- Adding white noise (jitter).
- Lambda expressions.
- Matrices are iterable.
2022-05-31 (v0.4.1):
- The Rules.
- Matrix multiplication, dot products.
- Euclidean distance, few-nearest-neighbour and fixed-radius search.
- Aggregation of multidimensional data.
- Regression with k-nearest neighbours.
- Least squares fitting of linear regression models.
- Geometric transforms; orthonormal matrices.
- SVD and dimensionality reduction/PCA.
- Classification with k-nearest neighbours.
- Clustering with k-means.
- Text Processing and Regular Expression chapters merged.
- Unidimensional Data Aggregation and Transformation chapters merged.
- pandas.GroupBy objects are iterable.
- Semitransparent histograms.
- Contour plots.
- Argument unpacking and variadic arguments (*args, **kwargs).
2022-05-23 (v0.3.1):
- More lightweight mathematical notation.
- Some equalities related to the mathematical functions we rely on (the natural logarithm, cosine, etc.).
- A way to compute the most correlated pair of variables.
- A note on modifying elements in an array and on adding new rows and columns.
- An example seasonal plot in the time series chapter.
- Solutions to the SQL exercises added; to ignore small round-off errors, use pandas.testing.assert_frame_equal instead of pandas.DataFrame.equals.
- More details on file paths.
2022-04-12 (v0.2.1):
- Many chapters merged or relocated.
- Added captions to all figures.
- Improved formatting of elements (information boxes such as note, important, exercise, example).
2022-03-27 (v0.1.1):
- The first public release: most chapters are drafted, more or less.
- Using Sphinx for building.
2022-01-05 (v0.0.0):
- Project started.