Minimalist Data Wrangling with Python#
Minimalist Data Wrangling with Python is envisaged as a student’s first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results.
Although available online, it is a whole course, and should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.
For many students around the world, educational resources are hardly affordable. Therefore, I have decided that this book should remain an independent, non-profit, open-access project (available both in PDF and HTML forms). Whilst, for some people, the presence of a “designer tag” from a major publisher might still be a proxy for quality, it is my hope that this publication will prove useful to those who seek knowledge for knowledge’s sake.
You can also order a paper copy.
Any bug/typo reports/fixes are appreciated. Please submit them via this project’s GitHub repository. Thank you.
Copyright (C) 2022–2023 by Marek Gagolewski. Some rights reserved. This material is published under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).
- 1. Getting started with Python
- 2. Scalar types and control structures in Python
- 3. Sequential and other types in Python
- 4. Unidimensional numeric data and their empirical distribution
- 5. Processing unidimensional data
- 6. Continuous probability distributions
- 7. Multidimensional numeric data at a glance
- 8. Processing multidimensional data
- 9. Exploring relationships between variables
- 10. Introducing data frames
- 11. Handling categorical data
- 12. Processing data in groups
- 13. Accessing databases
- 14. Text data
- 15. Missing, censored, and questionable data
- 16. Time series