Minimalist Data Wrangling with Python

Marek Gagolewski

doi:10.5281/zenodo.6451068

References¶

[1]

Abramowitz, M. and Stegun, I.A., editors. (1972). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover Publications. URL: https://personal.math.ubc.ca/~cbm/aands/intro.htm.

[2]

Aggarwal, C.C. (2015). Data Mining: The Textbook. Springer.

[3]

Arnold, B.C. (2015). Pareto Distributions. Chapman and Hall/CRC. DOI: 10.1201/b18141.

[4]

Arnold, T.B. and Emerson, J.W. (2011). Nonparametric goodness-of-fit tests for discrete null distributions. The R Journal, 3(2):34–39. DOI: 10.32614/RJ-2011-016.

[5]

Bartoszyński, R. and Niewiadomska-Bugaj, M. (2007). Probability and Statistical Inference. Wiley.

[6]

Beirlant, J., Goegebeur, Y., Teugels, J., and Segers, J. (2004). Statistics of Extremes: Theory and Applications. Wiley. DOI: 10.1002/0470012382.

[7]

Benaglia, T., Chauveau, D., Hunter, D.R., and Young, D.S. (2009). Mixtools: An R package for analyzing mixture models. Journal of Statistical Software, 32(6):1–29. DOI: 10.18637/jss.v032.i06.

[8]

Bezdek, J.C., Ehrlich, R., and Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computer and Geosciences, 10(2–3):191–203. DOI: 10.1016/0098-3004(84)90020-7.

[9]

Billingsley, P. (1995). Probability and Measure. John Wiley & Sons.

[10]

Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer-Verlag. URL: https://www.microsoft.com/en-us/research/people/cmbishop.

[11]

Blum, A., Hopcroft, J., and Kannan, R. (2020). Foundations of Data Science. Cambridge University Press. URL: https://www.cs.cornell.edu/jeh/book.pdf.

[12]

Box, G.E.P. and Cox, D.R. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26(2):211–252.

[13]

Bullen, P.S. (2003). Handbook of Means and Their Inequalities. Springer Science+Business Media.

[14]

Campello, R.J.G.B., Moulavi, D., Zimek, A., and Sander, J. (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data, 10(1):5:1–5:51. DOI: 10.1145/2733381.

[15]

Chambers, J.M. and Hastie, T. (1991). Statistical Models in S. Wadsworth & Brooks/Cole.

[16]

Clauset, A., Shalizi, C.R., and Newman, M.E.J. (2009). Power-law distributions in empirical data. SIAM Review, 51(4):661–703. DOI: 10.1137/070710111.

[17]

Connolly, T. and Begg, C. (2015). Database Systems: A Practical Approach to Design, Implementation, and Management. Pearson.

[18]

Conover, W.J. (1972). A Kolmogorov goodness-of-fit test for discontinuous distributions. Journal of the American Statistical Association, 67(339):591–596. DOI: 10.1080/01621459.1972.10481254.

[19]

Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. URL: https://archive.org/details/in.ernet.dli.2015.223699.

[20]

Dasu, T. and Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons.

[21]

Date, C.J. (2003). An Introduction to Database Systems. Pearson.

[22]

Deisenroth, M.P., Faisal, A.A., and Ong, C.S. (2020). Mathematics for Machine Learning. Cambridge University Press. URL: https://mml-book.github.io/.

[23]

Dekking, F.M., Kraaikamp, C., Lopuhaä, H.P., and Meester, L.E. (2005). A Modern Introduction to Probability and Statistics: Understanding Why and How. Springer.

[24]

Devroye, L., Györfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer. DOI: 10.1007/978-1-4612-0711-5.

[25]

Deza, M.M. and Deza, E. (2014). Encyclopedia of Distances. Springer.

[26]

Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press.

[27]

Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. KDD'96, pp. 226–231.

[28]

Feller, W. (1950). An Introduction to Probability Theory and Its Applications: Volume I. Wiley.

[29]

Forbes, C., Evans, M., Hastings, N., and Peacock, B. (2010). Statistical Distributions. Wiley.

[30]

Freedman, D. and Diaconis, P. (1981). On the histogram as a density estimator: L₂ theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57:453–476.

[31]

Friedl, J.E.F. (2006). Mastering Regular Expressions. O'Reilly.

[32]

Gagolewski, M. (2015). Data Fusion: Theory, Methods, and Applications. Institute of Computer Science, Polish Academy of Sciences. DOI: 10.5281/zenodo.6960306.

[33]

Gagolewski, M. (2015). Spread measures and their relation to aggregation functions. European Journal of Operational Research, 241(2):469–477. DOI: 10.1016/j.ejor.2014.08.034.

[34]

Gagolewski, M. (2021). genieclust: Fast and robust hierarchical clustering. SoftwareX, 15:100722. URL: https://genieclust.gagolewski.com/, DOI: 10.1016/j.softx.2021.100722.

[35]

Gagolewski, M. (2022). stringi: Fast and portable character string processing in R. Journal of Statistical Software, 103(2):1–59. URL: https://stringi.gagolewski.com/, DOI: 10.18637/jss.v103.i02.

[36]

Gagolewski, M. (2025). Deep R Programming. URL: https://deepr.gagolewski.com/, DOI: 10.5281/zenodo.7490464.

[37]

Gagolewski, M., Bartoszuk, M., and Cena, A. (2016). Przetwarzanie i analiza danych w języku Python (Data Processing and Analysis in Python). PWN. in Polish.

[38]

Gagolewski, M., Bartoszuk, M., and Cena, A. (2021). Are cluster validity measures (in)valid? Information Sciences, 581:620–636. DOI: 10.1016/j.ins.2021.10.004.

[39]

Gentle, J.E. (2003). Random Number Generation and Monte Carlo Methods. Springer.

[40]

Gentle, J.E. (2009). Computational Statistics. Springer-Verlag.

[41]

Gentle, J.E. (2020). Theory of Statistics. book draft. URL: https://mason.gmu.edu/~jgentle/books/MathStat.pdf.

[42]

Gentle, J.E. (2024). Matrix Algebra: Theory, Computations and Applications in Statistics. Springer.

[43]

Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 21(1):5–48. URL: https://perso.ens-lyon.fr/jean-michel.muller/goldberg.pdf.

[44]

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. URL: https://www.deeplearningbook.org/.

[45]

Grabisch, M., Marichal, J.-L., Mesiar, R., and Pap, E. (2009). Aggregation Functions. Cambridge University Press.

[46]

Grimmett, G.R. and Stirzaker, D.R. (2020). Probability and Random Processes. Oxford University Press.

[47]

Gumbel, E.J. (1939). La probabilité des hypothèses. Comptes Rendus de l'Académie des Sciences Paris, 209:645–647.

[48]

Harris, C.R. and others. (2020). Array programming with NumPy. Nature, 585(7825):357–362. DOI: 10.1038/s41586-020-2649-2.

[49]

Hart, E.M. and others. (2016). Ten simple rules for digital data storage. PLOS Computational Biology, 12(10):1–12. DOI: 10.1371/journal.pcbi.1005097.

[50]

Hastie, T., Tibshirani, R., and Friedman, J. (2017). The Elements of Statistical Learning. Springer-Verlag. URL: https://hastie.su.domains/ElemStatLearn.

[51]

Higham, N.J. (2002). Accuracy and Stability of Numerical Algorithms. SIAM. DOI: 10.1137/1.9780898718027.

[52]

Hopcroft, J.E. and Ullman, J.D. (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley.

[53]

Huber, P.J. and Ronchetti, E.M. (2009). Robust Statistics. Wiley.

[54]

Hunter, J.D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3):90–95.

[55]

Hyndman, R.J. and Athanasopoulos, G. (2021). Forecasting: Principles and Practice. OTexts. URL: https://otexts.com/fpp3.

[56]

Hyndman, R.J. and Fan, Y. (1996). Sample quantiles in statistical packages. American Statistician, 50(4):361–365. DOI: 10.2307/2684934.

[57]

Kleene, S.C. (1951). Representation of events in nerve nets and finite automata. Technical Report RM-704, The RAND Corporation, Santa Monica, CA. URL: https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/RM704.pdf.

[58]

Knuth, D.E. (1992). Literate Programming. CSLI.

[59]

Knuth, D.E. (1997). The Art of Computer Programming II: Seminumerical Algorithms. Addison-Wesley.

[60]

Kuchling, A.M. (2023). Regular Expression HOWTO. URL: https://docs.python.org/3/howto/regex.html.

[61]

Lee, J. (2011). A First Course in Combinatorial Optimisation. Cambridge University Press.

[62]

Ling, R.F. (1973). A probability theory of cluster analysis. Journal of the American Statistical Association, 68(341):159–164. DOI: 10.1080/01621459.1973.10481356.

[63]

Little, R.J.A. and Rubin, D.B. (2002). Statistical Analysis with Missing Data. John Wiley & Sons.

[64]

Lloyd, S.P. (1957 (1982)). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:128–137. Originally a 1957 Bell Telephone Laboratories Research Report; republished in 1982. DOI: 10.1109/TIT.1982.1056489.

[65]

Matloff, N.S. (2011). The Art of R Programming: A Tour of Statistical Software Design. No Starch Press.

[66]

McKinney, W. (2022). Python for Data Analysis. O'Reilly. URL: https://wesmckinney.com/book.

[67]

Modarres, M., Kaminskiy, M.P., and Krivtsov, V. (2016). Reliability Engineering and Risk Analysis: A Practical Guide. CRC Press.

[68]

Monahan, J.F. (2011). Numerical Methods of Statistics. Cambridge University Press.

[69]

Müllner, D. (2011). Modern hierarchical, agglomerative clustering algorithms. arXiv:1109.2378 [stat.ML]. URL: https://arxiv.org/abs/1109.2378v1.

[70]

Nelsen, R.B. (1999). An Introduction to Copulas. Springer-Verlag.

[71]

Newman, M.E.J. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary Physics, pages 323–351. DOI: 10.1080/00107510500052444.

[72]

Oetiker, T. and others. (2021). The Not So Short Introduction to LaTeX 2ε. URL: https://tobi.oetiker.ch/lshort/lshort.pdf.

[73]

Olver, F.W.J. and others. (2025). NIST Digital Library of Mathematical Functions. URL: https://dlmf.nist.gov/.

[74]

Ord, J.K., Fildes, R., and Kourentzes, N. (2017). Principles of Business Forecasting. Wessex Press.

[75]

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

[76]

Poore, G.M. (2019). Codebraid: Live code in pandoc Markdown. In: Proc. 18th Python in Science Conf., pp. 54–61. DOI: 10.25080/Majora-7ddc1dd1-008.

[77]

Press, W.H., Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P. (2007). Numerical Recipes. The Art of Scientific Computing. Cambridge University Press.

[78]

Pérez-Fernández, R., Baets, B. De, and Gagolewski, M. (2019). A taxonomy of monotonicity properties for the aggregation of multidimensional data. Information Fusion, 52:322–334. DOI: 10.1016/j.inffus.2019.05.006.

[79]

Rabin, M. and Scott, D. (1959). Finite automata and their decision problems. IBM Journal of Research and Development, 3:114–125.

[80]

Ritchie, D.M. and Thompson, K.L. (1970). QED text editor. Technical Report 70107-002, Bell Telephone Laboratories, Inc. URL: https://wayback.archive-it.org/all/20150203071645/http://cm.bell-labs.com/cm/cs/who/dmr/qedman.pdf.

[81]

Robert, C.P. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer-Verlag.

[82]

Ross, S.M. (2020). Introduction to Probability and Statistics for Engineers and Scientists. Academic Press.

[83]

Ross, S.M. (2024). Introduction to Probability Models. Elsevier.

[84]

Rousseeuw, P.J., Ruts, I., and Tukey, J.W. (1999). The bagplot: A bivariate boxplot. The American Statistician, 53(4):382–387. DOI: 10.2307/2686061.

[85]

Rubin, D.B. (1976). Inference and missing data. Biometrika, 63(3):581–590.

[86]

Sandve, G.K., Nekrutenko, A., Taylor, J., and Hovig, E. (2013). Ten simple rules for reproducible computational research. PLOS Computational Biology, 9(10):1–4. DOI: 10.1371/journal.pcbi.1003285.

[87]

Smith, S.W. (2002). The Scientist and Engineer's Guide to Digital Signal Processing. Newnes. URL: https://www.dspguide.com/.

[88]

Spicer, A. (2018). Business Bullshit. Routledge.

[89]

Steiglitz, K. (1996). A Digital Signal Processing Primer: With Applications to Digital Audio and Computer Music. Pearson.

[90]

Tijms, H.C. (2003). A First Course in Stochastic Models. Wiley.

[91]

Tufte, E.R. (2001). The Visual Display of Quantitative Information. Graphics Press.

[92]

Tukey, J.W. (1962). The future of data analysis. Annals of Mathematical Statistics, 33(1):1–67. URL: https://projecteuclid.org/journalArticle/Download?urlId=10.1214%2Faoms%2F1177704711, DOI: 10.1214/aoms/1177704711.

[93]

Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley.

[94]

van Buuren, S. (2018). Flexible Imputation of Missing Data. CRC Press. URL: https://stefvanbuuren.name/fimd.

[95]

van der Loo, M. and de Jonge, E. (2018). Statistical Data Cleaning with Applications in R. John Wiley & Sons.

[96]

Venables, W.N., Smith, D.M., and R Core Team. (2025). An Introduction to R. URL: https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf.

[97]

Virtanen, P. and others. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17:261–272. DOI: 10.1038/s41592-019-0686-2.

[98]

Wainer, H. (1997). Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot. Copernicus.

[99]

Waskom, M.L. (2021). seaborn: Statistical data visualization. Journal of Open Source Software, 6(60):3021. DOI: 10.21105/joss.03021.

[100]

Wickham, H. (2011). The split-apply-combine strategy for data analysis. Journal of Statistical Software, 40(1):1–29. DOI: 10.18637/jss.v040.i01.

[101]

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10):1–23. DOI: 10.18637/jss.v059.i10.

[102]

Wickham, H., Çetinkaya-Rundel, M., and Grolemund, G. (2023). R for Data Science. O'Reilly. URL: https://r4ds.hadley.nz/.

[103]

Wierzchoń, S.T. and Kłopotek, M.A. (2018). Modern Algorithms for Cluster Analysis. Springer. DOI: 10.1007/978-3-319-69308-8.

[104]

Wilson, G. and others. (2014). Best practices for scientific computing. PLOS Biology, 12(1):1–7. DOI: 10.1371/journal.pbio.1001745.

[105]

Wilson, G. and others. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6):1–20. DOI: 10.1371/journal.pcbi.1005510.

[106]

Xie, Y. (2015). Dynamic Documents with R and knitr. Chapman and Hall/CRC.