5. Processing Unidimensional Data

The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.

It is extremely rare for our datasets to bring interesting and valid insights out of the box. The ones we are using for illustrational purposes in the first part of our book have already been curated. After all, this is an introductory course, and we need to build up the necessary skills and not overwhelm the kind reader with too much information at the same time. We learn simple things first, learn them well, and then we move to more complex matters with a healthy level of confidence.

In real life, various data cleansing and feature engineering techniques will need to be performed on data. Most of them are based on the simple operations on vectors that we cover in this chapter:

  • summarising data (for example, computing the median or sum),

  • transforming values (applying mathematical operations on each element, such as subtracting a scalar or taking the natural logarithm),

  • filtering (selecting or removing observations that meet specific criteria, e.g., those that are larger than the arithmetic mean ± 5 standard deviations).

Important

The same operations we are going to be applying on individual data frame columns in Chapter 10.

5.1. Aggregating Numeric Data

Recall that in the previous chapter we have been discussing the heights and income datasets whose histograms were given in Figure 4.2 and Figure 4.3, respectively. Such graphical summaries are based on binned data and hence provide us with snapshots of how much probability mass is allocated in different parts of the data domain.

Instead of dealing with large datasets, we have obtained a few counts. The process of binning and its textual or visual depictions is useful in determining whether the distribution is unimodal or multimodal, skewed or symmetric around some point, what range of values contains most of the observations, how small or large extreme values are, etc.

Still, too much information may sometimes be overwhelming. Also, revealing it might not be a clever idea for privacy or confidentially reasons (although we strongly advocate for all information of concern to the public be openly available, so that experienced statisticians can put them in good use).

Thus, oftentimes we will be interested in even more synthetic descriptions – data aggregates which reduce the whole dataset into a single number reflecting one of its many characteristics and thus providing a kind of bird’s eye view on some aspect of it. We refer to such a process as data aggregation [Gag15a, GMMP09].

In this part we discuss a few noteworthy measures of:

  • location; e.g., central tendency measures such as mean and median;

  • dispersion; e.g., standard deviation and interquartile range;

  • distribution shape; e.g., skewness.

We also introduce box and whisker plots.

5.1.1. Measures of Location

5.1.1.1. Arithmetic Mean and Median

Two main measures of central tendency are:

  • the arithmetic mean (sometimes for simplicity called the mean or average), defined as the sum of all observations divided by the sample size:

    \[ \bar{x} = \frac{(x_1+x_2+\dots+x_n)}{n} = \frac{1}{n} \sum_{i=1}^n x_i, \]
  • the median, being the middle value in a sorted version of the sample if its length is odd or the arithmetic mean of the two middle values otherwise:

    \[\begin{split} m = \left\{ \begin{array}{ll} x_{(n+1)/2} & \text{if }n\text{ is odd},\\ \frac{x_{(n/2)}+x_{(n/2+1)}}{2} & \text{if }n\text{ is even}. \end{array} \right. \end{split}\]

They can be computed using the numpy.mean and numpy.median functions.

heights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/nhanes_adult_female_height_2020.txt")
np.mean(heights), np.median(heights)
## (160.13679222932953, 160.1)
income = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/uk_income_simulated_2020.txt")
np.mean(income), np.median(income)
## (35779.994, 30042.0)

Let us underline what follows:

  • for symmetric distributions, the arithmetic mean and the median are expected to be more or less equal,

  • for skewed distributions, the arithmetic mean will be biased towards the heavier tail.

Exercise 5.1

Compute the arithmetic mean and median for the 37_pzu_warsaw_marathon_mins dataset mentioned in Chapter 4.

Exercise 5.2

(*) Write a function that computes the median without the use of numpy.median (based on its mathematical definition and numpy.sort).

Note

(*) Technically, the arithmetic mean can also be computed using the mean method for the numpy.ndarray class – it will sometimes be the case that we have many ways to perform the same operation. We can even “implement” it manually using the sum function. Thus, all the following expressions are equivalent:

print(
    np.mean(income),
    income.mean(),
    np.sum(income)/len(income),
    income.sum()/income.shape[0]
)
## 35779.994 35779.994 35779.994 35779.994

On the other hand, there exists the numpy.median function but, unfortunately, the median method for vectors is not available. This is why we prefer sticking with functions.

5.1.1.2. Sensitive to Outliers vs Robust

The arithmetic mean is strongly influenced by very large or very small observations (which in some contexts we refer to as outliers). For instance, assume that we are inviting a billionaire to the income dataset:

income2 = np.append(income, [1_000_000_000])
print(np.mean(income), np.mean(income2))
## 35779.994 1034745.2487512487

Comparing this new result to the previous one, oh we all feel much richer now, right? In fact, the arithmetic mean reflects the income each of us would get if all the wealth were gathered in a single Santa Claus (or Robin Hood) sack and then distributed equally amongst all of us. A noble idea provided that everyone contributes equally to the society, which unfortunately is not the case.

On the other hand, the median is the value such that 50% of the observations are less than or equal to it and 50% of the remaining ones are not less than it. Hence, it is completely insensitive to most of the data points – on both the left and the right side of the distribution:

print(np.median(income), np.median(income2))
## 30042.0 30076.0

Because of this, we cannot say that one measure is better than the other. Certainly, for symmetrical distributions with no outliers (e.g., heights), the mean will be better as it uses all data (and its efficiency can be proven for certain statistical models). For skewed distributions (e.g., income), the median has a nice interpretation, as it gives the value in the middle of the ordered sample. Let us still remember that these data summaries allow us to look at a single data aspect only, and there can be many different, valid perspectives. The reality is complex.

5.1.1.3. Sample Quantiles

Quantiles generalise the notions of the sample median and of the inverse of the empirical cumulative distribution function (Section 4.3.8). They provide us with the value that is not exceeded by the elements in a given sample with a predefined probability.

Before proceeding with a formal definition, which is quite technical, let us point out that for larger sample sizes, we have the following rule of thumb.

Important

For any p between 0 and 1, the p-quantile, denoted \(q_p\), is a value dividing the sample in such a way that approximately \(100p\%\) of observations are not greater than \(q_p\), and the remaining ca. \(100(1-p)\%\) are not less than \(q_p\).

Quantiles appear under many different names, but they all refer to the same concept. In particular, we can speak about 100p-th percentiles, e.g., the 0.5-quantile is the same as the 50th percentile.

Furthermore:

  • 0-quantile (\(q_0\)) – the minimum (also: numpy.min),

  • 0.25-quantile (\(q_{0.25}\)) – the 1st quartile (denoted \(Q_1\)),

  • 0.5-quantile (\(q_{0.5}\)) – the 2nd quartile a.k.a. median (denoted \(m\) or \(Q_2\)),

  • 0.75-quantile (\(q_{0.75}\)) – the 3rd quartile (denoted \(Q_3\)),

  • 1.0-quantile (\(q_{1}\)) – the maximum (also: numpy.max).

Here are the above five aggregates for our two datasets:

np.quantile(heights, [0, 0.25, 0.5, 0.75, 1])
## array([131.1, 155.3, 160.1, 164.8, 189.3])
np.quantile(income,  [0, 0.25, 0.5, 0.75, 1])
## array([  5750.  ,  20669.75,  30042.  ,  44123.75, 199969.  ])
Exercise 5.3

What is the income bracket for 95% of the “most typical” UK taxpayers? In other words, determine the 2.5- and 97.5-percentiles.

Exercise 5.4

Compute the midrange of income and heights, being the arithmetic mean of the minimum and the maximum. Let us emphasise that this measure is extremely sensitive to outliers.

Note

(*) As we do not like the approximately part in the “asymptotic definition” given above, in this course we shall assume that for any \(p\in[0,1]\), the p-quantile is given by

\[ q_p = x_{(\lfloor k\rfloor)} + (k-\lfloor k\rfloor) (x_{(\lfloor k\rfloor+1)}-x_{(\lfloor k\rfloor)}), \]

where \(k=(n-1)p+1\) and \(\lfloor k\rfloor\) is the floor function, i.e., the greatest integer less than or equal to \(k\) (e.g., \(\lfloor 2.0\rfloor=\lfloor 2.001\rfloor=\lfloor 2.999\rfloor=2\), \(\lfloor 3.0\rfloor=\lfloor 3.999\rfloor=3\), \(\lfloor -3.0\rfloor=\lfloor -2.999\rfloor=\lfloor -2.001\rfloor=-3\), and \(\lfloor -2.0\rfloor=\lfloor -1.001\rfloor=-2\)).

\(q_p\) is thus a function that linearly interpolates between the points featuring the consecutive order statistics, \(((k - 1) / (n - 1), x_{(k)})\) for \(k=1,\dots,n\). For instance, for \(n=5\), we are linearly interpolating between the points \((0, x_{(1)})\), \((0.25, x_{(2)})\), \((0.5, x_{(3)})\), \((0.75, x_{(4)})\), \((1, x_{(5)})\), whereas for \(n=6\), we do the same for \((0, x_{(1)})\), \((0.2, x_{(2)})\), \((0.4, x_{(3)})\), \((0.6, x_{(4)})\), \((0.8, x_{(5)})\), \((1, x_{(6)})\), see Figure 5.1 for an illustration.

../_images/quantile7-1.png

Figure 5.1 \(q_p\) as a function of \(p\) for an example vector of length 5 (left subfigure) and 6 (right)

Notice that for p=0.5 we get the median regardless of whether n is even or not.

Note

(**) There are many definitions of quantiles across statistical software, see the method argument to numpy.quantile. They have been nicely summarised in [HF96] as well as in the corresponding Wikipedia article. They are all approximately equivalent for large sample sizes (i.e., asymptotically), but the best practice is to be explicit about which variant we are using in the computations when reporting data analysis results. And thus, in our case, we say that we are relying on the type-7 quantiles as described in [HF96], see also [Gum39].

Actually, simply mentioning that our computations are done with numpy version 1.xx (Section 1.4) implicitly implies that the default method parameters are used everywhere, unless otherwise stated. In many contexts, that is good enough.

5.1.2. Measures of Dispersion

Measures of central tendency quantify the location of the most typical value (whatever that means, and we already know it is complicated). That of dispersion (spread), on the other hand, will tell us how much variability is in our data. After all, when we say that the height of a group of people is 160 cm (on average) ± 14 cm (here, 2 standard deviations), the latter piece of information is a valuable addition (and is much different than the imaginary ± 4 cm case).

Some degree of variability might be good in certain contexts and bad in other ones. A bolt factory should keep the variance of the fasteners’ diameters as low as possible – after all, this is how we define quality products (assuming that they all meet, on average, the required specification). On the other hand, too much diversity in human behaviour, where everyone feels that they are special, is not really sustainable (but lack thereof would be extremely boring), and so forth.

Let us therefore look at different ways in which we can quantify it.

5.1.2.1. Standard Deviation (and Variance)

The standard deviation1, is the average distance to the arithmetic mean:

\[s = \sqrt{ \frac{ (x_1-\bar{x})^2 + (x_2-\bar{x})^2 + \dots + (x_n-\bar{x})^2 }{n} } = \sqrt{ \frac{1}{n} \sum_{i=1}^n (x_i-\bar{x})^2, } \]

Computing the above with numpy:

np.std(heights), np.std(income)
## (7.062021850008261, 22888.77122437908)

The standard deviation quantifies the typical amount of spread around the arithmetic mean. It is overall useful for making comparisons across different samples measuring similar things (e.g., heights of males vs of females, incomes in the UK vs in South Africa). However, without further assumptions, it is quite difficult to express the meaning of a particular value of s (e.g., the statement that the standard deviation of income is £22,900 is hard to interpret). Thus, it makes most sense for data distributions that are symmetric around the mean.

Note

(*) For bell-shaped data (more precisely: for normally-distributed samples, see the next chapter) such as heights, we sometimes report \(\bar{x}\pm 2s\), because the theoretical expectancy is that ca. 95% of data points fall into the \([\bar{x}-2s, \bar{x}+2s]\) interval (the so-called 2σ rule).

Further, the variance is the square of the standard deviation, \(s^2\). Mind that if data are expressed in centimetres, then the variance is in centimetres squared, which is not very intuitive. The standard deviation does not have this drawback. Mathematicians find the square root annoying though (for many reasons); that is why we will come across the \(s^2\) every now and then too.

5.1.2.2. Interquartile Range

The interquartile range (IQR) is another popular measure of dispersion. It is defined as the difference between the 3rd and the 1st quartile:

\[ \mathrm{IQR} = q_{0.75}-q_{0.25} = Q_3-Q_1. \]

Computing the above is almost effortless:

np.quantile(heights, 0.75) - np.quantile(heights, 0.25)
## 9.5
np.quantile(income, 0.75) - np.quantile(income, 0.25)
## 23454.0

The IQR has an appealing interpretation: it is the range comprised of the 50% most typical values. It is a quite robust measure, as it ignores the 25% smallest and 25% largest observations. Standard deviation, on the other hand, is extremely sensitive to outliers.

Furthermore, by range (or support) we will mean a measure based on extremal quantiles: it is the difference between the maximal and minimal observation.

5.1.3. Measures of Shape

From a histogram we can easily read whether a dataset is symmetric or skewed. It turns out that we can easily quantify such a characteristic. Namely, the skewness is given by:

\[ g = \frac{ \frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})^3 }{ \left(\sqrt{\frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})^2}\right)^3 }. \]

For symmetric distributions, skewness is approximately zero. Positive and negative skewness indicates a heavier right and left tail, respectively.

For example, heights are an instance of an almost-symmetric distribution:

scipy.stats.skew(heights)
## 0.0811184528074054

Income, on the other hand, is right-skewed:

scipy.stats.skew(income)
## 1.9768735693998942

And now we have both expressed as a single number.

Note

(*) It is worth stressing that \(g>0\) does not imply that the sample mean is greater than the median. As an alternative measure of skewness, sometime practitioners use:

\[ g'=\frac{\bar{x}-m}{s}. \]

Furthermore, Yule’s coefficient is an example of a robust skewness measure:

\[ g''=\frac{Q_3+Q_1-2m}{Q_3-Q_1}. \]

The computation thereof on our example datasets is left as an exercise.

Furthermore, kurtosis (or Fisher’s excess kurtosis, compare scipy.stats.kurtosis) can be used to describe whether an empirical distribution is heavy- or thin-tailed.

5.1.4. Box (and Whisker) Plots

A box and whisker plot (box plot for short) depicts some noteworthy features of a data sample.

plt.subplot(2, 1, 1)  # 2 rows, 1 column, 1st subplot
sns.boxplot(data=heights, orient="h", color="lightgray")
plt.yticks([0], ["heights"])
plt.subplot(2, 1, 2)  # 2 rows, 1 column, 2nd subplot
sns.boxplot(data=income, orient="h", color="lightgray")
plt.yticks([0], ["income"])
plt.show()
../_images/boxplot-income-heights-3.png

Figure 5.2 Example box plots

Each box plot (compare Figure 5.2) consists of:

  • the box, which spans between the 1st and the 3rd quartile:

    • the median is clearly marked by a vertical bar inside the box;

    • the width of the box corresponds to the IQR;

  • the whiskers, which span2 between:

    • the smallest observation (the minimum) or \(Q_1-1.5 \mathrm{IQR}\) (the left side of the box minus 3/2 of its width), whichever is larger, and

    • the largest observation (the maximum) or \(Q_3+1.5 \mathrm{IQR}\) (the right side of the box plus 3/2 of its width), whichever is smaller.

Additionally, all observations that are less than \(Q_1-1.5 \mathrm{IQR}\) (if any) or greater than \(Q_3+1.5 \mathrm{IQR}\) (if any) are separately marked.

Note

We are used to referring to the individually marked points as outliers, however, it does not automatically mean there is anything anomalous about them. They are atypical in the sense that they are considerably farther away from the box. However, they might also indicate some problems in data quality (e.g., when someone made a typo entering the data). Actually, box plots are calibrated (via the nicely round magic constant 1.5) in such a way that we expect there to be no or only few outliers if the data are normally distributed. For skewed distributions, there will naturally be many outliers on either side. See Section 15.4 for more details.

Box plots are based solely on sample quantiles. Most of the statistical packages do not draw the arithmetic mean. If they do, it is marked with a distinctive symbol.

Exercise 5.5

Call matplotlib.pyplot.plot(numpy.mean(..data..), 0, "wX") after seaborn.boxplot to mark the arithmetic mean with a white cross.

Box plots are particularly useful for comparing data samples with each other (e.g., body measures of men and women separately); both in terms of the relative shift (location) as well as spread and skewness, see, e.g., Figure 12.1.

Example 5.6

(*) We may also sometimes be interested in a violin plot (Figure 5.3), which combines a box plot (although with no outliers marked) and the so-called kernel density estimator (which is a smoothened version of a histogram, see Section 15.4.2).

plt.subplot(2, 1, 1)  # 2 rows, 1 column, 1st subplot
sns.violinplot(data=heights, orient="h", color="lightgray")
plt.yticks([0], ["heights"])
plt.subplot(2, 1, 2)  # 2 rows, 1 column, 2nd subplot
sns.violinplot(data=income, orient="h", color="lightgray")
plt.yticks([0], ["income"])
plt.show()
../_images/violinplot-income-5.png

Figure 5.3 Eexample violin plots

5.1.5. Further Methods (*)

We have said that the arithmetic mean is overly sensitive to extreme observations. The sample median is an example of a robust aggregate — it ignores all but 1-2 middle observations (we would say it has a high breakdown point). Some measures of central tendency that are in-between the mean-median extreme include:

  • trimmed means – the arithmetic mean of all the observations except a number of, say p, the smallest and the greatest ones,

  • winsorised means – the arithmetic mean with p smallest and p greatest observations replaced with the (p+1)-smallest the (p+1)-largest one.

The arithmetic mean is not the only mean of interest. The two other famous means are the geometric and harmonic ones. The former is more meaningful for averaging growth rates and speedups whilst the latter can be used for computing the average speed from speed measurements at sections of identical lengths, see also the notion of the F-measure in Section 12.3.2. Also, the quadratic mean is featured in the definition of the standard deviation (it is the quadratic mean of the distances to the mean).

As far as spread measures are concerned, the interquartile range (IQR) is a robust statistic. If necessary, standard deviation might be replaced with:

  • mean absolute deviation from the mean: \(\frac{1}{n} \sum_{i=1}^n |x_i-\bar{x}|\),

  • mean absolute deviation from the median: \(\frac{1}{n} \sum_{i=1}^n |x_i-m|\),

  • median absolute deviation from the median: the median of \((|x_1-m|, |x_2-m|, \dots, |x_n-m|)\).

The coefficient of variation, being the standard deviation dived by the arithmetic mean, is an example of a relative (or normalised) spread measure. It can be useful for comparing data on different scales, as it is unit-less (think how standard deviation changes when you convert between metres and centimetres).

The Gini index widely used in economics, can also serve as a measure of relative dispersion, but assumes that all data points are nonnegative.

\[ G = \frac{ \sum_{i=1}^{n} \sum_{j=1}^n |x_i-x_j| }{ 2 (n-1)n\, \bar{x} } = \frac{ \sum_{i=1}^{n} (n-2i+1) x_{(n-i+1)} }{ (n-1) \sum_{i=1}^n x_i }. \]

It is normalised so that it takes values in the unit interval. An index of 0 reflects the situation where all values in a sample are the same (0 variance; perfect equality). If there is a single entity in possession of all the “wealth”, and the remaining ones are 0, then the index is equal to 1.

Exercise 5.7

(*) It may be shown that the sample variance is equal to

\[ s^2 = \frac{\sum_{i=1}^{n} \sum_{j=1}^n |x_i-x_j|}{n^2}. \]

Based on this, show the relationship between the Gini index and the coefficient of variation.

For a more generic (algebraic) treatment of aggregation functions for unidimensional data, see, e.g., [Bul03, Gag15a, Gag15b, GMMP09]. Some measures might be better than others under certain (often strict) assumptions usually explored in a course on mathematical statistics, e.g., [Gen20].

Overall, numerical aggregates should be used in cases where data are unimodal. For multimodal mixtures or data in groups, they should rather be applied to summarise each cluster/class separately, compare Chapter 12. Also, in Chapter 8 we will extend consider a few summaries for multidimensional data.

5.2. Vectorised Mathematical Functions

Numpy, just like any other comprehensive numerical computing package, library, or environment (e.g., R, GNU Octave, Scilab, and Julia), defines many basic mathematical functions:

  • absolute value: numpy.abs,

  • square and square root: numpy.square and numpy.sqrt, respectively,

  • (natural) exponential function: numpy.exp,

  • logarithms: numpy.log (the natural logarithm, i.e., base e), numpy.log10 (base 10), etc.,

  • trigonometric functions: numpy.cos, numpy.sin, numpy.tan, etc., and their inverses: numpy.arccos, etc.

  • rounding and truncating: numpy.round, numpy.floor, numpy.ceil, numpy.trunc.

Each of these functions is vectorised: if we apply, say, \(f\), on a vector like \(\boldsymbol{x}=(x_1,\dots,x_n)\), we obtain a sequence of the same size with all elements appropriately transformed:

\[ f(\boldsymbol{x}) = (f(x_1), f(x_2), \dots, f(x_n)). \]

We will frequently be using such operations for adjusting the data, e.g., as in Figure 6.7 where we discover that the logarithm of the UK incomes has a bell-shaped distribution.

An example call to the vectorised version of the rounding function:

np.round([-3.249, -3.151, 2.49, 2.51, 3.49, 3.51], 1)
## array([-3.2, -3.2,  2.5,  2.5,  3.5,  3.5])

The input list has been automatically converted to a numpy vector.

Important

Thanks to the vectorised functions, our code is not only more readable, but also runs faster: we do not have employ the generally slow Python-level while or for loops to traverse through each element in a given sequence.

Here are some significant properties of the natural logarithm and its inverse, the exponential function. By convention, the Euler’s number \(e \simeq 2.718\), \(\log x = \log_e x\), and \(\exp(x)=e^x\).

  • \(\log 1 = 0\), \(\log e = 1\); note that logarithms are only defined for \(x > 0\): in the limit as \(x \to 0\), we have that \(\log x \to -\infty\),

  • \(\log x^y = y \log x\) and hence \(\log e^x = x\),

  • \(\log (xy) = \log x + \log y\) and thus \(\log (x/y) = \log x - \log y\),

  • \(e^0 = 1\), \(e^1=e\), and \(e^x\to 0\) as \(x \to -\infty\),

  • \(e^{\log x} = x\),

  • \(e^{x+y} = e^x e^y\) and hence \(e^{x-y}=e^x / e^y\).

Both functions are strictly increasing. For \(x\ge 1\), the logarithm grows very slowly whereas the exponential function increases very rapidly, see Figure 5.4.

plt.subplot(1, 2, 1)
x = np.linspace(np.exp(-2), np.exp(3), 1001)
plt.plot(x, np.log(x), label="$y=\\log x$")
plt.legend()
plt.subplot(1, 2, 2)
x = np.linspace(-2, 3, 1001)
plt.plot(x, np.exp(x), label="$y=\\exp(x)$")
plt.legend()
plt.show()
../_images/log-exp-7.png

Figure 5.4 The natural logarithm (left) and the exponential function (right)

Logarithms of different bases and non-natural exponential functions are also available. In particular, when drawing plots, we have already used the base-10 logarithmic scales on the axes. It holds \(\log_{10} x = \frac{\log x}{\log 10}\) and its inverse is \(10^x\). For example:

10.0**np.array([-1, 0, 1, 2])  # see below
## array([  0.1,   1. ,  10. , 100. ])
np.log10([-1, 0.01, 0.1, 1, 2, 5, 10, 100, 1000, 10000])
## array([     nan, -2.     , -1.     ,  0.     ,  0.30103,  0.69897,
##         1.     ,  2.     ,  3.     ,  4.     ])
## 
## <string>:1: RuntimeWarning: invalid value encountered in log10

Take note of the warning and the not-a-number (NaN) generated.

Exercise 5.8

Check that when using the log-scale on the x-axis (plt.xscale("log")), the plot of the logarithm (of any base) is a straight line. Similarly, the log-scale on the y-axis (plt.yscale("log")) makes the exponential function linear.

Moving on, the trigonometric functions in numpy accept angles in radians; if \(x\) is the degree in angles, then to compute its cosine we will thus be writing \(\cos (x \pi /180)\) instead; see Figure 5.5.

x = np.linspace(-2*np.pi, 4*np.pi, 1001)
plt.plot(x, np.cos(x))
plt.xticks(
    [-2*np.pi, -np.pi, 0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi, 4*np.pi],
    ["$-2\\pi$", "$-\\pi$", "$0$", "$\\pi/2$", "$\\pi$",
     "$3\\pi/2$", "$2\\pi$", "$4\\pi$"]
)
plt.show()
../_images/cos-9.png

Figure 5.5 The cosine

Some identities worth memorising:

  • \(\sin x = \cos(\pi/2 - x)\),

  • \(\cos(-x) = \cos x\),

  • \(\cos^2 x + \sin^2 x = 1\), where \(\cos^2 x = (\cos x)^2\),

  • \(\cos(x+y) = \cos x \cos y - \sin x \sin y\),

  • \(\cos(x-y) = \cos x \cos y + \sin x \sin y\).

We will refer to them later.

Important

The classical handbook of mathematical functions and the (in)equalities related to them is [AS72], see [O+22] for its updated version.

5.3. Arithmetic Operators

We can apply the standard binary (two-argument) arithmetic operators `+`, `-`, `*`, `/`, `**`, `%`, and `//` on vectors too. Below we discuss the possible cases of the operands’ lengths.

5.3.1. Vector-Scalar Case

Often, we will be referring to the binary operators in contexts where one operand is a vector and the other one is a single value (scalar), for example:

np.array([-2, -1, 0, 1, 2, 3])**2
## array([4, 1, 0, 1, 4, 9])
(np.array([-2, -1, 0, 1, 2, 3])+2)/5
## array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])

In such a case, each element in the vector is being operated upon (e.g., squared, divided by 5) and we get a vector of the same length in return. Hence, in this case the operators behave just like the vectorised mathematical functions discussed above.

Mathematically, it is common to assume that scalar multiplication and, less commonly, addition is performed in this way.

\[ c \boldsymbol{x} + t = (cx_1+t, cx_2+t, \dots, cx_n+t). \]

We will also become used to writing, \((\boldsymbol{x}-t)/c\) which is of course equivalent to \((1/c) \boldsymbol{x} + (-t/c)\).

5.3.2. Application: Feature Scaling

Vector-scalar operations and aggregation functions are the basis for the most commonly applied feature scalers:

  • standardisation,

  • normalisation,

  • min-max scaling and clipping.

They can increase the interpretability of data points by bringing them onto a common, unitless scale. They might also be essential when computing pairwise distances between multidimensional points, see Section 8.4.

All the above are linear transforms of the form \(\boldsymbol{y}=c \boldsymbol{x}+t\), which geometrically we can interpret as scaling (stretching or shrinking) and shifting (translating), see Figure 5.6 for an illustration.

../_images/scale-shift-11.png

Figure 5.6 Scaled and shifted versions of an example vector

Note

Let \(\boldsymbol{y}=c \boldsymbol{x}+t\) and let \(\bar{x}, \bar{y}, s_x, s_y\) denote the vectors’ arithmetic means and standard deviations. The following properties hold.

  • The arithmetic mean and all the quantiles (including, of course, the median), are equivariant with respect to translation and scaling; it holds, for instance, \(\bar{y} = c \bar{x} + t\).

  • Thhe standard deviation, the interquartile range, and the range are invariant to translations and equivariant to scaling; e.g., \(s_y = c s_x\).

As a by-product, for variance we get of course… \(s_y^2 = c^2 s_x^2\).

5.3.2.1. Standardisation and Z-scores

A standardised version of a vector \(\boldsymbol{x}=(x_1,\dots,x_n)\) consists in subtracting, from each element, the sample arithmetic mean (which we call centring) and then dividing it by the standard deviation, i.e., \(\boldsymbol{z}=(\boldsymbol{x}-\bar{x})/s\).

Thus, we transform each \(x_i\) to obtain:

\[ z_i = \frac{x_i-\bar{x}}{s}. \]

Consider again the female heights dataset:

heights[-5:] # preview
## array([157. , 167.4, 159.6, 168.5, 147.8])

whose mean \(\bar{x}\) and standard deviation \(s\) are equal to:

np.mean(heights), np.std(heights)
## (160.13679222932953, 7.062021850008261)

With numpy, standardisation is as simple as applying 2 aggregation functions and 2 arithmetic operations:

heights_std = (heights-np.mean(heights))/np.std(heights)
heights_std[-5:]  # preview
## array([-0.44417764,  1.02848843, -0.07601113,  1.18425119, -1.74692071])

What we obtained is sometimes referred to as the z-scores. They are nicely interpretable:

  • z-score of 0 corresponds to an observation equal to the sample mean (perfectly average);

  • z-score of 1 is obtained for a datum 1 standard deviation above the mean;

  • z-score of -2 means that it is a value 2 standard deviations below the mean;

and so forth.

Because of the way they emerge, the mean of the z-scores is always 0 and standard deviation is 1 (up to a tiny numerical error, as usual; see Section 5.6.2):

np.mean(heights_std), np.std(heights_std)
## (1.8920872660373198e-15, 1.0)

Even though the original heights were measured in centimetres, the z-scores are unitless (centimetres/centimetres = 1).

Important

Standardisation enables the comparison of measurements on different scales (think: height in centimetres vs weight in kilograms or apples vs oranges).

It makes most sense for bell-shaped distributions, in particular normally-distributed ones. In the next chapter we will note that, due to the 2σ rule for the normal family, we expect that 95% of observations should have z-scores between -2 and 2. Further, z-scores less than -3 and greater than 3 are highly unlikely.

Exercise 5.9

We have a patient whose height z-score is 1 and weight z-score is -1. How to interpret this information?

Exercise 5.10

How about a patient whose weight z-score is 0 but BMI z-score is 2?

On a side note, sometimes we might be interested in performing some form of robust standardisation (e.g., for skewed data or those that feature some outliers). In such a case, we can replace the mean with the median and the standard deviation with the IQR.

5.3.2.2. Min-Max Scaling and Clipping

A less frequently but still noteworthy transformation is called min-max scaling and involves subtracting the minimum and then dividing by the range, \((x-x_{(1)})/(x_{(n)}-x_{(1)})\).

x = np.array([-1.5, 0.5, 3.5, -1.33, 0.25, 0.8])
(x - np.min(x))/(np.max(x)-np.min(x))
## array([0.   , 0.4  , 1.   , 0.034, 0.35 , 0.46 ])

Here, the smallest value is mapped to 0 and the largest becomes equal to 1. Let us stress that, in this context, 0.5 does not represent the value which is equal to the mean (unless we are incredibly lucky).

Also, clipping can be used to replace all values less than 0 with 0 and those greater than 1 with 1.

np.clip(x, 0, 1)
## array([0.  , 0.5 , 1.  , 0.  , 0.25, 0.8 ])

The function is of course flexible; another popular choice is clipping to [-1, 1]. This can also be implemented manually by means of the vectorised pairwise minimum and maximum functions.

np.minimum(1, np.maximum(0, x))
## array([0.  , 0.5 , 1.  , 0.  , 0.25, 0.8 ])

5.3.2.3. Normalisation (\(l_2\); Dividing by Magnitude)

Normalisation is the scaling of a given vector so that it is of unit length. Usually, by length we mean the square root of the sum of squares, i.e., the Euclidean (\(l_2\)) norm also known as the magnitude:

\[ \| (x_1,\dots,x_n) \| = \sqrt{ x_1^2 + x_2^2 + \dots + x_n^2 } = \sqrt{ \sum_{i=1}^n x_i^2 }, \]

Its special case for n=2 we know well from high school: the length of a vector \((a,b)\) is \(\sqrt{a^2+b^2}\), e.g., \(\|(1, 2)\| = \sqrt{5} \simeq 2.236\). Also, it is good to program our brains so that when next time we see \(\| \boldsymbol{x} \|^2\), we immediately think of the sum of squares.

We can thus write:

x = np.array([1, 5, -4, 2, 2.5])
x/np.sqrt(np.sum(x**2))  # x divided by the Euclidean norm of x
## array([ 0.13834289,  0.69171446, -0.55337157,  0.27668579,  0.34585723])

to mean the normalised vector

\[ \frac{\boldsymbol{x}}{\| \boldsymbol{x} \|} = \left( \frac{x_1}{\| \boldsymbol{x} \|}, \frac{x_2}{\| \boldsymbol{x} \|}, \dots, \frac{x_n}{\| \boldsymbol{x} \|} \right). \]
Exercise 5.11

Normalisation is similar to standardisation if data are already centred (when mean was subtracted). Show that we can obtain one from the other via the scaling by \(\sqrt{n}\).

Important

A common confusion is that normalisation is supposed to make data more “normally” distributed. This is not the case, as we only scale (stretch or shrink) the observations here.

5.3.2.4. Normalisation (\(l_1\); Dividing by Sum)

At other times, we might be interested in considering the Manhattan (\(l_1\)) norm,

\[ \| (x_1,\dots,x_n) \|_1 = |x_1| + |x_2| + \dots + |x_n| = \sum_{i=1}^n |x_i|, \]

being the sum of absolute values.

x / np.sum(np.abs(x))
## array([ 0.06896552,  0.34482759, -0.27586207,  0.13793103,  0.17241379])

\(l_1\) normalisation is frequently applied on vectors of nonnegative values, whose normalised versions can be interpreted as probabilities or proportions: values between 0 and 1 which sum to 1 (or, equivalently, 100%).

Example 5.12

Given some binned data:

c, b = np.histogram(heights, [-np.inf, 150, 160, 170, np.inf])
print(c)  # counts
## [ 306 1776 1773  366]

We can convert the counts to empirical probabilities:

p = c/np.sum(c)   # np.abs is not needed here
print(p)
## [0.07249467 0.42075338 0.42004264 0.08670931]

We did not apply numpy.abs, because the values were already nonnegative.

5.3.3. Vector-Vector Case

So far we have been applying `*`, `+`, etc., on vectors and scalars only. All arithmetic operators can also be applied on two vectors of equal lengths. In such a case, they will act elementwisely: taking each element from the first operand and combining it with the corresponding element from the second argument:

np.array([2, 3, 4, 5]) * np.array([10, 100, 1000, 10000])
## array([   20,   300,  4000, 50000])

We see that the first element in the left operand (2) was multiplied by the first element in the right operand (10). Then, we computed 3*100 (the second elements), and so forth.

Such a behaviour of the binary operators is inspired by the usual convention in vector algebra where applying \(+\) (or \(-\)) on \(\boldsymbol{x}=(x_1,\dots,x_n)\) and \(\boldsymbol{y}=(y_1,\dots,y_n)\) means exactly:

\[ \boldsymbol{x}+\boldsymbol{y} = (x_1+y_1, x_2+y_2, \dots, x_n+y_n). \]

Using other operators this way (elementwisely) is less standard in mathematics (for instance multiplication might denote the dot product), but in numpy it is really convenient.

Example 5.13

Let us compute the value of the expression \(h = - (p_1\, \log p_1 + \dots + p_n\, \log p_n)\), i.e., \(h=-\sum_{i=1}^n p_i\, \log p_i\) (the so-called entropy):

p = np.array([0.1, 0.3, 0.25, 0.15, 0.12, 0.08])  # example vector
-np.sum(p*np.log(p))
## 1.6790818544987114

The above involves the use of a unary vectorised minus (change sign), an aggregation function (sum), a vectorised mathematical function (log), and an elementwise multiplication of two vectors of the same lengths.

Example 5.14

Let us assume that – for whatever the reason – we would like to plot two mathematical functions, the sine, \(f(x)= \sin x\), and a polynomial of degree 7, \(g(x) = x - x^3/6 + x^5/120 - x^7/5040\) for \(x\) in the interval \([-\pi, 3\pi/2]\).

To do this, we can probe the values of \(f\) and \(g\) at sufficiently many points using the vectorised operations discussed so far and then use the matplotlib.pyplot.plot function to draw what we see in Figure 5.7.

x = np.linspace(-np.pi, 1.5*np.pi, 1001)  # many points in the said interval
yf = np.sin(x)
yg = x - x**3/6 + x**5/120 - x**7/5040
plt.plot(x, yf, 'k-', label="f(x)")  # black solid line
plt.plot(x, yg, 'r:', label="g(x)")  # red dotted line
plt.legend()
plt.show()
../_images/vectorised-easy-13.png

Figure 5.7 With vectorised functions, it is easy to generate plots like this one; we used different line styles so that the plot is readable also when printed in black and white

Decreasing the number of points in x will reveal that the plotting function merely draws a series of straight-line segments. Computer graphics is essentially discrete.

Exercise 5.15

Based on the nhanes_adult_female_height_2020 and nhanes_adult_female_weight_2020 datasets (discussed in the part on histograms), using a single line of code, compute the vector of BMIs of all persons. It is assumed that the i-th elements therein both refer to the same person.

5.4. Indexing Vectors

Recall from Section 3.2.1 and Section 3.2.2 that sequential objects in Python (lists, tuples, strings, ranges) support indexing using scalars and slices:

x = [10, 20, 30, 40, 50]
x[1]  # scalar index – extract
## 20
x[1:2]  # slice index – subset
## [20]

numpy vectors support two additional indexing schemes: using integer and boolean sequences.

5.4.1. Integer Indexing

Indexing with a single integer extracts a particular element:

x = np.array([10, 20, 30, 40, 50])
x[0]   # first
## 10
x[1]   # second
## 20
x[-1]  # last
## 50

We can also use lists of vectors of integer indexes, which return a subvector with elements at the specified indices:

x[ [0] ]
## array([10])
x[ [0, 1, -1, 0, 1, 0, 0] ]
## array([10, 20, 50, 10, 20, 10, 10])
x[ [] ]
## array([], dtype=int64)

We added some spaces between the square brackets, but only because, for example, x[[0]] might seem slightly more enigmatic. (What are these double square brackets? Nah, it is a list inside the index operator.)

5.4.2. Logical Indexing

Subsetting using a logical vector of the same length as the indexed vector is possible too:

x[ [True, False, True, True, False] ]
## array([10, 30, 40])

This returned the 1st, 3rd, and 4th element (select 1st, omit 2nd, select 3rd, select 4th, omit 5th).

This is particularly useful as a data filtering technique. Knowing that the relational operators `<`, `<=`, `==`, `!=`, `>=`, and `>` on vectors are performed elementwisely (just like `+`, `*`, etc.), for instance:

x >= 30
## array([False, False,  True,  True,  True])

we can write:

x[ x >= 30 ]
## array([30, 40, 50])

to mean “select the elements in x which are not less than 30”.

Of course, the indexed vector and the vector specifying the filter do not3 have to be the same:

y = (x/10) % 2  # whatever
y  # equal to 0 if a number is a multiply of 10 times an even number
## array([1., 0., 1., 0., 1.])
x[ y == 0 ]
## array([20, 40])

Important

If we would like to combine many logical vectors, sadly we cannot use the and, or, and not operators, because they are not vectorised (this is a limitation of our language per se).

In numpy, we use the `&`, `|`, and `~` operators instead. Unfortunately, they have a lower order of precedence than `<`, `<=`, `==`, etc. Therefore, the bracketing of the comparisons is obligatory.

For example:

x[ (20 <= x) & (x <= 40) ]  # check what happens if we skip the brackets
## array([20, 30, 40])

means “elements in x between 20 and 40” (greater than or equal to 20 and less than or equal to 40).

Exercise 5.16

Compute the BMIs only of the women whose height is between 150 and 170 cm.

5.4.3. Slicing

Just as with ordinary lists, slicing with “:” can be used to fetch the elements at indexes in a given range like from:to or from:to:by.

x[::-1]
## array([50, 40, 30, 20, 10])
x[3:]
## array([40, 50])
x[1:4]
## array([20, 30, 40])

Important

For efficiency reasons, slicing returns a view on existing data. It does not have to make an independent copy of the subsetted elements, because sliced ranges are regular by definition.

In other words, both x and its sliced version share the same memory. This is important when we apply operations which modify a given vector in place, such as the sort method.

y = np.array([6, 4, 8, 5, 1, 3, 2, 9, 7])
y[::2] *= 10  # modifies parts of y in place
y  # has changed
## array([60,  4, 80,  5, 10,  3, 20,  9, 70])

This multiplied every second element in y by 10 (i.e., [6, 8, 1, 2, 7]). On the other hand, indexing with an integer or logical vector always returns a copy.

y[ [1, 3, 5, 7] ] *= 10  # modifies a new object and then forgets about it
y  # has not changed since the last modification
## array([60, 40, 80, 50, 10, 30, 20, 90, 70])

This did not modify the original vector, because we have applied `*=` on a different object, which has not even been memorised after that operation took place.

5.5. Other Operations

5.5.1. Cumulative Sums and Iterated Differences

Recall that the `+` operator acts on two vectors elementwisely and that the numpy.sum function aggregates all values into a single one. We have a similar function, but vectorised in a slightly different fashion. Namely, numpy.cumsum returns the vector of cumulative sums:

np.cumsum([5, 3, -4, 1, 1, 3])
## array([5, 8, 4, 5, 6, 9])

This gave, in this order: the first element, the sum of first two elements, the sum of first three elements, …, the sum of all elements.

Further, iterative differences are a somewhat inverse operation:

np.diff([5, 8, 4, 5, 6, 9])
## array([ 3, -4,  1,  1,  3])

returned the difference between the 2nd and 1st element, then the difference between the 3rd and the 2nd, and so forth. The resulting vector is 1 element shorter than the input one.

We often make use of cumulative sums and iterated differences when processing time series, e.g., stock exchange data (e.g., by how much the price changed since the previous day?; Section 16.3.1) or determining cumulative distribution functions (Section 4.3.8).

5.5.2. Sorting

The numpy.sort function returns a sorted copy of a given vector, i.e., determines the order statistics.

x = np.array([40, 10, 20, 40, 40, 30, 20, 40, 50, 10, 10, 70, 30, 40, 30])
np.sort(x)
## array([10, 10, 10, 20, 20, 30, 30, 30, 40, 40, 40, 40, 40, 50, 70])

The sort method (as in: x.sort()), on the other hand, sorts the vector in place (and returns nothing).

Exercise 5.17

Readers interested more in chaos than in bringing order should give numpy.random.permutation a try. This function shuffles the elements in a given vector.

5.5.3. Dealing with Tied Observations

Some statistical methods, especially for continuous data4, assume that all observations in a vector are unique, i.e., there are no ties. In real life, however, a situation where one value occurs many times can happen. For instance, two marathoners might finish their runs in exactly the same time, data can be rounded up to a certain number of fractional digits, or observations are inherently integer. We should thus be able to detect it.

numpy.unique is a workhorse for dealing with tied observations.

x = np.array([40, 10, 20, 40, 40, 30, 20, 40, 50, 10, 10, 70, 30, 40, 30])
np.unique(x)
## array([10, 20, 30, 40, 50, 70])

Returns a sorted5 version of a given vector with duplicates removed.

We can also get the corresponding counts:

np.unique(x, return_counts=True)  # returns a tuple of length 2
## (array([10, 20, 30, 40, 50, 70]), array([3, 2, 3, 5, 1, 1]))

This can be used to determine if all the values in a vector are unique:

np.all(np.unique(x, return_counts=True)[1] == 1)
## False
Exercise 5.18

Play with the return_index argument to numpy.unique that allows to pinpoint the indexes of the first occurrences of each unique value.

5.6. Determining the Ordering Permutation and Ranking

numpy.argsort returns a sequence of indexes that lead to an ordered version of a given vector (i.e., an ordering permutation).

x = np.array([40, 10, 20, 40, 40, 30, 20, 40, 50, 10, 10, 70, 30, 40, 30])
np.argsort(x)
## array([ 1,  9, 10,  2,  6,  5, 12, 14,  0,  3,  4,  7, 13,  8, 11])

Which means that the smallest element is at index 1, then the 2nd smallest is at index 9, 3rd smallest at index 10, etc. Therefore:

x[np.argsort(x)]
## array([10, 10, 10, 20, 20, 30, 30, 30, 40, 40, 40, 40, 40, 50, 70])

is equivalent to numpy.sort(x).

Note

(**) If there are tied observations in a vector x, numpy.argsort(x, kind="stable") will use a stable sorting algorithm (timsort, a variant of mergesort), which guarantees that the ordering permutation is unique: tied elements are placed in the order of appearance.

Next, scipy.stats.rankdata returns a vector of ranks.

x = np.array([10, 40, 50, 20, 30])
scipy.stats.rankdata(x)
## array([1., 4., 5., 2., 3.])

Element 10 is the smallest (“the winner”, say, the quickest racer), hence it has rank 1. Element 40 is the 4th on the podium, hence its rank is 4. And so on.

There are many methods in nonparametric statistics (those that do not make any too particular assumptions about the underlying data distribution) that are based on ranks, e.g., the Spearman correlation coefficient which we cover in Section 9.1.4.

Exercise 5.19

Consult the manual page of scipy.stats.rankdata and test various methods for dealing with ties.

Note

(**) Readers with some background in discrete mathematics will be interested in the fact that calling numpy.argsort on a vector representing a permutation of elements in fact generates its inverse. In particular, np.argsort(np.argsort(x, kind="stable"))+1 is equivalent to scipy.stats.rankdata(x, method="ordinal").

5.6.1. Searching for Certain Indexes (Argmin, Argmax)

numpy.argmin and numpy.argmax return the index at which we can find the smallest and the largest observation in a given vector.

x = np.array([10, 30, 50, 40, 20, 50])
np.argmin(x), np.argmax(x)
## (0, 2)

If there are tied observations, the smallest index is returned.

Using mathematical notation, we can denote the former with:

\[ i = \mathrm{arg}\min_j x_j \]

and read it as: let \(i\) be the index of the smallest element in the sequence. Alternatively, it is the argument of the minimum, whenever:

\[ x_i = \min_j x_j, \]

i.e., the \(i\)-th element is the smallest.

We can use numpy.flatnonzero to fetch the indexes where a logical vector has elements equal to True (in Section 11.1.3 we mention that a value equal to zero is treated as the logical False, and as True in all other cases). For example:

np.flatnonzero(x == np.max(x))
## array([2, 5])

It is a version of numpy.argmax that lets us decide what we would like to do with the tied maxima (there are two).

Exercise 5.20

Let x be a vector with possible ties. Write an expression that returns a randomly chosen index pinpointing one of the sample maxima.

5.6.2. Dealing with Round-off and Measurement Errors

Mathematics tells us (the easy proof is left as an exercise to the reader) that a centred version of a given vector \(\boldsymbol{x}\), \(\boldsymbol{y} = \boldsymbol{x}-\bar{x}\), has arithmetic mean 0, i.e., \(\bar{y}=0\).

Of course, it is also true on a computer. Or is it?

heights_centred = (heights - np.mean(heights))
np.mean(heights_centred) == 0
## False

Actually, the average is:

np.mean(heights_centred)
## 1.3359078775153175e-14

which is almost zero (0.0000000000000134), but not exactly zero (it is a zero for an engineer, not a mathematician). We have seen a similar result when performing standardisation (which involves centring) in Section 5.3.2.

Important

All floating-point operations on a computer6 (not only in Python) are performed with finite precision of 15–17 decimal digits.

We know it from school – for example, some fractions cannot be represented as decimals. Therefore, when asked to add or multiply them, we will always have to apply some rounding that ultimately leads to precision loss; e.g., we know that \(1/3 + 1/3 + 1/3 = 1\), but using a decimal representation with 1 fractional digit, we get \(0.3+0.3+0.3=0.9\); with 2 digits we obtain \(0.33+0.33+0.33=0.99\), and so forth. This sum will never be equal exactly to 1 when using a finite precision.

Moreover, errors induced in one operation will propagate onto further ones. Most often they cancel out, but in extreme cases they can lead to undesirable consequences (like for some model matrices in linear regression, see Section 9.2.9).

There is no reason to panic, though. The rule to remember is:

Important

As the floating-point values are precise up to a few decimal digits, we should refrain from comparing them using the `==` operator, because it tests exact equality.

When a comparison is needed, we need to take some error margin into account. Ideally, instead of testing x == y, we should either inspect the absolute error

\[ |x-y| \le \varepsilon \]

or, assuming \(y\neq 0\), the relative error

\[ \frac{|x-y|}{|y|} \le \varepsilon, \]

where \(\varepsilon\) is some small error margin.

For instance, numpy.allclose(x, y) checks (by default) if for all corresponding elements in both vectors it holds numpy.abs(x-y) <= 1e-8 + 1e-5*numpy.abs(y), which is a combination of both tests.

np.allclose(np.mean(heights_centred), 0)
## True

To avoid sorrow surprises, even the testing of inequalities like x >= 0 should rather be performed as, say, x >= 1e-8.

Note

Our data are often imprecise by nature. When asked about people’s heights, rarely will they give a non-integer answer (assuming they know how tall they are and are not lying about it, but it is a different story) – we get data rounded to 0 decimal digits. Actually, in our dataset the precision is a bit higher:

heights[:6]  # preview
## array([160.2, 152.7, 161.2, 157.4, 154.6, 144.7])

But still, there is an inherent observational error. Therefore, even if, for example, the mean thereof was computed exactly, the fact that the inputs themselves are not necessarily ideal makes the estimate approximate as well. We can only hope that these errors will more or less cancel out in the computations.

Exercise 5.21

Compute the BMIs of all females in the NHANES study. Determine their arithmetic mean. Compare it to the arithmetic mean computed for BMIs rounded to 1, 2, 3, 4, etc., decimal digits.

Note

(*) Another problem is related to the fact that floats on a computer use the binary base, not the decimal one. Therefore, some fractional numbers that we believe to be representable exactly, in fact require an infinite number of bits and thus are subject to rounding.

0.1 + 0.1 + 0.1 == 0.3  # obviously
## False

This is because 0.1, 0.1+0.1+0.1, and 0.3 is actually represented as, respectively:

print(f"{0.1:.19f}, {0.1+0.1+0.1:.19f}, and {0.3:.19f}.")
## 0.1000000000000000056, 0.3000000000000000444, and 0.2999999999999999889.

A good introductory reference to the topic of numerical inaccuracies is [Gol91], see also [Hig02, Knu97] for a more comprehensive treatment of numerical analysis.

5.6.3. Vectorising Scalar Operations with List Comprehensions

List comprehensions of the form [ expression for name in iterable ] are part of base Python. They allow us to create lists based on transformed versions of individual elements in a given iterable object. Hence, they might work in cases where a task at hand cannot be solved by means of vectorised numpy functions.

For example, here is a way to generate squares of 10 first natural numbers:

[ i**2 for i in range(1, 11) ]
## [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

The result can be passed to numpy.array to convert it to a vector.

Further, given an example vector:

x = np.round(np.random.rand(10)*2-1, 2)
x
## array([ 0.86, -0.37, -0.63, -0.59,  0.14,  0.19,  0.93,  0.31,  0.5 ,
##         0.31])

If we wish to filter our all elements that are not greater than 0, we can write:

[ e for e in x if e > 0 ]
## [0.86, 0.14, 0.19, 0.93, 0.31, 0.5, 0.31]

We can also use the ternary operator of the form x_true if cond else x_false to return either x_true or x_false depending on the truth value of cond.

e = -2
e**0.5 if e >= 0 else (-e)**0.5
## 1.4142135623730951

Combined with a list comprehension, we can write, for instance:

[ round(e**0.5 if e >= 0 else (-e)**0.5, 2) for e in x ]
## [0.93, 0.61, 0.79, 0.77, 0.37, 0.44, 0.96, 0.56, 0.71, 0.56]

This gave the square root of absolute values.

There is also a tool which vectorises a scalar function so that it can be used on numpy vectors:

def clip01(x):
    """clip to the unit inverval"""
    if x < 0:    return 0
    elif x > 1:  return 1
    else:        return x

clip01s = np.vectorize(clip01)  # returns a function object
clip01s([0.3, -1.2, 0.7, 4, 9])
## array([0.3, 0. , 0.7, 1. , 1. ])

In the above cases, it is much better (faster, more readable code) to rely on vectorised numpy functions. However, in cases where the corresponding operations are unavailable (e.g., string processing, reading many files), list comprehensions provide a reasonable replacement therefor.

Exercise 5.22

Write equivalent versions of the above expressions using vectorised numpy functions.

Exercise 5.23

Write equivalent versions of the above expressions using base Python lists, the for loop and the list.append method (start from an empty list which will store the result).

5.7. Exercises

Exercise 5.24

What are some benefits of using a numpy vector over an ordinary Python list? What are the drawbacks?

Exercise 5.25

How can we interpret the possibly different values of the arithmetic mean, median, standard deviation, interquartile range, and skewness, when comparing between heights of men and women?

Exercise 5.26

There is something scientific and magical about numbers that makes us approach them with some kind of respect. However, taking into account that there are many possible data aggregates, there is a risk that a party may be cherry-picking – reporting the one that portrays the analysed entity in a good or bad light. For instance, reporting the mean instead of the median or vice versa. Is there anything that can be done about it?

Exercise 5.27

Although – mathematically speaking – all measures can be computed on all data, it does not mean that it always makes sense to do so. For instance, some distributions will yield skewness of 0, but we should not automatically assume that they are nicely symmetric and bell-shaped (e.g., this can be a bimodal distribution). It is thus best to visualise your data first. Give some examples of datasets and measures where we should be critical about the obtained results.

Exercise 5.28

Give some examples where simple data preprocessing can drastically change the values of chosen sample aggregates.

Exercise 5.29

Give the mathematical definitions, use cases, and interpretations of standardisation, normalisation, and min-max scaling.

Exercise 5.30

How are numpy.log and numpy.exp related to each other? How about numpy.log vs numpy.log10, numpy.cumsum vs numpy.diff, numpy.min vs numpy.argmin, numpy.sort vs numpy.argsort, and scipy.stats.rankdata vs numpy.argsort?

Exercise 5.31

What is the difference between numpy.trunc, numpy.floor, numpy.ceil, and numpy.round?

Exercise 5.32

What happens when we apply `+` on two vectors of different lengths?

Exercise 5.33

List the four ways to index a vector.

Exercise 5.34

What is wrong with the expression x[ x >= 0 and x <= 1 ], where x is a numeric vector? How about x[ x >= 0 & x <= 1 ]?

Exercise 5.35

What does it mean that slicing returns a view on existing data?

Exercise 5.36

(**) Reflect on the famous saying not everything that can be counted counts, and not everything that counts can be counted.

Exercise 5.37

(**) Being a data scientist can be a frustrating job, especially when you care for some causes. Reflect on: some things that count can be counted, but we will not count them, because there’s no budget for it.

Exercise 5.38

(**) Being a data scientist can be a frustrating job, especially when you care for the truth. Reflect on: some things that count can be counted, but we will not count them, because some people might be offended or find it unpleasant.

Exercise 5.39

(**) Assume you were to establish your own nation on some island and become the benevolent dictator thereof. How would you measure if your people are happy or not? Let us say that you need to come up with 3 quantitative measures (key performance indicators). What would happen if your policy making was solely focused on optimising those KPIs? How about the same problem but with regard to your company and employees? Think what can go wrong in other areas of life.


1

(**) Based on the so-called uncorrected for bias version of the sample variance. We prefer it here for didactical reasons (simplicity, better interpretability). Plus, it is the default one in numpy. Passing ddof=1 (delta degrees of freedom) to numpy.std will apply division by \(n-1\) instead of by \(n\). When used as an estimator of the distribution’s standard deviation, the latter has slightly better statistical properties (which we normally explore in a course on mathematical statistics, which this one is not). Interestingly, the std methods in the pandas package have ddof=1 by default, therefore we might be interested in setting ddof=0 therein.

2

The \(1.5 \mathrm{IQR}\) rule is the most popular in the statistical literature, but some plotting software may use different whisker ranges. See Section 15.4.1 for further discussion.

3

(*) This is because the indexer is computed first, and its value is passed as an argument to the index operator. Python neither is a symbolic programming language, nor does it feature any nonstandard evaluation techniques. In other words, [...] does not care how the indexer was obtained.

4

Where, theoretically, the probability of obtaining a tie is equal to 0.

5

Later we will mention pandas.unique which lists the values in the order of appearance.

6

Double precision float64 format as defined by the IEEE Standard for Floating-Point Arithmetic (IEEE 754).