# 5. Processing Unidimensional Data

The open-access textbookMinimalist Data Wrangling with Pythonby Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated.

It is extremely rare for our datasets to bring interesting and valid insights out of the box. The ones we are using for illustrational purposes in the first part of our book have already been curated. After all, this is an introductory course, and we need to build up the necessary skills and not overwhelm the kind reader with too much information at the same time. We learn simple things first, learn them well, and then we move to more complex matters with a healthy level of confidence.

In real life, various *data cleansing* and *feature engineering*
techniques will need to be performed on data.
Most of them are based on the simple operations on vectors
that we cover in this chapter:

summarising data (for example, computing the median or sum),

transforming values (applying mathematical operations on each element, such as subtracting a scalar or taking the natural logarithm),

filtering (selecting or removing observations that meet specific criteria, e.g., those that are larger than the arithmetic mean ± 5 standard deviations).

Important

The same operations we are going to be applying on individual data frame columns in Chapter 10.

## 5.1. Aggregating Numeric Data

Recall that in the previous chapter we discussed
the `heights`

and `income`

datasets whose histograms
we gave in Figure 4.2
and Figure 4.3, respectively.
Such graphical summaries are based on binned data. They
provide us with snapshots of how much probability mass
is allocated in different parts of the data domain.

Instead of dealing with large datasets, we obtained a few counts. The process of binning and its textual or visual depictions is valuable in determining whether the distribution is unimodal or multimodal, skewed or symmetric around some point, what range of values contains most of the observations, and how small or large extreme values are.

Too much information may sometimes be overwhelming. Besides,
revealing it might not be a clever idea for privacy or confidentiality
reasons1.
Consequently, we might be interested in even more synthetic descriptions –
data aggregates which reduce the whole dataset into a *single*
number reflecting one of its many characteristics.
They can provide us with a kind of bird’s eye view on some aspect of it.
We refer to such processes as data *aggregation*;
see [26, 38].

In this part, we discuss a few noteworthy *measures* of:

*location*; e.g., central tendency measures such as mean and median;*dispersion*; e.g., standard deviation and interquartile range;distribution

*shape*; e.g., skewness.

We also introduce *box and whisker plots*.

### 5.1.1. Measures of Location

#### 5.1.1.1. Arithmetic Mean and Median

Two main measures of *central tendency* are:

*the arithmetic mean*(sometimes for simplicity called the mean or average), defined as the sum of all observations divided by the sample size:\[ \bar{x} = \frac{(x_1+x_2+\dots+x_n)}{n} = \frac{1}{n} \sum_{i=1}^n x_i, \]*the median*, being the middle value in a sorted version of the sample if its length is odd or the arithmetic mean of the two middle values otherwise:\[\begin{split} m = \left\{ \begin{array}{ll} x_{(n+1)/2} & \text{if }n\text{ is odd},\\ \frac{x_{(n/2)}+x_{(n/2+1)}}{2} & \text{if }n\text{ is even}. \end{array} \right. \end{split}\]

They can be computed using the **numpy.mean** and
**numpy.median** functions.

```
heights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_height_2020.txt")
np.mean(heights), np.median(heights)
## (160.13679222932953, 160.1)
income = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/uk_income_simulated_2020.txt")
np.mean(income), np.median(income)
## (35779.994, 30042.0)
```

Let us underline what follows:

for symmetric distributions, the arithmetic mean and the median are expected to be more or less equal,

for skewed distributions, the arithmetic mean will be biased towards the heavier tail.

Get the arithmetic mean
and median for the `37_pzu_warsaw_marathon_mins`

dataset
mentioned in Chapter 4.

(*) Write a function
that computes the median without the use of **numpy.median**
(based on its mathematical definition and **numpy.sort**).

Note

(*) Technically, the arithmetic mean can also be computed
using the **mean** *method* for the `numpy.ndarray`

class –
it will sometimes be the case
that we have many ways to perform the same operation.
We can even “implement” it manually using the **sum** function.
Thus, all the following expressions are equivalent:

```
print(
np.mean(income),
income.mean(),
np.sum(income)/len(income),
income.sum()/income.shape[0]
)
## 35779.994 35779.994 35779.994 35779.994
```

On the other hand, there exists the **numpy.median** function
but, unfortunately, the **median** method for vectors is not
available. This is why we prefer sticking with functions.

#### 5.1.1.2. Sensitive to Outliers vs Robust

The arithmetic mean is strongly influenced by very large
or very small observations (which in some contexts we refer
to as *outliers*). For instance, assume that we are inviting
a billionaire to the `income`

dataset:

```
income2 = np.append(income, [1_000_000_000])
print(np.mean(income), np.mean(income2))
## 35779.994 1034745.2487512487
```

Comparing this new result to the previous one, oh we all feel much richer now, right? In fact, the arithmetic mean reflects the income each of us would get if all the wealth were gathered inside a single Santa Claus’s (or Robin Hood’s) sack and then distributed equally amongst all of us. A noble idea provided that everyone contributes equally to the society, which unfortunately is not the case.

On the other hand, the median is the value such that 50% of the observations are less than or equal to it and 50% of the remaining ones are not less than it. Hence, it is completely insensitive to most of the data points – on both the left and the right side of the distribution:

```
print(np.median(income), np.median(income2))
## 30042.0 30076.0
```

Because of this, we cannot say that one measure is better than the other.
Certainly, for symmetrical distributions with no outliers
(e.g., `heights`

), the mean
will be better as it uses *all* data (and its efficiency
can be proven for certain statistical models).
For skewed distributions (e.g., `income`

),
the median has a nice interpretation, as it gives the value in the middle
of the ordered sample. Let us still remember that these data
summaries allow us to look at a single data aspect only,
and there can be many different, valid perspectives.
The reality is complex.

#### 5.1.1.3. Sample Quantiles

Quantiles generalise the notions of the sample median and of the inverse of the empirical cumulative distribution function (Section 4.3.8). They provide us with the value that is not exceeded by the elements in a given sample with a predefined probability.

Before proceeding with a formal definition, which is quite technical, let us point out that for larger sample sizes, we have the following rule of thumb.

Important

For any \(p\) between 0 and 1, the \(p\)-quantile, denoted \(q_p\), is a value dividing the sample in such a way that approximately \(100p\%\) of observations are not greater than \(q_p\), and the remaining ca. \(100(1-p)\%\) are not less than \(q_p\).

Quantiles appear under many different names, but they
all refer to the same concept.
In particular, we can speak about the \(100p\)-th *percentiles*,
e.g., the 0.5-quantile is the same as the 50th percentile.

Furthermore:

0-quantile (\(q_0\)) – the minimum (also:

**numpy.min**),0.25-quantile (\(q_{0.25}\)) – the 1st quartile (denoted \(Q_1\)),

0.5-quantile (\(q_{0.5}\)) – the 2nd quartile a.k.a. median (denoted \(m\) or \(Q_2\)),

0.75-quantile (\(q_{0.75}\)) – the 3rd quartile (denoted \(Q_3\)),

1.0-quantile (\(q_{1}\)) – the maximum (also:

**numpy.max**).

Here are the above five aggregates for our two datasets:

```
np.quantile(heights, [0, 0.25, 0.5, 0.75, 1])
## array([131.1, 155.3, 160.1, 164.8, 189.3])
np.quantile(income, [0, 0.25, 0.5, 0.75, 1])
## array([ 5750. , 20669.75, 30042. , 44123.75, 199969. ])
```

What is the income bracket for 95% of the *most typical* UK
taxpayers? In other words, determine the 2.5- and 97.5-percentiles.

Compute the *midrange* of `income`

and `heights`

,
being the arithmetic mean of the minimum and the maximum
(this measure is extremely sensitive to outliers).

Note

(*)
As we do not like the *approximately* part in the
“asymptotic definition” given above, in this course
we shall assume that for any \(p\in[0,1]\), the *p*-quantile is given by

where \(k=(n-1)p+1\) and \(\lfloor k\rfloor\) is the floor function, i.e., the greatest integer less than or equal to \(k\) (e.g., \(\lfloor 2.0\rfloor=\lfloor 2.001\rfloor=\lfloor 2.999\rfloor=2\), \(\lfloor 3.0\rfloor=\lfloor 3.999\rfloor=3\), \(\lfloor -3.0\rfloor=\lfloor -2.999\rfloor=\lfloor -2.001\rfloor=-3\), and \(\lfloor -2.0\rfloor=\lfloor -1.001\rfloor=-2\)).

\(q_p\) is a function that linearly interpolates between the points featuring the consecutive order statistics, \(((k - 1) / (n - 1), x_{(k)})\) for \(k=1,\dots,n\). For instance, for \(n=5\), we connect the points \((0, x_{(1)})\), \((0.25, x_{(2)})\), \((0.5, x_{(3)})\), \((0.75, x_{(4)})\), \((1, x_{(5)})\). For \(n=6\), we do the same for \((0, x_{(1)})\), \((0.2, x_{(2)})\), \((0.4, x_{(3)})\), \((0.6, x_{(4)})\), \((0.8, x_{(5)})\), \((1, x_{(6)})\). See Figure 5.1 for an illustration.

Notice that for \(p=0.5\) we get the median regardless of whether \(n\) is even or not.

Note

(**) There are many definitions
of quantiles across statistical software; see
the `method`

argument to **numpy.quantile**.
They were nicely summarised in [47]
as well as in the corresponding Wikipedia article.
They are all approximately equivalent for large sample sizes
(i.e., asymptotically), but the best practice is to be explicit about
which variant we are using in the computations when reporting
data analysis results.
Accordingly, in our case, we say that we are relying on
the type-7 quantiles as described in [47],
see also [39].

In fact, simply mentioning that our computations are done with
**numpy** version 1.xx (as specified in Section 1.4)
implicitly implies that the default method parameters are used everywhere,
unless otherwise stated.
In many contexts, that is good enough.

### 5.1.2. Measures of Dispersion

Measures of central tendency quantify the location of the
most *typical* value (whatever that means, and we already
know it is complicated). That of dispersion (spread), on the other hand,
will tell us how much *variability* is in our data.
After all, when we say that the height of a group of people is 160 cm
(on average) ± 14 cm (here, 2 standard deviations), the latter
piece of information is a valuable addition
(and is very different from the imaginary ± 4 cm case).

Some degree of variability might be good in certain contexts and bad in other ones. A bolt factory should keep the variance of the fasteners’ diameters as low as possible: this is how we define quality products (assuming that they all meet, on average, the required specification). On the other hand, too much diversity in human behaviour, where everyone feels that they are special, is not really sustainable (but lack thereof would be extremely boring), and so forth.

Let us explore the different ways in which we can quantify this data aspect.

#### 5.1.2.1. Standard Deviation (and Variance)

The standard deviation2, is the average distance to the arithmetic mean:

Computing the above with **numpy**:

```
np.std(heights), np.std(income)
## (7.062021850008261, 22888.77122437908)
```

The standard deviation quantifies the typical amount of spread around the arithmetic mean. It is overall adequate for making comparisons across different samples measuring similar things (e.g., heights of males vs of females, incomes in the UK vs in South Africa). However, without further assumptions, it is quite difficult to express the meaning of a particular value of \(s\) (e.g., the statement that the standard deviation of income is £22,900 is hard to interpret). This measure makes therefore most sense for data distributions that are symmetric around the mean.

Note

(*)
For bell-shaped data (more precisely: for normally-distributed
samples; see the next chapter) such as `heights`

,
we sometimes report \(\bar{x}\pm 2s\), because the theoretical expectancy
is that ca. 95% of data points fall into the \([\bar{x}-2s, \bar{x}+2s]\)
interval (the so-called 2σ rule).

Further, the *variance* is the square of the standard deviation, \(s^2\).
Mind that if data are expressed in centimetres,
then the variance is in centimetres *squared*, which
is not very intuitive. The standard deviation does not have this
drawback. Mathematicians find the square root annoying though
(for many reasons); that is why we will come across the \(s^2\)
every now and then too.

#### 5.1.2.2. Interquartile Range

The interquartile range (IQR) is another popular measure of dispersion. It is defined as the difference between the 3rd and the 1st quartile:

Computing the above is almost effortless:

```
np.quantile(heights, 0.75) - np.quantile(heights, 0.25)
## 9.5
np.quantile(income, 0.75) - np.quantile(income, 0.25)
## 23454.0
```

The IQR has an appealing interpretation:
it is the range comprised of the 50% *most typical*
values. It is a quite robust measure, as it ignores the 25% smallest
and 25% largest observations.
Standard deviation, on the other hand, is extremely sensitive to outliers.

Furthermore, by *range* (or support) we will mean a measure based on
extremal quantiles: it is the difference between the maximal
and minimal observation.

### 5.1.3. Measures of Shape

From a histogram, we can easily read whether a dataset is
symmetric or skewed. It turns out that we can easily quantify
such a characteristic. Namely, the *skewness* is given by:

For symmetric distributions, skewness is approximately zero. Positive and negative skewness indicates a heavier right and left tail, respectively.

For example, heights are an instance of an almost-symmetric distribution:

```
scipy.stats.skew(heights)
## 0.0811184528074054
```

Income, on the other hand, is right-skewed:

```
scipy.stats.skew(income)
## 1.9768735693998942
```

Now we have them expressed as a single number.

Note

(*) It is worth stressing that \(g>0\) does not imply that the sample mean is greater than the median. As an alternative measure of skewness, sometimes the practitioners use:

*Yule’s coefficient* is an example of a robust
skewness measure:

The computation thereof on our example datasets is left as an exercise.

Furthermore, *kurtosis* (or Fisher’s excess kurtosis,
compare **scipy.stats.kurtosis**) can be used to
describe whether an empirical distribution is heavy- or thin-tailed.

### 5.1.4. Box (and Whisker) Plots

A *box and whisker* plot (*box plot* for short)
depicts some noteworthy features of a data sample.

```
plt.subplot(2, 1, 1) # 2 rows, 1 column, 1st subplot
sns.boxplot(data=heights, orient="h", color="lightgray")
plt.yticks([0], ["heights"])
plt.subplot(2, 1, 2) # 2 rows, 1 column, 2nd subplot
sns.boxplot(data=income, orient="h", color="lightgray")
plt.yticks([0], ["income"])
plt.show()
```

Each box plot (compare Figure 5.2) consists of:

the box, which spans between the 1st and the 3rd quartile:

the median is clearly marked by a vertical bar inside the box;

the width of the box corresponds to the IQR;

the whiskers, which span3 between:

the smallest observation (the minimum) or \(Q_1-1.5 \mathrm{IQR}\) (the left side of the box minus 3/2 of its width), whichever is larger, and

the largest observation (the maximum) or \(Q_3+1.5 \mathrm{IQR}\) (the right side of the box plus 3/2 of its width), whichever is smaller.

Additionally, all observations that are less than \(Q_1-1.5 \mathrm{IQR}\) (if any) or greater than \(Q_3+1.5 \mathrm{IQR}\) (if any) are separately marked.

Note

We are used to referring to the individually marked points
as *outliers*. Still, it does not automatically mean there is anything
*anomalous* about them.
They are *atypical* in the sense that they are
considerably farther away from the box.
But this might as well indicate some problems in data quality
(e.g., when someone made a typo entering the data).
Actually, box plots are calibrated
(via the nicely round magic constant 1.5)
in such a way that we expect there
to be no or only few outliers if the data are normally distributed.
For skewed distributions, there will naturally be many outliers
on either side; see Section 15.4 for more details.

Box plots are based solely on sample quantiles.
Most of the statistical packages *do not* draw the arithmetic mean.
If they do, it is marked with a distinctive symbol.

Call
**matplotlib.pyplot.plot**`(`

**numpy.mean**`(..data..), 0, "wX")`

after **seaborn.boxplot** to mark the arithmetic mean with
a white cross.

Box plots are particularly beneficial for comparing data samples with each other (e.g., body measures of men and women separately), both in terms of the relative shift (location) as well as spread and skewness; see, e.g., Figure 12.1.

(*)
We may also sometimes be interested in a *violin plot*
(Figure 5.3),
which combines a box plot (although with no outliers marked)
and the so-called kernel density estimator
(which is a smoothened version of a histogram; see Section 15.4.2).

```
plt.subplot(2, 1, 1) # 2 rows, 1 column, 1st subplot
sns.violinplot(data=heights, orient="h", color="lightgray")
plt.yticks([0], ["heights"])
plt.subplot(2, 1, 2) # 2 rows, 1 column, 2nd subplot
sns.violinplot(data=income, orient="h", color="lightgray")
plt.yticks([0], ["income"])
plt.show()
```

### 5.1.5. Further Methods (*)

We said that the arithmetic mean is overly sensitive to extreme observations.
The sample median is an example of a *robust* aggregate — it ignores
all but 1–2 middle observations (we would say it has
a high *breakdown point*). Some measures of central tendency
that are in-between the mean-median extreme include:

*trimmed means*– the arithmetic mean of all the observations except several, say \(p\), the smallest and the greatest ones,*winsorised means*– the arithmetic mean with \(p\) smallest and \(p\) greatest observations replaced with the \((p+1)\)-th smallest the \((p+1)\)-th largest one.

The arithmetic mean is not the only mean of interest.
The two other famous means are the *geometric* and *harmonic* ones.
The former is more meaningful for averaging growth rates and speedups
whilst the latter can be used for computing the average speed from
speed measurements at sections of identical lengths; see also
the notion of the F-measure in Section 12.3.2.
Also, the quadratic mean is featured in the definition
of the standard deviation (it is the quadratic mean of the
distances to the mean).

As far as spread measures are concerned, the interquartile range (IQR) is a robust statistic. If necessary, the standard deviation might be replaced with:

mean absolute deviation from the mean: \(\frac{1}{n} \sum_{i=1}^n |x_i-\bar{x}|\),

mean absolute deviation from the median: \(\frac{1}{n} \sum_{i=1}^n |x_i-m|\),

median absolute deviation from the median: the median of \((|x_1-m|, |x_2-m|, \dots, |x_n-m|)\).

The *coefficient of variation*, being the standard deviation
divided by the arithmetic mean, is an example of a *relative* (or
normalised) spread measure. It can be appropriate for comparing data
on different scales, as it is unit-less
(think how standard deviation changes when you
convert between metres and centimetres).

The *Gini index*,
widely used in economics, can also serve as a measure of relative
dispersion, but assumes that all data points are nonnegative:

It is normalised so that it takes values in the unit interval. An index of 0 reflects the situation where all values in a sample are the same (0 variance; perfect equality). If there is a single entity in possession of all the “wealth”, and the remaining ones are 0, then the index is equal to 1.

For a more generic (algebraic) treatment of aggregation functions for unidimensional data; see, e.g., [9, 26, 27, 38]. Some measures might be better than others under certain (often strict) assumptions usually explored in a course on mathematical statistics, e.g., [35].

Overall, numerical aggregates should be used in cases where data are unimodal. For multimodal mixtures or data in groups, they should rather be applied to summarise each cluster/class separately; compare Chapter 12. Also, in Chapter 8, we will extend consider a few summaries for multidimensional data.

## 5.2. Vectorised Mathematical Functions

**numpy**, just like any other comprehensive numerical computing
package, library, or environment (e.g., R, GNU Octave, Scilab, and Julia),
defines many basic mathematical functions:

absolute value:

**numpy.abs**,square and square root:

**numpy.square**and**numpy.sqrt**, respectively,(natural) exponential function:

**numpy.exp**,logarithms:

**numpy.log**(the natural logarithm, i.e., base*e*),**numpy.log10**(base 10), etc.,trigonometric functions:

**numpy.cos**,**numpy.sin**,**numpy.tan**, etc., and their inverses:**numpy.arccos**, etc.rounding and truncating:

**numpy.round**,**numpy.floor**,**numpy.ceil**,**numpy.trunc**.

Each of these functions is *vectorised*: if we apply, say, \(f\),
on a vector like \(\boldsymbol{x}=(x_1,\dots,x_n)\), we
obtain a sequence of the same size
with all elements appropriately transformed:

We will frequently be using such operations for adjusting the data,
e.g., as in Figure 6.7, where
we discover that the *logarithm* of the UK incomes
has a bell-shaped distribution.

An example call to the vectorised version of the rounding function:

```
np.round([-3.249, -3.151, 2.49, 2.51, 3.49, 3.51], 1)
## array([-3.2, -3.2, 2.5, 2.5, 3.5, 3.5])
```

The input list has been automatically converted
to a **numpy** vector.

Important

Thanks to the vectorised functions, our code is not only more
readable, but also runs faster: we do not have to employ
the generally slow Python-level **while** or **for** loops
to traverse through each element in a given sequence.

Here are some significant properties of the natural logarithm and its inverse, the exponential function. By convention, Euler’s number \(e \simeq 2.718\), \(\log x = \log_e x\), and \(\exp(x)=e^x\).

\(\log 1 = 0\), \(\log e = 1\); note that logarithms are only defined for \(x > 0\): in the limit as \(x \to 0\), we have that \(\log x \to -\infty\),

\(\log x^y = y \log x\) and hence \(\log e^x = x\),

\(\log (xy) = \log x + \log y\) and thus \(\log (x/y) = \log x - \log y\),

\(e^0 = 1\), \(e^1=e\), and \(e^x\to 0\) as \(x \to -\infty\),

\(e^{\log x} = x\),

\(e^{x+y} = e^x e^y\) and so \(e^{x-y}=e^x / e^y\).

Both functions are strictly increasing. For \(x\ge 1\), the logarithm grows very slowly whereas the exponential function increases very rapidly; see Figure 5.4.

```
plt.subplot(1, 2, 1)
x = np.linspace(np.exp(-2), np.exp(3), 1001)
plt.plot(x, np.log(x), label="$y=\\log x$")
plt.legend()
plt.subplot(1, 2, 2)
x = np.linspace(-2, 3, 1001)
plt.plot(x, np.exp(x), label="$y=\\exp(x)$")
plt.legend()
plt.show()
```

Logarithms of different bases and non-natural exponential functions are also available. In particular, when drawing plots, we used the base-10 logarithmic scales on the axes. It holds \(\log_{10} x = \frac{\log x}{\log 10}\) and its inverse is \(10^x\). For example:

```
10.0**np.array([-1, 0, 1, 2]) # see below
## array([ 0.1, 1. , 10. , 100. ])
np.log10([-1, 0.01, 0.1, 1, 2, 5, 10, 100, 1000, 10000])
## array([ nan, -2. , -1. , 0. , 0.30103, 0.69897,
## 1. , 2. , 3. , 4. ])
##
## <string>:1: RuntimeWarning: invalid value encountered in log10
```

Take note of the warning and the not-a-number (`NaN`

) generated.

Check that when using the log-scale on the x-axis
(**plt.xscale**`("log")`

), the plot of the logarithm
(of any base) is a straight line. Similarly, the log-scale on the y-axis
(**plt.yscale**`("log")`

) makes the exponential function
linear.

Moving on, the trigonometric functions in **numpy**
accept angles in radians.
If \(x\) is the degree in angles, then to compute its
cosine, we should instead write \(\cos (x \pi /180)\);
see Figure 5.5.

```
x = np.linspace(-2*np.pi, 4*np.pi, 1001)
plt.plot(x, np.cos(x))
plt.xticks(
[-2*np.pi, -np.pi, 0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi, 4*np.pi],
["$-2\\pi$", "$-\\pi$", "$0$", "$\\pi/2$", "$\\pi$",
"$3\\pi/2$", "$2\\pi$", "$4\\pi$"]
)
plt.show()
```

Some identities worth memorising:

\(\sin x = \cos(\pi/2 - x)\),

\(\cos(-x) = \cos x\),

\(\cos^2 x + \sin^2 x = 1\), where \(\cos^2 x = (\cos x)^2\),

\(\cos(x+y) = \cos x \cos y - \sin x \sin y\),

\(\cos(x-y) = \cos x \cos y + \sin x \sin y\).

We will refer to them later.

## 5.3. Arithmetic Operators

We can apply the standard binary (two-argument) arithmetic operators
`**+**`, `**-**`, `*****`,
`**/**`, `******`, `**%**`, and
`**//**` on vectors too.
Below we discuss the possible cases of the operands’ lengths.

### 5.3.1. Vector-Scalar Case

Often, we will be referring to the binary operators in contexts where one operand is a vector and the other one is a single value (scalar), for example:

```
np.array([-2, -1, 0, 1, 2, 3])**2
## array([4, 1, 0, 1, 4, 9])
(np.array([-2, -1, 0, 1, 2, 3])+2)/5
## array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])
```

In such a case, each element in the vector is being operated upon (e.g., squared, divided by 5) and we get a vector of the same length in return. Hence, in this case, the operators behave just like the vectorised mathematical functions discussed above.

Mathematically, it is common to assume that the scalar multiplication and, less commonly, the addition are performed in this way.

We will also become used to writing, \((\boldsymbol{x}-t)/c\) which is of course equivalent to \((1/c) \boldsymbol{x} + (-t/c)\).

### 5.3.2. Application: Feature Scaling

Vector-scalar operations and aggregation functions are the basis
for the most commonly applied *feature scalers*:

standardisation,

normalisation,

min-max scaling and clipping.

They can increase the interpretability of data points by bringing them onto a common, unitless scale. They might also be essential when computing pairwise distances between multidimensional points; see Section 8.4.

All the above are linear transforms of the form \(\boldsymbol{y}=c \boldsymbol{x}+t\), which geometrically we can interpret as scaling (stretching or shrinking) and shifting (translating), see Figure 5.6 for an illustration.

Note

Let \(\boldsymbol{y}=c \boldsymbol{x}+t\) and let \(\bar{x}, \bar{y}, s_x, s_y\) denote the vectors’ arithmetic means and standard deviations. The following properties hold.

The arithmetic mean and all the quantiles (including, of course, the median), are equivariant with respect to translation and scaling; it holds, for instance, \(\bar{y} = c \bar{x} + t\).

The standard deviation, the interquartile range, and the range are invariant to translations and equivariant to scaling; e.g., \(s_y = c s_x\).

As a by-product, for the variance, we get… \(s_y^2 = c^2 s_x^2\).

#### 5.3.2.1. Standardisation and Z-Scores

A *standardised* version of a vector \(\boldsymbol{x}=(x_1,\dots,x_n)\)
consists in subtracting, from each element, the sample arithmetic mean
(which we call *centring*) and then dividing it by the standard deviation,
i.e., \(\boldsymbol{z}=(\boldsymbol{x}-\bar{x})/s\).

Thus, we transform each \(x_i\) to obtain:

Consider again the female heights dataset:

```
heights[-5:] # preview
## array([157. , 167.4, 159.6, 168.5, 147.8])
```

whose mean \(\bar{x}\) and standard deviation \(s\) are equal to:

```
np.mean(heights), np.std(heights)
## (160.13679222932953, 7.062021850008261)
```

With **numpy**, standardisation is as simple
as applying two aggregation functions and two arithmetic operations:

```
heights_std = (heights-np.mean(heights))/np.std(heights)
heights_std[-5:] # preview
## array([-0.44417764, 1.02848843, -0.07601113, 1.18425119, -1.74692071])
```

What we obtained is sometimes referred to as the *z-scores*.
They are nicely interpretable:

z-score of 0 corresponds to an observation equal to the sample mean (perfectly average);

z-score of 1 is obtained for a datum 1 standard deviation above the mean;

z-score of -2 means that it is a value 2 standard deviations below the mean;

and so forth.

Because of the way they emerge, the mean of the z-scores is always 0 and standard deviation is 1 (up to a tiny numerical error, as usual; see Section 5.5.6):

```
np.mean(heights_std), np.std(heights_std)
## (1.8920872660373198e-15, 1.0)
```

Even though the original `heights`

were measured in centimetres,
the z-scores are *unitless* (centimetres divided by centimetres).

Important

Standardisation
enables the comparison of measurements on different scales
(think: height in centimetres vs weight in kilograms
or apples vs oranges).
It makes the most sense for bell-shaped distributions,
in particular normally-distributed ones.
In the next chapter we will note that, due to the 2σ rule for the
normal family, we *expect* that 95% of observations should
have z-scores between -2 and 2.
Further, z-scores less than -3 and greater than 3 are highly unlikely.

We have a patient whose height z-score is 1 and weight z-score is -1. How can we interpret this information?

How about a patient whose weight z-score is 0 but BMI z-score is 2?

On a side note, sometimes we might be interested in performing some form
of *robust* standardisation (e.g., for skewed data or
those that feature some outliers). In such a case,
we can replace the mean with the median
and the standard deviation with the IQR.

#### 5.3.2.2. Min-Max Scaling and Clipping

A less frequently but still noteworthy transformation
is called *min-max scaling* and
involves subtracting the minimum
and then dividing by the range, \((x-x_{(1)})/(x_{(n)}-x_{(1)})\).

```
x = np.array([-1.5, 0.5, 3.5, -1.33, 0.25, 0.8])
(x - np.min(x))/(np.max(x)-np.min(x))
## array([0. , 0.4 , 1. , 0.034, 0.35 , 0.46 ])
```

Here, the smallest value is mapped to 0 and the largest becomes equal to 1. Let us stress that, in this context, 0.5 does not represent the value which is equal to the mean (unless we are incredibly lucky).

Also, *clipping* can be used to replace all values less than
0 with 0 and those greater than 1 with 1.

```
np.clip(x, 0, 1)
## array([0. , 0.5 , 1. , 0. , 0.25, 0.8 ])
```

The function is of course flexible; another popular choice is clipping to \([-1, 1]\). This can also be implemented manually by means of the vectorised pairwise minimum and maximum functions.

```
np.minimum(1, np.maximum(0, x))
## array([0. , 0.5 , 1. , 0. , 0.25, 0.8 ])
```

#### 5.3.2.3. Normalisation (\(l_2\); Dividing by Magnitude)

*Normalisation* is the scaling of a given vector so that it is of
*unit length*. Usually, by *length* we mean the square root of the sum of
squares, i.e., the Euclidean (\(l_2\)) norm also known as the *magnitude*:

Its special case for \(n=2\) we know well from high school:
the length of a vector \((a,b)\)
is \(\sqrt{a^2+b^2}\), e.g., \(\|(1, 2)\| = \sqrt{5} \simeq 2.236\).
Also, it is good to program our brains so that when
next time we see \(\| \boldsymbol{x} \|^2\), we immediately
think of the *sum of squares*.

Consequently, a normalised vector:

can be determined via:

```
x = np.array([1, 5, -4, 2, 2.5]) # example vector
x/np.sqrt(np.sum(x**2)) # x divided by the Euclidean norm of x
## array([ 0.13834289, 0.69171446, -0.55337157, 0.27668579, 0.34585723])
```

Normalisation is similar to standardisation if data are already centred (when the mean was subtracted). Show that we can obtain one from the other via the scaling by \(\sqrt{n}\).

Important

A common confusion is that normalisation is supposed to make
data *more normally* distributed. This is not the case4,
as we only scale (stretch or shrink) the observations here.

#### 5.3.2.4. Normalisation (\(l_1\); Dividing by Sum)

At other times, we might be interested in considering the Manhattan (\(l_1\)) norm:

being the sum of absolute values.

```
x / np.sum(np.abs(x))
## array([ 0.06896552, 0.34482759, -0.27586207, 0.13793103, 0.17241379])
```

\(l_1\) normalisation is frequently applied on vectors of nonnegative values,
whose normalised versions can be interpreted as *probabilities*
or *proportions*: values between 0 and 1 which sum to 1 (or, equivalently,
100%).

Given some binned data:

```
c, b = np.histogram(heights, [-np.inf, 150, 160, 170, np.inf])
print(c) # counts
## [ 306 1776 1773 366]
```

We can convert the counts to empirical probabilities:

```
p = c/np.sum(c) # np.abs is not needed here
print(p)
## [0.07249467 0.42075338 0.42004264 0.08670931]
```

We did not apply **numpy.abs**, because the values were already nonnegative.

### 5.3.3. Vector-Vector Case

So far we have been applying `*****`, `**+**`, etc.,
on vectors and scalars only.
All arithmetic operators can also be applied on two vectors of equal lengths.
In such a case, they will act *elementwisely*:
taking each element from the first operand
and combining it with the *corresponding* element from the second argument:

```
np.array([2, 3, 4, 5]) * np.array([10, 100, 1000, 10000])
## array([ 20, 300, 4000, 50000])
```

We see that the first element in the left operand (2) was multiplied by the first element in the right operand (10). Then, we multiplied 3 by 100 (the second corresponding elements), and so forth.

Such a behaviour of the binary operators is inspired by the usual convention in vector algebra where applying \(+\) (or \(-\)) on \(\boldsymbol{x}=(x_1,\dots,x_n)\) and \(\boldsymbol{y}=(y_1,\dots,y_n)\) means exactly:

Using other operators this way (elementwisely)
is less standard in mathematics (for instance multiplication
might denote the dot product), but in **numpy**
it is really convenient.

Let us compute the value of the expression \(h = - (p_1\, \log p_1 + \dots + p_n\, \log p_n)\), i.e., \(h=-\sum_{i=1}^n p_i\, \log p_i\) (the so-called entropy):

```
p = np.array([0.1, 0.3, 0.25, 0.15, 0.12, 0.08]) # example vector
-np.sum(p*np.log(p))
## 1.6790818544987114
```

The above involves the use of a unary vectorised minus (change sign), an aggregation function (sum), a vectorised mathematical function (log), and an elementwise multiplication of two vectors of the same lengths.

Let us assume that – for whatever reason – we would like to plot two mathematical functions, the sine, \(f(x)= \sin x\), and a polynomial of degree 7, \(g(x) = x - x^3/6 + x^5/120 - x^7/5040\) for \(x\) in the interval \([-\pi, 3\pi/2]\).

To do this, we can probe the values of \(f\) and \(g\) at sufficiently many points
using the vectorised operations discussed so far
and then use the **matplotlib.pyplot.plot** function
to draw what we see in Figure 5.7.

```
x = np.linspace(-np.pi, 1.5*np.pi, 1001) # many points in the said interval
yf = np.sin(x)
yg = x - x**3/6 + x**5/120 - x**7/5040
plt.plot(x, yf, 'k-', label="f(x)") # black solid line
plt.plot(x, yg, 'r:', label="g(x)") # red dotted line
plt.legend()
plt.show()
```

Decreasing the number of points in `x`

will reveal that the
plotting function merely draws a series of straight-line segments.
Computer graphics is essentially discrete.

Using a single line of code, compute the vector of BMIs of all persons
based on the
`nhanes_adult_female_height_2020`

and
`nhanes_adult_female_weight_2020`

datasets.
It is assumed that the *i*-th elements therein both refer to the same
person.

## 5.4. Indexing Vectors

Recall from Section 3.2.1 and Section 3.2.2 that sequential objects in Python (lists, tuples, strings, ranges) support indexing using scalars and slices:

```
x = [10, 20, 30, 40, 50]
x[1] # scalar index – extract
## 20
x[1:2] # slice index – subset
## [20]
```

**numpy** vectors support two *additional* indexing schemes:
using integer and boolean sequences.

### 5.4.1. Integer Indexing

Indexing with a single integer *extracts*
a particular element:

```
x = np.array([10, 20, 30, 40, 50])
x[0] # first
## 10
x[1] # second
## 20
x[-1] # last
## 50
```

We can also use lists of vectors of integer indexes, which return a subvector with elements at the specified indexes:

```
x[ [0] ]
## array([10])
x[ [0, 1, -1, 0, 1, 0, 0] ]
## array([10, 20, 50, 10, 20, 10, 10])
x[ [] ]
## array([], dtype=int64)
```

We added some spaces between the square brackets,
but only because, for example, `x[[0]]`

might seem slightly more
enigmatic. (What are these double square brackets?
Nah, it is a list inside the index operator.)

### 5.4.2. Logical Indexing

Subsetting using a logical vector of the same length as the indexed vector is possible too:

```
x[ [True, False, True, True, False] ]
## array([10, 30, 40])
```

This returned the 1st, 3rd, and 4th element (select 1st, omit 2nd, select 3rd, select 4th, omit 5th).

This is particularly useful as a *data filtering* technique.
Knowing that the relational operators `**<**`, `**<=**`,
`**==**`, `**!=**`,
`**>=**`, and `**>**` on vectors are performed
elementwisely
(just like `**+**`, `*****`, etc.), for instance:

```
x >= 30
## array([False, False, True, True, True])
```

we can write:

```
x[ x >= 30 ]
## array([30, 40, 50])
```

to mean “select the elements in `x`

which are not less than 30”.

Of course, the indexed vector and the vector
specifying the *filter* do not5 have to be the same:

```
y = (x/10) % 2 # whatever
y # equal to 0 if a number is a multiply of 10 times an even number
## array([1., 0., 1., 0., 1.])
x[ y == 0 ]
## array([20, 40])
```

Important

If we would like to combine many logical vectors, sadly we cannot
use the **and**, **or**, and **not** operators,
because they are not vectorised (this is a limitation of our language
per se).

In **numpy**, we use the `**&**`, `**|**`,
and `**~**` operators instead.
Unfortunately, they have a lower order of precedence than
`**<**`, `**<=**`, `**==**`, etc.
Therefore, the bracketing of the comparisons is obligatory.

For example:

```
x[ (20 <= x) & (x <= 40) ] # check what happens if we skip the brackets
## array([20, 30, 40])
```

means “elements in `x`

between 20 and 40”
(greater than or equal to 20 and less than or equal to 40).

Compute the BMIs only of the women whose height is between 150 and 170 cm.

### 5.4.3. Slicing

Just as with ordinary lists,
slicing with “`:`

” can be used to fetch the elements
at indexes in a given range like `from:to`

or `from:to:by`

.

```
x[::-1]
## array([50, 40, 30, 20, 10])
x[3:]
## array([40, 50])
x[1:4]
## array([20, 30, 40])
```

Important

For efficiency reasons, slicing returns a
*view* on existing data. It does not have to make an independent
copy of the subsetted elements, because sliced ranges are *regular*
by definition.

In other words, both `x`

and its sliced version share the same memory.
This is important when we apply operations which modify a
given vector in place, such as the `sort`

method.

```
y = np.array([6, 4, 8, 5, 1, 3, 2, 9, 7])
y[::2] *= 10 # modifies parts of y in place
y # has changed
## array([60, 4, 80, 5, 10, 3, 20, 9, 70])
```

This multiplied every second element in `y`

by 10
(i.e., `[6, 8, 1, 2, 7]`

).
On the other hand, indexing with an integer or logical
vector always returns a copy.

```
y[ [1, 3, 5, 7] ] *= 10 # modifies a new object and then forgets about it
y # has not changed since the last modification
## array([60, 40, 80, 50, 10, 30, 20, 90, 70])
```

This *did not* modify the original vector, because
we applied `***=**` on a different object,
which has not even been memorised after that operation took place.

## 5.5. Other Operations

### 5.5.1. Cumulative Sums and Iterated Differences

Recall that the `**+**` operator acts on two vectors elementwisely
and that the **numpy.sum** function aggregates all values into
a single one. We have a similar function, but vectorised
in a slightly different fashion. Namely, **numpy.cumsum**
returns the vector of *cumulative sums*:

```
np.cumsum([5, 3, -4, 1, 1, 3])
## array([5, 8, 4, 5, 6, 9])
```

This gave, in this order: the first element, the sum of first two elements, the sum of first three elements, …, the sum of all elements.

*Iterative differences* are a somewhat inverse operation:

```
np.diff([5, 8, 4, 5, 6, 9])
## array([ 3, -4, 1, 1, 3])
```

returned the difference between the 2nd and 1st element, then the difference between the 3rd and the 2nd, and so forth. The resulting vector is one element shorter than the input one.

We often make use of cumulative sums and iterated differences when processing time series, e.g., stock exchange data (e.g., by how much the price changed since the previous day?; Section 16.3.1) or determining cumulative distribution functions (Section 4.3.8).

### 5.5.2. Sorting

The **numpy.sort** function returns a sorted copy
of a given vector, i.e., determines the order statistics.

```
x = np.array([40, 10, 20, 40, 40, 30, 20, 40, 50, 10, 10, 70, 30, 40, 30])
np.sort(x)
## array([10, 10, 10, 20, 20, 30, 30, 30, 40, 40, 40, 40, 40, 50, 70])
```

The **sort** method (as in: `x.sort()`

), on the other hand,
sorts the vector in place (and returns nothing).

Readers interested more in chaos than in bringing order
should give **numpy.random.permutation** a try.
This function shuffles the elements in a given vector.

### 5.5.3. Dealing with Tied Observations

Some statistical methods, especially for continuous data6,
assume that all observations in a vector are unique, i.e.,
there are no *ties*. In real life, however, some values
might be recorded multiple times. For instance, two marathoners
might finish their runs in exactly the same time,
data can be rounded up to a certain number
of fractional digits, or it just happens that observations are inherently
integer. Therefore, we should be able to detect duplicated entries.

**numpy.unique** is a workhorse for dealing with tied observations.

```
x = np.array([40, 10, 20, 40, 40, 30, 20, 40, 50, 10, 10, 70, 30, 40, 30])
np.unique(x)
## array([10, 20, 30, 40, 50, 70])
```

Returns a *sorted*7 version of
a given vector with duplicates removed.

We can also get the corresponding counts:

```
np.unique(x, return_counts=True) # returns a tuple of length 2
## (array([10, 20, 30, 40, 50, 70]), array([3, 2, 3, 5, 1, 1]))
```

This can be used to determine if all the values in a vector are unique:

```
np.all(np.unique(x, return_counts=True)[1] == 1)
## False
```

Play with the `return_index`

argument to **numpy.unique**
that allows pinpointing the indexes of the first occurrences
of each unique value.

### 5.5.4. Determining the Ordering Permutation and Ranking

**numpy.argsort** returns a sequence of indexes that
lead to an ordered version of a given vector
(i.e., an ordering permutation).

```
x = np.array([40, 10, 20, 40, 40, 30, 20, 40, 50, 10, 10, 70, 30, 40, 30])
np.argsort(x)
## array([ 1, 9, 10, 2, 6, 5, 12, 14, 0, 3, 4, 7, 13, 8, 11])
```

Which means that the smallest element is at index 1, then the 2nd smallest is at index 9, 3rd smallest at index 10, etc. Therefore:

```
x[np.argsort(x)]
## array([10, 10, 10, 20, 20, 30, 30, 30, 40, 40, 40, 40, 40, 50, 70])
```

is equivalent to **numpy.sort**`(x)`

.

Note

(**)
If there are tied observations in a vector `x`

,
**numpy.argsort**`(x, kind="stable")`

will use
a *stable* sorting algorithm
(timsort,
a variant of mergesort), which guarantees that the ordering
permutation is unique: tied elements are placed in the order of appearance.

Next, **scipy.stats.rankdata** returns a vector of *ranks*.

```
x = np.array([10, 40, 50, 20, 30])
scipy.stats.rankdata(x)
## array([1., 4., 5., 2., 3.])
```

Element 10 is the smallest (“the winner”, say, the quickest racer). Hence, it ranks first. Element 40 is the 4th on the podium. Thus, its rank is 4. And so on.

On a side note, there are many methods in nonparametric statistics (those that do not make any too particular assumptions about the underlying data distribution) that are based on ranks. In particular, in Section 9.1.4, we cover the Spearman correlation coefficient.

Consult the manual page of
**scipy.stats.rankdata** and test various methods for dealing
with ties.

Note

(**)
Readers with some background in discrete mathematics
will be interested in the fact that calling **numpy.argsort**
on a vector representing a permutation of elements
in fact generates its inverse. In particular,
**np.argsort**`(`

**np.argsort**`(x, kind="stable"))+1`

is equivalent to
**scipy.stats.rankdata**`(x, method="ordinal")`

.

### 5.5.5. Searching for Certain Indexes (Argmin, Argmax)

**numpy.argmin** and **numpy.argmax** return the index at
which we can find the smallest and the largest observation
in a given vector.

```
x = np.array([10, 30, 50, 40, 20, 50])
np.argmin(x), np.argmax(x)
## (0, 2)
```

If there are tied observations, the smallest index is returned.

Using mathematical notation, we can denote the former with:

and read it as: let \(i\) be the index of the smallest element in the sequence.
Alternatively, it is the *argument of the minimum*, whenever:

i.e., the \(i\)-th element is the smallest.

We can use **numpy.flatnonzero** to fetch the indexes
where a logical vector has elements equal to `True`

(in Section 11.1.2, we mention that a value equal
to zero is treated as the logical `False`

, and as `True`

in all
other cases).
For example:

```
np.flatnonzero(x == np.max(x))
## array([2, 5])
```

It is a version of **numpy.argmax** that lets us decide
what we would like to do with the tied maxima (there are two).

Let `x`

be a vector with possible ties.
Write an expression that returns a randomly chosen index
pinpointing one of the sample maxima.

### 5.5.6. Dealing with Round-off and Measurement Errors

Mathematics tells us (the easy proof is left as an exercise to the reader) that a centred version of a given vector \(\boldsymbol{x}\), \(\boldsymbol{y} = \boldsymbol{x}-\bar{x}\), has the arithmetic mean of 0, i.e., \(\bar{y}=0\).

Of course, it is also true on a computer. Or is it?

```
heights_centred = (heights - np.mean(heights))
np.mean(heights_centred) == 0
## False
```

The average is actually equal to:

```
np.mean(heights_centred)
## 1.3359078775153175e-14
```

which is *almost* zero (0.0000000000000134), but not *exactly* zero
(it is zero for an engineer, not a mathematician).
We saw a similar result
when performing standardisation (which involves centring)
in Section 5.3.2.

Important

All floating-point operations on a computer8 (not only in Python)
are performed with *finite* precision of 15–17 decimal digits.

We know it from school – for example, some fractions cannot be represented as decimals. When asked to add or multiply them, we will always have to apply some rounding that ultimately leads to precision loss. We know that \(1/3 + 1/3 + 1/3 = 1\), but using a decimal representation with one fractional digit, we get \(0.3+0.3+0.3=0.9\). With two digits, we obtain \(0.33+0.33+0.33=0.99\). And so on. This sum will never be equal exactly to 1 when using a finite precision.

Moreover, errors induced in one operation will propagate onto further ones. Most often they cancel out, but in extreme cases, they can lead to undesirable consequences (like for some model matrices in linear regression; see Section 9.2.9).

There is no reason to panic, though. The rule to remember is:

Important

As the floating-point values are precise up to a few decimal digits,
we should refrain from comparing them using the `**==**`
operator, because it tests exact equality.

When a comparison is needed, we need to take some error margin
into account. Ideally, instead of testing `x == y`

, we should either
inspect the *absolute error*:

or, assuming \(y\neq 0\), the *relative error*:

where \(\varepsilon\) is some small error margin.

For instance, **numpy.allclose**`(x, y)`

checks (by default)
if for all corresponding elements in both vectors it holds
**numpy.abs**`(x-y) <= 1e-8 + 1e-5*`

**numpy.abs**`(y)`

,
which is a combination of both tests.

```
np.allclose(np.mean(heights_centred), 0)
## True
```

To avoid sorrow surprises, even the testing of inequalities
like `x >= 0`

should rather be performed as, say, `x >= 1e-8`

.

Note

Our data are often imprecise by nature. When asked about people’s heights, rarely will they provide a non-integer answer (assuming they know how tall they are and are not lying about it, but it is a different story). We will most likely get data rounded to 0 decimal digits. In our dataset the precision is a bit higher:

```
heights[:6] # preview
## array([160.2, 152.7, 161.2, 157.4, 154.6, 144.7])
```

But still, there is an inherent *observational error*.
Even if, for example, the mean thereof was computed exactly,
the fact that the inputs themselves are not necessarily ideal
makes the estimate *approximate* as well. We can only hope
that these errors will more or less cancel out in the computations.

Compute the BMIs of all females in the NHANES study. Determine their arithmetic mean. Compare it to the arithmetic mean computed for BMIs rounded to 1, 2, 3, 4, etc., decimal digits.

Note

(*) Another problem is related to the fact that floats on a computer
use the binary base, not the decimal one. Therefore, some
fractional numbers that we *believe* to be representable exactly,
require an infinite number of bits. As a consequence, they are subject to
rounding.

```
0.1 + 0.1 + 0.1 == 0.3 # obviously
## False
```

This is because `0.1`

, `0.1+0.1+0.1`

, and `0.3`

is literally
represented as, respectively:

```
print(f"{0.1:.19f}, {0.1+0.1+0.1:.19f}, and {0.3:.19f}.")
## 0.1000000000000000056, 0.3000000000000000444, and 0.2999999999999999889.
```

A good introductory reference to the topic of numerical inaccuracies is [36]; see also [43, 50] for a more comprehensive treatment of numerical analysis.

### 5.5.7. Vectorising Scalar Operations with List Comprehensions

*List comprehensions* of the form
`[ expression `

**for**` name `

**in**` iterable ]`

are part of base Python.
They allow us to create lists based on transformed versions
of individual elements in a given iterable object.
Hence, they might work in cases where a task at hand
cannot be solved by means of vectorised **numpy** functions.

For example, here is a way to generate the squares of a few positive natural numbers:

```
[ i**2 for i in range(1, 11) ]
## [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
```

The result can be passed to **numpy.array** to convert it to a vector.

Further, given an example vector:

```
x = np.round(np.random.rand(10)*2-1, 2)
x
## array([ 0.86, -0.37, -0.63, -0.59, 0.14, 0.19, 0.93, 0.31, 0.5 ,
## 0.31])
```

If we wish to filter our all elements that are not greater than 0, we can write:

```
[ e for e in x if e > 0 ]
## [0.86, 0.14, 0.19, 0.93, 0.31, 0.5, 0.31]
```

We can also use the ternary operator of the form
`x_true `

**if**` cond `

**else**` x_false`

to return either `x_true`

or
`x_false`

depending on the truth value of `cond`

.

```
e = -2
e**0.5 if e >= 0 else (-e)**0.5
## 1.4142135623730951
```

Combined with a list comprehension, we can write, for instance:

```
[ round(e**0.5 if e >= 0 else (-e)**0.5, 2) for e in x ]
## [0.93, 0.61, 0.79, 0.77, 0.37, 0.44, 0.96, 0.56, 0.71, 0.56]
```

This gave the square root of absolute values.

There is also a tool which vectorises a scalar function
so that it can be used on **numpy** vectors:

```
def clip01(x):
"""clip to the unit inverval"""
if x < 0: return 0
elif x > 1: return 1
else: return x
clip01s = np.vectorize(clip01) # returns a function object
clip01s([0.3, -1.2, 0.7, 4, 9])
## array([0.3, 0. , 0.7, 1. , 1. ])
```

In the above cases, it is much better (faster, more readable code)
to rely on vectorised **numpy** functions.
Still, if the corresponding operations are unavailable
(e.g., string processing, reading many files),
list comprehensions provide a reasonable replacement therefor.

Write equivalent versions of the above expressions
using vectorised **numpy** functions.

Write equivalent versions of the above expressions
using base Python lists,
the **for** loop and the **list.append** method (start from
an empty list that will store the result).

## 5.6. Exercises

What are some benefits of using a **numpy** vector
over an ordinary Python list? What are the drawbacks?

How can we interpret the possibly different values of the arithmetic mean, median, standard deviation, interquartile range, and skewness, when comparing between heights of men and women?

There is something scientific and magical about *numbers*
that make us approach them with some kind of respect.
However, taking into account that there are many possible data aggregates,
there is a risk that a party may be cherry-picking – reporting
the one that portrays the analysed entity in a good or bad light. For instance,
reporting the mean instead of the median or vice versa.
Is there anything that can be done about it?

Even though, mathematically speaking, all measures can be computed on all data, it does not mean that it always makes sense to do so. For instance, some distributions will have skewness of 0, but we should not automatically assume that they are delightfully symmetric and bell-shaped (e.g., this can be a bimodal distribution). This is why we always need to visualise our data. Give some examples of datasets and measures where we should be critical of the obtained results.

Give some examples where simple data preprocessing can drastically change the values of chosen sample aggregates.

Give the mathematical definitions, use cases, and interpretations of standardisation, normalisation, and min-max scaling.

How are **numpy.log** and **numpy.exp** related to each other?
How about **numpy.log** vs **numpy.log10**,
**numpy.cumsum** vs **numpy.diff**,
**numpy.min** vs **numpy.argmin**,
**numpy.sort** vs **numpy.argsort**, and
**scipy.stats.rankdata** vs **numpy.argsort**?

What is the difference between
**numpy.trunc**, **numpy.floor**, **numpy.ceil**,
and **numpy.round**?

What happens when we apply `**+**` on two vectors of different
lengths?

List the four ways to index a vector.

What is wrong with the expression
`x[ x >= 0 and x <= 1 ]`

, where `x`

is a numeric vector?
How about `x[ x >= 0 & x <= 1 ]`

?

What does it mean that slicing returns a view on existing data?

(**) Reflect on the
famous
saying: *not everything that can be counted counts,
and not everything that counts can be counted*.

(**) Being a data scientist can be a frustrating job, especially
when you care for some causes.
Reflect on: *some things that count can be counted,
but we will not count them, because there’s no budget for them.*

(**) Being a data scientist can be a frustrating job, especially
when you care for the truth.
Reflect on: *some things that count can be counted,
but we will not count them, because some people might be offended
or find it unpleasant.*

(**) Assume you were to establish your own nation on some island
and become the benevolent dictator thereof.
How would you *measure* if your people are happy or not?
Let us say that you need to come up with 3 quantitative measures
(key performance indicators).
What would happen if your policy-making was solely focused
on optimising those KPIs?
How about the same problem but with regard to your company and employees?
Think about what can go wrong in other areas of life.

- 1
Nevertheless, we strongly advocate for all information of concern to the public to be openly available, so that

*experienced*statisticians can put them to good use.- 2
(**) Based on the so-called

*uncorrected for bias*version of the sample variance. We prefer it here for didactical reasons (simplicity, better interpretability). Plus, it is the default one in**numpy**. Passing`ddof=1`

(*delta degrees of freedom*) to**numpy.std**will apply division by \(n-1\) instead of by \(n\). When used as an estimator of the distribution’s standard deviation, the latter has slightly better statistical properties (which we normally explore in a course on mathematical statistics, which this one is not). On the other hand, we will see later that the`std`

methods in the**pandas**package have`ddof=1`

by default. Therefore, we might be interested in setting`ddof=0`

therein.- 3
The \(1.5 \mathrm{IQR}\) rule is the most popular in the statistical literature, but some plotting software may use different whisker ranges. See Section 15.4.1 for further discussion.

- 4
(*) A Box–Cox transformation can help achieve this in some datasets; see [8].

- 5
(*) This is because the indexer is computed first, and its

*value*is passed as an argument to the index operator. Python neither is a symbolic programming language, nor does it feature any nonstandard evaluation techniques. In other words,**[...]**does not care how the indexer was obtained.- 6
Where, theoretically, the probability of obtaining a tie is equal to 0.

- 7
Later we will mention

**pandas.unique**which lists the values in the order of appearance.- 8
Double precision

`float64`

format as defined by the IEEE Standard for Floating-Point Arithmetic (IEEE 754).