# 7. Multidimensional Numeric Data at a Glance#

The open-access textbookMinimalist Data Wrangling with Pythonby Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated.Also, make sure to check out my other book,Deep R Programming[34].

From the perspective of structured datasets,
a vector often represents *n* independent measurements
of the same quantitative property,
e.g., heights of *n* different patients,
incomes in *n* randomly chosen households,
or ages of *n* runners.
More generally, these are all instances of a bag of
*n* points on the real line.
By far[1] we should have become quite fluent with the methods
for processing such one-dimensional arrays.

Let us increase the level of complexity
by allowing each of the *n* entities to be described by *m* features,
for any \(m\ge 1\).
In other words, we will be dealing with *n* points
in an *m*-dimensional space, \(\mathbb{R}^m\).

We can arrange all the observations in a table with *n* rows and *m* columns
(just like in spreadsheets).
Such an object can be expressed with **numpy** as a two-dimensional
array which we will refer to as *matrices*.
Thanks to matrices, we can keep the *n* tuples of length
*m* together in a single object and process them all at once
(or *m* tuples of length *n*, depending on how we want to look at them).
Very convenient.

Important

Just like vectors, matrices were
designed to store data of the same type.
In Chapter 10, we will cover *data frames*,
which further increase the degree of
complexity (and freedom) by not only allowing for mixed data types
(e.g., numerical and categorical; this will enable us
to perform data analysis in subgroups more easily)
but also for the rows and columns be named.

Many data analysis algorithms convert data frames to matrices
automatically and deal with them as such.
From the computational side, it is **numpy** that does
most of the “mathematical” work.
**pandas** implements many recipes for basic data wrangling
tasks, but we want to go way beyond that.
After all, we would like to be able to tackle *any* problem.

## 7.1. Creating Matrices#

### 7.1.1. Reading CSV Files#

Tabular data are often stored and distributed
in a very portable plain-text format called CSV (comma-separated values)
or variants thereof.
We can read them quite easily with **numpy.genfromtxt**
(or later with **pandas.read_csv**).

```
body = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
delimiter=",")[1:, :] # skip first row (column names)
```

Note that the file specifies column names (the first non-comment line), therefore we had to skip it manually (more on matrix indexing later). Here is a preview of the first few rows:

```
body[:6, :] # six first rows, all columns
## array([[ 97.1, 160.2, 34.7, 40.8, 35.8, 126.1, 117.9],
## [ 91.1, 152.7, 33.5, 33. , 38.5, 125.5, 103.1],
## [ 73. , 161.2, 37.4, 38. , 31.8, 106.2, 92. ],
## [ 61.7, 157.4, 38. , 34.7, 29. , 101. , 90.5],
## [ 55.4, 154.6, 34.6, 34. , 28.3, 92.5, 73.2],
## [ 62. , 144.7, 32.5, 34.2, 29.8, 106.7, 84.8]])
```

This is an extended version of the National Health and Nutrition Examination Survey (NHANES), where the consecutive columns give the following body measurements of adult females:

```
body_columns = np.array([
"weight (kg)",
"standing height (cm)",
"upper arm len. (cm)",
"upper leg len. (cm)",
"arm circ. (cm)",
"hip circ. (cm)",
"waist circ. (cm)"
])
```

**numpy** matrices do not support column naming. This is why we noted
them down separately. It is only a minor inconvenience.
**pandas** data frames will have this capability, but from the
algebraic side, they are not as convenient as matrices for the purpose of
scientific computing.

What we are dealing with is still a **numpy** array:

```
type(body) # class of this object
## <class 'numpy.ndarray'>
```

But this time it is a two-dimensional one:

```
body.ndim # number of dimensions
## 2
```

which means that the `shape`

slot is now a tuple of length 2:

```
body.shape
## (4221, 7)
```

The above gave the total number of rows and columns, respectively.

### 7.1.2. Enumerating Elements#

**numpy.array** can create a two-dimensional array based on a list
of lists or vector-like objects, all of the same lengths.
Each of them will constitute a separate row of the resulting matrix.

For example:

```
np.array([ # list of lists
[ 1, 2, 3, 4 ], # the 1st row
[ 5, 6, 7, 8 ], # the 2nd row
[ 9, 10, 11, 12 ] # the 3rd row
])
## array([[ 1, 2, 3, 4],
## [ 5, 6, 7, 8],
## [ 9, 10, 11, 12]])
```

gives a 3-by-4 (3×4) matrix,

```
np.array([ [1], [2], [3] ])
## array([[1],
## [2],
## [3]])
```

yields a 3-by-1 one (we call it a *column vector*,
but it is a special matrix — we will soon learn that shapes
can make a significant difference), and

```
np.array([ [1, 2, 3, 4] ])
## array([[1, 2, 3, 4]])
```

produces a 1-by-4 array (a *row vector*).

Note

An ordinary vector (a unidimensional array) only uses a single pair of square brackets:

```
np.array([1, 2, 3, 4])
## array([1, 2, 3, 4])
```

### 7.1.3. Repeating Arrays#

The previously mentioned **numpy.tile** and **numpy.repeat**
can also generate some nice matrices.
For instance,

```
np.repeat([[1, 2, 3, 4]], 3, axis=0)
## array([[1, 2, 3, 4],
## [1, 2, 3, 4],
## [1, 2, 3, 4]])
```

repeats a row vector rowwisely (i.e., over axis `0`

– the first one).

Replicating a column vector columnwisely (i.e.,
over axis `1`

– the second one) is possible as well:

```
np.repeat([[1], [2], [3]], 4, axis=1)
## array([[1, 1, 1, 1],
## [2, 2, 2, 2],
## [3, 3, 3, 3]])
```

How can we generate matrices of the following kinds?

### 7.1.4. Stacking Arrays#

**numpy.column_stack** and **numpy.row_stack** take
a tuple of array-like objects and bind them column- or rowwisely
to form a new matrix:

```
np.column_stack(([10, 20], [30, 40], [50, 60])) # a tuple of lists
## array([[10, 30, 50],
## [20, 40, 60]])
np.row_stack(([10, 20], [30, 40], [50, 60]))
## array([[10, 20],
## [30, 40],
## [50, 60]])
np.column_stack((
np.row_stack(([10, 20], [30, 40], [50, 60])),
[70, 80, 90]
))
## array([[10, 20, 70],
## [30, 40, 80],
## [50, 60, 90]])
```

Perform similar operations
using **numpy.append**, **numpy.vstack**, **numpy.hstack**,
**numpy.concatenate**, and (*) **numpy.c_**.

Using **numpy.insert**,
and a new row/column at the beginning, end, and in the middle
of an array. Let us stress that this function returns a new array.

### 7.1.5. Other Functions#

Many built-in functions allow for generating arrays of arbitrary shapes (not only vectors). For example:

```
np.random.seed(123)
np.random.rand(2, 5) # not: rand((2, 5))
## array([[0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897],
## [0.42310646, 0.9807642 , 0.68482974, 0.4809319 , 0.39211752]])
```

The same with **scipy**:

```
scipy.stats.uniform.rvs(0, 1, size=(2, 5), random_state=123)
## array([[0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897],
## [0.42310646, 0.9807642 , 0.68482974, 0.4809319 , 0.39211752]])
```

The way we specify the output shapes might differ across functions and packages. Consequently, as usual, it is always best to refer to their documentation.

Check out the documentation of the following functions:
**numpy.eye**, **numpy.diag**, **numpy.zeros**,
**numpy.ones**, and **numpy.empty**.

## 7.2. Reshaping Matrices#

Let us take an example 3-by-4 matrix:

```
A = np.array([
[ 1, 2, 3, 4 ],
[ 5, 6, 7, 8 ],
[ 9, 10, 11, 12 ]
])
```

Internally, a matrix is represented using a *long* flat vector
where elements are stored in
the row-major[2] order:

```
A.size # total number of elements
## 12
A.ravel() # the underlying array
## array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
```

It is the `shape`

slot that is causing the 12 elements to be treated as
if they were arranged on a 3-by-4 grid,
for example in different algebraic computations and during
the printing thereof.
This arrangement can be altered anytime without modifying
the underlying array:

```
A.shape = (4, 3)
A
## array([[ 1, 2, 3],
## [ 4, 5, 6],
## [ 7, 8, 9],
## [10, 11, 12]])
```

This way, we obtained a different *view* of the same data.

For convenience, there is also the **reshape** method
that returns a modified version of the object it is applied on:

```
A.reshape(-1, 6) # A.reshape(don't make me compute this for you mate!, 6)
## array([[ 1, 2, 3, 4, 5, 6],
## [ 7, 8, 9, 10, 11, 12]])
```

Here, “`-1`

” means that **numpy** must deduce by itself how many rows
we want in the result. Twelve elements are supposed
to be arranged in six columns, so the maths behind it is not rocket science.

Thanks to this, generating row or column vectors is straightforward:

```
np.linspace(0, 1, 5).reshape(1, -1) # one row, auto-guess number of columns
## array([[0. , 0.25, 0.5 , 0.75, 1. ]])
np.array([9099, 2537, 1832]).reshape(-1, 1) # auto-guess rows, one columns
## array([[9099],
## [2537],
## [1832]])
```

Reshaping is not the same as matrix *transpose*, which also
changes the order of elements in the underlying array:

```
A # before
## array([[ 1, 2, 3],
## [ 4, 5, 6],
## [ 7, 8, 9],
## [10, 11, 12]])
A.T # transpose of A
## array([[ 1, 4, 7, 10],
## [ 2, 5, 8, 11],
## [ 3, 6, 9, 12]])
```

We see that the rows became columns and vice versa.

Note

(*) Higher-dimensional arrays are also possible. For example,

```
np.arange(24).reshape(2, 4, 3)
## array([[[ 0, 1, 2],
## [ 3, 4, 5],
## [ 6, 7, 8],
## [ 9, 10, 11]],
##
## [[12, 13, 14],
## [15, 16, 17],
## [18, 19, 20],
## [21, 22, 23]]])
```

Is an array of “depth” 2, “height” 4, and “width” 3; we can see it as two 4-by-3 matrices stacked together. Theoretically, they can be used for representing contingency tables for products of many factors. Still, in our application areas, we prefer to stick with long data frames instead; see Section 10.6.2. This is due to their more aesthetic display and better handling of sparse data.

## 7.3. Mathematical Notation#

Here is some standalone mathematical notation that we shall
be employing in this course.
A matrix with *n* rows and *m* columns
(an *n*-by-*m* matrix) \(\mathbf{X}\) can be written as:

Mathematically, we denote this as \(\mathbf{X}\in\mathbb{R}^{n\times m}\). Looking at the above, if this makes us think of how data are displayed in spreadsheets, we are correct, because the latter was inspired by the former.

We see that
\(x_{i,j}\in\mathbb{R}\) denotes
the element in the \(i\)-th row (e.g., the \(i\)-th *observation*)
and the \(j\)-th column (e.g., the \(j\)-th *feature* or *variable*), for every
\(i=1,\dots,n\), \(j=1,\dots,m\).

In particular, if \(\mathbf{X}\) denoted the `body`

dataset,
then \(x_{1,2}\) would be the height of the 1st person.

Important

Matrices are a convenient means of representing many different kinds of data:

*n*points in an*m*-dimensional space (like*n*observations for which there are*m*measurements/features recorded, where each row describes a different object; exactly the case of the`body`

dataset above) – this is the most common scenario;*m*time series sampled at*n*points in time (e.g., prices of*m*different currencies on*n*consecutive days; see Chapter 16);a single kind of measurement for data in

*m*groups, each consisting of*n*subjects (e.g., heights of*n*males and*n*females); here, the order of elements in each column does not usually matter as observations are not*paired*; there is no relationship between \(x_{i,j}\) and \(x_{i,k}\) for \(j\neq k\); a matrix is used merely as a convenient container for storing a few unrelated vectors of identical sizes; we will be dealing with a more generic case of possibly nonhomogeneous groups in Chapter 12;two-way contingency tables (see Section 11.2.2), where an element \(x_{i,j}\) gives the number of occurrences of items at the \(i\)-th level of the first categorical variable and, at the same time, being at the \(j\)-th level of the second variable (e.g., blue-eyed

*and*blonde-haired);graphs and other relationships between objects, e.g., \(x_{i,j}=0\) might denote that the \(i\)-th object is not connected[3] with the \(j\)-th one and \(x_{k,l}=0.2\) that there is a weak connection between \(k\) and \(l\) (e.g., who is a friend of whom, whether a user recommends a particular item);

images, where \(x_{i,j}\) represents the intensity of a colour component (e.g., red, green, blue or shades of grey or hue, saturation, brightness; compare Section 16.4) of a pixel in the \((n-i+1)\)-th row and the \(j\)-th column.

Note

In practice, more complex and
less-structured data can quite often be mapped to a tabular form.
For instance, a set of audio recordings
can be described by measuring the overall loudness,
timbre, and danceability of each song.
Also, a collection of documents can be described
by means of the degrees of belongingness to some
automatically discovered topics
(e.g., someone said that Joyce’s *Ulysses* is
80% travel literature, 70% comedy,
and 50% heroic fantasy, but let us not take it for granted).

### 7.3.1. Row and Column Vectors#

Additionally, will sometimes use the following notation to emphasise that \(\mathbf{X}\) consists of \(n\) rows:

Here, \(\mathbf{x}_{i,\cdot}\) is a *row vector* of length \(m\),
i.e., a \((1\times m)\)-matrix:

Alternatively, we can specify the \(m\) columns:

where \(\mathbf{x}_{\cdot,j}\) is a *column vector* of length \(n\),
i.e., an \((n\times 1)\)-matrix:

where \(\cdot^T\) denotes the transpose of a given matrix (thanks to which we can save some vertical space, we do not want this book to be 1000 pages long, do we?).

Also, recall that we are used to denoting
*vectors* of length \(m\) with \(\boldsymbol{x}=(x_1, \dots, x_m)\).
A vector is a one-dimensional array (not a two-dimensional one),
hence a slightly different font in the case where ambiguity
can be troublesome.

### 7.3.2. Transpose#

The *transpose* of a matrix
\(\mathbf{X}\in\mathbb{R}^{n\times m}\)
is an \((m\times n)\)-matrix \(\mathbf{Y}\) given by:

i.e., it enjoys \(y_{i,j}=x_{j, i}\).

Compare the display of an example matrix
`A`

and its transpose `A.T`

above.

### 7.3.3. Identity and Other Diagonal Matrices#

\(\mathbf{I}\) denotes the *identity matrix*,
being a square \(n\times n\) matrix
(with \(n\) most often clear from the context)
with 0s everywhere except on the main diagonal, where 1s lie.

```
np.eye(5) # I
## array([[1., 0., 0., 0., 0.],
## [0., 1., 0., 0., 0.],
## [0., 0., 1., 0., 0.],
## [0., 0., 0., 1., 0.],
## [0., 0., 0., 0., 1.]])
```

The identity matrix is a neutral element of the matrix multiplication (Section 8.3).

More generally, any diagonal matrix, \(\mathrm{diag}(a_1,\dots,a_n)\), can be constructed from a given sequence of elements by calling:

```
np.diag([1, 2, 3, 4])
## array([[1, 0, 0, 0],
## [0, 2, 0, 0],
## [0, 0, 3, 0],
## [0, 0, 0, 4]])
```

## 7.4. Visualising Multidimensional Data#

Let us go back to our `body`

dataset:

```
body[:6, :] # preview
## array([[ 97.1, 160.2, 34.7, 40.8, 35.8, 126.1, 117.9],
## [ 91.1, 152.7, 33.5, 33. , 38.5, 125.5, 103.1],
## [ 73. , 161.2, 37.4, 38. , 31.8, 106.2, 92. ],
## [ 61.7, 157.4, 38. , 34.7, 29. , 101. , 90.5],
## [ 55.4, 154.6, 34.6, 34. , 28.3, 92.5, 73.2],
## [ 62. , 144.7, 32.5, 34.2, 29.8, 106.7, 84.8]])
body.shape
## (4221, 7)
```

This is an example of tabular (“structured”) data. The important property is that the elements in each row describe the same person. We can freely reorder all the columns at the same time (change the order of participants). Still, sorting a single column and leaving the other ones unchanged will be semantically invalid.

Mathematically, we consider the above as
a set of 4221 points in a seven-dimensional space, \(\mathbb{R}^7\).
Let us discuss how we can try visualising different
natural *projections* thereof.

### 7.4.1. 2D Data#

A *scatterplot* can be used to visualise one variable against another one.

```
plt.plot(body[:, 1], body[:, 3], "o", c="#00000022")
plt.xlabel(body_columns[1])
plt.ylabel(body_columns[3])
plt.show()
```

Figure 7.1 depicts
upper leg length (the y-axis) vs (versus; against; as a function of)
standing height (the x-axis) in the form
of a point cloud with \((x, y)\) coordinates like
`(body[i, 1], body[i, 3])`

.

Here are the exact coordinates of the point corresponding to the person of the smallest height:

```
body[np.argmin(body[:, 1]), [1, 3]]
## array([131.1, 30.8])
```

and here is the one with the greatest upper leg length:

```
body[np.argmax(body[:, 3]), [1, 3]]
## array([168.9, 49.1])
```

Locate them in Figure 7.1.

As the points are abundant, normally we cannot easily see
*where* most of them are located.
To remedy this, we applied the simple trick
of plotting the points using a semi-transparent colour.
Here, the colour specifier was of the form `#rrggbbaa`

,
giving the intensity of the red, green, blue, and alpha (opaqueness)
channel in three series of two hexadecimal digits (between `00`

= 0
and `ff`

= 255).

Overall, the plot reveals that there is a *general tendency*
for small heights and small upper leg lengths to occur frequently together.
The same with larger pairs. In Chapter 9,
we explore some measures of correlation that will enable
us to quantify the degree of association
between variable pairs.

### 7.4.2. 3D Data and Beyond#

If we have more than two variables to visualise, we might be tempted to use, e.g., a three-dimensional scatterplot like the one in Figure 7.2.

```
fig = plt.figure()
ax = fig.add_subplot(projection="3d", facecolor="#ffffff00")
ax.scatter(body[:, 1], body[:, 3], body[:, 0], color="#00000011")
ax.view_init(elev=30, azim=60, vertical_axis="y")
ax.set_xlabel(body_columns[1])
ax.set_ylabel(body_columns[3])
ax.set_zlabel(body_columns[0])
plt.show()
```

Infrequently will such a 3D plot provide us with readable results, though. We are projecting a three-dimensional reality onto a two-dimensional screen or page. Some information must inherently be lost. Also, what we see is relative to the position of the virtual camera.

(*)
Try finding an *interesting* elevation and azimuth angle
by playing with the arguments passed to the
**mpl_toolkits.mplot3d.axes3d.Axes3D.view_init** function.
Also, depict arm circumference, hip circumference, and weight on
a 3D plot.

Note

(*) Sometimes there might be facilities available to create an interactive scatterplot (running the above from the Python’s console enables this), where the virtual camera can be freely repositioned with a mouse/touchpad. This can give some more insight into our data. Also, there are means of creating animated sequences, where we can fly over the data scene. Some people find it cool, others find it annoying, but the biggest problem therewith is that they cannot be included in printed material. Yet, if we are only targeting the display for the Web (this includes mobile devices), we can try some Python libraries that output HTML+CSS+JavaScript code to be rendered by a browser engine.

Instead of drawing a 3D plot, it might be better to play with different marker colours (or sometimes sizes: think of them as bubbles). Suitable colour maps can be used to distinguish between low and high values of an additional variable, as in Figure 7.3.

```
from matplotlib import cm
plt.scatter(
body[:, 4], # x
body[:, 5], # y
c=body[:, 0], # "z" - colours
cmap=cm.get_cmap("copper"), # colour map
alpha=0.5 # opaqueness level between 0 and 1
)
plt.xlabel(body_columns[4])
plt.ylabel(body_columns[5])
plt.axis("equal")
plt.rcParams["axes.grid"] = False
cbar = plt.colorbar()
plt.rcParams["axes.grid"] = True
cbar.set_label(body_columns[0])
plt.show()
```

We can see some tendency for the weight to be greater as both the arm and the hip circumferences increase.

Play around with different colour palettes. However, be wary that ca. every 1 in 12 men (8%) and 1 in 200 women (0.5%) have colour vision deficiencies, especially in the red-green or blue-yellow spectrum. For this reason, some diverging colour maps might be worse than others.

A piece of paper is two-dimensional. We only have height and width. Looking around us, we also understand the notion of depth. So far so good. But when the case of more-dimensional data is concerned, well, suffice it to say that we are three-dimensional creatures and any attempts towards visualising them will simply not work, don’t even trip.

Luckily, this is where mathematics comes to our rescue.
With some more knowledge and intuitions, and this book helps us develop them,
it will be as easy[5] as imagining a generic *m*-dimensional space,
and then assuming that, say, *m=7* or *42*.

This is exactly why data science relies on automated methods for knowledge/pattern discovery. Thanks to them, we can identify, describe, and analyse the structures that might be present in the data, but cannot be perceived with our imperfect senses.

Note

Linear and nonlinear dimensionality reduction techniques
can be applied to visualise some aspects of high-dimensional data
in the form of 2D (or 3D) plots.
In particular, the principal component analysis (PCA) finds
an *interesting* angle from which looking at the data might be
worth considering; see Section 9.3.

### 7.4.3. Scatterplot Matrix (Pairplot)#

We may also try depicting all pairs of selected variables in the form of a scatterplot matrix; see Figure 7.4.

```
def pairplot(X, labels, bins=21, alpha=0.1):
"""
Draws a scatterplot matrix, given:
* X - data matrix,
* labels - list of column names
"""
assert X.shape[1] == len(labels)
k = X.shape[1]
fig, axes = plt.subplots(nrows=k, ncols=k, sharex="col", sharey="row",
figsize=(plt.rcParams["figure.figsize"][0], )*2)
for i in range(k):
for j in range(k):
ax = axes[i, j]
if i == j: # diagonal
ax.text(0.5, 0.5, labels[i], transform=ax.transAxes,
ha="center", va="center", size="x-small")
else:
ax.plot(X[:, j], X[:, i], ".", color="black", alpha=alpha)
```

And now:

```
which = [0, 1, 4, 5]
pairplot(body[:, which], body_columns[which])
plt.show()
```

Plotting variables against themselves is uninteresting (exercise: what would that be?), therefore we printed out the feature labels on the main diagonal.

A scatterplot matrix can be a valuable tool for identifying interesting combinations of columns in our datasets. We see that some pairs of variables are more “structured” than others, e.g., hip circumference and weight are more or less aligned on a straight line. This is why in Chapter 9 we will be interested in describing the possible relationships between the variables.

Create a pairplot where weight, arm circumference, and hip circumference are on the log-scale.

(*)
Call **seaborn.pairplot** to create a scatterplot matrix with
histograms on the main diagonal, thanks to which you will be able to
see how the *marginal distributions* are distributed.
Note that the matrix must, unfortunately, be converted to a
**pandas** data frame first.

## 7.5. Exercises#

What is the difference between
`[1, 2, 3]`

, `[[1, 2, 3]]`

, and `[[1], [2], [3]]`

in the context
of array creation?

If `A`

is a matrix with 5 rows and 6 columns,
what is the difference between `A.reshape(6, 5)`

and `A.T`

?

If `A`

is a matrix with 5 rows and 6 columns,
what is the meaning of:
`A.reshape(-1)`

, `A.reshape(3, -1)`

,
`A.reshape(-1, 3)`

, `A.reshape(-1, -1)`

,
`A.shape = (3, 10)`

, and `A.shape = (-1, 3)`

?

List some methods to add a new row and add a new column to an existing matrix.

Give some ways to visualise three-dimensional data.

How can we set point opaqueness/transparency when drawing a scatter plot? Why would we be interested in this?