7. Multidimensional numeric data at a glance#

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Deep R Programming [34] too.

From the perspective of structured datasets, a vector often represents \(n\) independent measurements of the same quantitative property, e.g., heights of \(n\) different patients, incomes in \(n\) randomly chosen households, or ages of \(n\) runners. More generally, these are all instances of a bag of \(n\) points on the real line. By far[1], we should have become quite fluent with the methods for processing such one-dimensional arrays.

Let us increase the level of complexity by permitting each of the \(n\) entities to be described by \(m\) features, for any \(m\ge 1\). In other words, we will be dealing with \(n\) points in an \(m\)-dimensional space, \(\mathbb{R}^m\).

We can arrange all the observations in a table with \(n\) rows and \(m\) columns (just like in spreadsheets). Such an object can be expressed with numpy as a two-dimensional array which we will refer to as matrices. Thanks to matrices, we can keep the \(n\) tuples of length \(m\) together in a single object and process them all at once (or \(m\) tuples of length \(n\), depending on how we want to look at them). Very convenient.

Important

Just like vectors, matrices were designed to store data of the same type. Chapter 10 will cover data frames, which further increase the degree of complexity (and freedom). They tolerate mixed data types, e.g., numerical and categorical (this will enable us to perform data analysis in subgroups more easily). Moreover, they let rows and columns be named.

Many data analysis algorithms convert data frames to matrices automatically and deal with them as such. From the computational side, it is numpy that does most of the “mathematical” work. pandas implements many recipes for basic data wrangling tasks, but we want to go way beyond that. After all, we would like to be able to tackle any problem.

7.1. Creating matrices#

7.1.1. Reading CSV files#

Tabular data are often stored and distributed in a very portable plain-text format called CSV (comma-separated values) or variants thereof. We can read them quite easily with numpy.genfromtxt (or later with pandas.read_csv).

body = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
    delimiter=",")[1:, :]  # skip first row (column names)

Note that the file specifies column names (the first non-comment line; we suggest inspecting the above CSV file in a web browser). Therefore, we had to skip it manually (more on matrix indexing later). Here is a preview of the first few rows:

body[:6, :]  # six first rows, all columns
## array([[ 97.1, 160.2,  34.7,  40.8,  35.8, 126.1, 117.9],
##        [ 91.1, 152.7,  33.5,  33. ,  38.5, 125.5, 103.1],
##        [ 73. , 161.2,  37.4,  38. ,  31.8, 106.2,  92. ],
##        [ 61.7, 157.4,  38. ,  34.7,  29. , 101. ,  90.5],
##        [ 55.4, 154.6,  34.6,  34. ,  28.3,  92.5,  73.2],
##        [ 62. , 144.7,  32.5,  34.2,  29.8, 106.7,  84.8]])

This is an extended version of the National Health and Nutrition Examination Survey (NHANES), where the consecutive columns give the following body measurements of adult females:

body_columns = np.array([
    "weight (kg)",
    "standing height (cm)",   # we know `heights` from the previous chapters
    "upper arm len. (cm)",
    "upper leg len. (cm)",
    "arm circ. (cm)",
    "hip circ. (cm)",
    "waist circ. (cm)"
])

numpy matrices do not support column naming. This is why we noted them down separately. It is only a minor inconvenience. pandas data frames will have this capability, but from the algebraic side, they are not as convenient as matrices for the purpose of scientific computing.

What we are dealing with is still a numpy array:

type(body)  # class of this object
## <class 'numpy.ndarray'>

But this time it is a two-dimensional one:

body.ndim  # number of dimensions
## 2

which means that the shape slot is now a tuple of length 2:

body.shape
## (4221, 7)

The above gave the total number of rows and columns, respectively.

7.1.2. Enumerating elements#

numpy.array can create a two-dimensional array based on a list of lists or vector-like objects, all of the same lengths. Each of them will constitute a separate row of the resulting matrix. For example:

np.array([  # list of lists
    [ 1,  2,  3,  4 ],  # the first row
    [ 5,  6,  7,  8 ],  # the second row
    [ 9, 10, 11, 12 ]   # the third row
])
## array([[ 1,  2,  3,  4],
##        [ 5,  6,  7,  8],
##        [ 9, 10, 11, 12]])

It gave a \(3\times 4\) (3-by-4) matrix.

np.array([ [1], [2], [3] ])
## array([[1],
##        [2],
##        [3]])

It yielded a \(3\times 1\) one; we call a column vector, but it is a special matrix (we will soon learn that shapes can make a significant difference). Moreover:

np.array([ [1, 2, 3, 4] ])
## array([[1, 2, 3, 4]])

It produced a \(1\times 4\) array (a row vector).

Note

An ordinary vector (a unidimensional array) only uses a single pair of square brackets:

np.array([1, 2, 3, 4])
## array([1, 2, 3, 4])

7.1.3. Repeating arrays#

The previously mentioned numpy.tile and numpy.repeat can also generate some nice matrices. For instance:

np.repeat([[1, 2, 3, 4]], 3, axis=0)
## array([[1, 2, 3, 4],
##        [1, 2, 3, 4],
##        [1, 2, 3, 4]])

repeats a row vector rowwisely (i.e., over axis 0 – the first one).

Replicating a column vector columnwisely (i.e., over axis 1 – the second one) is possible as well:

np.repeat([[1], [2], [3]], 4, axis=1)
## array([[1, 1, 1, 1],
##        [2, 2, 2, 2],
##        [3, 3, 3, 3]])
Exercise 7.1

How can we generate matrices of the following kinds?

\[\left[ \begin{array}{cc} 1 & 2 \\ 1 & 2 \\ 1 & 2 \\ 3 & 4 \\ 3 & 4 \\ 3 & 4 \\ 3 & 4 \\ \end{array} \right], \qquad \left[ \begin{array}{cccccc} 1 & 2 & 1 & 2& 1& 2 \\ 1 & 2 & 1 & 2& 1& 2 \\ \end{array} \right], \qquad \left[ \begin{array}{ccccc} 1& 1& 2& 2 & 2 \\ 3& 3& 4& 4 & 4 \\ \end{array} \right].\]

7.1.4. Stacking arrays#

numpy.column_stack and numpy.row_stack take a tuple of array-like objects and bind them column- or rowwisely to form a new matrix:

np.column_stack(([10, 20], [30, 40], [50, 60]))  # a tuple of lists
## array([[10, 30, 50],
##        [20, 40, 60]])
np.row_stack(([10, 20], [30, 40], [50, 60]))
## array([[10, 20],
##        [30, 40],
##        [50, 60]])
np.column_stack((
    np.row_stack(([10, 20], [30, 40], [50, 60])),
    [70, 80, 90]
))
## array([[10, 20, 70],
##        [30, 40, 80],
##        [50, 60, 90]])
Exercise 7.2

Perform similar operations using numpy.append, numpy.vstack, numpy.hstack, numpy.concatenate, and (*) numpy.c_. Are they worth taking note of, or are they redundant?

Exercise 7.3

Using numpy.insert, add a new row/column at the beginning, end, and in the middle of an array. Let us stress that this function returns a new array.

7.1.5. Other functions#

Many built-in functions generate arrays of arbitrary shapes (not only vectors). For example:

np.random.seed(123)
np.random.rand(2, 5)  # not: rand((2, 5))
## array([[0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897],
##        [0.42310646, 0.9807642 , 0.68482974, 0.4809319 , 0.39211752]])

The same with scipy:

scipy.stats.uniform.rvs(0, 1, size=(2, 5), random_state=123)
## array([[0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897],
##        [0.42310646, 0.9807642 , 0.68482974, 0.4809319 , 0.39211752]])

The way we specify the output shapes might differ across functions and packages. Consequently, as usual, it is always best to refer to their documentation.

Exercise 7.4

Check out the documentation of the following functions: numpy.eye, numpy.diag, numpy.zeros, numpy.ones, and numpy.empty.

7.2. Reshaping matrices#

Let us take an example \(3\times 4\) matrix:

A = np.array([
    [ 1,  2,  3,  4 ],
    [ 5,  6,  7,  8 ],
    [ 9, 10, 11, 12 ]
])

Internally, a matrix is represented using a long flat vector where elements are stored in the row-major[2] order:

A.size  # total number of elements
## 12
A.ravel()  # the underlying array
## array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

It is the shape slot that is causing the 12 elements to be treated as if they were arranged on a \(3\times 4\) grid, for example in different algebraic computations and during the printing thereof. This arrangement can be altered anytime without modifying the underlying array:

A.shape = (4, 3)
A
## array([[ 1,  2,  3],
##        [ 4,  5,  6],
##        [ 7,  8,  9],
##        [10, 11, 12]])

This way, we obtained a different view of the same data.

For convenience, there is also the reshape method that returns a modified version of the object it is applied on:

A.reshape(-1, 6)  # A.reshape(don't make me compute this for you mate!, 6)
## array([[ 1,  2,  3,  4,  5,  6],
##        [ 7,  8,  9, 10, 11, 12]])

Here, “-1” means that numpy must deduce by itself how many rows we want in the result. Twelve elements are supposed to be arranged in six columns, so the maths behind it is not rocket science.

Thanks to this, generating row or column vectors is straightforward:

np.linspace(0, 1, 5).reshape(1, -1)  # one row, guess the number of columns
## array([[0.  , 0.25, 0.5 , 0.75, 1.  ]])
np.array([9099, 2537, 1832]).reshape(-1, 1)  # one column, guess row count
## array([[9099],
##        [2537],
##        [1832]])

Reshaping is not the same as matrix transpose, which also changes the order of elements in the underlying array:

A  # before
## array([[ 1,  2,  3],
##        [ 4,  5,  6],
##        [ 7,  8,  9],
##        [10, 11, 12]])
A.T  # transpose of A
## array([[ 1,  4,  7, 10],
##        [ 2,  5,  8, 11],
##        [ 3,  6,  9, 12]])

Rows became columns and vice versa.

Note

(*) Higher-dimensional arrays are also possible. For example:

np.arange(24).reshape(2, 4, 3)
## array([[[ 0,  1,  2],
##         [ 3,  4,  5],
##         [ 6,  7,  8],
##         [ 9, 10, 11]],
## 
##        [[12, 13, 14],
##         [15, 16, 17],
##         [18, 19, 20],
##         [21, 22, 23]]])

Is an array of “depth” 2, “height” 4, and “width” 3; we can see it as two \(4\times 3\) matrices stacked together. Theoretically, they can be used for representing contingency tables for products of many factors. Still, in our application areas, we prefer to stick with long data frames instead; see Section 10.6.2. This is due to their more aesthetic display and better handling of sparse data.

7.3. Mathematical notation#

Here is some standalone mathematical notation that we shall be employing in this course. A matrix with \(n\) rows and \(m\) columns (an \(n\times m\) matrix) \(\mathbf{X}\) can be written as:

\[ \mathbf{X}= \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & \cdots & x_{1,m} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,m} \\ \end{array} \right]. \]

Mathematically, we denote this as \(\mathbf{X}\in\mathbb{R}^{n\times m}\). Spreadsheets use a similar display of data.

We see that \(x_{i,j}\in\mathbb{R}\) denotes the element in the \(i\)-th row (e.g., the \(i\)-th observation or case) and the \(j\)-th column (e.g., the \(j\)-th feature or variable), for every \(i=1,\dots,n\), \(j=1,\dots,m\).

In particular, if \(\mathbf{X}\) denoted the body dataset, then \(x_{1,2}\) would be the height of the first person.

Important

Matrices are a convenient means of representing many different kinds of data:

  • \(n\) points in an \(m\)-dimensional space (like \(n\) observations for which there are \(m\) measurements/features recorded, where each row describes a different object; exactly the case of the body dataset above) – this is the most common scenario;

  • \(m\) time series sampled at \(n\) points in time (e.g., prices of \(m\) different currencies on \(n\) consecutive days; see Chapter 16);

  • a single kind of measurement for data in \(m\) groups, each consisting of \(n\) subjects (e.g., heights of \(n\) males and \(n\) females); here, the order of elements in each column does not usually matter as observations are not paired; there is no relationship between \(x_{i,j}\) and \(x_{i,k}\) for \(j\neq k\); a matrix is used merely as a convenient container for storing a few unrelated vectors of identical sizes; we will be dealing with a more generic case of possibly nonhomogeneous groups in Chapter 12;

  • two-way contingency tables (see Section 11.2.2), where an element \(x_{i,j}\) gives the number of occurrences of items at the \(i\)-th level of the first categorical variable and, at the same time, being at the \(j\)-th level of the second variable (e.g., blue-eyed and blonde-haired);

  • graphs and other relationships between objects, e.g., \(x_{i,j}=0\) might denote that the \(i\)-th object is not connected[3] with the \(j\)-th one and \(x_{k,l}=0.2\) that there is a weak connection between \(k\) and \(l\) (e.g., who is a friend of whom, whether a user recommends a particular item);

  • images, where \(x_{i,j}\) represents the intensity of a colour component (e.g., red, green, blue or shades of grey or hue, saturation, brightness; compare Section 16.4) of a pixel in the \((n-i+1)\)-th row and the \(j\)-th column.

Note

In practice, more complex and less-structured data can quite often be mapped to a tabular form. For instance, a set of audio recordings can be described by measuring the overall loudness, timbre, and danceability of each song. Also, a collection of documents can be described by means of the degrees of belongingness to some automatically discovered topics (e.g., someone said that Joyce’s Ulysses is 80% travel literature, 70% comedy, and 50% heroic fantasy, but let us not take it for granted).

7.3.1. Row and column vectors#

Additionally, will sometimes use the following notation to emphasise that \(\mathbf{X}\) consists of \(n\) rows:

\[ \mathbf{X} = \left[ \begin{array}{c} \mathbf{x}_{1,\cdot} \\ \mathbf{x}_{2,\cdot} \\ \vdots\\ \mathbf{x}_{n,\cdot} \\ \end{array} \right]. \]

Here, \(\mathbf{x}_{i,\cdot}\) is a row vector of length \(m\), i.e., a \((1\times m)\)-matrix:

\[ \mathbf{x}_{i,\cdot} = \left[ \begin{array}{cccc} x_{i,1} & x_{i,2} & \cdots & x_{i,m} \\ \end{array} \right]. \]

Alternatively, we can specify the \(m\) columns:

\[ \mathbf{X} = \left[ \begin{array}{cccc} \mathbf{x}_{\cdot,1} & \mathbf{x}_{\cdot,2} & \cdots & \mathbf{x}_{\cdot,m} \\ \end{array} \right], \]

where \(\mathbf{x}_{\cdot,j}\) is a column vector of length \(n\), i.e., an \((n\times 1)\)-matrix:

\[ \mathbf{x}_{\cdot,j} = \left[ \begin{array}{cccc} x_{1,j} & x_{2,j} & \cdots & x_{n,j} \\ \end{array} \right]^T=\left[ \begin{array}{c} {x}_{1,j} \\ {x}_{2,j} \\ \vdots\\ {x}_{n,j} \\ \end{array} \right], \]

where \(\cdot^T\) denotes the transpose of a given matrix (thanks to which we can save some vertical space, we do not want this book to be 1000 pages long, do we?).

Also, recall that we are used to denoting vectors of length \(m\) with \(\boldsymbol{x}=(x_1, \dots, x_m)\). A vector is a one-dimensional array (not a two-dimensional one), hence a slightly different font in the case where ambiguity can be troublesome.

Note

To avoid notation clutter, we will often be implicitly promoting vectors like \(\boldsymbol{x}=(x_1,\dots,x_m)\) to row vectors \(\mathbf{x}=[x_1\,\cdots\,x_m]\). This is the behaviour that numpy[4] uses; see Chapter 8.

7.3.2. Transpose#

The transpose of a matrix \(\mathbf{X}\in\mathbb{R}^{n\times m}\) is an \((m\times n)\)-matrix \(\mathbf{Y}\) given by:

\[ \mathbf{Y}= \mathbf{X}^T= \left[ \begin{array}{cccc} x_{1,1} & x_{2,1} & \cdots & x_{m,1} \\ x_{1,2} & x_{2,2} & \cdots & x_{m,2} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1,n} & x_{2,n} & \cdots & x_{m,n} \\ \end{array} \right], \]

i.e., it enjoys \(y_{i,j}=x_{j, i}\).

Exercise 7.5

Compare the display of an example matrix A and its transpose A.T above.

7.3.3. Identity and other diagonal matrices#

\(\mathbf{I}\) denotes the identity matrix, being a square \(n\times n\) matrix (with \(n\) most often clear from the context) with 0s everywhere except on the main diagonal, where 1s lie.

np.eye(5)  # I
## array([[1., 0., 0., 0., 0.],
##        [0., 1., 0., 0., 0.],
##        [0., 0., 1., 0., 0.],
##        [0., 0., 0., 1., 0.],
##        [0., 0., 0., 0., 1.]])

The identity matrix is a neutral element of the matrix multiplication (Section 8.3).

More generally, any diagonal matrix, \(\mathrm{diag}(a_1,\dots,a_n)\), can be constructed from a given sequence of elements by calling:

np.diag([1, 2, 3, 4])
## array([[1, 0, 0, 0],
##        [0, 2, 0, 0],
##        [0, 0, 3, 0],
##        [0, 0, 0, 4]])

7.4. Visualising multidimensional data#

Let us go back to our body dataset:

body[:6, :]  # preview
## array([[ 97.1, 160.2,  34.7,  40.8,  35.8, 126.1, 117.9],
##        [ 91.1, 152.7,  33.5,  33. ,  38.5, 125.5, 103.1],
##        [ 73. , 161.2,  37.4,  38. ,  31.8, 106.2,  92. ],
##        [ 61.7, 157.4,  38. ,  34.7,  29. , 101. ,  90.5],
##        [ 55.4, 154.6,  34.6,  34. ,  28.3,  92.5,  73.2],
##        [ 62. , 144.7,  32.5,  34.2,  29.8, 106.7,  84.8]])
body.shape
## (4221, 7)

This is an example of tabular (“structured”) data. The important property is that the elements in each row describe the same person. We can freely reorder all the columns at the same time (change the order of participants), and this dataset will make the same sense (but sorting a single column and leaving others unchanged will be semantically invalid).

Mathematically, we consider the above as a set of 4221 points in a seven-dimensional space, \(\mathbb{R}^7\). Let us discuss how we can try visualising different natural projections thereof.

7.4.1. 2D Data#

A scatter plot visualises one variable against another one.

plt.plot(body[:, 1], body[:, 3], "o", c="#00000022")
plt.xlabel(body_columns[1])
plt.ylabel(body_columns[3])
plt.show()
../_images/body-scatter-1-3-1.png

Figure 7.1 An example scatter plot.#

Figure 7.1 depicts upper leg length (the y-axis) vs (versus; against; as a function of) standing height (the x-axis) in the form of a point cloud with \((x, y)\) coordinates like (body[i, 1], body[i, 3]).

Example 7.6

Here are the exact coordinates of the point corresponding to the person of the smallest height:

body[np.argmin(body[:, 1]), [1, 3]]
## array([131.1,  30.8])

and here is the one with the greatest upper leg length:

body[np.argmax(body[:, 3]), [1, 3]]
## array([168.9,  49.1])

Locate them in Figure 7.1.

As the points are abundant, normally we cannot easily see where most of them are located. As a remedy, we applied the simple trick of plotting the points using a semi-transparent colour. This gave a kind of a density estimate. Here, the colour specifier was of the form #rrggbbaa, giving the intensity of the red, green, blue, and alpha (opaqueness) channel in four series of two hexadecimal digits (between 00 = 0 and ff = 255).

Overall, the plot reveals that there is a general tendency for small heights and small upper leg lengths to occur frequently together. The same with larger pairs. As there is some natural variability, this of course does not mean that always the taller the person, the longer her legs. They just seem to be longer on average. For example, looking at people of height roughly equal to 160 cm, their upper leg length can be anywhere between 25 ad 50 cm (range). But we expect the majority to lie somewhere between 35 and 40 cm.

Chapter 9 will explore two measures of correlation that will enable us to quantify the degree of association between variable pairs.

7.4.2. 3D Data and beyond#

If we have more than two variables to visualise, we might be tempted to use, e.g., a three-dimensional scatter plot like the one in Figure 7.2.

fig = plt.figure()
ax = fig.add_subplot(projection="3d", facecolor="#ffffff00")
ax.scatter(body[:, 1], body[:, 3], body[:, 0], color="#00000011")
ax.view_init(elev=30, azim=60, vertical_axis="y")
ax.set_xlabel(body_columns[1])
ax.set_ylabel(body_columns[3])
ax.set_zlabel(body_columns[0])
plt.show()
../_images/body-scatter-3d-3.png

Figure 7.2 A three-dimensional scatter plot reveals almost nothing.#

Infrequently will such a 3D plot provide us with readable results, though. We are projecting a three-dimensional reality onto a two-dimensional screen or page. Some information must inherently be lost. Also, what we see is relative to the position of the virtual camera.

Exercise 7.7

(*) Try finding an interesting elevation and azimuth angle by playing with the arguments passed to the mpl_toolkits.mplot3d.axes3d.Axes3D.view_init function. Also, depict arm circumference, hip circumference, and weight on a 3D plot.

Note

(*) Sometimes there might be facilities available to create an interactive scatter plot (running the above from the Python’s console enables this), where the virtual camera can be freely repositioned with a mouse/touch pad. This can give some more insight into our data. Also, there are means of creating animated sequences, where we can fly over the data scene. Some people find it cool, others find it annoying, but the biggest problem therewith is that they cannot be included in printed material. Yet, if we are only targeting the display for the Web (this includes mobile devices), we can try some Python libraries that output HTML+CSS+JavaScript code to be rendered by a browser engine.

Example 7.8

Instead of drawing a 3D plot, it might be better to play with different marker colours (or sometimes sizes: think of them as bubbles). Suitable colour maps can distinguish between low and high values of an additional variable, as in Figure 7.3.

plt.scatter(
    body[:, 4],    # x
    body[:, 5],    # y
    c=body[:, 0],  # "z" - colours
    cmap=plt.colormaps.get_cmap("copper"),  # colour map
    alpha=0.5  # opaqueness level between 0 and 1
)
plt.xlabel(body_columns[4])
plt.ylabel(body_columns[5])
plt.axis("equal")
plt.rcParams["axes.grid"] = False
cbar = plt.colorbar()
plt.rcParams["axes.grid"] = True
cbar.set_label(body_columns[0])
plt.show()
../_images/scatter-3d-colour-5.png

Figure 7.3 A two-dimensional scatter plot displaying three variables.#

We can see some tendency for the weight to be greater as both the arm and the hip circumferences increase.

Exercise 7.9

Play around with different colour palettes. However, be wary that every 1 in 12 men (8%) and 1 in 200 women (0.5%) have colour vision deficiencies, especially in the red-green or blue-yellow spectrum. For this reason, some diverging colour maps might be worse than others.

A piece of paper is two-dimensional. We only have height and width. Looking around us, we also understand the notion of depth. So far so good. But when the case of more-dimensional data is concerned, well, suffice it to say that we are three-dimensional creatures and any attempts towards visualising them will simply not work, don’t even trip.

Luckily, this is where mathematics comes to our rescue. With some more knowledge and intuitions, and this book helps us develop them, it will be as easy[5] as imagining a generic \(m\)-dimensional space, and then assuming that, say, \(m=7\) or \(42\).

This is exactly why data science relies on automated methods for knowledge/pattern discovery. Thanks to them, we can identify, describe, and analyse the structures that might be present in the data, but cannot be perceived with our imperfect senses.

Note

Linear and nonlinear dimensionality reduction techniques can be applied to visualise some aspects of high-dimensional data in the form of 2D (or 3D) plots. In particular, the principal component analysis (PCA) finds an interesting angle from which looking at the data might be worth considering; see Section 9.3.

7.4.3. Scatter plot matrix (pairs plot)#

We may also try depicting all pairs of selected variables in the form of a scatter plot matrix; see Figure 7.4.

def pairplot(X, labels, bins=21, alpha=0.1):
    """
    Draws a scatter plot matrix, given:
    * X - data matrix,
    * labels - list of column names
    """
    assert X.shape[1] == len(labels)

    k = X.shape[1]
    fig, axes = plt.subplots(nrows=k, ncols=k, sharex="col", sharey="row",
        figsize=(plt.rcParams["figure.figsize"][0], )*2)
    for i in range(k):
        for j in range(k):
            ax = axes[i, j]
            if i == j:  # diagonal
                ax.text(0.5, 0.5, labels[i], transform=ax.transAxes,
                    ha="center", va="center", size="x-small")
            else:
                ax.plot(X[:, j], X[:, i], ".", color="black", alpha=alpha)

And now:

which = [0, 1, 4, 5]
pairplot(body[:, which], body_columns[which])
plt.show()
../_images/body-pairplot-7.png

Figure 7.4 The scatter plot matrix for selected columns in the body dataset.#

Plotting variables against themselves is uninteresting (exercise: what would that be?). Therefore, on the main diagonal, we printed out the variable names.

A scatter plot matrix can be a valuable tool for identifying noteworthy combinations of columns in our datasets. We see that some pairs of variables are more “structured” than others, e.g., hip circumference and weight are more or less aligned on a straight line. This is why Chapter 9 will be interested in describing the possible relationships between the variables.

Exercise 7.10

Create a pairs plot where weight, arm circumference, and hip circumference are on the log-scale.

Exercise 7.11

(*) Call seaborn.pairplot to create a scatter plot matrix with histograms on the main diagonal, thanks to which you will be able to see how the marginal distributions are distributed. Note that the matrix must, unfortunately, be converted to a pandas data frame first.

Exercise 7.12

(**) Modify our pairplot function so that it displays the histograms of the marginal distributions on the main diagonal.

7.5. Exercises#

Exercise 7.13

What is the difference between [1, 2, 3], [[1, 2, 3]], and [[1], [2], [3]] in the context of array creation?

Exercise 7.14

If A is a matrix with 5 rows and 6 columns, what is the difference between A.reshape(6, 5) and A.T?

Exercise 7.15

If A is a matrix with 5 rows and 6 columns, what is the meaning of: A.reshape(-1), A.reshape(3, -1), A.reshape(-1, 3), A.reshape(-1, -1), A.shape = (3, 10), and A.shape = (-1, 3)?

Exercise 7.16

List some methods to add a new row and add a new column to an existing matrix.

Exercise 7.17

Give some ways to visualise three-dimensional data.

Exercise 7.18

How can we set point opaqueness/transparency when drawing a scatter plot? Why would we be interested in this?