7. Multidimensional Numeric Data at a Glance

The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.

From the perspective of structured datasets, a vector often represents n independent measurements of the same quantitative property, e.g., heights of n different patients, incomes in n randomly chosen households, or ages of n runners. More generally, these are all instances of a bag of n points on the real line. By far1 we should have become quite fluent with the methods for processing such one-dimensional arrays.

Let us increase the level of complexity by allowing each of the n entities to be described by possibly more than one feature: say, m of them. In other words, we will be dealing with n points in an m-dimensional space, \(\mathbb{R}^m\).

We can arrange all the observations in a table with n rows and m columns (just like in spreadsheets). Such an object can be expressed with numpy as a two-dimensional array which we will refer to as matrices. Thanks to matrices, we can keep the n tuples of length m together in a single object and process them all at once (or m tuples of length n, depending how we want to look at them). Very convenient.

Important

Just like vectors, matrices were designed to store data of the same type. In Chapter 10 we will cover data frames, which further increase the degree of complexity (and freedom) by not only allowing for mixed data types (e.g., numerical and categorical; this will enable us to perform data analysis in subgroups more easily) but also for the rows and columns be named.

Many data analysis algorithms convert data frames to matrices automatically and deal with them as such. From the computational side, it is numpy that does most of the “mathematical” work. pandas implements many recipes for basic data wrangling tasks, but we want to go way beyond that. After all, we would like to be able to tackle any problem.

7.1. Creating Matrices

7.1.1. Reading CSV Files

Tabular data are often stored and distributed in a very portable plain-text format called CSV (comma-separated values) or variants thereof.

numpy.loadtxt supports them quite well as long as they do not feature column names (comment lines are accepted, though). Unfortunately, for most CSV files the opposite is the case, and hence we suggest relying rely on a corresponding function from the pandas package:

body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/nhanes_adult_female_bmx_2020.csv",
    comment="#")
body = np.array(body)  # convert to matrix

Notice that we have converted the data frame to a matrix by calling the numpy.array function. Here is a preview of the first few rows:

body[:6, :]  # six first rows, all columns
## array([[ 97.1, 160.2,  34.7,  40.8,  35.8, 126.1, 117.9],
##        [ 91.1, 152.7,  33.5,  33. ,  38.5, 125.5, 103.1],
##        [ 73. , 161.2,  37.4,  38. ,  31.8, 106.2,  92. ],
##        [ 61.7, 157.4,  38. ,  34.7,  29. , 101. ,  90.5],
##        [ 55.4, 154.6,  34.6,  34. ,  28.3,  92.5,  73.2],
##        [ 62. , 144.7,  32.5,  34.2,  29.8, 106.7,  84.8]])

This is an extended version of the National Health and Nutrition Examination Survey (NHANES), where the consecutive columns give the following body measurements of adult females:

body_columns = np.array([
    "weight (kg)",
    "standing height (cm)",
    "upper arm length (cm)",
    "upper leg length (cm)",
    "arm circumference (cm)",
    "hip circumference (cm)",
    "waist circumference (cm)"
])

numpy matrices do not support column naming. This is why we noted them down separately. It is only a minor inconvenience. pandas data frames will have this capability, but from the algebraic side, they are not as convenient as matrices for the purpose of scientific computing.

What we are dealing with is still a numpy array:

type(body)  # class of this object
## <class 'numpy.ndarray'>

however, this time a two-dimensional one:

body.ndim  # number of dimensions
## 2

which means that the shape slot is now a tuple of length 2:

body.shape
## (4221, 7)

The above gave the total number of rows and columns, respectively.

7.1.2. Enumerating Elements

numpy.array can create a two-dimensional array based on a list of lists or vector-like objects, all of identical lengths. Each of them will constitute a separate row of the resulting matrix.

For example:

np.array([  # list of lists
    [ 1,  2,  3,  4 ],  # the 1st row
    [ 5,  6,  7,  8 ],  # the 2nd row
    [ 9, 10, 11, 12 ]   # the 3rd row
])
## array([[ 1,  2,  3,  4],
##        [ 5,  6,  7,  8],
##        [ 9, 10, 11, 12]])

gives a 3-by-4 (3×4) matrix,

np.array([ [1], [2], [3] ])
## array([[1],
##        [2],
##        [3]])

yields a 3-by-1 one (we call it a column vector, but it is a special matrix — we will soon learn that shapes can make a significant difference), and

np.array([ [1, 2, 3, 4] ])
## array([[1, 2, 3, 4]])

produces a 1-by-4 array (a row vector).

Note

An ordinary vector (a 1-dimensional array) only uses a single pair of square brackets:

np.array([1, 2, 3, 4])
## array([1, 2, 3, 4])

7.1.3. Repeating Arrays

The previously mentioned numpy.tile and numpy.repeat can also generate some nice matrices. For instance,

np.repeat([[1, 2, 3, 4]], 3, axis=0)
## array([[1, 2, 3, 4],
##        [1, 2, 3, 4],
##        [1, 2, 3, 4]])

repeats a row vector rowwisely (i.e., over axis 0 – the first one).

Replicating a column vector columnwisely (i.e., over axis 1 – the second one) is possible as well:

np.repeat([[1], [2], [3]], 4, axis=1)
## array([[1, 1, 1, 1],
##        [2, 2, 2, 2],
##        [3, 3, 3, 3]])
Exercise 7.1

How to generate matrices of the following kinds?

\[\begin{split}\left[ \begin{array}{cc} 1 & 2 \\ 1 & 2 \\ 1 & 2 \\ 3 & 4 \\ 3 & 4 \\ 3 & 4 \\ 3 & 4 \\ \end{array} \right], \qquad \left[ \begin{array}{cccccc} 1 & 2 & 1 & 2& 1& 2 \\ 1 & 2 & 1 & 2& 1& 2 \\ \end{array} \right], \qquad \left[ \begin{array}{ccccc} 1& 1& 2& 2 & 2 \\ 3& 3& 4& 4 & 4 \\ \end{array} \right].\end{split}\]

7.1.4. Stacking Arrays

numpy.column_stack and numpy.row_stack take a tuple of array-like objects and bind them column- or rowwisely to form a new matrix:

np.column_stack(([10, 20], [30, 40], [50, 60]))  # a tuple of lists
## array([[10, 30, 50],
##        [20, 40, 60]])
np.row_stack(([10, 20], [30, 40], [50, 60]))
## array([[10, 20],
##        [30, 40],
##        [50, 60]])
np.column_stack((
    np.row_stack(([10, 20], [30, 40], [50, 60])),
    [70, 80, 90]
))
## array([[10, 20, 70],
##        [30, 40, 80],
##        [50, 60, 90]])
Exercise 7.2

Perform similar operations using numpy.append, numpy.vstack, numpy.hstack, numpy.concatenate, and (*) numpy.c_.

Exercise 7.3

Using numpy.insert, and a new row/column at the beginning, end, and in the middle of an array. Let us stress that this function returns a new array.

7.1.5. Other Functions

Many built-in functions allow for generating arrays of arbitrary shapes (not only vectors). For example:

np.random.seed(123)
np.random.rand(2, 5)  # not: rand((2, 5))
## array([[0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897],
##        [0.42310646, 0.9807642 , 0.68482974, 0.4809319 , 0.39211752]])

The same with scipy:

scipy.stats.uniform.rvs(0, 1, size=(2, 5), random_state=123)
## array([[0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897],
##        [0.42310646, 0.9807642 , 0.68482974, 0.4809319 , 0.39211752]])

The way we specify the output shapes might differ across functions and packages. Therefore, as usual, it is always best to refer to the their documentation.

Exercise 7.4

Check out the documentation of the following functions: numpy.eye, numpy.diag, numpy.zeros, numpy.ones, and numpy.empty.

7.2. Reshaping Matrices

Let us take an example 3-by-4 matrix:

A = np.array([
    [ 1,  2,  3,  4 ],
    [ 5,  6,  7,  8 ],
    [ 9, 10, 11, 12 ]
])

Internally, a matrix is represented using a long flat vector where elements are stored in the row-major2 order:

A.size  # total number of elements
## 12
A.ravel()  # the underlying array
## array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

It is the shape slot that is causing the 12 elements to be treated as if they were arranged on a 3-by-4 grid, for example in different algebraic computations and during the printing thereof. This arrangement can be modified anytime without modifying the underlying array:

A.shape = (4, 3)
A
## array([[ 1,  2,  3],
##        [ 4,  5,  6],
##        [ 7,  8,  9],
##        [10, 11, 12]])

This way, we have obtained a different view on the same data.

For convenience, there is also the reshape method that returns a modified version of the object it is applied on:

A.reshape(-1, 6)
## array([[ 1,  2,  3,  4,  5,  6],
##        [ 7,  8,  9, 10, 11, 12]])

Here, “-1” means that numpy must deduce by itself how many rows we actually want in the result (12 elements are supposed to be arranged in 6 columns, so the maths behind it is not rocket science).

Thanks to this, generating row or column vectors is very easy:

np.linspace(0, 1, 5).reshape(1, -1)
## array([[0.  , 0.25, 0.5 , 0.75, 1.  ]])
np.array([9099, 2537, 1832]).reshape(-1, 1)
## array([[9099],
##        [2537],
##        [1832]])

Reshaping is not the same as matrix transpose, which also changes the order of elements in the underlying array:

A  # before
## array([[ 1,  2,  3],
##        [ 4,  5,  6],
##        [ 7,  8,  9],
##        [10, 11, 12]])
A.T  # transpose of A
## array([[ 1,  4,  7, 10],
##        [ 2,  5,  8, 11],
##        [ 3,  6,  9, 12]])

We see that the rows became columns and vice versa.

Note

(*) Higher-dimensional arrays are also possible. For example,

np.arange(24).reshape(2, 4, 3)
## array([[[ 0,  1,  2],
##         [ 3,  4,  5],
##         [ 6,  7,  8],
##         [ 9, 10, 11]],
## 
##        [[12, 13, 14],
##         [15, 16, 17],
##         [18, 19, 20],
##         [21, 22, 23]]])

Is an array of “depth” 2, “height” 4, and “width” 3; we can see it as two 4-by-3 matrices stacked together. Theoretically, they can be useful for representing contingency tables for products of many factors. However, in our application areas, we are used to sticking with long data frames instead, see Section 10.6.2, due to their more aesthetic display and better handling of sparse data.

7.3. Mathematical Notion

Here is some standalone mathematical notation that we shall be employing in this course. A matrix with n rows and m columns (an n-by-m matrix) \(\mathbf{X}\) can be written as

\[\begin{split} \mathbf{X}= \left[ \begin{array}{cccc} x_{1,1} & x_{1,2} & \cdots & x_{1,m} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,m} \\ \end{array} \right]. \end{split}\]

Mathematically, we denote this as \(\mathbf{X}\in\mathbb{R}^{n\times m}\). Looking at the above, if this makes us think of how data are displayed in spreadsheets, we are correct, because the latter were inspired by the former.

We see that \(x_{i,j}\in\mathbb{R}\) denotes the element in the \(i\)-th row (e.g., the \(i\)-th observation) and the \(j\)-th column (e.g., the \(j\)-th feature or variable), for every \(i=1,\dots,n\), \(j=1,\dots,m\).

In particular, if \(\mathbf{X}\) denoted the body dataset, then \(x_{1,2}\) would be the height of the 1st person.

Important

Matrices are a convenient means of representing many different kinds of data:

  • n points in an m dimensional space (like n observations for which there are m measurements/features recorded, where each row describes a different object; exactly the case of the body dataset above) – this is the most common scenario;

  • m time series sampled at n points in time (e.g., prices of m different currencies on n consecutive days; see Chapter 16);

  • a single kind of measurement for data in m groups, each consisting of n subjects (e.g., heights of n males and n females); here, the order of elements in each column does not usually matter as observations are not paired; there is no relationship between \(x_{i,j}\) and \(x_{i,k}\) for \(j\neq k\); a matrix is used merely as a convenient container for storing a few unrelated vectors of identical sizes; we will be dealing with a more generic case of possibly nonhomogeneous groups in Chapter 12;

  • two-way contingency tables (see Section 11.2.2), where an element \(x_{i,j}\) gives the number of occurrences of items at the \(i\)-th level of the first categorical variable and, at the same time, being at the \(j\)-th level of the second variable (e.g., blue-eyed and blonde-haired);

  • graphs and other relationships between objects, e.g., \(x_{i,j}=0\) might denote that the \(i\)-th object is not connected3 with the \(j\)-th one and \(x_{k,l}=0.2\) that there is a weak connection between \(k\) and \(l\) (e.g., who is a friend of whom, whether a user recommends a particular item);

  • images, where \(x_{i,j}\) represents the intensity of a colour component (e.g., red, green, blue or shades of grey or hue, saturation, brightness; compare Section 16.4) of a pixel in the \((n-i+1)\)-th row and the \(j\)-th column.

Note

In practice, more complex and less-structured data can quite often be mapped to a tabular form. For instance, a set of audio recordings can be described by measuring overall loudness, timbre, and danceability of each song. Also, a collection of documents can be described by means of the degrees of belongingness to some automatically discovered topics (e.g., someone said that Joyce’s Ulysses is 80% travel literature, 70% comedy, and 50% heroic fantasy, but let us not take it for granted).

7.3.1. Row and Column Vectors

Additionally, will sometimes use the following notation to emphasise that \(\mathbf{X}\) consists of \(n\) rows:

\[\begin{split} \mathbf{X} = \left[ \begin{array}{c} \mathbf{x}_{1,\cdot} \\ \mathbf{x}_{2,\cdot} \\ \vdots\\ \mathbf{x}_{n,\cdot} \\ \end{array} \right]. \end{split}\]

Here, \(\mathbf{x}_{i,\cdot}\) is a row vector of length \(m\), i.e., a \((1\times m)\)-matrix:

\[\begin{split} \mathbf{x}_{i,\cdot} = \left[ \begin{array}{cccc} x_{i,1} & x_{i,2} & \cdots & x_{i,m} \\ \end{array} \right]. \end{split}\]

Alternatively, we can specify the \(m\) columns

\[\begin{split} \mathbf{X} = \left[ \begin{array}{cccc} \mathbf{x}_{\cdot,1} & \mathbf{x}_{\cdot,2} & \cdots & \mathbf{x}_{\cdot,m} \\ \end{array} \right], \end{split}\]

where \(\mathbf{x}_{\cdot,j}\) is a column vector of length \(n\), i.e., an \((n\times 1)\)-matrix:

\[\begin{split} \mathbf{x}_{\cdot,j} = \left[ \begin{array}{cccc} x_{1,j} & x_{2,j} & \cdots & x_{n,j} \\ \end{array} \right]^T=\left[ \begin{array}{c} {x}_{1,j} \\ {x}_{2,j} \\ \vdots\\ {x}_{n,j} \\ \end{array} \right], \end{split}\]

where \(\cdot^T\) denotes the transpose of a given matrix (thanks to which we can save some vertical space, we do not want this book be 1000 pages long, do we?).

Also, recall that we are used to denoting vectors of length \(m\) with \(\boldsymbol{x}=(x_1, \dots, x_m)\). A vector is a 1-dimensional array (not a 2-dimensional one), hence a slightly different font in the case where ambiguity can be troublesome.

Note

To avoid notation clutter, we will often be implicitly promoting vectors like \(\boldsymbol{x}=(x_1,\dots,x_m)\) to row vectors \(\mathbf{x}=[x_1\,\cdots\,x_m]\), because this is the behaviour that numpy4 uses; see Chapter 8.

7.3.2. Transpose

The transpose of a matrix \(\mathbf{X}\in\mathbb{R}^{n\times m}\) is an \((m\times n)\)-matrix \(\mathbf{Y}\) given by:

\[\begin{split} \mathbf{Y}= \mathbf{X}^T= \left[ \begin{array}{cccc} x_{1,1} & x_{2,1} & \cdots & x_{m,1} \\ x_{1,2} & x_{2,2} & \cdots & x_{m,2} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1,n} & x_{2,n} & \cdots & x_{m,n} \\ \end{array} \right], \end{split}\]

i.e., it enjoys \(y_{i,j}=x_{j, i}\).

Exercise 7.5

Compare the display of an example matrix A and its transpose A.T above.

7.3.3. Identity and Other Diagonal Matrices

\(\mathbf{I}\) denotes the identity matrix, being a square \(n\times n\) matrix (with \(n\) most often clear from the context) with 0s everywhere except on the main diagonal, where 1s lie.

np.eye(5)  # I
## array([[1., 0., 0., 0., 0.],
##        [0., 1., 0., 0., 0.],
##        [0., 0., 1., 0., 0.],
##        [0., 0., 0., 1., 0.],
##        [0., 0., 0., 0., 1.]])

The identity matrix is a neutral element of the matrix multiplication (Section 8.3).

More generally, any diagonal matrix, \(\mathrm{diag}(a_1,\dots,a_n)\), can be constructed from a given sequence of elements by calling:

np.diag([1, 2, 3, 4])
## array([[1, 0, 0, 0],
##        [0, 2, 0, 0],
##        [0, 0, 3, 0],
##        [0, 0, 0, 4]])

7.4. Visualising Multidimensional Data

Let us go back to our body dataset:

body[:6, :]  # preview
## array([[ 97.1, 160.2,  34.7,  40.8,  35.8, 126.1, 117.9],
##        [ 91.1, 152.7,  33.5,  33. ,  38.5, 125.5, 103.1],
##        [ 73. , 161.2,  37.4,  38. ,  31.8, 106.2,  92. ],
##        [ 61.7, 157.4,  38. ,  34.7,  29. , 101. ,  90.5],
##        [ 55.4, 154.6,  34.6,  34. ,  28.3,  92.5,  73.2],
##        [ 62. , 144.7,  32.5,  34.2,  29.8, 106.7,  84.8]])
body.shape
## (4221, 7)

This is an example of tabular (“structured”) data. The important property it that the elements in each row describe the same person; we can freely reorder all the columns at the same time (change the order of participants). However, sorting a single column and leaving the other ones unchanged will be semantically invalid.

Mathematically, we consider the above as a set of 4221 points in a 7-dimensional space, \(\mathbb{R}^7\). Let us discuss how we can try visualising different natural projections thereof.

7.4.1. 2D Data

A scatterplot can be used to visualise one variable against another one.

plt.plot(body[:, 1], body[:, 3], "o", c="#00000022")
plt.xlabel(body_columns[1])
plt.ylabel(body_columns[3])
plt.show()
../_images/body-scatter-1-3-1.png

Figure 7.1 An example scatterplot

Figure 7.1 depicts upper leg length (the y-axis) vs (versus; against; as a function of) standing height (the x-axis) in the form of a point cloud with \((x, y)\) coordinates like (body[i, 1], body[i, 3]).

Example 7.6

Here are the exact coordinates of the point corresponding to the person of the smallest height:

body[np.argmin(body[:, 1]), [1, 3]]
## array([131.1,  30.8])

and here is the one with the greatest upper leg length:

body[np.argmax(body[:, 3]), [1, 3]]
## array([168.9,  49.1])

Locate them in Figure 7.1.

As the points are abundant, normally we cannot easily see where the majority of them is located. However, to remedy this, we applied the simple trick of plotting the points using a semi-transparent colour. Here, the colour specifier was of the form #rrggbbaa, giving the intensity of the red, green, blue, and alpha (opaqueness) channel in series of two hexadecimal digits (between 00 = 0 and ff = 255).

Overall, the plot reveals that there is a general tendency of small heights and small upper leg lengths to occur frequently together. The same with larger pairs. In Chapter 9, we explore some measures of correlation that will enable us to quantify the degree of association between variable pairs.

7.4.2. 3D Data and Beyond

If we have more than 2 variables to visualise, we might be tempted to use, e.g., a 3-dimensional scatterplot like the one in Figure 7.2.

fig = plt.figure()
ax = fig.add_subplot(projection="3d", facecolor="#ffffff00")
ax.scatter(body[:, 1], body[:, 3], body[:, 0], color="#00000011")
ax.view_init(elev=30, azim=60, vertical_axis="y")
ax.set_xlabel(body_columns[1])
ax.set_ylabel(body_columns[3])
ax.set_zlabel(body_columns[0])
plt.show()
../_images/body-scatter-3d-3.png

Figure 7.2 A three-dimensional scatterplot reveals almost nothing

However, infrequently will such a 3D plot provide us with readable results: we are projecting a three-dimensional reality onto a two-dimensional screen or page. Some information must inherently be lost. Also, what we see is relative to the position of the virtual camera.

Exercise 7.7

(*) Try finding an interesting elevation and azimuth angle by playing with the arguments passed to the mpl_toolkits.mplot3d.axes3d.Axes3D.view_init function. Also, depict arm circumference, hip circumference, and weight on a 3D plot.

Note

(*) Sometimes there might be facilities available to create an interactive scatterplot (running the above from the Python’s console actually enables this), where the virtual camera can be freely repositioned with a mouse/touchpad. This can give some more insight into our data. Also, there are means of creating animated sequences, where we can fly over the data scene. Some people find it cool, others find in annoying, but the biggest problem therewith is that they cannot be included in printed material. However, if we are only targeting the display for the Web (this includes mobile devices), we can try some Python libraries that output HTML+CSS+JavaScript code to be rendered by a browser engine.

Example 7.8

Instead of drawing a 3D plot, it might be better to play with different marker colours (or sometimes sizes: think of them as bubbles). A suitable colour map, can be used to distinguish between low and high values of an additional variable, as in Figure 7.3.

from matplotlib import cm
plt.scatter(
    body[:, 4],    # x
    body[:, 5],    # y
    c=body[:, 0],  # "z" - colours
    cmap=cm.get_cmap("copper"),  # colour map
    alpha=0.5  # opaqueness level between 0 and 1
)
plt.xlabel(body_columns[4])
plt.ylabel(body_columns[5])
plt.axis("equal")
plt.rcParams["axes.grid"] = False
cbar = plt.colorbar()
plt.rcParams["axes.grid"] = True
cbar.set_label(body_columns[0])
plt.show()
../_images/scatter-3d-colour-5.png

Figure 7.3 A two-dimensional scatter plot displaying 3 variables

We can see some tendency for the weight be greater as both the arm and the hip circumferences increase.

Exercise 7.9

Play around with different colour pallettes. However, be wary that ca. every 1 in 12 men (8%) and 1 in 200 women (0.5%) have colour vision deficiencies, especially in the red-green or blue-yellow spectrum, thence some diverging colour maps might be worse than others.

A piece of paper is 2-dimensional. We only have height and width. The world around us is 3-dimensional, we thus also understand the notion of depth. As far as the case of more-dimensional data is concerned, well, suffice it to say that we are 3-dimensional creatures and any attempts towards visualising them will simply not work, don’t even trip.

Luckily, this is where mathematics comes to our rescue. With some more knowledge and intuitions, and this book lets us gain them, it will be as easy5 as imagining a generic m-dimensional space, and then assuming that, say, m=7 or 42.

This is exactly why data science relies on automated methods for knowledge/pattern discovery – so that we are able to identify, describe, and analyse the structures that might be present in the data, but cannot be perceived with our imperfect senses.

Note

Linear and nonlinear dimensionality reduction techniques can be applied to visualise some aspects of high-dimensional data in the form of 2D (or 3D) plots. In particular, the principal component analysis (PCA) finds an interesting angle from which it is worth to look at the data; see Section 9.3.

7.4.3. Scatterplot Matrix (Pairplot)

We may also try depicting all (or most – ones that we deem interesting) pairs of variables in the form of a scatterplot matrix; see Figure 7.4.

sns.pairplot(
    data=pd.DataFrame(  # sns.pairplot needs a DataFrame...
        body[:, [0, 1, 4, 5]],
        columns=body_columns[[0, 1, 4, 5]]
    ),
    plot_kws=dict(alpha=0.1)
)
# plt.show()  # not needed :/
../_images/body-pairplot-7.png

Figure 7.4 Scatterplot matrix for selected columns in the body dataset: scatterplots for all unique pairs of variables together with histograms on the main diagonal

Plotting variables against themselves is uninteresting (exercise: what would that be?). Thus, we have included histograms on the main diagonal to see how the they are distributed (the marginal distributions).

A scatterplot matrix can be a good tool for identifying interesting combinations of columns in our datasets. We see that some pairs of variables are more “structured” than others, e.g., hip circumference and weight are more or less aligned on a straight line. This is why in Chapter 9 we will be interested in describing the possible relationships between the variables.

Exercise 7.10

(*) Use matplotlib.pyplot.subplot and other functions we have learned in the previous part to create a scatterplot matrix manually. Draw weight, arm circumference, and hip circumference on a logarithmic scale.

7.5. Exercises

Exercise 7.11

What is the difference between [1, 2, 3], [[1, 2, 3]], and [[1], [2], [3]] in the context of array creation?

Exercise 7.12

If A is a matrix with 5 rows and 6 columns, what is the difference between A.reshape(6, 5) and A.T?

Exercise 7.13

If A is a matrix with 5 rows and 6 columns, what is the meaning of: A.reshape(-1), A.reshape(3, -1), A.reshape(-1, 3), A.reshape(-1, -1), A.shape = (3, 10), and A.shape = (-1, 3)?

Exercise 7.14

List some methods to add a new row and add a new column to an existing matrix.

Exercise 7.15

Give some ways to visualise 3-dimensional data.

Exercise 7.16

How to set point opaqueness/transparency when drawing a scatter plot? Why would we be interested in this?


1

Assuming we solved all the suggested exercises, which did, didn’t we? See Rule #3.

2

(*) Sometimes referred to as a C-style array, as opposed to Fortran-style which is used in, e.g., R.

3

Such matrices are usually sparse, i.e., have many elements equal to 0. We have special, memory-efficient data structures for handling these data, see scipy.sparse for more detail as they go beyond the scope of our introductory course.

4

However, some textbooks assume that all vectors are column vectors; in such a case, they would define the Euclidean norm as \(\|\boldsymbol{x}\|=\sqrt{\mathbf{x}^T \mathbf{x}}\).

5

This is an old funny joke which most funny mathematicians find funny.