11. Handling Categorical Data

The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.

So far we have been mostly dealing with quantitative (numeric) data – real numbers on which we can apply various mathematical operations, such as computing the arithmetic mean or taking the square thereof. Of course, not every transformation thereof must always make sense in every context (e.g., multiplying temperatures – what does it mean when we say that it is twice as hot today as compared to yesterday?), but still, the possibilities were plenty.

Qualitative data (also known as categorical data, factors, enumerated types), on the other hand, take a small number of unique values and support a very limited set of admissible operations. Usually, we can only determine where two entities are equal to each other or not (think: eye colour, blood type, or a flag whether a patient is ill).

In datasets involving many features, which we shall cover in Chapter 12, categorical variables are often used for observation grouping (e.g., so that we can compute the best and average time for marathoners in each age category or draw box plots for finish times of men and women separately). Also, they may serve as target variables in statistical classification tasks (e.g., so that we can determine if an email is “spam” or “not spam”).

Also, sometimes we might additionally be able to rank the observations (Australian school grades are linearly ordered like F (fail) < P (pass) < C (credit) < D (distinction) < HD (high distinction), some questionnaires use Likert-type scales such as “strongly disagree” < “disagree” < “neutral” < “agree” < “strongly agree”, etc.).

11.1. Representing and Generating Categorical Data

Common ways to represent a categorical variable with l distinct levels \(\{L_1,L_2,\dots,L_l\}\) is by storing it as:

  • a vector of strings,

  • a vector of integers between 0 (inclusive) and l (exclusive).

These two are easily interchangeable.

Furthermore, for \(l=2\) (binary data), another convenient representation is by means of logical vectors. This can be extended to a so-called one-hot encoded representation using a logical vector of length l.

Let us consider the data on the original whereabouts of the top 16 marathoners (the 37th PZU Warsaw Marathon dataset):

marathon = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
cntrs = np.array(marathon.loc[:, "country"], dtype="str")
cntrs16 = cntrs[:16]
## array(['KE', 'KE', 'KE', 'ET', 'KE', 'KE', 'ET', 'MA', 'PL', 'PL', 'IL',
##        'PL', 'KE', 'KE', 'PL', 'PL'], dtype='<U2')

These are two-letter ISO 3166 country codes, encoded of course as strings (notice the dtype="str" argument).

Calling numpy.unique allows us to determine the set of distinct categories:

cat_cntrs16 = np.unique(cntrs16)
## array(['ET', 'IL', 'KE', 'MA', 'PL'], dtype='<U2')

Hence, cntrs16 is a categorical vector of length n=16 (len(cntrs16)) with data assuming one of \(l=5\) different levels (len(cat_cntrs16)).


numpy.unique sorts the distinct values lexicographically. In other words, they are not listed in the order of appearance, which might be something desirable in certain contexts.

11.1.1. Encoding and Decoding Factors

In order to encode a label vector through a set of consecutive nonnegative integers, we pass the return_inverse=True argument to numpy.unique:

cat_cntrs16, codes_cntrs16 = np.unique(cntrs16, return_inverse=True)
## array(['ET', 'IL', 'KE', 'MA', 'PL'], dtype='<U2')
## array([2, 2, 2, 0, 2, 2, 0, 3, 4, 4, 1, 4, 2, 2, 4, 4])

The code sequence 2, 2, 2, 0, … corresponds to the 3rd, 3rd, 3rd, 1st, … level in cat_cntrs16, i.e., Kenya, Kenya, Kenya, Ethiopia, ….

The values between \(0\) and \(l-1=4\) can be used to index a given array of length \(l=5\). Hence, in order to decode our factor, we can write:

## array(['KE', 'KE', 'KE', 'ET', 'KE', 'KE', 'ET', 'MA', 'PL', 'PL', 'IL',
##        'PL', 'KE', 'KE', 'PL', 'PL'], dtype='<U2')

We can use any other set of labels now:

np.array(["Ethiopia", "Israel", "Kenya", "Morocco", "Poland"])[codes_cntrs16]
## array(['Kenya', 'Kenya', 'Kenya', 'Ethiopia', 'Kenya', 'Kenya',
##        'Ethiopia', 'Morocco', 'Poland', 'Poland', 'Israel', 'Poland',
##        'Kenya', 'Kenya', 'Poland', 'Poland'], dtype='<U8')

This is an instance of the recoding of a categorical variable.


Despite the fact that we can represent categorical variables using a set of integers, it does not mean that they become instances of a quantitative type. Arithmetic operations thereon do not really make sense.


When we represent categorical data using numeric codes, it is possible to introduce non-occurring levels. Such information can be useful, e.g., we could explicitly indicate that there were no runners from Australia in the top 16.

Exercise 11.1

(**) Determine the set of unique values in cntrs16 in the order of appearance (and not sorted lexicographically). Then, encode cntrs16 using this level set.

Hint: check out the return_index argument to numpy.unique and the numpy.searchsorted function.

11.1.2. Categorical Data in pandas

pandas includes a special dtype for storing categorical data. Namely, we can write:

cntrs16_series = marathon.iloc[:16, :].loc[:, "country"].astype("category")
## 0     KE
## 1     KE
## 2     KE
## 3     ET
## 4     KE
## 5     KE
## 6     ET
## 7     MA
## 8     PL
## 9     PL
## 10    IL
## 11    PL
## 12    KE
## 13    KE
## 14    PL
## 15    PL
## Name: country, dtype: category
## Categories (5, object): ['ET', 'IL', 'KE', 'MA', 'PL']

or, equivalently in our case, pd.Series(cntrs16, dtype="category"). This yields a Series object displayed as if it was represented using string labels, however, in fact it is encoded using the numeric representation. This can be revealed by accessing:

## array([2, 2, 2, 0, 2, 2, 0, 3, 4, 4, 1, 4, 2, 2, 4, 4], dtype=int8)
## Index(['ET', 'IL', 'KE', 'MA', 'PL'], dtype='object')

exactly matching what we have obtained with numpy.unique. Most often, however, categorical data in data frames will be stored as ordinary strings.

11.1.3. Binary Data and Logical Vectors

Binary data is a special case of the qualitative setting, where we only have two categories.

For convenience, we usually encode the two classes as integers:

  • 0 (or logical False, e.g., healthy/fail/off/non-spam/absent/…) and

  • 1 (or True, e.g., ill/success/on/spam/present/…).


When converting logical to numeric, False becomes 0 and True becomes 1. Conversely, 0 is converted to False and anything else (including -0.326) to True.

For example:

np.array([True, False, True, True, False]).astype(int)
## array([1, 0, 1, 1, 0])

The other way around:

np.array([-2, -0.326, -0.000001, 0.0, 0.1, 1, 7643]).astype(bool)
## array([ True,  True,  True, False,  True,  True,  True])

or, equivalently:

np.array([-2, -0.326, -0.000001, 0.0, 0.1, 1, 7643]) != 0
## array([ True,  True,  True, False,  True,  True,  True])
Exercise 11.2

Given a numeric vector x, create a vector of the same length as x whose i-th element is equal to "yes" if x[i] is in the unit interval and to "no" otherwise. Use numpy.where, which can act as a vectorised version of the if statement.

11.1.4. One-Hot Encoding (*)

Let \(\boldsymbol{x}\) be vector of n integers in \(\{0,...,l-1\}\). Its one-hot encoded version is a 0-1 (or, equivalently, logical) matrix \(\mathbf{R}\) of shape n by l such that \(r_{i,j}=1\) if and only if \(x_i = j\).

For example, if \(\boldsymbol{x}=(0, 1, 2, 1)\) and \(l=4\), then:

\[\begin{split} \mathbf{R} = \left[ \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ \end{array} \right]. \end{split}\]

Such a representation is useful when solving a multiclass classification problem by means of l binary classifiers. For example, if spam, bacon, and hot dogs are on the menu, then spam is encoded as \((1, 0, 0)\), i.e., yeah-spam, nah-bacon, and nah-hot dog. We can build three binary classifiers, each specialising in telling whether what it encounters is a given food or something else.

Example 11.3

Write a function to one-hot encode a given categorical vector.

Example 11.4

Write a function to decode a one-hot encoded matrix.

11.1.5. Binning Numeric Data (Revisited)

Numerical data can be converted to categorical via binning (quantisation). This results in information (precision) loss, however, it also opens some new possibilities. In fact, this is what we needed to do in order to draw all the histograms in Chapter 4. Also, reporting observation counts in each bin instead of raw data enables us to include them in printed reports (in the form of tables).


As strong proponents of openness and transparency, we always encourage all entities (governments, universities, non-for-profits, corporations, etc.) to share (e.g., under the Creative Commons CC-BY-SA-4.0 license) unabridged versions of their datasets to enable public scrutiny and getting the most of the possibilities they can bring for the public good.

Of course, sometimes the sharing of unprocessed information can violate the privacy of the subjects. In such a case, it might be a good idea to communicate them in a binned form.

Consider the 16 best marathon finish times (in minutes):

mins = marathon.loc[:, "mins"].to_numpy()
mins16 = mins[:16]
## array([129.32, 130.75, 130.97, 134.17, 134.68, 135.97, 139.88, 143.2 ,
##        145.22, 145.92, 146.83, 147.8 , 149.65, 149.88, 152.65, 152.88])

numpy.searchsorted can be used to determine the interval where each value in mins falls.

bins = [130, 140, 150]
codes_mins16 = np.searchsorted(bins, mins16)
## array([0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3])

By default, the intervals are of the form (a, b] (not including a, including b). Code 0 corresponds to values less than the first bin bound, whereas code 3 – greater than or equal to the last bound:

pandas.cut us another interface to the same binning method. It returns a vector-like object of dtype "category", with very readable labels generated automatically (and ordered, see Section 11.4.7):

cut_mins16 = pd.Series(pd.cut(mins16, [-np.inf, 130, 140, 150, np.inf]))
## 0      (-inf, 130.0]
## 1     (130.0, 140.0]
## 2     (130.0, 140.0]
## 3     (130.0, 140.0]
## 4     (130.0, 140.0]
## 5     (130.0, 140.0]
## 6     (130.0, 140.0]
## 7     (140.0, 150.0]
## 8     (140.0, 150.0]
## 9     (140.0, 150.0]
## 10    (140.0, 150.0]
## 11    (140.0, 150.0]
## 12    (140.0, 150.0]
## 13    (140.0, 150.0]
## 14      (150.0, inf]
## 15      (150.0, inf]
## dtype: category
## Categories (4, interval[float64, right]): [(-inf, 130.0] < (130.0, 140.0] < (140.0, 150.0] <
##                                            (150.0, inf]]
## Index(['(-inf, 130.0]', '(130.0, 140.0]', '(140.0, 150.0]',
##        '(150.0, inf]'],
##       dtype='object')
Exercise 11.5

(*) Check out the numpy.histogram_bin_edges function which tries to determine some informative interval bounds automatically based on a range of simple heuristics. Also, numpy.linspace and numpy.geomspace, which we have covered in Chapter 4, can be useful for generating equidistant bounds on linear and logarithmic scale, respectively.

Example 11.6

(**) We can create a set of the corresponding categories manually, for example, as follows:

bins2 = np.r_[-np.inf, bins, np.inf]
cat_mins16 = np.array(
    [f"({bins2[i-1]}, {bins2[i]}]" for i in range(1, len(bins2))]
## array(['(-inf, 130.0]', '(130.0, 140.0]', '(140.0, 150.0]',
##        '(150.0, inf]'], dtype='<U14')

Recall from Section 5.6.3 that list comprehensions are a convenient substitute for a for loop and the list.append method. Recoding based on the above yields:

## array(['(-inf, 130.0]', '(130.0, 140.0]', '(130.0, 140.0]',
##        '(130.0, 140.0]', '(130.0, 140.0]', '(130.0, 140.0]',
##        '(130.0, 140.0]', '(140.0, 150.0]', '(140.0, 150.0]',
##        '(140.0, 150.0]', '(140.0, 150.0]', '(140.0, 150.0]',
##        '(140.0, 150.0]', '(140.0, 150.0]', '(150.0, inf]',
##        '(150.0, inf]'], dtype='<U14')

11.1.6. Generating Pseudorandom Labels

It is worth knowing that numpy.random.choice allows us to create a pseudorandom sample with categories picked with any probabilities:

    a=["spam", "bacon", "eggs", "tempeh"],
    p=[   0.7,     0.1,   0.15,     0.05],
## array(['spam', 'spam', 'spam', 'spam', 'bacon', 'spam', 'tempeh', 'spam',
##        'spam', 'spam', 'spam', 'bacon', 'spam', 'spam', 'spam', 'bacon'],
##       dtype='<U6')

Hence, if we generate a sufficiently large sample, we will expect "spam" to occur ca. 70% times, and "tempeh" to be drawn in 5% of the cases, etc.

11.2. Frequency Distributions

11.2.1. Counting

For arbitrary categorical data, we can call:

cat_cntrs16, counts_cntrs16 = np.unique(cntrs16, return_counts=True)
cat_cntrs16, counts_cntrs16
## (array(['ET', 'IL', 'KE', 'MA', 'PL'], dtype='<U2'), array([2, 1, 7, 1, 5]))

to get both the set of unique categories and the corresponding number of occurrences. For instance, there were 7 runners from Kenya amongst the top 16.

If we already have an array of integer codes between \(0\) and \(l-1\), there is no need to call numpy.unique, as numpy.bincount can return the number of times each code appears therein.

## array([2, 1, 7, 1, 5])

Of course, a vector of counts can easily be turned into a vector of proportions (fractions):

counts_cntrs16 / np.sum(counts_cntrs16)
## array([0.125 , 0.0625, 0.4375, 0.0625, 0.3125])

Hence, almost 31.25% of the top runners were from Poland (it is a marathon in Warsaw after all…). We can of course multiply the above by 100 to get the percentages.

We can output a nice frequency table by storing both objects in a single data frame (or a Series object with an appropriate vector of row labels):

table = pd.DataFrame({
    "Country": cat_cntrs16,
    "%": 100 * counts_cntrs16 / np.sum(counts_cntrs16)
##   Country      %
## 0      ET  12.50
## 1      IL   6.25
## 2      KE  43.75
## 3      MA   6.25
## 4      PL  31.25
Example 11.7

(*) In Section 14.3 we will discuss IPython.display.Markdown as a means to embed arbitrary Markdown code inside IPython/Jupyter reports. In the meantime, let us just note that nicely formatted tables can be created from data frames by calling:

import IPython.display













pandas.Series.value_counts is even more convenient, as it returns a Series object equipped with a readable index (element labels):

marathon.iloc[:16, :].loc[:, "country"].value_counts()
## KE    7
## PL    5
## ET    2
## MA    1
## IL    1
## Name: country, dtype: int64

By default, data are ordered with respect to the counts, decreasingly.

Exercise 11.8

In Chapter 4 we have mentioned the numpy.histogram function which applies the binning of a numeric vector and then counts the number of occurrences. This is merely a helper function: the same result can be obtained by means of the more basic numpy.searchsorted, numpy.bincount, and numpy.histogram_bin_edges. Apply numpy.histogram on the whole 37_pzu_warsaw_marathon_mins dataset.

Exercise 11.9

Using numpy.argsort, sort counts_cntrs16 increasingly together with the corresponding items in cat_cntrs16.

11.2.2. Two-Way Contingency Tables: Factor Combinations

Some datasets may feature many categorical columns, each having possibly different levels. Let us now consider the whole marathon dataset:

marathon.loc[:, "age"] = marathon.category.str.slice(1)  # first two chars
marathon.loc[marathon.age >= "60", "age"] = "60+"  # too few runners aged 70+
marathon = marathon.loc[:, ["sex", "age", "country"]]
##   sex age country
## 0   M  20      KE
## 1   M  20      KE
## 2   M  20      KE
## 3   M  20      ET
## 4   M  30      KE

The three columns are: sex, age (in 10-year brackets), and country. We can of course analyse the data distribution in each column individually, however, some interesting patterns might also arise when we study the combinations of levels of different variables.

Here are the levels of the sex and age variables:

np.unique(marathon.loc[:, "sex"])
## array(['F', 'M'], dtype=object)
np.unique(marathon.loc[:, "age"])
## array(['20', '30', '40', '50', '60+'], dtype=object)

We thus have 10 different possible combinations thereof.

A two-way contingency table is a matrix which gives the number of occurrences of each pair of values:

v = marathon.loc[:, ["sex", "age"]].value_counts().unstack(fill_value=0)
## age   20    30    40   50  60+
## sex                           
## F    240   449   262   43   19
## M    879  2200  1708  541  170

Hence, for example, there were 19 women aged 60 and over amongst the marathoners. Nice.

The marginal (one dimensional) frequency distributions can be recreated by computing the rowwise and columnwise sums.

np.sum(v, axis=1)
## sex
## F    1013
## M    5498
## dtype: int64
np.sum(v, axis=0)
## age
## 20     1119
## 30     2649
## 40     1970
## 50      584
## 60+     189
## dtype: int64

11.2.3. Combinations of Even More Factors

pandas.DataFrame.value_counts can also be used with a combination of more than 2 categorical variables:

marathon.loc[:, ["sex", "age", "country"]].value_counts().\
##     sex  age country  count
## 0     M   30      PL   2081
## 1     M   40      PL   1593
## 2     M   20      PL    824
## 3     M   50      PL    475
## 4     F   30      PL    422
## ..   ..  ...     ...    ...
## 189   M   30      SI      1
## 190   F  60+      EE      1
## 191   F  60+      FI      1
## 192   M   30      PE      1
## 193   F   20      BE      1
## [194 rows x 4 columns]

Of course, the display will be in the long format (compare Section 10.6.2) here, as high-dimensional arrays are not nicely printable.

11.3. Visualising Factors

Methods for visualising categorical data are by no means fascinating (unless we use them as grouping variables in more complex datasets, but this is a topic that we cover in Chapter 12).

11.3.1. Bar Plots

Bar plots are self-explanatory and hence will do the trick most of the time, see Figure 11.1.

ind = np.arange(len(cat_cntrs16))
plt.bar(ind, height=counts_cntrs16,
    color="lightgray", edgecolor="black", alpha=0.8)
plt.xticks(ind, cat_cntrs16)

Figure 11.1 Bar plot for the top 16 marathoners’ countries

The ind vector gives the x-coordinates of the bars, here: consecutive integers. By calling matplotlib.pyplot.xticks we assign them readable labels.

Exercise 11.10

Assign a different colour to each bar.

Exercise 11.11

Draw a bar plot which features percentages instead of counts, so that the total bar height is 100%.

Exercise 11.12

Print the frequency table and draw a bar plot for the top 100 marathoners’ countries.

Exercise 11.13

(*) Print the frequency table and draw a bar plot all the marathoners (not just the elite ones), ordered from the highest to the lowest counts. There will be too many bars, therefore replace the few last bars with a single one, labelled “All other”.

A bar plot is a versatile tool for visualising the counts also in the two-variable case, see Figure 11.2:

v = marathon.loc[:, ["sex", "age"]].value_counts().\
sns.barplot(x="age", hue="sex", y="count", data=v)

Figure 11.2 Number of runners by age category and sex

Exercise 11.14

(*) Draw a similar chart using matplotlib.pyplot.bar.

Exercise 11.15

(**) Create a stacked bar plot similar to the one in Figure 11.3, where we have horizontal bars for data that have been normalised so that for each sex their sum is 100%.


Figure 11.3 Example stacked bar plot: Age distribution for different sexes amongst all the runners

11.3.2. Political Marketing and Statistics

Even such a simple plot can be manipulated. For example, presidential elections were held in Poland in 2020. In the second round, Andrzej Duda had won against Rafał Trzaskowski. In Figure 11.4 we have the official results that might be presented by one of the infamous Polish TV conglomerate:

plt.bar([1, 2], height=[51.03, 48.97], width=0.25,
    color="lightgray", edgecolor="black", alpha=0.8)
plt.xticks([1, 2], ["Duda", "Trzaskowski"])
plt.xlim(0, 3)
plt.ylim(48.9, 51.1)

Figure 11.4 Flawless victory!

Such a great victory! Wait… it was a close vote after all! We should just take a look at the y-axis tick marks.

Another media outlet could have reported it like in Figure 11.5:

plt.bar([1, 2], height=[51.03, 48.97], width=0.25,
    color="lightgray", edgecolor="black", alpha=0.8)
plt.xticks([1, 2], ["Duda", "Trzaskowski"])
plt.xlim(0, 3)
plt.ylim(0, 250)
plt.yticks([0, 100])

Figure 11.5 It was a draw, so close!

The moral of the story is:


Always read the y-axis tick marks. And when drawing own bar plots, do not trick the reader; this is unethical.

11.3.3. Pie Cha… Don’t Even Trip

We are definitely not going to discuss the infamous pie charts, because their use in data analysis has been widely criticised for a long time (it is difficult to judge the ratios of areas of their slices). Case closed. Good morning.

11.3.4. Pareto Charts (*)

As a general (empirical) rule, it is usually the case that most instances of something’s happening (usually 70–90%) are due to only few causes (10–30%). This is known as the Pareto rule (with 80% vs 20% being an often cited rule of thumb).

Example 11.16

In Chapter 6 we modelled the US cities’ population dataset using the Pareto distribution (the very same Pareto, but a different, yet related object). We discovered that only ca. 14% of the settlements (those with 10,000 or more inhabitants) is home to as much as 84% of the population. Hence, we may say that this data domain follows the Pareto rule.

Here is a dataset fabricated by the Clinical Excellence Commission in New South Wales, Australia, listing the most frequent causes for medication errors:

cat_med = np.array([
    "Unauthorised drug", "Wrong IV rate", "Wrong patient", "Dose missed",
    "Underdose", "Wrong calculation","Wrong route", "Wrong drug",
    "Wrong time", "Technique error", "Duplicated drugs", "Overdose"
counts_med = np.array([1, 4, 53, 92, 7, 16, 27, 76, 83, 3, 9, 59])
np.sum(counts_med)  # total number of medication errors
## 430

Let us order the dataset with respect to the counts, decreasingly:

o = np.argsort(counts_med)[::-1]  # ordering permutation (decreasing)
cat_med = cat_med[o]  # order categories based on counts
counts_med = counts_med[o]  # equivalent to np.sort(counts_med)[::-1]
))  # nicer display
##              category  count
## 0         Dose missed     92
## 1          Wrong time     83
## 2          Wrong drug     76
## 3            Overdose     59
## 4       Wrong patient     53
## 5         Wrong route     27
## 6   Wrong calculation     16
## 7    Duplicated drugs      9
## 8           Underdose      7
## 9       Wrong IV rate      4
## 10    Technique error      3
## 11  Unauthorised drug      1

Pareto charts are tools which may aid in visualising the Pareto-ruled datasets. They are based on bar plots, but feature some extras:

  • bars are listed in a decreasing order,

  • the cumulative percentage curve is added.

The plotting of the Pareto chart is a little tricky, because it involves using two different Y axes (as usual, fine-tuning of the figure and studying the manual of the matplotlib package is left as an exercise.)

x = np.arange(len(cat_med))  # 0, 1, 2, ...
p = 100.0*counts_med/np.sum(counts_med)  # percentages

fig, ax1 = plt.subplots()
ax1.set_xticks(x-0.5, cat_med, rotation=60)
ax1.bar(x, height=p)
ax2 = ax1.twinx()  # creates a new coordinate system with a shared x-axis
ax2.plot(x, np.cumsum(p), "ro-")
ax2.set_ylabel("cumulative %")


Figure 11.6 The most frequent causes for medication errors

From Figure 11.6, we can read that 5 causes (less than 40%) correspond to ca. 85% of the medication errors. More precisely,

    "category": cat_med,
    "cumulative %": np.round(np.cumsum(p), 1)
##              category  cumulative %
## 0         Dose missed          21.4
## 1          Wrong time          40.7
## 2          Wrong drug          58.4
## 3            Overdose          72.1
## 4       Wrong patient          84.4
## 5         Wrong route          90.7
## 6   Wrong calculation          94.4
## 7    Duplicated drugs          96.5
## 8           Underdose          98.1
## 9       Wrong IV rate          99.1
## 10    Technique error          99.8
## 11  Unauthorised drug         100.0

Note that there is an explicit assumption here that a single error is only due to a single cause. Also, we presume that each medication error has a similar degree of severity.

Policy makers and quality controllers often rely on such simplifications, therefore they most probably are going to be addressing only the top causes. If we have ever wondered why some processes (mal)function the way they do, there is a hint above. However, coming up with something more effective yet so simple at the same time requires much more effort.

11.3.5. Heat Maps

Two-way contingency tables can be depicted by means of a heatmap, where each count affects the corresponding cell’s colour intensity, see Figure 11.7.

from matplotlib import cm
v = marathon.loc[:, ["sex", "age"]].value_counts().unstack(fill_value=0)
sns.heatmap(v, annot=True, fmt="d", cmap=cm.get_cmap("copper"))

Figure 11.7 Heatmap for the marathoners’ sex and age category

11.4. Aggregating and Comparing Factors

11.4.1. A Mode

As we have already said, the only operation on categorical data that we can rely on is counting (because we have an equivalence relation on the set of labels and nothing more). Therefore, as far as qualitative data aggregation is concerned, what we are with left is the mode, i.e., the most frequently occurring value.

cat_cntrs16, counts_cntrs16 = np.unique(cntrs16, return_counts=True)
## 'KE'

Recall that if i is numpy.argmax(counts) (argument maximum, i.e., where is it?), then counts[i] is the same as numpy.max(counts) and cat[i] is the category with the greatest counts.


A mode might be ambiguous.

For instance, amongst the fastest 22 runners, there is a tie between Kenya and Poland – both meet our definition of a mode:

cat_cntrs22, counts_cntrs22 = np.unique(cntrs[:22], return_counts=True)
cat_cntrs22[np.where(counts_cntrs22 == np.max(counts_cntrs22))]
## array(['KE', 'PL'], dtype='<U2')

To avoid any bias, it would be best to report both of them as potential mode candidates. Alternatively, we can pick one at random (calling numpy.random.choice).

11.4.2. Binary Data as Logical Vectors

Perhaps the most useful arithmetic operation on logical vectors is the sum.

cntrs16 = cntrs[:16]
cntrs16  # recall
## array(['KE', 'KE', 'KE', 'ET', 'KE', 'KE', 'ET', 'MA', 'PL', 'PL', 'IL',
##        'PL', 'KE', 'KE', 'PL', 'PL'], dtype='<U2')
np.sum(cntrs16 == "PL")
## 5

is the number of elements in cntrs16 that are equal to "PL" (because the sum of 0s and 1s is equal to the number of 1s in the sequence). Note that (cntrs16 == "PL") is a logical vector that represents a binary categorical variable with levels: not-Poland (False) and Poland (True).

If we divide the above result by the length of the vector, we will get the proportion:

np.mean(cntrs16 == "PL")
## 0.3125

Hence, 31.25% amongst the top 16 runners are from Poland.

Exercise 11.17

What is the meaning of numpy.all, numpy.any, numpy.min, numpy.max, numpy.cumsum, and numpy.cumprod applied on logical vectors?


(**) Having the 0/1 (or zero/nonzero) vs False/True correspondence allows us to perform some logical operations using integer arithmetic. In particular, assuming that p and q are logical values and a and b are numeric ones, we have, what follows:

  • p+q != 0 means that at least one value is True and p+q == 0 if and only if both are False;

  • more generally, p+q == 2 if both elements are True, p+q == 1 if only one is True (we call it exclusive-or, XOR), and p+q == 0 if both are False;

  • p*q != 0 means that both values are True and p*q == 0 holds whenever at least one is False;

  • 1-p corresponds to negation of p;

  • p*a + (1-p)*b is equal to a if p is True and equal to b otherwise.

11.4.3. Pearson’s Chi-Squared Test (*)

The Kolmogorov–Smirnov test that we described in Section 6.2.3 verifies whether a given sample differs significantly from a hypothesised continuous distribution, i.e., it works for numeric data.

For binned/categorical data, we can use a classical and easy-to-understand test developed by Karl Pearson in 1900. It is supposed to judge whether the difference between the observed proportions \(\hat{p}_1,\dots,\hat{p}_l\) and the theoretical ones \(p_1,\dots,p_l\) is significantly large or not:

\[\begin{split} \left\{ \begin{array}{rll} H_0: & \hat{p}_i=p_i\ \text{ for all }i=1,\dots,l & \text{(null hypothesis)}\\ H_1: & \hat{p}_i\neq p_i\ \text{ for some }i=1,\dots,l & \text{(alternative hypothesis)} \\ \end{array} \right. \end{split}\]

Having such a test is beneficial, e.g., when the data we have at hand are based on small surveys that are supposed to serve as estimates of what might be happening in a larger population.

Going back to our political example from Section 11.3.2, it turns out that one of the pre-election polls indicated that \(c=516\) out of \(n=1017\) people would vote for the first candidate. We have \(\hat{p}_1=50.74\%\) (Duda) and \(\hat{p}_2=49.26\%\) (Traskowski). If we would like to test whether the observed proportions are significantly different from each other, we could test them against the theoretical distribution \(p_1=50\%\) and \(p_2=50\%\), stating that there is a tie between the competitors (up to a sampling error).

As a natural test statistic is based on the relative squared differences:

\[ \hat{T} = n \sum_{i=1}^l \frac{\left(\hat{p}_i-p_i\right)^2}{p_i}. \]
c, n = 516, 1017
p_observed = np.array([c, n-c]) / n
p_expected = np.array([0.5, 0.5])
T = n * np.sum( (p_observed-p_expected)**2 / p_expected )
## 0.2212389380530986

Similarly to the continuous case in Section 6.2.3, we should reject the null hypothesis, if

\[ \hat{T} \ge K, \]

where the critical value \(K\) is based on the fact that, if the null hypothesis is true, \(\hat{T}\) follows the \(\chi^2\) (chi-squared, hence the name of the test) distribution with \(l-1\) degrees of freedom, see scipy.stats.chi2.

alpha = 0.001  # significance level
scipy.stats.chi2.ppf(1-alpha, len(p_observed)-1)
## 10.827566170662733

As \(\hat{T} < K\) (because \(0.22 < 10.83\)), we cannot deem the two proportions significantly different from each other. In other words, this poll did not indicate (at significance level \(0.1\%\)) any of the candidates as a clear winner.

Exercise 11.18

Determine the smallest \(c\), i.e., the number of respondents indicating they would vote for Duda, that leads to the rejection of the null hypothesis.

11.4.4. Two-Sample Pearson’s Chi-Squared Test (*)

Let us consider the data depicted in Figure 11.3 and test whether the runners’ age distributions differ significantly between females and males.

We have \(l=5\) categories. First, denote the total number of observations in both groups with \(n'\) and \(n''\).

d = marathon.loc[:, ["sex", "age"]].value_counts().unstack(fill_value=0)
c1, c2 = d.to_numpy()  # first row, second row
n1 = c1.sum()
n2 = c2.sum()
n1, n2
## (1013, 5498)

The observed proportions in the first group (females), denoted as \(\hat{p}_1',\dots,\hat{p}_l'\), are, respectively:

p1 = c1/n1
## array([0.23692004, 0.44323791, 0.25863771, 0.04244817, 0.01875617])

Here are the proportions in the second group (males), \(\hat{p}_1'',\dots,\hat{p}_l''\):

p2 = c2/n2
## array([0.15987632, 0.40014551, 0.31065842, 0.09839942, 0.03092033])

We would like to verify whether the corresponding proportions are equal to each other (up to some sampling error):

\[\begin{split} \left\{ \begin{array}{rll} H_0: & \hat{p}_i'=\hat{p}_i''\ \text{ for all }i=1,\dots,l & \text{(null hypothesis)}\\ H_1: & \hat{p}_i'\neq \hat{p}_i''\ \text{ for some }i=1,\dots,l & \text{(alternative hypothesis)} \\ \end{array} \right. \end{split}\]

In other words, we are interested whether the categorical data in the two groups come from the same discrete probability distribution.

Taking the estimated expected proportions,

\[\bar{p}_i = \frac{c_i'+c_i''}{n'+n''},\]

for all \(i=1,\dots,l\), the test statistic this time is equal to:

\[ \hat{T} = n' \sum_{i=1}^l \frac{\left(\hat{p}_i' -\bar{p}_i\right)^2}{\bar{p}_i} + n'' \sum_{i=1}^l \frac{\left(\hat{p}_i''-\bar{p}_i\right)^2}{\bar{p}_i}, \]

which is a variation on the one-sample theme presented in Section 11.4.4.

pp = (c1+c2)/(n1+n2)
T = n1 * np.sum( (p1-pp)**2 / pp ) + n2 * np.sum( (p2-pp)**2 / pp )
## 75.31373854741857

It can be shown that, if the null hypothesis is true, the test statistic approximately follows the \(\chi^2\) distribution with \(l-1\) degrees of freedom1. The critical value \(K\) is equal to:

alpha = 0.001  # significance level
scipy.stats.chi2.ppf(1-alpha, len(c1)-1)
## 18.46682695290317

As \(\hat{T} \ge K\) (because \(75.31 \ge 18.47\)), we reject the null hypothesis. And so, the age distribution differs across sexes (at significance level \(0.1\%\)).

11.4.5. Measuring Association (*)

Let us consider Australian Bureau of Statistics’ National Health Survey 2018 data on the prevalence of certain medical conditions as a function of age. Here is the extracted contingency table:

l = [
    ["Arthritis", "Asthma", "Back problems", "Cancer (malignant neoplasms)",
     "Chronic obstructive pulmonary disease", "Diabetes mellitus",
     "Heart, stroke and vascular disease", "Kidney disease",
     "Mental and behavioural conditions", "Osteoporosis"],
    ["15-44", "45-64", "65+"]
C = 1000*np.array([
    [ 360.2,     1489.0,      1772.2],
    [1069.7,      741.9,       433.7],
    [1469.6,     1513.3,       955.3],
    [  28.1,      162.7,       237.5],
    [ 103.8,      207.0,       251.9],
    [ 135.4,      427.3,       607.7],
    [  94.0,      344.4,       716.0],
    [  29.6,       67.7,       123.3],
    [2218.9,     1390.6,       725.0],
    [  36.1,      312.3,       564.7],
pd.DataFrame(C, index=l[0], columns=l[1])
##                                          15-44    45-64      65+
## Arthritis                               360000  1489000  1772000
## Asthma                                 1069000   741000   433000
## Back problems                          1469000  1513000   955000
## Cancer (malignant neoplasms)             28000   162000   237000
## Chronic obstructive pulmonary disease   103000   207000   251000
## Diabetes mellitus                       135000   427000   607000
## Heart, stroke and vascular disease       94000   344000   716000
## Kidney disease                           29000    67000   123000
## Mental and behavioural conditions      2218000  1390000   725000
## Osteoporosis                             36000   312000   564000

Cramér’s \(V\) is one of a few ways to measure the degree of association between two categorical variables. It is equal to 0 (lowest possible value) if the two variables are independent (there is no association between them) and 1 (highest possible value) if they are tied.

## 0.316237999724298

The above means that there might be a small association between age and the prevalence of certain conditions. In other words, it might be the case that some conditions are more prevalent in different age groups than others.

Given a two-way contingency table \(C\) with \(n\) rows and \(m\) columns and assuming that

\[ T = \sum_{i=1}^n \sum_{j=1}^m \frac{(c_{i,j} - e_{i,j})^2}{e_{i,j}}, \]


\[ e_{i,j} = \frac{ \sum_{k=1}^m v_{i,k} \sum_{k=1}^n c_{k,j} }{ \sum_{i=1}^n \sum_{j=1}^m c_{i,j} } \]

the Cramér coefficient is given by

\[ V = \sqrt{ \frac{T}{\min\{n-1, m-1\} \sum_{i=1}^n \sum_{j=1}^m c_{i,j}} }. \]

Here, \(c_{i,j}\) gives the actually observed counts and \(e_{i, j}\) denotes the number that we would expect if the two variables were actually independent.

Exercise 11.19

Compute the Cramér \(V\) using only numpy functions.

Exercise 11.20

(**) Actually we can easily verify the hypothesis whether \(V\) does not differ significantly from \(0\), i.e., whether the variables are independent. Looking at \(T\), we see that this is actually the test statistic in Pearson’s chi-squared goodness-of-fit test.

E = C.sum(axis=1).reshape(-1, 1) * C.sum(axis=0).reshape(1, -1) / C.sum()
T = np.sum((C-E)**2 / E)
## 3715440.465191512

If the data are really independent, \(T\) follows the chi-squared distribution \(n + m - 1\), hence the critical value is equal to

alpha = 0.001  # significance level
scipy.stats.chi2.ppf(1-alpha, C.shape[0] + C.shape[1] - 1)
## 32.90949040736021

as it is greater than \(T\), we conclude (at significance level \(0.1\%\)) that the conditions are not independent of age.

Exercise 11.21

(*) Take a look at Table 19: Comorbidity of selected chronic conditions of the National Health Survey 2018, where we clearly see that many disorders co-occur. Visualise them on some heatmaps and bar plots (including data grouped by sex and age).

11.4.6. Binned Numeric Data

Generally, modes do not work for continuous data, where repeated values are – at least theoretically – highly unlikely (unless someone does not report them with full digit precision). It might make sense to compute it on binned data, though.

Looking at a histogram, the mode is the interval corresponding to the highest bar (hopefully assuming there is only one). If we would like to obtain a single number, we can choose for example the middle of this interval as the mode.

Of course, for numeric data, the mode will heavily depend on the binning (recall that we can also apply logarithmic binning). Thus, the question “what is the most popular income” is overall quite difficult to answer.

Exercise 11.22

Compute some potentially informative modes for the 2020 UK income data. Play around with different numbers of bins for linear and logarithmic binning and see how it affects the mode.

11.4.7. Ordinal Data (*)

The case where the categories can be linearly ordered, is called ordinal data. This gives a few more options as except for the mode, thanks to the existence of order statistics, we can also easily define sample quantiles. However, the standard methods for resolving ties will not work, hence we need to be careful.

For example, median of a sample of student grades (P, P, C, D, HD) is C, but (P, P, C, D, HD, HD) is either C or D - we can choose one at random or just report that the solution is ambiguous.

Another option, of course, is to treat ordinal data as numbers (e.g., F=0, P=1, …, HD=4). In the latter example, the median would be equal to 2.5.

There are some cases, though, where the conversion of labels to consecutive integers is far from optimal – because it gives the impression that the “distance” between different levels is always equal (linear).

Exercise 11.23

(**) The grades_results represents the grades (F, P, C, D, HD) of 100 students attending an imaginary course in a virtual Australian university. You can load it in the form of an ordered categorical Series by calling:

grades = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/grades_results.txt", dtype="str")
grades = pd.Series(pd.Categorical(grades,
    categories=["F", "P", "C", "D", "HD"], ordered=True))
## 0       F
## 1       F
## 2       F
## 3       F
## 4       F
##        ..
## 118    HD
## 119    HD
## 120    HD
## 121    HD
## 122    HD
## Length: 123, dtype: category
## Categories (5, object): ['F' < 'P' < 'C' < 'D' < 'HD']

How would you determine the average grade represented as a number between 0 and 100, taking into account that for a P you need at least 50%, C is given for ≥ 60%, D for ≥ 70%, and HD for only 80% of the points. Come up with a pessimistic, optimistic, and a best-shot estimate and then compare your result to the true corresponding scores listed in the grades_scores dataset.

11.5. Exercises

Exercise 11.24

Does it make sense to compute the arithmetic mean of a categorical variable?

Exercise 11.25

Name the basic use cases for categorical data.

Exercise 11.26

(*) What is a Pareto chart?

Exercise 11.27

How to deal with the case of the mode being nonunique?

Exercise 11.28

What is the meaning of numpy.mean((x > 0) & (x < 1)), where x is a numeric vector?

Exercise 11.29

What is the meaning of the sum and mean for binary data (logical vectors)?

Exercise 11.30

List some ways to visualise multidimensional categorical data.

Exercise 11.31

(*) State the null hypotheses verified by the one- and two-sample chi-squared tests.

Exercise 11.32

(*) How is Cramér’s V defined and what values does it take?


Notice that [PTVF07] in Section 14.3 suggests \(l\) degrees of freedom, but do not agree with the reasoning therein. Also, simple Monte Carlo simulations suggests that \(l-1\) is a better candidate.