11. Handling Categorical Data
The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.
So far we have been mostly dealing with quantitative (numeric) data – real numbers on which we can apply various mathematical operations, such as computing the arithmetic mean or taking the square thereof. Of course, not every transformation thereof must always make sense in every context (e.g., multiplying temperatures – what does it mean when we say that it is twice as hot today as compared to yesterday?), but still, the possibilities were plenty.
Qualitative data (also known as categorical data, factors, enumerated types), on the other hand, take a small number of unique values and support a very limited set of admissible operations. Usually, we can only determine where two entities are equal to each other or not (think: eye colour, blood type, or a flag whether a patient is ill).
In datasets involving many features, which we shall cover in Chapter 12, categorical variables are often used for observation grouping (e.g., so that we can compute the best and average time for marathoners in each age category or draw box plots for finish times of men and women separately). Also, they may serve as target variables in statistical classification tasks (e.g., so that we can determine if an email is “spam” or “not spam”).
Also, sometimes we might additionally be able to rank the observations (Australian school grades are linearly ordered like F (fail) < P (pass) < C (credit) < D (distinction) < HD (high distinction), some questionnaires use Likert-type scales such as “strongly disagree” < “disagree” < “neutral” < “agree” < “strongly agree”, etc.).
11.1. Representing and Generating Categorical Data
Common ways to represent a categorical variable with l distinct levels \(\{L_1,L_2,\dots,L_l\}\) is by storing it as:
a vector of strings,
a vector of integers between 0 (inclusive) and l (exclusive).
These two are easily interchangeable.
Furthermore, for \(l=2\) (binary data), another convenient representation is by means of logical vectors. This can be extended to a so-called one-hot encoded representation using a logical vector of length l.
Let us consider the data on the original whereabouts of the top 16 marathoners (the 37th PZU Warsaw Marathon dataset):
marathon = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/37_pzu_warsaw_marathon_simplified.csv",
comment="#")
cntrs = np.array(marathon.loc[:, "country"], dtype="str")
cntrs16 = cntrs[:16]
cntrs16
## array(['KE', 'KE', 'KE', 'ET', 'KE', 'KE', 'ET', 'MA', 'PL', 'PL', 'IL',
## 'PL', 'KE', 'KE', 'PL', 'PL'], dtype='<U2')
These are two-letter ISO 3166 country codes,
encoded of course as strings (notice the dtype="str"
argument).
Calling numpy.unique allows us to determine the set of distinct categories:
cat_cntrs16 = np.unique(cntrs16)
cat_cntrs16
## array(['ET', 'IL', 'KE', 'MA', 'PL'], dtype='<U2')
Hence, cntrs16
is a categorical vector of length n=16
(len(cntrs16)
)
with data assuming one of \(l=5\) different levels
(len(cat_cntrs16)
).
Important
numpy.unique sorts the distinct values lexicographically. In other words, they are not listed in the order of appearance, which might be something desirable in certain contexts.
11.1.1. Encoding and Decoding Factors
In order to encode a label vector through a set of consecutive nonnegative
integers, we pass the return_inverse=True
argument
to numpy.unique:
cat_cntrs16, codes_cntrs16 = np.unique(cntrs16, return_inverse=True)
cat_cntrs16
## array(['ET', 'IL', 'KE', 'MA', 'PL'], dtype='<U2')
codes_cntrs16
## array([2, 2, 2, 0, 2, 2, 0, 3, 4, 4, 1, 4, 2, 2, 4, 4])
The code sequence 2, 2, 2, 0, …
corresponds to the 3rd, 3rd, 3rd, 1st, … level in cat_cntrs16
,
i.e., Kenya, Kenya, Kenya, Ethiopia, ….
The values between \(0\) and \(l-1=4\) can be used to index a given array of length \(l=5\). Hence, in order to decode our factor, we can write:
cat_cntrs16[codes_cntrs16]
## array(['KE', 'KE', 'KE', 'ET', 'KE', 'KE', 'ET', 'MA', 'PL', 'PL', 'IL',
## 'PL', 'KE', 'KE', 'PL', 'PL'], dtype='<U2')
We can use any other set of labels now:
np.array(["Ethiopia", "Israel", "Kenya", "Morocco", "Poland"])[codes_cntrs16]
## array(['Kenya', 'Kenya', 'Kenya', 'Ethiopia', 'Kenya', 'Kenya',
## 'Ethiopia', 'Morocco', 'Poland', 'Poland', 'Israel', 'Poland',
## 'Kenya', 'Kenya', 'Poland', 'Poland'], dtype='<U8')
This is an instance of the recoding of a categorical variable.
Important
Despite the fact that we can represent categorical variables using a set of integers, it does not mean that they become instances of a quantitative type. Arithmetic operations thereon do not really make sense.
Note
When we represent categorical data using numeric codes, it is possible to introduce non-occurring levels. Such information can be useful, e.g., we could explicitly indicate that there were no runners from Australia in the top 16.
(**)
Determine the set of unique values
in cntrs16
in the order of appearance (and not sorted lexicographically).
Then, encode cntrs16
using this level set.
Hint: check out the return_index
argument to numpy.unique
and the numpy.searchsorted function.
11.1.2. Categorical Data in pandas
pandas includes
a special dtype
for storing categorical data.
Namely, we can write:
cntrs16_series = marathon.iloc[:16, :].loc[:, "country"].astype("category")
cntrs16_series
## 0 KE
## 1 KE
## 2 KE
## 3 ET
## 4 KE
## 5 KE
## 6 ET
## 7 MA
## 8 PL
## 9 PL
## 10 IL
## 11 PL
## 12 KE
## 13 KE
## 14 PL
## 15 PL
## Name: country, dtype: category
## Categories (5, object): ['ET', 'IL', 'KE', 'MA', 'PL']
or, equivalently in our case, pd.Series(cntrs16, dtype="category")
.
This yields a Series
object displayed as if it was represented
using string labels, however, in fact it is encoded using the numeric
representation. This can be revealed by accessing:
cntrs16_series.cat.codes.to_numpy()
## array([2, 2, 2, 0, 2, 2, 0, 3, 4, 4, 1, 4, 2, 2, 4, 4], dtype=int8)
cntrs16_series.cat.categories
## Index(['ET', 'IL', 'KE', 'MA', 'PL'], dtype='object')
exactly matching what we have obtained with numpy.unique
.
Most often, however, categorical data in data frames will be stored
as ordinary strings.
11.1.3. Binary Data and Logical Vectors
Binary data is a special case of the qualitative setting, where we only have two categories.
For convenience, we usually encode the two classes as integers:
0 (or logical
False
, e.g., healthy/fail/off/non-spam/absent/…) and1 (or
True
, e.g., ill/success/on/spam/present/…).
Important
When converting
logical to numeric, False
becomes 0 and True
becomes 1.
Conversely, 0 is converted to False
and anything else (including -0.326)
to True
.
For example:
np.array([True, False, True, True, False]).astype(int)
## array([1, 0, 1, 1, 0])
The other way around:
np.array([-2, -0.326, -0.000001, 0.0, 0.1, 1, 7643]).astype(bool)
## array([ True, True, True, False, True, True, True])
or, equivalently:
np.array([-2, -0.326, -0.000001, 0.0, 0.1, 1, 7643]) != 0
## array([ True, True, True, False, True, True, True])
Given a numeric vector x
, create a vector of the same length
as x
whose i-th element is equal to "yes"
if x[i]
is in the unit interval and to "no"
otherwise.
Use numpy.where, which can act as a vectorised version
of the if
statement.
11.1.4. One-Hot Encoding (*)
Let \(\boldsymbol{x}\) be vector of n integers in \(\{0,...,l-1\}\). Its one-hot encoded version is a 0-1 (or, equivalently, logical) matrix \(\mathbf{R}\) of shape n by l such that \(r_{i,j}=1\) if and only if \(x_i = j\).
For example, if \(\boldsymbol{x}=(0, 1, 2, 1)\) and \(l=4\), then:
Such a representation is useful when solving a multiclass classification problem by means of l binary classifiers. For example, if spam, bacon, and hot dogs are on the menu, then spam is encoded as \((1, 0, 0)\), i.e., yeah-spam, nah-bacon, and nah-hot dog. We can build three binary classifiers, each specialising in telling whether what it encounters is a given food or something else.
Write a function to one-hot encode a given categorical vector.
Write a function to decode a one-hot encoded matrix.
11.1.5. Binning Numeric Data (Revisited)
Numerical data can be converted to categorical via binning (quantisation). This results in information (precision) loss, however, it also opens some new possibilities. In fact, this is what we needed to do in order to draw all the histograms in Chapter 4. Also, reporting observation counts in each bin instead of raw data enables us to include them in printed reports (in the form of tables).
Note
As strong proponents of openness and transparency, we always encourage all entities (governments, universities, non-for-profits, corporations, etc.) to share (e.g., under the Creative Commons CC-BY-SA-4.0 license) unabridged versions of their datasets to enable public scrutiny and getting the most of the possibilities they can bring for the public good.
Of course, sometimes the sharing of unprocessed information can violate the privacy of the subjects. In such a case, it might be a good idea to communicate them in a binned form.
Consider the 16 best marathon finish times (in minutes):
mins = marathon.loc[:, "mins"].to_numpy()
mins16 = mins[:16]
mins16
## array([129.32, 130.75, 130.97, 134.17, 134.68, 135.97, 139.88, 143.2 ,
## 145.22, 145.92, 146.83, 147.8 , 149.65, 149.88, 152.65, 152.88])
numpy.searchsorted can be used to determine
the interval where each value in mins
falls.
bins = [130, 140, 150]
codes_mins16 = np.searchsorted(bins, mins16)
codes_mins16
## array([0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3])
By default, the intervals are of the form (a, b] (not including a, including b). Code 0 corresponds to values less than the first bin bound, whereas code 3 – greater than or equal to the last bound:
pandas.cut us another interface to the same binning method.
It returns a vector-like object of dtype
"category"
,
with very readable labels generated automatically
(and ordered, see Section 11.4.7):
cut_mins16 = pd.Series(pd.cut(mins16, [-np.inf, 130, 140, 150, np.inf]))
cut_mins16
## 0 (-inf, 130.0]
## 1 (130.0, 140.0]
## 2 (130.0, 140.0]
## 3 (130.0, 140.0]
## 4 (130.0, 140.0]
## 5 (130.0, 140.0]
## 6 (130.0, 140.0]
## 7 (140.0, 150.0]
## 8 (140.0, 150.0]
## 9 (140.0, 150.0]
## 10 (140.0, 150.0]
## 11 (140.0, 150.0]
## 12 (140.0, 150.0]
## 13 (140.0, 150.0]
## 14 (150.0, inf]
## 15 (150.0, inf]
## dtype: category
## Categories (4, interval[float64, right]): [(-inf, 130.0] < (130.0, 140.0] < (140.0, 150.0] <
## (150.0, inf]]
cut_mins16.cat.categories.astype("str")
## Index(['(-inf, 130.0]', '(130.0, 140.0]', '(140.0, 150.0]',
## '(150.0, inf]'],
## dtype='object')
(*) Check out the numpy.histogram_bin_edges function which tries to determine some informative interval bounds automatically based on a range of simple heuristics. Also, numpy.linspace and numpy.geomspace, which we have covered in Chapter 4, can be useful for generating equidistant bounds on linear and logarithmic scale, respectively.
(**) We can create a set of the corresponding categories manually, for example, as follows:
bins2 = np.r_[-np.inf, bins, np.inf]
cat_mins16 = np.array(
[f"({bins2[i-1]}, {bins2[i]}]" for i in range(1, len(bins2))]
)
cat_mins16
## array(['(-inf, 130.0]', '(130.0, 140.0]', '(140.0, 150.0]',
## '(150.0, inf]'], dtype='<U14')
Recall from Section 5.6.3 that list comprehensions are a convenient substitute for a for loop and the list.append method. Recoding based on the above yields:
cat_mins16[codes_mins16]
## array(['(-inf, 130.0]', '(130.0, 140.0]', '(130.0, 140.0]',
## '(130.0, 140.0]', '(130.0, 140.0]', '(130.0, 140.0]',
## '(130.0, 140.0]', '(140.0, 150.0]', '(140.0, 150.0]',
## '(140.0, 150.0]', '(140.0, 150.0]', '(140.0, 150.0]',
## '(140.0, 150.0]', '(140.0, 150.0]', '(150.0, inf]',
## '(150.0, inf]'], dtype='<U14')
11.1.6. Generating Pseudorandom Labels
It is worth knowing that numpy.random.choice allows us to create a pseudorandom sample with categories picked with any probabilities:
np.random.seed(123)
np.random.choice(
a=["spam", "bacon", "eggs", "tempeh"],
p=[ 0.7, 0.1, 0.15, 0.05],
replace=True,
size=16
)
## array(['spam', 'spam', 'spam', 'spam', 'bacon', 'spam', 'tempeh', 'spam',
## 'spam', 'spam', 'spam', 'bacon', 'spam', 'spam', 'spam', 'bacon'],
## dtype='<U6')
Hence, if we generate a sufficiently large sample,
we will expect "spam"
to occur ca. 70% times, and "tempeh"
to be drawn in 5% of the cases, etc.
11.2. Frequency Distributions
11.2.1. Counting
For arbitrary categorical data, we can call:
cat_cntrs16, counts_cntrs16 = np.unique(cntrs16, return_counts=True)
cat_cntrs16, counts_cntrs16
## (array(['ET', 'IL', 'KE', 'MA', 'PL'], dtype='<U2'), array([2, 1, 7, 1, 5]))
to get both the set of unique categories and the corresponding number of occurrences. For instance, there were 7 runners from Kenya amongst the top 16.
If we already have an array of integer codes between \(0\) and \(l-1\), there is no need to call numpy.unique, as numpy.bincount can return the number of times each code appears therein.
np.bincount(codes_cntrs16)
## array([2, 1, 7, 1, 5])
Of course, a vector of counts can easily be turned into a vector of proportions (fractions):
counts_cntrs16 / np.sum(counts_cntrs16)
## array([0.125 , 0.0625, 0.4375, 0.0625, 0.3125])
Hence, almost 31.25% of the top runners were from Poland (it is a marathon in Warsaw after all…). We can of course multiply the above by 100 to get the percentages.
We can output a nice frequency table by storing both objects
in a single data frame (or a Series
object with an appropriate
vector of row labels):
table = pd.DataFrame({
"Country": cat_cntrs16,
"%": 100 * counts_cntrs16 / np.sum(counts_cntrs16)
})
table
## Country %
## 0 ET 12.50
## 1 IL 6.25
## 2 KE 43.75
## 3 MA 6.25
## 4 PL 31.25
(*) In Section 14.3 we will discuss IPython.display.Markdown as a means to embed arbitrary Markdown code inside IPython/Jupyter reports. In the meantime, let us just note that nicely formatted tables can be created from data frames by calling:
import IPython.display
IPython.display.Markdown(table.to_markdown(index=False))
Country |
% |
---|---|
ET |
12.5 |
IL |
6.25 |
KE |
43.75 |
MA |
6.25 |
PL |
31.25 |
pandas.Series.value_counts is even more convenient,
as it returns a Series
object equipped with a readable index
(element labels):
marathon.iloc[:16, :].loc[:, "country"].value_counts()
## KE 7
## PL 5
## ET 2
## MA 1
## IL 1
## Name: country, dtype: int64
By default, data are ordered with respect to the counts, decreasingly.
In Chapter 4
we have mentioned the numpy.histogram function which applies the
binning of a numeric vector and then counts the number
of occurrences.
This is merely a helper function: the same result can be obtained
by means of the more basic
numpy.searchsorted, numpy.bincount,
and numpy.histogram_bin_edges.
Apply numpy.histogram
on the whole 37_pzu_warsaw_marathon_mins
dataset.
Using numpy.argsort, sort
counts_cntrs16
increasingly
together with the corresponding items in cat_cntrs16
.
11.2.2. Two-Way Contingency Tables: Factor Combinations
Some datasets may feature many categorical columns,
each having possibly different levels.
Let us now consider the whole marathon
dataset:
marathon.loc[:, "age"] = marathon.category.str.slice(1) # first two chars
marathon.loc[marathon.age >= "60", "age"] = "60+" # too few runners aged 70+
marathon = marathon.loc[:, ["sex", "age", "country"]]
marathon.head()
## sex age country
## 0 M 20 KE
## 1 M 20 KE
## 2 M 20 KE
## 3 M 20 ET
## 4 M 30 KE
The three columns are: sex, age (in 10-year brackets), and country. We can of course analyse the data distribution in each column individually, however, some interesting patterns might also arise when we study the combinations of levels of different variables.
Here are the levels of the sex and age variables:
np.unique(marathon.loc[:, "sex"])
## array(['F', 'M'], dtype=object)
np.unique(marathon.loc[:, "age"])
## array(['20', '30', '40', '50', '60+'], dtype=object)
We thus have 10 different possible combinations thereof.
A two-way contingency table is a matrix which gives the number of occurrences of each pair of values:
v = marathon.loc[:, ["sex", "age"]].value_counts().unstack(fill_value=0)
v
## age 20 30 40 50 60+
## sex
## F 240 449 262 43 19
## M 879 2200 1708 541 170
Hence, for example, there were 19 women aged 60 and over amongst the marathoners. Nice.
The marginal (one dimensional) frequency distributions can be recreated by computing the rowwise and columnwise sums.
np.sum(v, axis=1)
## sex
## F 1013
## M 5498
## dtype: int64
np.sum(v, axis=0)
## age
## 20 1119
## 30 2649
## 40 1970
## 50 584
## 60+ 189
## dtype: int64
11.2.3. Combinations of Even More Factors
pandas.DataFrame.value_counts can also be used with a combination of more than 2 categorical variables:
marathon.loc[:, ["sex", "age", "country"]].value_counts().\
rename("count").reset_index()
## sex age country count
## 0 M 30 PL 2081
## 1 M 40 PL 1593
## 2 M 20 PL 824
## 3 M 50 PL 475
## 4 F 30 PL 422
## .. .. ... ... ...
## 189 M 30 SI 1
## 190 F 60+ EE 1
## 191 F 60+ FI 1
## 192 M 30 PE 1
## 193 F 20 BE 1
##
## [194 rows x 4 columns]
Of course, the display will be in the long format (compare Section 10.6.2) here, as high-dimensional arrays are not nicely printable.
11.3. Visualising Factors
Methods for visualising categorical data are by no means fascinating (unless we use them as grouping variables in more complex datasets, but this is a topic that we cover in Chapter 12).
11.3.1. Bar Plots
Bar plots are self-explanatory and hence will do the trick most of the time, see Figure 11.1.
ind = np.arange(len(cat_cntrs16))
plt.bar(ind, height=counts_cntrs16,
color="lightgray", edgecolor="black", alpha=0.8)
plt.xticks(ind, cat_cntrs16)
plt.show()

Figure 11.1 Bar plot for the top 16 marathoners’ countries
The ind
vector gives the x-coordinates of the bars,
here: consecutive integers.
By calling matplotlib.pyplot.xticks
we assign them readable labels.
Assign a different colour to each bar.
Draw a bar plot which features percentages instead of counts, so that the total bar height is 100%.
Print the frequency table and draw a bar plot for the top 100 marathoners’ countries.
(*) Print the frequency table and draw a bar plot all the marathoners (not just the elite ones), ordered from the highest to the lowest counts. There will be too many bars, therefore replace the few last bars with a single one, labelled “All other”.
A bar plot is a versatile tool for visualising the counts also in the two-variable case, see Figure 11.2:
v = marathon.loc[:, ["sex", "age"]].value_counts().\
rename("count").reset_index()
sns.barplot(x="age", hue="sex", y="count", data=v)
plt.show()

Figure 11.2 Number of runners by age category and sex
(*) Draw a similar chart using matplotlib.pyplot.bar.
(**) Create a stacked bar plot similar to the one in Figure 11.3, where we have horizontal bars for data that have been normalised so that for each sex their sum is 100%.

Figure 11.3 Example stacked bar plot: Age distribution for different sexes amongst all the runners
11.3.2. Political Marketing and Statistics
Even such a simple plot can be manipulated. For example, presidential elections were held in Poland in 2020. In the second round, Andrzej Duda had won against Rafał Trzaskowski. In Figure 11.4 we have the official results that might be presented by one of the infamous Polish TV conglomerate:
plt.bar([1, 2], height=[51.03, 48.97], width=0.25,
color="lightgray", edgecolor="black", alpha=0.8)
plt.xticks([1, 2], ["Duda", "Trzaskowski"])
plt.ylabel("%")
plt.xlim(0, 3)
plt.ylim(48.9, 51.1)
plt.show()

Figure 11.4 Flawless victory!
Such a great victory! Wait… it was a close vote after all! We should just take a look at the y-axis tick marks.
Another media outlet could have reported it like in Figure 11.5:
plt.bar([1, 2], height=[51.03, 48.97], width=0.25,
color="lightgray", edgecolor="black", alpha=0.8)
plt.xticks([1, 2], ["Duda", "Trzaskowski"])
plt.ylabel("%")
plt.xlim(0, 3)
plt.ylim(0, 250)
plt.yticks([0, 100])
plt.show()

Figure 11.5 It was a draw, so close!
The moral of the story is:
Important
Always read the y-axis tick marks. And when drawing own bar plots, do not trick the reader; this is unethical.
11.3.3. Pie Cha… Don’t Even Trip
We are definitely not going to discuss the infamous pie charts, because their use in data analysis has been widely criticised for a long time (it is difficult to judge the ratios of areas of their slices). Case closed. Good morning.
11.3.4. Pareto Charts (*)
As a general (empirical) rule, it is usually the case that most instances of something’s happening (usually 70–90%) are due to only few causes (10–30%). This is known as the Pareto rule (with 80% vs 20% being an often cited rule of thumb).
In Chapter 6 we modelled the US cities’ population dataset using the Pareto distribution (the very same Pareto, but a different, yet related object). We discovered that only ca. 14% of the settlements (those with 10,000 or more inhabitants) is home to as much as 84% of the population. Hence, we may say that this data domain follows the Pareto rule.
Here is a dataset fabricated by the Clinical Excellence Commission in New South Wales, Australia, listing the most frequent causes for medication errors:
cat_med = np.array([
"Unauthorised drug", "Wrong IV rate", "Wrong patient", "Dose missed",
"Underdose", "Wrong calculation","Wrong route", "Wrong drug",
"Wrong time", "Technique error", "Duplicated drugs", "Overdose"
])
counts_med = np.array([1, 4, 53, 92, 7, 16, 27, 76, 83, 3, 9, 59])
np.sum(counts_med) # total number of medication errors
## 430
Let us order the dataset with respect to the counts, decreasingly:
o = np.argsort(counts_med)[::-1] # ordering permutation (decreasing)
cat_med = cat_med[o] # order categories based on counts
counts_med = counts_med[o] # equivalent to np.sort(counts_med)[::-1]
pd.DataFrame(dict(
category=cat_med,
count=counts_med
)) # nicer display
## category count
## 0 Dose missed 92
## 1 Wrong time 83
## 2 Wrong drug 76
## 3 Overdose 59
## 4 Wrong patient 53
## 5 Wrong route 27
## 6 Wrong calculation 16
## 7 Duplicated drugs 9
## 8 Underdose 7
## 9 Wrong IV rate 4
## 10 Technique error 3
## 11 Unauthorised drug 1
Pareto charts are tools which may aid in visualising the Pareto-ruled datasets. They are based on bar plots, but feature some extras:
bars are listed in a decreasing order,
the cumulative percentage curve is added.
The plotting of the Pareto chart is a little tricky, because it involves using two different Y axes (as usual, fine-tuning of the figure and studying the manual of the matplotlib package is left as an exercise.)
x = np.arange(len(cat_med)) # 0, 1, 2, ...
p = 100.0*counts_med/np.sum(counts_med) # percentages
fig, ax1 = plt.subplots()
ax1.set_xticks(x-0.5, cat_med, rotation=60)
ax1.set_ylabel("%")
ax1.bar(x, height=p)
ax2 = ax1.twinx() # creates a new coordinate system with a shared x-axis
ax2.plot(x, np.cumsum(p), "ro-")
ax2.grid(visible=False)
ax2.set_ylabel("cumulative %")
fig.tight_layout()
plt.show()

Figure 11.6 The most frequent causes for medication errors
From Figure 11.6, we can read that 5 causes (less than 40%) correspond to ca. 85% of the medication errors. More precisely,
pd.DataFrame({
"category": cat_med,
"cumulative %": np.round(np.cumsum(p), 1)
})
## category cumulative %
## 0 Dose missed 21.4
## 1 Wrong time 40.7
## 2 Wrong drug 58.4
## 3 Overdose 72.1
## 4 Wrong patient 84.4
## 5 Wrong route 90.7
## 6 Wrong calculation 94.4
## 7 Duplicated drugs 96.5
## 8 Underdose 98.1
## 9 Wrong IV rate 99.1
## 10 Technique error 99.8
## 11 Unauthorised drug 100.0
Note that there is an explicit assumption here that a single error is only due to a single cause. Also, we presume that each medication error has a similar degree of severity.
Policy makers and quality controllers often rely on such simplifications, therefore they most probably are going to be addressing only the top causes. If we have ever wondered why some processes (mal)function the way they do, there is a hint above. However, coming up with something more effective yet so simple at the same time requires much more effort.
11.3.5. Heat Maps
Two-way contingency tables can be depicted by means of a heatmap, where each count affects the corresponding cell’s colour intensity, see Figure 11.7.
from matplotlib import cm
v = marathon.loc[:, ["sex", "age"]].value_counts().unstack(fill_value=0)
sns.heatmap(v, annot=True, fmt="d", cmap=cm.get_cmap("copper"))
plt.show()

Figure 11.7 Heatmap for the marathoners’ sex and age category
11.4. Aggregating and Comparing Factors
11.4.1. A Mode
As we have already said, the only operation on categorical data that we can rely on is counting (because we have an equivalence relation on the set of labels and nothing more). Therefore, as far as qualitative data aggregation is concerned, what we are with left is the mode, i.e., the most frequently occurring value.
cat_cntrs16, counts_cntrs16 = np.unique(cntrs16, return_counts=True)
cat_cntrs16[np.argmax(counts_cntrs16)]
## 'KE'
Recall that if i
is numpy.argmax(counts)
(argument maximum, i.e.,
where is it?),
then counts[i]
is the same as numpy.max(counts)
and cat[i]
is the category with the greatest counts.
Important
A mode might be ambiguous.
For instance, amongst the fastest 22 runners, there is a tie between Kenya and Poland – both meet our definition of a mode:
cat_cntrs22, counts_cntrs22 = np.unique(cntrs[:22], return_counts=True)
cat_cntrs22[np.where(counts_cntrs22 == np.max(counts_cntrs22))]
## array(['KE', 'PL'], dtype='<U2')
To avoid any bias, it would be best to report both of them as potential mode candidates. Alternatively, we can pick one at random (calling numpy.random.choice).
11.4.2. Binary Data as Logical Vectors
Perhaps the most useful arithmetic operation on logical vectors is the sum.
cntrs16 = cntrs[:16]
cntrs16 # recall
## array(['KE', 'KE', 'KE', 'ET', 'KE', 'KE', 'ET', 'MA', 'PL', 'PL', 'IL',
## 'PL', 'KE', 'KE', 'PL', 'PL'], dtype='<U2')
np.sum(cntrs16 == "PL")
## 5
is the number of elements in cntrs16
that are equal to "PL"
(because the sum of 0s and 1s is equal to the number of 1s in the sequence).
Note that (cntrs16 == "PL")
is a logical vector that
represents a binary categorical variable
with levels: not-Poland (False
) and Poland (True
).
If we divide the above result by the length of the vector, we will get the proportion:
np.mean(cntrs16 == "PL")
## 0.3125
Hence, 31.25% amongst the top 16 runners are from Poland.
What is the meaning of numpy.all, numpy.any, numpy.min, numpy.max, numpy.cumsum, and numpy.cumprod applied on logical vectors?
Note
(**) Having the 0/1 (or zero/nonzero) vs False
/True
correspondence
allows us to perform some logical operations using
integer arithmetic.
In particular, assuming that p
and q
are logical values
and a
and b
are numeric ones, we have, what follows:
p+q != 0
means that at least one value isTrue
andp+q == 0
if and only if both areFalse
;more generally,
p+q == 2
if both elements areTrue
,p+q == 1
if only one isTrue
(we call it exclusive-or, XOR), andp+q == 0
if both areFalse
;p*q != 0
means that both values areTrue
andp*q == 0
holds whenever at least one isFalse
;1-p
corresponds to negation ofp
;p*a + (1-p)*b
is equal toa
ifp
isTrue
and equal tob
otherwise.
11.4.3. Pearson’s Chi-Squared Test (*)
The Kolmogorov–Smirnov test that we described in Section 6.2.3 verifies whether a given sample differs significantly from a hypothesised continuous distribution, i.e., it works for numeric data.
For binned/categorical data, we can use a classical and easy-to-understand test developed by Karl Pearson in 1900. It is supposed to judge whether the difference between the observed proportions \(\hat{p}_1,\dots,\hat{p}_l\) and the theoretical ones \(p_1,\dots,p_l\) is significantly large or not:
Having such a test is beneficial, e.g., when the data we have at hand are based on small surveys that are supposed to serve as estimates of what might be happening in a larger population.
Going back to our political example from Section 11.3.2, it turns out that one of the pre-election polls indicated that \(c=516\) out of \(n=1017\) people would vote for the first candidate. We have \(\hat{p}_1=50.74\%\) (Duda) and \(\hat{p}_2=49.26\%\) (Traskowski). If we would like to test whether the observed proportions are significantly different from each other, we could test them against the theoretical distribution \(p_1=50\%\) and \(p_2=50\%\), stating that there is a tie between the competitors (up to a sampling error).
As a natural test statistic is based on the relative squared differences:
c, n = 516, 1017
p_observed = np.array([c, n-c]) / n
p_expected = np.array([0.5, 0.5])
T = n * np.sum( (p_observed-p_expected)**2 / p_expected )
T
## 0.2212389380530986
Similarly to the continuous case in Section 6.2.3, we should reject the null hypothesis, if
where the critical value \(K\) is based on the fact that, if the null hypothesis is true, \(\hat{T}\) follows the \(\chi^2\) (chi-squared, hence the name of the test) distribution with \(l-1\) degrees of freedom, see scipy.stats.chi2.
alpha = 0.001 # significance level
scipy.stats.chi2.ppf(1-alpha, len(p_observed)-1)
## 10.827566170662733
As \(\hat{T} < K\) (because \(0.22 < 10.83\)), we cannot deem the two proportions significantly different from each other. In other words, this poll did not indicate (at significance level \(0.1\%\)) any of the candidates as a clear winner.
Determine the smallest \(c\), i.e., the number of respondents indicating they would vote for Duda, that leads to the rejection of the null hypothesis.
11.4.4. Two-Sample Pearson’s Chi-Squared Test (*)
Let us consider the data depicted in Figure 11.3 and test whether the runners’ age distributions differ significantly between females and males.
We have \(l=5\) categories. First, denote the total number of observations in both groups with \(n'\) and \(n''\).
d = marathon.loc[:, ["sex", "age"]].value_counts().unstack(fill_value=0)
c1, c2 = d.to_numpy() # first row, second row
n1 = c1.sum()
n2 = c2.sum()
n1, n2
## (1013, 5498)
The observed proportions in the first group (females), denoted as \(\hat{p}_1',\dots,\hat{p}_l'\), are, respectively:
p1 = c1/n1
p1
## array([0.23692004, 0.44323791, 0.25863771, 0.04244817, 0.01875617])
Here are the proportions in the second group (males), \(\hat{p}_1'',\dots,\hat{p}_l''\):
p2 = c2/n2
p2
## array([0.15987632, 0.40014551, 0.31065842, 0.09839942, 0.03092033])
We would like to verify whether the corresponding proportions are equal to each other (up to some sampling error):
In other words, we are interested whether the categorical data in the two groups come from the same discrete probability distribution.
Taking the estimated expected proportions,
for all \(i=1,\dots,l\), the test statistic this time is equal to:
which is a variation on the one-sample theme presented in Section 11.4.4.
pp = (c1+c2)/(n1+n2)
T = n1 * np.sum( (p1-pp)**2 / pp ) + n2 * np.sum( (p2-pp)**2 / pp )
T
## 75.31373854741857
It can be shown that, if the null hypothesis is true, the test statistic approximately follows the \(\chi^2\) distribution with \(l-1\) degrees of freedom1. The critical value \(K\) is equal to:
alpha = 0.001 # significance level
scipy.stats.chi2.ppf(1-alpha, len(c1)-1)
## 18.46682695290317
As \(\hat{T} \ge K\) (because \(75.31 \ge 18.47\)), we reject the null hypothesis. And so, the age distribution differs across sexes (at significance level \(0.1\%\)).
11.4.5. Measuring Association (*)
Let us consider Australian Bureau of Statistics’ National Health Survey 2018 data on the prevalence of certain medical conditions as a function of age. Here is the extracted contingency table:
l = [
["Arthritis", "Asthma", "Back problems", "Cancer (malignant neoplasms)",
"Chronic obstructive pulmonary disease", "Diabetes mellitus",
"Heart, stroke and vascular disease", "Kidney disease",
"Mental and behavioural conditions", "Osteoporosis"],
["15-44", "45-64", "65+"]
]
C = 1000*np.array([
[ 360.2, 1489.0, 1772.2],
[1069.7, 741.9, 433.7],
[1469.6, 1513.3, 955.3],
[ 28.1, 162.7, 237.5],
[ 103.8, 207.0, 251.9],
[ 135.4, 427.3, 607.7],
[ 94.0, 344.4, 716.0],
[ 29.6, 67.7, 123.3],
[2218.9, 1390.6, 725.0],
[ 36.1, 312.3, 564.7],
]).astype(int)
pd.DataFrame(C, index=l[0], columns=l[1])
## 15-44 45-64 65+
## Arthritis 360000 1489000 1772000
## Asthma 1069000 741000 433000
## Back problems 1469000 1513000 955000
## Cancer (malignant neoplasms) 28000 162000 237000
## Chronic obstructive pulmonary disease 103000 207000 251000
## Diabetes mellitus 135000 427000 607000
## Heart, stroke and vascular disease 94000 344000 716000
## Kidney disease 29000 67000 123000
## Mental and behavioural conditions 2218000 1390000 725000
## Osteoporosis 36000 312000 564000
Cramér’s \(V\) is one of a few ways to measure the degree of association between two categorical variables. It is equal to 0 (lowest possible value) if the two variables are independent (there is no association between them) and 1 (highest possible value) if they are tied.
scipy.stats.contingency.association(C)
## 0.316237999724298
The above means that there might be a small association between age and the prevalence of certain conditions. In other words, it might be the case that some conditions are more prevalent in different age groups than others.
Given a two-way contingency table \(C\) with \(n\) rows and \(m\) columns and assuming that
where
the Cramér coefficient is given by
Here, \(c_{i,j}\) gives the actually observed counts and \(e_{i, j}\) denotes the number that we would expect if the two variables were actually independent.
Compute the Cramér \(V\) using only numpy functions.
(**) Actually we can easily verify the hypothesis whether \(V\) does not differ significantly from \(0\), i.e., whether the variables are independent. Looking at \(T\), we see that this is actually the test statistic in Pearson’s chi-squared goodness-of-fit test.
E = C.sum(axis=1).reshape(-1, 1) * C.sum(axis=0).reshape(1, -1) / C.sum()
T = np.sum((C-E)**2 / E)
T
## 3715440.465191512
If the data are really independent, \(T\) follows the chi-squared distribution \(n + m - 1\), hence the critical value is equal to
alpha = 0.001 # significance level
scipy.stats.chi2.ppf(1-alpha, C.shape[0] + C.shape[1] - 1)
## 32.90949040736021
as it is greater than \(T\), we conclude (at significance level \(0.1\%\)) that the conditions are not independent of age.
(*) Take a look at Table 19: Comorbidity of selected chronic conditions of the National Health Survey 2018, where we clearly see that many disorders co-occur. Visualise them on some heatmaps and bar plots (including data grouped by sex and age).
11.4.6. Binned Numeric Data
Generally, modes do not work for continuous data, where repeated values are – at least theoretically – highly unlikely (unless someone does not report them with full digit precision). It might make sense to compute it on binned data, though.
Looking at a histogram, the mode is the interval corresponding to the highest bar (hopefully assuming there is only one). If we would like to obtain a single number, we can choose for example the middle of this interval as the mode.
Of course, for numeric data, the mode will heavily depend on the binning (recall that we can also apply logarithmic binning). Thus, the question “what is the most popular income” is overall quite difficult to answer.
Compute some potentially informative modes for the 2020 UK income data. Play around with different numbers of bins for linear and logarithmic binning and see how it affects the mode.
11.4.7. Ordinal Data (*)
The case where the categories can be linearly ordered, is called ordinal data. This gives a few more options as except for the mode, thanks to the existence of order statistics, we can also easily define sample quantiles. However, the standard methods for resolving ties will not work, hence we need to be careful.
For example, median of a sample of student grades (P, P, C, D, HD) is C, but (P, P, C, D, HD, HD) is either C or D - we can choose one at random or just report that the solution is ambiguous.
Another option, of course, is to treat ordinal data as numbers (e.g., F=0, P=1, …, HD=4). In the latter example, the median would be equal to 2.5.
There are some cases, though, where the conversion of labels to consecutive integers is far from optimal – because it gives the impression that the “distance” between different levels is always equal (linear).
(**)
The grades_results represents the grades (F, P, C, D, HD)
of 100 students attending an imaginary course in a virtual Australian
university. You can load it in the form of an ordered categorical Series
by calling:
grades = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/grades_results.txt", dtype="str")
grades = pd.Series(pd.Categorical(grades,
categories=["F", "P", "C", "D", "HD"], ordered=True))
grades
## 0 F
## 1 F
## 2 F
## 3 F
## 4 F
## ..
## 118 HD
## 119 HD
## 120 HD
## 121 HD
## 122 HD
## Length: 123, dtype: category
## Categories (5, object): ['F' < 'P' < 'C' < 'D' < 'HD']
How would you determine the average grade represented as a number between 0 and 100, taking into account that for a P you need at least 50%, C is given for ≥ 60%, D for ≥ 70%, and HD for only 80% of the points. Come up with a pessimistic, optimistic, and a best-shot estimate and then compare your result to the true corresponding scores listed in the grades_scores dataset.
11.5. Exercises
Does it make sense to compute the arithmetic mean of a categorical variable?
Name the basic use cases for categorical data.
(*) What is a Pareto chart?
How to deal with the case of the mode being nonunique?
What is the meaning of numpy.mean((x > 0) & (x < 1))
, where
x
is a numeric vector?
What is the meaning of the sum and mean for binary data (logical vectors)?
List some ways to visualise multidimensional categorical data.
(*) State the null hypotheses verified by the one- and two-sample chi-squared tests.
(*) How is Cramér’s V defined and what values does it take?