10. Introducing Data Frames#

The open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Also, make sure to check out my other book, Deep R Programming [34].

numpy arrays are an extremely versatile tool for performing data analysis exercises and other numerical computations of various kinds. Although theoretically possible otherwise, in practice we only store elements of the same type therein, most often numbers.

pandas [62] is amongst over one hundred thousand open-source packages and repositories that use numpy to provide additional data wrangling functionality. It was originally written by Wes McKinney but was heavily inspired by the data.frame[1] objects in S and R as well as tables in relational (think: SQL) databases and spreadsheets.

pandas delivers a few classes, of which the most important are:

• DataFrame – for representing tabular data (matrix-like) with columns of possibly different types, in particular a mix of numerical and categorical variables,

• Series – vector-like objects for representing individual columns,

• Index and its derivatives – vector-like (usually) objects for labelling individual rows and columns in DataFrames and items in Series objects,

• GroupBy and its descendants SeriesGroupBy and DataFrameGroupBy – representing observations grouped by a categorical variable or a combination of factors (Chapter 12),

together with many methods for:

• transforming/aggregating/processing data, also in groups determined by categorical variables or products thereof,

• reshaping (e.g., from wide to long format) and joining datasets,

• importing/exporting data from/to various sources and formats, e.g., CSV and HDF5 files or relational databases,

• handling missing data,

all of which we introduce in this part.

Before we delve into the world of pandas, let us point out that it is customary to load this package under the following alias:

import pandas as pd


Important

Let us repeat: pandas is built on top of numpy and most objects therein can be processed by numpy functions as well. Many other functions (e.g., in sklearn) accept both DataFrame and ndarray objects, but often convert the former to the latter internally to enable data processing using fast C/C++/Fortran routines.

What we have learnt so far[2] still applies. But there is of course more, hence this part.

10.1. Creating Data Frames#

Data frames can be created, amongst others, using the DataFrame class constructor, which can be fed, for example, with a numpy matrix:

np.random.seed(123)
pd.DataFrame(
np.random.rand(4, 3),
columns=["a", "b", "c"]
)
##           a         b         c
## 0  0.696469  0.286139  0.226851
## 1  0.551315  0.719469  0.423106
## 2  0.980764  0.684830  0.480932
## 3  0.392118  0.343178  0.729050


Notice that rows and columns are labelled (and how readable that is).

A dictionary of vector-like objects of equal lengths is another common option:

np.random.seed(123)
df = pd.DataFrame(dict(
a = np.round(np.random.rand(5), 2),
b = [1, 2.5, np.nan, 4, np.nan],
c = [True, True, False, False, True],
d = ["A", "B", "C", None, "E"],
e = ["spam", "spam", "bacon", "spam", "eggs"],
f = np.array([
"2021-01-01", "2022-02-02", "2023-03-03", "2024-04-04", "2025-05-05"
], dtype="datetime64[D]"),
g = [
["spam"], ["bacon", "spam"], None, ["eggs", "bacon", "spam"], ["ham"]
],
))
df
##       a    b      c     d      e          f                    g
## 0  0.70  1.0   True     A   spam 2021-01-01               [spam]
## 1  0.29  2.5   True     B   spam 2022-02-02        [bacon, spam]
## 2  0.23  NaN  False     C  bacon 2023-03-03                 None
## 3  0.55  4.0  False  None   spam 2024-04-04  [eggs, bacon, spam]
## 4  0.72  NaN   True     E   eggs 2025-05-05                [ham]


This illustrates the possibility of having columns of different types.

Exercise 10.1

Check out pandas.DataFrame.from_dict and from_records in the documentation. Use them to create some example data frames.

Further, data frames can be read from files in different formats, for instance, CSV:

body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
comment="#")
body.head()  # display first few rows (5 by default)
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 0   97.1  160.2     34.7    40.8     35.8   126.1     117.9
## 1   91.1  152.7     33.5    33.0     38.5   125.5     103.1
## 2   73.0  161.2     37.4    38.0     31.8   106.2      92.0
## 3   61.7  157.4     38.0    34.7     29.0   101.0      90.5
## 4   55.4  154.6     34.6    34.0     28.3    92.5      73.2


Reading from URLs and local files is of course supported; compare Section 13.6.1.

Exercise 10.2

Check out the other pandas.read_* functions in the pandas documentation. We will be discussing some of them later.

10.1.1. Data Frames Are Matrix-Like#

Data frames are modelled through numpy matrices. We can thus already feel quite at home with them.

For example, given one of the above-defined data frames, it is easy to fetch its number of rows and columns:

df.shape
## (5, 7)


or the type of each column:

df.dtypes  # returns a Series object; see below
## a           float64
## b           float64
## c              bool
## d            object
## e            object
## f    datetime64[ns]
## g            object
## dtype: object


Recall that numpy arrays are equipped with the dtype slot.

10.1.2. Series#

There is a separate class for storing individual data frame columns: it is called Series.

s = df.loc[:, "a"]  # extract the a column; alternatively: df.a
s
## 0    0.70
## 1    0.29
## 2    0.23
## 3    0.55
## 4    0.72
## Name: a, dtype: float64


Data frames with one column are printed out slightly differently. We get the column name at the top, but do not have the dtype information at the bottom.

s.to_frame()  # or: pd.DataFrame(s)
##       a
## 0  0.70
## 1  0.29
## 2  0.23
## 3  0.55
## 4  0.72


Indexing will of course be discussed later.

Important

It is crucial to know when we are dealing with a Series and when with a DataFrame object, because each of them defines a slightly different set of methods.

We will now be relying upon object-oriented syntax (compare Section 2.2.3) much more frequently than before.

Example 10.3

By calling:

s.mean()
## 0.49800000000000005


we refer to pandas.Series.mean (which returns a scalar), whereas:

df.mean(numeric_only=True)
## a    0.498
## b    2.500
## c    0.600
## dtype: float64


uses pandas.DataFrame.mean (which yields a Series).

Look up these two methods in the pandas manual. Note that their argument list is slightly different.

Objects of class Series are vector-like:

s.shape
## (5,)
s.dtype
## dtype('float64')


They are wrappers around numpy arrays.

s.values
## array([0.7 , 0.29, 0.23, 0.55, 0.72])


Most importantly, numpy functions can be called directly on them:

np.mean(s)
## 0.49800000000000005


As a consequence, what we covered in the part of this book that dealt with vector processing still holds for data frame columns (but there will be more).

Series can also be named.

s.name
## 'a'


This is convenient, especially when we convert them to a data frame, because the name sets the label of the newly created column:

s.rename("spam").to_frame()
##    spam
## 0  0.70
## 1  0.29
## 2  0.23
## 3  0.55
## 4  0.72


10.1.3. Index#

Another important class is called Index[3]. We use it for storing element or axes labels.

The index (lowercase) slot of a data frame stores an object of class Index (or one of its derivatives) that gives the row names:

df.index  # row labels
## RangeIndex(start=0, stop=5, step=1)


The above represents a sequence (0, 1, 2, 3, 4).

Furthermore, the column slot gives:

df.columns  # column labels
## Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')


Also, we can label the individual elements in Series objects:

s.index
## RangeIndex(start=0, stop=5, step=1)


The set_index method can be applied to make a data frame column act as a sequence of row labels:

df2 = df.set_index("e")
df2
##           a    b      c     d          f                    g
## e
## spam   0.70  1.0   True     A 2021-01-01               [spam]
## spam   0.29  2.5   True     B 2022-02-02        [bacon, spam]
## bacon  0.23  NaN  False     C 2023-03-03                 None
## spam   0.55  4.0  False  None 2024-04-04  [eggs, bacon, spam]
## eggs   0.72  NaN   True     E 2025-05-05                [ham]


This Index object is named:

df2.index.name
## 'e'


We can also rename the axes on the fly:

df2.rename_axis(index="ROWS", columns="COLS")
## COLS      a    b      c     d          f                    g
## ROWS
## spam   0.70  1.0   True     A 2021-01-01               [spam]
## spam   0.29  2.5   True     B 2022-02-02        [bacon, spam]
## bacon  0.23  NaN  False     C 2023-03-03                 None
## spam   0.55  4.0  False  None 2024-04-04  [eggs, bacon, spam]
## eggs   0.72  NaN   True     E 2025-05-05                [ham]


Having a named index slot is handy when we decide that we want to convert the vector of row labels back to a standalone column:

df2.rename_axis(index="NEW_COLUMN").reset_index()
##   NEW_COLUMN     a    b      c     d          f                    g
## 0       spam  0.70  1.0   True     A 2021-01-01               [spam]
## 1       spam  0.29  2.5   True     B 2022-02-02        [bacon, spam]
## 2      bacon  0.23  NaN  False     C 2023-03-03                 None
## 3       spam  0.55  4.0  False  None 2024-04-04  [eggs, bacon, spam]
## 4       eggs  0.72  NaN   True     E 2025-05-05                [ham]


There is also an option to get rid of the current index and to replace it with the default label sequence, i.e., 0, 1, 2, …:

df2.reset_index(drop=True)
##       a    b      c     d          f                    g
## 0  0.70  1.0   True     A 2021-01-01               [spam]
## 1  0.29  2.5   True     B 2022-02-02        [bacon, spam]
## 2  0.23  NaN  False     C 2023-03-03                 None
## 3  0.55  4.0  False  None 2024-04-04  [eggs, bacon, spam]
## 4  0.72  NaN   True     E 2025-05-05                [ham]


Take note of the fact that reset_index, and many other methods that we have used so far, do not modify the data frame in place.

Important

We will soon get used to calling reset_index(drop=True) quite frequently, sometimes more than once in a single series of commands.

Exercise 10.4

Use the pandas.DataFrame.rename method to change the name of the a column in df to spam.

Also, a hierarchical index – one that is comprised of more than one level – is possible. For example, here is a sorted (see Section 10.6.1) version of df with a new index based on two columns at the same time:

df.sort_values("e", ascending=False).set_index(["e", "c"])
##                 a    b     d          f                    g
## e     c
## spam  True   0.70  1.0     A 2021-01-01               [spam]
##       True   0.29  2.5     B 2022-02-02        [bacon, spam]
##       False  0.55  4.0  None 2024-04-04  [eggs, bacon, spam]
## eggs  True   0.72  NaN     E 2025-05-05                [ham]
## bacon False  0.23  NaN     C 2023-03-03                 None


For the sake of readability, the consecutive repeated spams were not printed.

Example 10.5

Hierarchical indexes might arise after aggregating data in groups. For example:

nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_p_demo_bmx_2020.csv",
comment="#").rename({
"BMXBMI": "bmival",
"RIAGENDR": "gender",
"DMDBORN4": "usborn"
}, axis=1)


In Chapter 12, we will get used to writing:

res = nhanes.groupby(["gender", "usborn"])["bmival"].mean()
res  # BMI by gender and US born-ness
## gender  usborn
## 1       1         25.734110
##         2         27.405251
## 2       1         27.120261
##         2         27.579448
##         77        28.725000
##         99        32.600000
## Name: bmival, dtype: float64


This returned a Series object with a hierarchical index. Let us fret not, though: reset_index always comes to our rescue:

res.reset_index()
##    gender  usborn     bmival
## 0       1       1  25.734110
## 1       1       2  27.405251
## 2       2       1  27.120261
## 3       2       2  27.579448
## 4       2      77  28.725000
## 5       2      99  32.600000


10.2. Aggregating Data Frames#

Here is another toy data frame:

np.random.seed(123)
df = pd.DataFrame(dict(
u = np.round(np.random.rand(5), 2),
v = np.round(np.random.randn(5), 2),
w = ["spam", "bacon", "spam", "eggs", "sausage"]
), index=["a", "b", "c", "d", "e"])
df
##       u     v        w
## a  0.70  0.32     spam
## b  0.29 -0.05    bacon
## c  0.23 -0.20     spam
## d  0.55  1.98     eggs
## e  0.72 -1.62  sausage


All numpy functions can be applied directly on individual columns, i.e., objects of type Series, because they are vector-like.

u = df.loc[:, "u"]  # extract the u column (gives a Series; see below)
np.quantile(u, [0, 0.5, 1])
## array([0.23, 0.55, 0.72])


Most numpy functions also work if they are fed with data frames, but we will need to extract the numeric columns manually.

uv = df.loc[:, ["u", "v"]]  # select two columns (a DataFrame; see below)
np.quantile(uv, [0, 0.5, 1], axis=0)
## array([[ 0.23, -1.62],
##        [ 0.55, -0.05],
##        [ 0.72,  1.98]])


Sometimes the results will automatically be coerced to a Series object with the index slot set appropriately:

np.mean(uv, axis=0)
## u    0.498
## v    0.086
## dtype: float64


Many operations, for convenience, were also implemented as methods for the Series and DataFrame classes, e.g., mean, median, min, max, quantile, var, std, and skew.

df.mean(numeric_only=True)
## u    0.498
## v    0.086
## dtype: float64
df.quantile([0, 0.5, 1], numeric_only=True)
##         u     v
## 0.0  0.23 -1.62
## 0.5  0.55 -0.05
## 1.0  0.72  1.98


Also note the describe method, which returns a few statistics at the same time.

df.describe()
##               u         v
## count  5.000000  5.000000
## mean   0.498000  0.086000
## std    0.227969  1.289643
## min    0.230000 -1.620000
## 25%    0.290000 -0.200000
## 50%    0.550000 -0.050000
## 75%    0.700000  0.320000
## max    0.720000  1.980000

Exercise 10.6

Check out the pandas.DataFrame.agg method that can apply all aggregates given by a list of functions. Write a call equivalent to df.describe().

Note

(*) Let us stress that above we see the corrected for bias (but still only asymptotically unbiased) version of standard deviation, given by $$\sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2}$$; compare Section 5.1. In pandas, std methods assume ddof=1 by default, whereas we recall that numpy uses ddof=0.

np.round([u.std(), np.std(u), np.std(np.array(u)), u.std(ddof=0)], 3)
## array([0.228, 0.204, 0.204, 0.204])


This is an unfortunate inconsistency between the two packages, but please do not blame the messenger.

10.3. Transforming Data Frames#

By applying the already well-known vectorised mathematical functions from numpy, we can transform each data cell and return an object of the same type as the input one.

np.exp(df.loc[:, "u"])
## a    2.013753
## b    1.336427
## c    1.258600
## d    1.733253
## e    2.054433
## Name: u, dtype: float64
np.exp(df.loc[:, ["u", "v"]])
##           u         v
## a  2.013753  1.377128
## b  1.336427  0.951229
## c  1.258600  0.818731
## d  1.733253  7.242743
## e  2.054433  0.197899


When applying the binary arithmetic, relational, and logical operators on an object of class Series and a scalar or a numpy vector, the operations are performed elementwisely – a style with which we are already familiar.

For instance, here is a standardised version of the u column:

u = df.loc[:, "u"]
(u - np.mean(u)) / np.std(u)
## a    0.990672
## b   -1.020098
## c   -1.314357
## d    0.255025
## e    1.088759
## Name: u, dtype: float64


Binary operators act on the elements with corresponding labels. For two objects having identical index slots (this is the most common scenario), this is the same as elementwise vectorisation. For instance:

df.loc[:, "u"] > df.loc[:, "v"]  # here: elementwise comparison
## a     True
## b     True
## c     True
## d    False
## e     True
## dtype: bool


For transforming many numerical columns at once, it is a good idea either to convert them to a numeric matrix explicitly and then use the basic numpy functions:

uv = np.array(df.loc[:, ["u", "v"]])
uv2 = (uv-np.mean(uv, axis=0))/np.std(uv, axis=0)
uv2
## array([[ 0.99067229,  0.20286225],
##        [-1.0200982 , -0.11790285],
##        [-1.3143573 , -0.24794275],
##        [ 0.25502455,  1.64197052],
##        [ 1.08875866, -1.47898717]])


or to use the pandas.DataFrame.apply method which invokes a given function on each column separately:

uv2 = df.loc[:, ["u", "v"]].apply(lambda x: (x-np.mean(x))/np.std(x))
uv2
##           u         v
## a  0.990672  0.202862
## b -1.020098 -0.117903
## c -1.314357 -0.247943
## d  0.255025  1.641971
## e  1.088759 -1.478987


Anticipating what we cover in the next section, in both cases, we can write df.loc[:, ["u", "v"]] = uv2 to replace the old content. Also, new columns can be added based on the transformed versions of the existing ones, for instance:

df.loc[:, "uv_squared"] = (df.loc[:, "u"] * df.loc[:, "v"])**2
df
##       u     v        w  uv_squared
## a  0.70  0.32     spam    0.050176
## b  0.29 -0.05    bacon    0.000210
## c  0.23 -0.20     spam    0.002116
## d  0.55  1.98     eggs    1.185921
## e  0.72 -1.62  sausage    1.360489

Example 10.7

(*) Binary operations on objects with different index slots are in fact vectorised labelwisely:

x = pd.Series([1, 10, 1000, 10000, 100000], index=["a", "b", "a", "a", "c"])
x
## a         1
## b        10
## a      1000
## a     10000
## c    100000
## dtype: int64
y = pd.Series([1, 2, 3, 4, 5], index=["b", "b", "a", "d", "c"])
y
## b    1
## b    2
## a    3
## d    4
## c    5
## dtype: int64


And now:

x * y
## a         3.0
## a      3000.0
## a     30000.0
## b        10.0
## b        20.0
## c    500000.0
## d         NaN
## dtype: float64


Here, each element in the first Series named a was multiplied by each (there was only one) element labelled a in the second Series. For d, there were no matches, hence the result’s being marked as missing; compare Chapter 15. Thus, this behaves like a full outer join-type operation; see Section 10.6.4.

The above is different from elementwise vectorisation in numpy:

np.array(x) * np.array(y)
## array([     1,     20,   3000,  40000, 500000])


Labelwise vectorisation can be useful in certain contexts, but we should be aware of this (yet another) incompatibility between the two packages.

10.4. Indexing Series Objects#

Recall that each DataFrame and Series object is equipped with a slot called index, which is an object of class Index (or subclass thereof), giving the row and element labels, respectively. It turns out that we may apply the index operator, [...], to subset these objects not only through the indexers known from the numpy part (e.g., numerical ones, i.e., by position) but also ones that pinpoint the items via their labels. That is quite a lot of index-ing.

Let us study different forms thereof in very detail. For illustration, we will be playing with the two following objects of class Series:

np.random.seed(123)
b = pd.Series(np.round(np.random.rand(10), 2))
b.index = np.random.permutation(np.arange(10))
b
## 2    0.70
## 1    0.29
## 8    0.23
## 7    0.55
## 9    0.72
## 4    0.42
## 5    0.98
## 6    0.68
## 3    0.48
## 0    0.39
## dtype: float64


and:

c = b.copy()
c.index = list("abcdefghij")
c
## a    0.70
## b    0.29
## c    0.23
## d    0.55
## e    0.72
## f    0.42
## g    0.98
## h    0.68
## i    0.48
## j    0.39
## dtype: float64


They consist of the same values, in the same order, but have different labels (index slots). In particular, b’s labels are integers that do not match the physical element positions (where 0 would denote the first element, etc.).

Important

For numpy vectors, we had four different indexing schemes: via a scalar (extracts an element at a given position), a slice, an integer vector, and a logical vector. Series objects are additionally labelled. Therefore, they can also be accessed through the contents of the index slot.

10.4.1. Do Not Use [...] Directly#

Applying the index operator, [...], directly on Series is generally not a good idea:

b[0]  # do not use it
## 0.39
b[ [0] ]  # do not use it
## 0    0.39
## dtype: float64


both do not select the first item, but the item labelled 0.

However:

b[:1]  # do not use it
## 2    0.7
## dtype: float64


and

c[0]  # there is no label 0... (do not use it)
## 0.7


both[4] fall back to position-based indexing.

Confusing? Well, with some self-discipline, the solution is easy:

Important

We should never apply [...] directly on Series nor DataFrame objects.

To avoid ambiguity, we should be referring to the loc[...] and iloc[...] accessors for the label- and position-based filtering, respectively.

10.4.2. loc[...]#

Series.loc[...] implements label-based indexing.

b.loc[0]
## 0.39


This returned the element labelled 0. On the other hand, c.loc[0] will raise a KeyError, because c consists of string labels only. But in this case, we can write:

c.loc["j"]
## 0.39


Next, we can use lists of labels to select a subset.

b.loc[ [0, 1, 0] ]
## 0    0.39
## 1    0.29
## 0    0.39
## dtype: float64
c.loc[ ["j", "b", "j"] ]
## j    0.39
## b    0.29
## j    0.39
## dtype: float64


The result is always of type Series.

Slicing behaves differently as the range is inclusive (sic![5]) at both sides:

b.loc[1:7]
## 1    0.29
## 8    0.23
## 7    0.55
## dtype: float64
b.loc[0:4:-1]
## 0    0.39
## 3    0.48
## 6    0.68
## 5    0.98
## 4    0.42
## dtype: float64
c.loc["d":"g"]
## d    0.55
## e    0.72
## f    0.42
## g    0.98
## dtype: float64


The above calls return all elements between the two indicated labels.

Note

Be careful that if there are repeated labels, then we will be returning all (sic![6]) the matching items:

d = pd.Series([1, 2, 3, 4], index=["a", "b", "a", "c"])
d.loc["a"]
## a    1
## a    3
## dtype: int64


The result is not a scalar but a Series object.

10.4.3. iloc[...]#

Here are some examples of position-based indexing with the iloc[...] accessor. It is worth stressing that, fortunately, its behaviour is consistent with its numpy counterpart, i.e., the ordinary square brackets applied on objects of class ndarray.

For example:

b.iloc[0]  # the same: c.iloc[0]
## 0.7


returns the first element.

b.iloc[1:7]  # the same: b.iloc[1:7]
## 1    0.29
## 8    0.23
## 7    0.55
## 9    0.72
## 4    0.42
## 5    0.98
## dtype: float64


returns the 2nd, 3rd, …, 7th element (not including b.iloc[7], i.e., the 8th one).

10.4.4. Logical Indexing#

Indexing using a logical vector-like object is also available. For this purpose, we will usually be using loc[...] with either a logical Series object of identical index slot as the subsetted object, or a Boolean numpy vector.

b.loc[(b > 0.4) & (b < 0.6)]
## 7    0.55
## 4    0.42
## 3    0.48
## dtype: float64


For iloc[...], the indexer must be unlabelled, e.g., be an ordinary numpy vector.



10.5. Indexing Data Frames#

10.5.1. loc[...] and iloc[...]#

For data frames, iloc and loc can be used too, but they now require two arguments, serving as row and column selectors.

For example:

np.random.seed(123)
df = pd.DataFrame(dict(
u = np.round(np.random.rand(5), 2),
v = np.round(np.random.randn(5), 2),
w = ["spam", "bacon", "spam", "eggs", "sausage"],
x = [True, False, True, False, True]
))


And now:

df.loc[ df.loc[:, "u"] > 0.5, "u":"w" ]
##       u     v        w
## 0  0.70  0.32     spam
## 3  0.55  1.98     eggs
## 4  0.72 -1.62  sausage


selects the rows where the values in the u column are greater than 0.5 and then returns all columns between u and w (inclusive!).

Furthermore,

df.iloc[:3, :].loc[:, ["u", "w"]]
##       u      w
## 0  0.70   spam
## 1  0.29  bacon
## 2  0.23   spam


fetches the first three rows (by position; iloc is necessary) and then selects two indicated columns.

Compare this to:

df.loc[:3, ["u", "w"]]  # df[:3, ["u", "w"]] does not even work - don't
##       u      w
## 0  0.70   spam
## 1  0.29  bacon
## 2  0.23   spam
## 3  0.55   eggs


Which results in four (!) rows.

Important

Getting a scrambled numeric index that does not match the physical positions is quite easy, for instance in the context of data frame sorting which we discuss in Section 10.6.1:

df2 = df.sort_values("v")
df2
##       u     v        w      x
## 4  0.72 -1.62  sausage   True
## 2  0.23 -0.20     spam   True
## 1  0.29 -0.05    bacon  False
## 0  0.70  0.32     spam   True
## 3  0.55  1.98     eggs  False


Note how different are the following results

df2.loc[:3, :]  # up to label 3, inclusive
##       u     v        w      x
## 4  0.72 -1.62  sausage   True
## 2  0.23 -0.20     spam   True
## 1  0.29 -0.05    bacon  False
## 0  0.70  0.32     spam   True
## 3  0.55  1.98     eggs  False
df2.iloc[:3, :]  # always: first three
##       u     v        w      x
## 4  0.72 -1.62  sausage   True
## 2  0.23 -0.20     spam   True
## 1  0.29 -0.05    bacon  False


Important

We can frequently write df.u as a shorter version of df.loc[:, "u"]. This improves the readability in contexts such as:

df.loc[(df.u >= 0.5) & (df.u <= 0.7), ["u", "w"]]
##       u     w
## 0  0.70  spam
## 3  0.55  eggs


This accessor is, sadly, not universal. We can verify this by considering a data frame featuring a column named, e.g., mean (which clashes with a built-in method).

Exercise 10.8

Use pandas.DataFrame.drop to select all columns except v in df.

Exercise 10.9

Use pandas.Series.isin (amongst others) to select all rows with spam and bacon on the df’s menu.

Exercise 10.10

In the tips dataset, select data on male customers where the total bills were in the $$[10, 20]$$ interval. Also, select Saturday and Sunday records where the tips were greater than \$5.

loc[...] can also be used to add new columns to an existing data frame:

df.loc[:, "y"] = df.loc[:, "u"]**2  # or df.loc[:, "y"] = df.u**2
df
##       u     v        w      x       y
## 0  0.70  0.32     spam   True  0.4900
## 1  0.29 -0.05    bacon  False  0.0841
## 2  0.23 -0.20     spam   True  0.0529
## 3  0.55  1.98     eggs  False  0.3025
## 4  0.72 -1.62  sausage   True  0.5184


Important

Notation like “df.new_column = ...” does not work. As we said, only loc and iloc are universal. For other accessors, this is not necessarily the case.

Exercise 10.11

Use pandas.DataFrame.insert to add a new column not necessarily at the end of df.

Exercise 10.12

Use pandas.DataFrame.append to add a few more rows to df.

10.5.3. Modifying Items#

In the current version of pandas, modifying particular elements gives a warning:

df.loc[:, "u"].iloc[0] = 7  # the same as df.u.iloc[0] = 7
## SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame
df.loc[:, "u"].iloc[0]  # testing
## 7.0


In order to remedy this, it is best to create a copy of a column, modify it, and then to replace the old contents with the new ones.

u = df.loc[:, "u"].copy()
u.iloc[0] = 42  # or a whole for loop to process them all, or whatever
df.loc[:, "u"] = u
df.loc[:, "u"].iloc[0]  # testing
## 42.0


10.5.4. Pseudorandom Sampling and Splitting#

As a simple application of what we have covered so far, let us consider the ways to sample several rows from an existing data frame.

We can use the pandas.DataFrame.sample method in the most basic scenarios, such as:

• select five rows, without replacement,

• select 20% rows, with replacement,

• rearrange all the rows.

For example:

body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
comment="#")
body.sample(5, random_state=123)  # 5 rows without replacement
##       BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 4214   58.4  156.2     35.2    34.7     27.2    99.5      77.5
## 3361   73.7  161.0     36.5    34.5     29.0   107.6      98.2
## 3759   61.4  164.6     37.5    40.4     26.9    93.5      84.4
## 3733  120.4  158.8     33.5    34.6     40.5   147.2     129.3
## 1121  123.5  157.5     35.5    29.0     50.5   143.0     136.4


Notice the random_state argument which controls the seed of the pseudorandom number generator so that we get reproducible results. Alternatively, we could call numpy.random.seed.

Exercise 10.13

Show how the three aforementioned scenarios can be implemented manually using iloc[...] and numpy.random.permutation or numpy.random.choice.

In machine learning practice, we are used to training and evaluating machine learning models on different (mutually disjoint) subsets of the whole data frame.

For instance, in Section 12.3.3, we mention that we may be interested in performing the so-called training/test split (partitioning), where 80% (or 60% or 70%) of the randomly selected rows would constitute the first new data frame and the remaining 20% (or 40% or 30%, respectively) would go to the second one.

Given a data frame like:

df = body.head(10)  # this is just an example
df
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 0   97.1  160.2     34.7    40.8     35.8   126.1     117.9
## 1   91.1  152.7     33.5    33.0     38.5   125.5     103.1
## 2   73.0  161.2     37.4    38.0     31.8   106.2      92.0
## 3   61.7  157.4     38.0    34.7     29.0   101.0      90.5
## 4   55.4  154.6     34.6    34.0     28.3    92.5      73.2
## 5   62.0  144.7     32.5    34.2     29.8   106.7      84.8
## 6   66.2  166.5     37.5    37.6     32.0    96.3      95.7
## 7   75.9  154.5     35.4    37.6     32.7   107.7      98.7
## 8   77.2  159.2     38.5    40.5     35.7   102.0      97.5
## 9   91.6  174.5     36.1    45.9     35.2   121.3     100.3


one way to perform the aforementioned split is to generate a random permutation of the set of row indexes:

np.random.seed(123)  # reproducibility matters
idx = np.random.permutation(df.shape[0])
idx
## array([4, 0, 7, 5, 8, 3, 1, 6, 9, 2])


And then to pick the first 80% of them to construct the data frame number one:

k = int(df.shape[0]*0.8)
df.iloc[idx[:k], :]
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 4   55.4  154.6     34.6    34.0     28.3    92.5      73.2
## 0   97.1  160.2     34.7    40.8     35.8   126.1     117.9
## 7   75.9  154.5     35.4    37.6     32.7   107.7      98.7
## 5   62.0  144.7     32.5    34.2     29.8   106.7      84.8
## 8   77.2  159.2     38.5    40.5     35.7   102.0      97.5
## 3   61.7  157.4     38.0    34.7     29.0   101.0      90.5
## 1   91.1  152.7     33.5    33.0     38.5   125.5     103.1
## 6   66.2  166.5     37.5    37.6     32.0    96.3      95.7


and the remaining ones to generate the second dataset:

df.iloc[idx[k:], :]
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 9   91.6  174.5     36.1    45.9     35.2   121.3     100.3
## 2   73.0  161.2     37.4    38.0     31.8   106.2      92.0

Exercise 10.14

In the wine_quality_all dataset, leave out all but the white wines. Partition the resulting data frame randomly into three data frames: wines_train (60% of the rows), wines_validate (another 20% of the rows), and wines_test (the remaining 20%).

Exercise 10.15

Write a function kfold which takes a data frame df and an integer $$k>1$$ as arguments. Return a list of data frames resulting in stemming from randomly partitioning df into $$k$$ disjoint chunks of equal (or almost equal if that is not possible) sizes.

10.5.5. Hierarchical Indexes (*)#

Consider the following DataFrame object with a hierarchical index:

np.random.seed(123)
df = pd.DataFrame(dict(
year = np.repeat([2023, 2024, 2025], 4),
quarter = np.tile(["Q1", "Q2", "Q3", "Q4"], 3),
data = np.round(np.random.rand(12), 2)
)).set_index(["year", "quarter"])
df
##               data
## year quarter
## 2023 Q1       0.70
##      Q2       0.29
##      Q3       0.23
##      Q4       0.55
## 2024 Q1       0.72
##      Q2       0.42
##      Q3       0.98
##      Q4       0.68
## 2025 Q1       0.48
##      Q2       0.39
##      Q3       0.34
##      Q4       0.73


The index has both levels named, but this is purely for aesthetic reasons.

Indexing using loc[...] by default relates to the first level of the hierarchy:

df.loc[2023, :]
##          data
## quarter
## Q1       0.70
## Q2       0.29
## Q3       0.23
## Q4       0.55


Note that we selected all rows corresponding to a given label and dropped (!) this level of the hierarchy.

Another example:

df.loc[ [2023, 2025], : ]
##               data
## year quarter
## 2023 Q1       0.70
##      Q2       0.29
##      Q3       0.23
##      Q4       0.55
## 2025 Q1       0.48
##      Q2       0.39
##      Q3       0.34
##      Q4       0.73


To access deeper levels, we can use tuples as indexers:

df.loc[ (2023, "Q1"), : ]
## data    0.7
## Name: (2023, Q1), dtype: float64
df.loc[ [(2023, "Q1"), (2024, "Q3")], : ]
##               data
## year quarter
## 2023 Q1       0.70
## 2024 Q3       0.98


In certain scenarios, though, it will probably be much easier to subset a hierarchical index by using reset_index and set_index creatively (together with loc[...] and pandas.Series.isin, etc.).

Let us stress again that the : operator can only be used directly within the square brackets. Nonetheless, we can always use the slice constructor to create a slice in any context:

df.loc[ (slice(None), ["Q1", "Q3"]), : ]  # :, ["Q1", "Q3"]
##               data
## year quarter
## 2023 Q1       0.70
##      Q3       0.23
## 2024 Q1       0.72
##      Q3       0.98
## 2025 Q1       0.48
##      Q3       0.34
df.loc[ (slice(None, None, -1), slice("Q2", "Q3")), : ]  # ::-1, "Q2":"Q3"
##               data
## year quarter
## 2025 Q3       0.34
##      Q2       0.39
## 2024 Q3       0.98
##      Q2       0.42
## 2023 Q3       0.23
##      Q2       0.29


10.6. Further Operations on Data Frames#

One of the many roles of data frames is to represent tables of values for their nice presentation, e.g., in reports from data analysis or research papers. Here are some functions that can aid in their formatting.

10.6.1. Sorting#

Let us consider another example dataset. Here are the yearly (for 2018) average air quality data in the Australian state of Victoria.

air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
air = (
air.
loc[air.param_id.isin(["BPM2.5", "NO2"]), :].
reset_index(drop=True)
)


We chose two air quality parameters using pandas.Series.isin, which determines whether each element in a Series is enlisted in a given sequence. We could also have used set_index and loc[...] for that.

Notice that the above code spans many lines. We needed to enclose it in round brackets to avoid a syntax error. Alternatively, we could have used backslashes at the end of each line.

Anyway, here is the data frame:

air
##           sp_name param_id      value
## 0      Alphington   BPM2.5   7.848758
## 1      Alphington      NO2   9.558120
## 2    Altona North      NO2   9.467912
## 3       Churchill   BPM2.5   6.391230
## 4       Dandenong      NO2   9.800705
## 5       Footscray   BPM2.5   7.640948
## 6       Footscray      NO2  10.274531
## 7   Geelong South   BPM2.5   6.502762
## 8   Geelong South      NO2   5.681722
## 9   Melbourne CBD   BPM2.5   8.072998
## 10            Moe   BPM2.5   6.427079
## 11   Morwell East   BPM2.5   6.784596
## 12  Morwell South   BPM2.5   6.512849
## 13  Morwell South      NO2   5.124430
## 14      Traralgon   BPM2.5   8.024735
## 15      Traralgon      NO2   5.776333


sort_values is a convenient means to order the rows with respect to one criterion, be it numeric or categorical.

air.sort_values("value", ascending=False)
##           sp_name param_id      value
## 6       Footscray      NO2  10.274531
## 4       Dandenong      NO2   9.800705
## 1      Alphington      NO2   9.558120
## 2    Altona North      NO2   9.467912
## 9   Melbourne CBD   BPM2.5   8.072998
## 14      Traralgon   BPM2.5   8.024735
## 0      Alphington   BPM2.5   7.848758
## 5       Footscray   BPM2.5   7.640948
## 11   Morwell East   BPM2.5   6.784596
## 12  Morwell South   BPM2.5   6.512849
## 7   Geelong South   BPM2.5   6.502762
## 10            Moe   BPM2.5   6.427079
## 3       Churchill   BPM2.5   6.391230
## 15      Traralgon      NO2   5.776333
## 8   Geelong South      NO2   5.681722
## 13  Morwell South      NO2   5.124430


It is also possible to take into account more sorting criteria:

air.sort_values(["param_id", "value"], ascending=[True, False])
##           sp_name param_id      value
## 9   Melbourne CBD   BPM2.5   8.072998
## 14      Traralgon   BPM2.5   8.024735
## 0      Alphington   BPM2.5   7.848758
## 5       Footscray   BPM2.5   7.640948
## 11   Morwell East   BPM2.5   6.784596
## 12  Morwell South   BPM2.5   6.512849
## 7   Geelong South   BPM2.5   6.502762
## 10            Moe   BPM2.5   6.427079
## 3       Churchill   BPM2.5   6.391230
## 6       Footscray      NO2  10.274531
## 4       Dandenong      NO2   9.800705
## 1      Alphington      NO2   9.558120
## 2    Altona North      NO2   9.467912
## 15      Traralgon      NO2   5.776333
## 8   Geelong South      NO2   5.681722
## 13  Morwell South      NO2   5.124430


Here, in each group of identical parameters, we get a decreasing order with respect to the value.

Exercise 10.16

Compare the ordering with respect to param_id and value vs value and then param_id.

Note

(*) DataFrame.sort_values by default lamentably (!) uses a non-stable algorithm (a modified quicksort). If a data frame is sorted with respect to one criterion, and then we reorder it with respect to another one, tied observations are not guaranteed to be listed in the original order:

(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id")
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
##    param_id        sp_name      value
## 0    BPM2.5  Melbourne CBD   8.072998
## 1    BPM2.5            Moe   6.427079
## 2    BPM2.5      Footscray   7.640948
## 3    BPM2.5   Morwell East   6.784596
## 4    BPM2.5      Churchill   6.391230
## 5    BPM2.5  Morwell South   6.512849
## 6    BPM2.5      Traralgon   8.024735
## 7    BPM2.5     Alphington   7.848758
## 8    BPM2.5  Geelong South   6.502762
## 9       NO2  Morwell South   5.124430
## 10      NO2      Traralgon   5.776333
## 11      NO2  Geelong South   5.681722
## 12      NO2   Altona North   9.467912
## 13      NO2     Alphington   9.558120
## 14      NO2      Dandenong   9.800705
## 15      NO2      Footscray  10.274531


We lost the ordering based on station names in the two subgroups. To switch to a mergesort-like method (timsort), we should pass kind="stable".

(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id", kind="stable")  # !
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
##    param_id        sp_name      value
## 0    BPM2.5     Alphington   7.848758
## 1    BPM2.5      Churchill   6.391230
## 2    BPM2.5      Footscray   7.640948
## 3    BPM2.5  Geelong South   6.502762
## 4    BPM2.5  Melbourne CBD   8.072998
## 5    BPM2.5            Moe   6.427079
## 6    BPM2.5   Morwell East   6.784596
## 7    BPM2.5  Morwell South   6.512849
## 8    BPM2.5      Traralgon   8.024735
## 9       NO2     Alphington   9.558120
## 10      NO2   Altona North   9.467912
## 11      NO2      Dandenong   9.800705
## 12      NO2      Footscray  10.274531
## 13      NO2  Geelong South   5.681722
## 14      NO2  Morwell South   5.124430
## 15      NO2      Traralgon   5.776333

Exercise 10.17

(*) Perform identical reorderings but using only loc[...], iloc[...], and numpy.argsort.

10.6.2. Stacking and Unstacking (Long and Wide Forms)#

Let us discuss some further ways to transform data frames, that benefit from, make sense because of, or are possible due to their being able to store data of different types.

The above air dataset is in the so-called long format, where all measurements are stacked one after/below another. Such a form is quite convenient for the storing of data, especially when there are few recorded values, but many possible combinations of levels (sparse data).

The long format might not necessarily be optimal in all data processing activities, though; compare [93]. In the matrix processing part of this book, it was much more natural for us to have a single observation (e.g., data for each measurement station) in each row.

We can unstack the air data frame (convert to the wide format) quite easily:

air_wide = air.set_index(["sp_name", "param_id"]).unstack().loc[:, "value"]
air_wide
## param_id         BPM2.5        NO2
## sp_name
## Alphington     7.848758   9.558120
## Altona North        NaN   9.467912
## Churchill      6.391230        NaN
## Dandenong           NaN   9.800705
## Footscray      7.640948  10.274531
## Geelong South  6.502762   5.681722
## Melbourne CBD  8.072998        NaN
## Moe            6.427079        NaN
## Morwell East   6.784596        NaN
## Morwell South  6.512849   5.124430
## Traralgon      8.024735   5.776333


The missing values are denoted with NaNs (not-a-number); see Section 15.1 for more details. Interestingly, we got a hierarchical index in the columns (sic!) slot, hence the loc[...] part to drop the last level of the hierarchy. Also notice that the index and columns slots are named.

The other way around, we can use the stack method:

air_wide.T.rename_axis(index="location", columns="param").\
stack().rename("value").reset_index()
##    location          param      value
## 0    BPM2.5     Alphington   7.848758
## 1    BPM2.5      Churchill   6.391230
## 2    BPM2.5      Footscray   7.640948
## 3    BPM2.5  Geelong South   6.502762
## 4    BPM2.5  Melbourne CBD   8.072998
## 5    BPM2.5            Moe   6.427079
## 6    BPM2.5   Morwell East   6.784596
## 7    BPM2.5  Morwell South   6.512849
## 8    BPM2.5      Traralgon   8.024735
## 9       NO2     Alphington   9.558120
## 10      NO2   Altona North   9.467912
## 11      NO2      Dandenong   9.800705
## 12      NO2      Footscray  10.274531
## 13      NO2  Geelong South   5.681722
## 14      NO2  Morwell South   5.124430
## 15      NO2      Traralgon   5.776333


We used the data frame transpose (T) to get a location-major order (less boring an outcome in this context). Missing values are gone now. We do not need them anymore. Nevertheless, passing dropna=False would help us identify the combinations of location and param for which the readings are not provided.

10.6.3. Set-Theoretic Operations and Removing Duplicates#

Here are two not at all disjoint sets of imaginary persons:

A = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/some_birth_dates1.csv",
comment="#")
A
##                   Name   BirthDate
## 0     Paitoon Ornwimol  26.06.1958
## 1         Antónia Lata  20.05.1935
## 2    Bertoldo Mallozzi  17.08.1972
## 3        Nedeljko Bukv  19.12.1921
## 4        Micha Kitchen  17.09.1930
## 5      Mefodiy Shachar  01.10.1914
## 6         Paul Meckler  29.09.1968
## 9  Duchanee Panomyaong  19.06.1952


and:

B = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/some_birth_dates2.csv",
comment="#")
B
##                    Name   BirthDate
## 0    Hushang Naigamwala  25.08.1991
## 1              Zhen Wei  16.11.1975
## 2         Micha Kitchen  17.09.1930
## 3           Jodoc Alwin  16.11.1969
## 4            Igor Mazał  14.05.2004
## 6   Duchanee Panomyaong  19.06.1952
## 7       Mefodiy Shachar  01.10.1914
## 8          Paul Meckler  29.09.1968
## 9         Noe Tae-Woong  11.07.1970


In both datasets, there is a single categorical column whose elements uniquely identify each record (i.e., Name). In the language of relational databases, we would call it the primary key. In such a case, implementing the set-theoretic operations is relatively easy, as we can refer to the pandas.Series.isin method.

First, $$A\cap B$$ (intersection), includes only the rows that are both in A and in B:

A.loc[A.Name.isin(B.Name), :]
##                   Name   BirthDate
## 4        Micha Kitchen  17.09.1930
## 5      Mefodiy Shachar  01.10.1914
## 6         Paul Meckler  29.09.1968
## 9  Duchanee Panomyaong  19.06.1952


Second, $$A\smallsetminus B$$ (difference), gives all the rows that are in A but not in B:

A.loc[~A.Name.isin(B.Name), :]
##                 Name   BirthDate
## 0   Paitoon Ornwimol  26.06.1958
## 1       Antónia Lata  20.05.1935
## 2  Bertoldo Mallozzi  17.08.1972
## 3      Nedeljko Bukv  19.12.1921


Third, $$A\cup B$$ (union), returns the rows that exist in A or are in B:

pd.concat((A, B.loc[~B.Name.isin(A.Name), :]))
##                   Name   BirthDate
## 0     Paitoon Ornwimol  26.06.1958
## 1         Antónia Lata  20.05.1935
## 2    Bertoldo Mallozzi  17.08.1972
## 3        Nedeljko Bukv  19.12.1921
## 4        Micha Kitchen  17.09.1930
## 5      Mefodiy Shachar  01.10.1914
## 6         Paul Meckler  29.09.1968
## 9  Duchanee Panomyaong  19.06.1952
## 0   Hushang Naigamwala  25.08.1991
## 1             Zhen Wei  16.11.1975
## 3          Jodoc Alwin  16.11.1969
## 4           Igor Mazał  14.05.2004
## 9        Noe Tae-Woong  11.07.1970


There are no duplicate rows in any of the above outputs.

Exercise 10.18

Determine $$(A\cup B) \smallsetminus (A\cap B) = (A\smallsetminus B)\cup(B\smallsetminus A)$$ (symmetric difference).

Exercise 10.19

(*) Determine the union, intersection, and difference of the wine_sample1 and wine_sample2 datasets, where there is no column uniquely identifying the observations. Hint: consider using pandas.concat and pandas.DataFrame.duplicated or pandas.DataFrame.drop_duplicates.

10.6.4. Joining (Merging)#

In database design, it is common to normalise the datasets. We do this to avoid the duplication of information and pathologies stemming from them (e.g., [19]).

Example 10.20

The above air quality parameters are separately described in another data frame:

param = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_param.csv",
comment="#")
param.rename(dict(param_std_unit_of_measure="unit"), axis=1)
##   param_id                   param_name   unit      param_short_name
## 0      API      Airborne particle index   none  Visibility Reduction
## 1   BPM2.5  BAM  Particles < 2.5 micron  ug/m3                 PM2.5
## 2       CO              Carbon Monoxide    ppm                    CO
## 3    HPM10                   Hivol PM10  ug/m3                   NaN
## 4      NO2             Nitrogen Dioxide    ppb                   NO2
## 5       O3                        Ozone    ppb                    O3
## 6     PM10     TEOM Particles <10micron  ug/m3                  PM10
## 7   PPM2.5               Partisol PM2.5  ug/m3                   NaN
## 8      SO2               Sulfur Dioxide    ppb                   SO2


We could have stored them alongside the air data frame, but that would be a waste of space. Also, if we wanted to modify some datum (note, e.g., the annoying double space in param_name for BPM2.5), we would have to update all the relevant records.

Instead, we can always match the records in both data frames that have the same param_ids, and join (merge) these datasets only when we really need this.

Let us discuss the possible join operations by studying the two following toy data sets:

A = pd.DataFrame({
"x": ["a0", "a1", "a2", "a3"],
"y": ["b0", "b1", "b2", "b3"]
})
A
##     x   y
## 0  a0  b0
## 1  a1  b1
## 2  a2  b2
## 3  a3  b3


and:

B = pd.DataFrame({
"x": ["a0", "a2", "a2", "a4"],
"z": ["c0", "c1", "c2", "c3"]
})
B
##     x   z
## 0  a0  c0
## 1  a2  c1
## 2  a2  c2
## 3  a4  c3


They both have one column somewhat in common, x.

The inner (natural) join returns the records that have a match in both datasets:

pd.merge(A, B, on="x")
##     x   y   z
## 0  a0  b0  c0
## 1  a2  b2  c1
## 2  a2  b2  c2


The left join of A with B guarantees to return all the records from A, even those which are not matched by anything in B.

pd.merge(A, B, how="left", on="x")
##     x   y    z
## 0  a0  b0   c0
## 1  a1  b1  NaN
## 2  a2  b2   c1
## 3  a2  b2   c2
## 4  a3  b3  NaN


The right join of A with B is the same as the left join of B with A:

pd.merge(A, B, how="right", on="x")
##     x    y   z
## 0  a0   b0  c0
## 1  a2   b2  c1
## 2  a2   b2  c2
## 3  a4  NaN  c3


Finally, the full outer join is the set-theoretic union of the left and the right join:

pd.merge(A, B, how="outer", on="x")
##     x    y    z
## 0  a0   b0   c0
## 1  a1   b1  NaN
## 2  a2   b2   c1
## 3  a2   b2   c2
## 4  a3   b3  NaN
## 5  a4  NaN   c3

Exercise 10.22

Normalise air_quality_2018 so that you get the three separate data frames mentioned in the previous exercise (value, point, and param).

Exercise 10.23

(*) In the National Health and Nutrition Examination Survey, each participant is uniquely identified by their sequence number (SEQN). This token is mentioned in numerous datasets, including:

Join a few chosen datasets that you find interesting.

10.6.5. …And (Too) Many More#

Looking at the list of methods for the DataFrame and Series classes in the pandas package’s documentation, we can see that they are abundant. Together with the object-oriented syntax, we will often find ourselves appreciating the high readability of even quite complex operation chains such as data.drop_duplicates().groupby(["year", "month"]).mean().reset_index(); see Chapter 12.

Nevertheless, the methods are probably too plentiful to our taste. Their developers were overgenerous. They wanted to include a list of all the possible verbs related to data analysis, even if they can be trivially expressed by a composition of 2-3 simpler operations from numpy or scipy instead.

As strong advocates of minimalism, we would rather save ourselves from being overloaded with too much new information. This is why our focus in this book is on developing the most transferable[7] skills. Our approach is also slightly more hygienic. We do not want the reader to develop a hopeless mindset, the habit of looking everything up on the internet when faced with even the simplest kinds of problems. We have brains for a reason.

10.7. Exercises#

Exercise 10.24

How are data frames different from matrices?

Exercise 10.25

What are the use cases of the name slot in Series and Index objects?

Exercise 10.26

What is the purpose of set_index and reset_index?

Exercise 10.27

Why learning numpy is crucial when someone wants to become a proficient user of pandas?

Exercise 10.28

What is the difference between iloc[...] and loc[...]?

Exercise 10.29

Why applying the index operator [...] directly on a Series or DataFrame object is not necessarily a good idea?

Exercise 10.30

What is the difference between index, Index, and columns`?

Exercise 10.31

How can we compute the arithmetic mean and median of all the numeric columns in a data frame, using a single line of code?

Exercise 10.32

What is a training/test split and how to perform it using numpy and pandas?

Exercise 10.33

What is the difference between stacking and unstacking? Which one yields a wide (as opposed to long) format?

Exercise 10.34

Name different data frame join (merge) operations and explain how they work.

Exercise 10.35

How does sorting with respect to more than one criterion work?

Exercise 10.36

Name the basic set-theoretic operations on data frames.