# 10. Introducing data frames#

The open-access textbookMinimalist Data Wrangling with Pythonby Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out the author’s other book,Deep R Programming[34].

**numpy** arrays are an extremely versatile tool for performing data
analysis exercises and other numerical computations of various kinds.
Although theoretically possible otherwise, in practice
we only store elements of the same type therein, most often numbers.

**pandas** [63]
is amongst over one hundred thousand
open-source packages and repositories that use **numpy** to provide
additional data wrangling functionality.
It was originally written by Wes McKinney
but was heavily inspired by the `data.frame`

[1] objects in S and R
as well as tables in relational (think: SQL) databases
and spreadsheets.

**pandas** delivers a few classes, of which the most important are:

`DataFrame`

– for representing tabular data (matrix-like) with columns of possibly different types, in particular a mix of numerical and categorical variables,`Series`

– vector-like objects for storing individual columns,`Index`

and its derivatives – vector-like (usually) objects for labelling individual rows and columns in`DataFrame`

s and items in`Series`

objects,`SeriesGroupBy`

and`DataFrameGroupBy`

– which model observations grouped by a categorical variable or a combination of factors (Chapter 12),

together with many methods for:

transforming/aggregating/processing data, also in groups determined by categorical variables or products thereof,

reshaping (e.g., from wide to long format) and joining datasets,

importing/exporting data from/to various sources and formats, e.g., CSV and HDF5 files or relational databases,

handling missing data,

all of which we introduce in this part.

Before we delve into the world of **pandas**,
let us point out that it is customary to load this package
under the following alias:

```
import pandas as pd
```

Important

Let us repeat: **pandas** is built on top of
**numpy** and most objects therein can be processed by
**numpy** functions as well.
Many other functions, e.g., in **scikit-learn**,
accept both `DataFrame`

and `ndarray`

objects,
but often convert the former to the latter internally to enable data
processing using fast C/C++/Fortran routines.

What we have learnt so far[2] still applies. But there is more, hence this part.

## 10.1. Creating data frames#

Data frames can be created, amongst others,
using the `DataFrame`

class constructor, which
can be fed, for example, with a **numpy** matrix:

```
np.random.seed(123)
pd.DataFrame(
np.random.rand(4, 3),
columns=["a", "b", "c"]
)
## a b c
## 0 0.696469 0.286139 0.226851
## 1 0.551315 0.719469 0.423106
## 2 0.980764 0.684830 0.480932
## 3 0.392118 0.343178 0.729050
```

Notice that rows and columns are labelled (and how readable that is).

A dictionary of vector-like objects of equal lengths is another common option:

```
np.random.seed(123)
df = pd.DataFrame(dict(
a = np.round(np.random.rand(5), 2),
b = [1, 2.5, np.nan, 4, np.nan],
c = [True, True, False, False, True],
d = ["A", "B", "C", None, "E"],
e = ["spam", "spam", "bacon", "spam", "eggs"],
f = np.array([
"2021-01-01", "2022-02-02", "2023-03-03", "2024-04-04", "2025-05-05"
], dtype="datetime64[D]"),
g = [
["spam"], ["bacon", "spam"], None, ["eggs", "bacon", "spam"], ["ham"]
],
))
df
## a b c d e f g
## 0 0.70 1.0 True A spam 2021-01-01 [spam]
## 1 0.29 2.5 True B spam 2022-02-02 [bacon, spam]
## 2 0.23 NaN False C bacon 2023-03-03 None
## 3 0.55 4.0 False None spam 2024-04-04 [eggs, bacon, spam]
## 4 0.72 NaN True E eggs 2025-05-05 [ham]
```

This illustrates the possibility of having columns of different types.

Check out **pandas.DataFrame.from_dict** and
**from_records**
in the documentation.
Use them to create some example data frames.

Further, data frames can be read from files in different formats, for instance, CSV:

```
body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
comment="#")
body.head() # display first few rows (5 by default)
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
```

Reading from URLs and local files is also supported; compare Section 13.6.1.

Check out other **pandas.read_*** functions
in the **pandas** documentation.
We will be discussing some of them later.

### 10.1.1. Data frames are matrix-like#

Data frames are modelled through **numpy** matrices.
We can thus already feel quite at home with them.

For example, a data frame, it is easy to fetch its number of rows and columns:

```
df.shape
## (5, 7)
```

or the type of each column:

```
df.dtypes # returns a Series object; see below
## a float64
## b float64
## c bool
## d object
## e object
## f datetime64[s]
## g object
## dtype: object
```

Recall that **numpy** arrays are equipped with the `dtype`

slot.

### 10.1.2. `Series`

#

There is a separate class for storing individual data frame
columns: it is called `Series`

.

```
s = df.loc[:, "a"] # extract the `a` column; alternatively: df.a
s
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
## Name: a, dtype: float64
```

Data frames with one column are printed
out slightly differently.
We get the column name at the top,
but do not have the `dtype`

information at the bottom.

```
s.to_frame() # or: pd.DataFrame(s)
## a
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
```

Indexing will be discussed later.

Important

It is crucial to know when we are dealing
with a `Series`

and when with a `DataFrame`

object
as each of them defines a slightly different set of methods.

We will now be relying on object-orientated syntax (compare Section 2.2.3) much more frequently than before.

By calling:

```
s.mean()
## 0.49800000000000005
```

we refer to **pandas.Series.mean** (which returns a scalar), whereas:

```
df.mean(numeric_only=True)
## a 0.498
## b 2.500
## c 0.600
## dtype: float64
```

uses **pandas.DataFrame.mean** (which yields a `Series`

).

Look up these two methods in the **pandas** manual.
Note that their argument list is slightly different.

Objects of the class `Series`

are vector-like:

```
s.shape
## (5,)
s.dtype
## dtype('float64')
```

They are wrappers around **numpy** arrays.

```
s.values
## array([0.7 , 0.29, 0.23, 0.55, 0.72])
```

Most importantly, **numpy** functions can be called directly on them:

```
np.mean(s)
## 0.49800000000000005
```

As a consequence, what we covered in the part of this book that dealt with vector processing still holds for data frame columns (but there will be more).

`Series`

can also be *named*.

```
s.name
## 'a'
```

This is convenient, especially when we convert them to a data frame
as the `name`

sets the label of the newly created column:

```
s.rename("spam").to_frame()
## spam
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
```

### 10.1.3. `Index`

#

Another important class is called `Index`

[3].
We use it for storing element or axes labels.

The `index`

(lowercase) *slot* of a data frame stores an
object of the class `Index`

(or one of its derivatives)
that gives the row names:

```
df.index # row labels
## RangeIndex(start=0, stop=5, step=1)
```

The above represents a sequence (0, 1, 2, 3, 4).

Furthermore, the `column`

slot gives:

```
df.columns # column labels
## Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')
```

Also, we can label the individual elements in `Series`

objects:

```
s.index
## RangeIndex(start=0, stop=5, step=1)
```

The **set_index** method can be applied
to make a data frame column act as a sequence of row labels:

```
df2 = df.set_index("e")
df2
## a b c d f g
## e
## spam 0.70 1.0 True A 2021-01-01 [spam]
## spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## bacon 0.23 NaN False C 2023-03-03 None
## spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## eggs 0.72 NaN True E 2025-05-05 [ham]
```

This `Index`

object is named:

```
df2.index.name
## 'e'
```

We can also rename the axes on the fly:

```
df2.rename_axis(index="ROWS", columns="COLS")
## COLS a b c d f g
## ROWS
## spam 0.70 1.0 True A 2021-01-01 [spam]
## spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## bacon 0.23 NaN False C 2023-03-03 None
## spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## eggs 0.72 NaN True E 2025-05-05 [ham]
```

Having a named `index`

slot is handy when we decide
that we want to convert the vector of row labels back
to a standalone column:

```
df2.rename_axis(index="NEW_COLUMN").reset_index()
## NEW_COLUMN a b c d f g
## 0 spam 0.70 1.0 True A 2021-01-01 [spam]
## 1 spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## 2 bacon 0.23 NaN False C 2023-03-03 None
## 3 spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## 4 eggs 0.72 NaN True E 2025-05-05 [ham]
```

There is also an option to get rid of the current `index`

and to replace it with the default label sequence, i.e.,
0, 1, 2, …:

```
df2.reset_index(drop=True)
## a b c d f g
## 0 0.70 1.0 True A 2021-01-01 [spam]
## 1 0.29 2.5 True B 2022-02-02 [bacon, spam]
## 2 0.23 NaN False C 2023-03-03 None
## 3 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## 4 0.72 NaN True E 2025-05-05 [ham]
```

Take note of the fact that **reset_index**, and many other methods
that we have used so far, do not modify the data frame in place.

Important

We will soon get used to calling
**reset_index**`(drop=True)`

quite frequently, sometimes
more than once in a single series of commands.

Use the
**pandas.DataFrame.rename** method to
change the name of the `a`

column in `df`

to `spam`

.

Also, a *hierarchical* index – one that is comprised
of more than one level – is possible.
For example, here is a sorted (see Section 10.6.1) version
of `df`

with a new index based on two columns at the same time:

```
df.sort_values("e", ascending=False).set_index(["e", "c"])
## a b d f g
## e c
## spam True 0.70 1.0 A 2021-01-01 [spam]
## True 0.29 2.5 B 2022-02-02 [bacon, spam]
## False 0.55 4.0 None 2024-04-04 [eggs, bacon, spam]
## eggs True 0.72 NaN E 2025-05-05 [ham]
## bacon False 0.23 NaN C 2023-03-03 None
```

For the sake of readability,
the consecutive repeated `spam`

s were not printed.

Hierarchical indexes might arise after aggregating data in groups. For example:

```
nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_p_demo_bmx_2020.csv",
comment="#").rename({
"BMXBMI": "bmival",
"RIAGENDR": "gender",
"DMDBORN4": "usborn"
}, axis=1)
```

In Chapter 12, we will get used to writing:

```
res = nhanes.groupby(["gender", "usborn"])["bmival"].mean()
res # BMI by gender and US born-ness
## gender usborn
## 1 1 25.734110
## 2 27.405251
## 2 1 27.120261
## 2 27.579448
## 77 28.725000
## 99 32.600000
## Name: bmival, dtype: float64
```

This returned a `Series`

object with a hierarchical index.
Let us fret not, though: **reset_index** always comes to
our rescue:

```
res.reset_index()
## gender usborn bmival
## 0 1 1 25.734110
## 1 1 2 27.405251
## 2 2 1 27.120261
## 3 2 2 27.579448
## 4 2 77 28.725000
## 5 2 99 32.600000
```

## 10.2. Aggregating data frames#

Here is another toy data frame:

```
np.random.seed(123)
df = pd.DataFrame(dict(
u = np.round(np.random.rand(5), 2),
v = np.round(np.random.randn(5), 2),
w = ["spam", "bacon", "spam", "eggs", "sausage"]
), index=["a", "b", "c", "d", "e"])
df
## u v w
## a 0.70 0.32 spam
## b 0.29 -0.05 bacon
## c 0.23 -0.20 spam
## d 0.55 1.98 eggs
## e 0.72 -1.62 sausage
```

All **numpy** functions can be applied directly on individual
columns, i.e., objects of the type `Series`

, because they are vector-like.

```
u = df.loc[:, "u"] # extract the `u` column (gives a Series; see below)
np.quantile(u, [0, 0.5, 1])
## array([0.23, 0.55, 0.72])
```

Most **numpy** functions also work if they are fed with data frames,
but we will need to extract the numeric columns manually.

```
uv = df.loc[:, ["u", "v"]] # select two columns (a DataFrame; see below)
np.quantile(uv, [0, 0.5, 1], axis=0)
## array([[ 0.23, -1.62],
## [ 0.55, -0.05],
## [ 0.72, 1.98]])
```

Sometimes the results will automatically be coerced to a `Series`

object
with the `index`

slot set appropriately:

```
np.mean(uv, axis=0)
## u 0.498
## v 0.086
## dtype: float64
```

For convenience, many operations are also available as methods for the
`Series`

and `DataFrame`

classes, e.g., **mean**, **median**,
**min**, **max**, **quantile**,
**var**, **std**, and **skew**.

```
df.mean(numeric_only=True)
## u 0.498
## v 0.086
## dtype: float64
df.quantile([0, 0.5, 1], numeric_only=True)
## u v
## 0.0 0.23 -1.62
## 0.5 0.55 -0.05
## 1.0 0.72 1.98
```

Also note the **describe** method, which
returns a few statistics at the same time.

```
df.describe()
## u v
## count 5.000000 5.000000
## mean 0.498000 0.086000
## std 0.227969 1.289643
## min 0.230000 -1.620000
## 25% 0.290000 -0.200000
## 50% 0.550000 -0.050000
## 75% 0.700000 0.320000
## max 0.720000 1.980000
```

Check out the **pandas.DataFrame.agg** method
that can apply all aggregates given by a list of functions.
Compose a call equivalent to `df.describe()`

.

Note

(*) Let us stress that above we see
the corrected for bias (but still only asymptotically unbiased)
version of standard deviation,
given by \(\sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2}\);
compare Section 5.1.
In **pandas**, **std** methods assume `ddof=1`

by default,
whereas we recall that **numpy** uses `ddof=0`

.

```
np.round([u.std(), np.std(u), np.std(np.array(u)), u.std(ddof=0)], 3)
## array([0.228, 0.204, 0.204, 0.204])
```

This is an unfortunate inconsistency between the two packages, but please do not blame the messenger.

## 10.3. Transforming data frames#

By applying the already well-known vectorised mathematical
functions from **numpy**, we can transform each data cell
and return an object of the same type as the input one.

```
np.exp(df.loc[:, "u"])
## a 2.013753
## b 1.336427
## c 1.258600
## d 1.733253
## e 2.054433
## Name: u, dtype: float64
np.exp(df.loc[:, ["u", "v"]])
## u v
## a 2.013753 1.377128
## b 1.336427 0.951229
## c 1.258600 0.818731
## d 1.733253 7.242743
## e 2.054433 0.197899
```

When applying the binary arithmetic, relational,
and logical operators on an object of the class `Series`

and
a scalar or a **numpy** vector, the operations are performed
elementwisely – a style with which we are already familiar.

For instance, here is a standardised version of the `u`

column:

```
u = df.loc[:, "u"]
(u - np.mean(u)) / np.std(u)
## a 0.990672
## b -1.020098
## c -1.314357
## d 0.255025
## e 1.088759
## Name: u, dtype: float64
```

Binary operators act on the elements with corresponding labels.
For two objects having identical `index`

slots
(this is the most common scenario), this is the same as elementwise
vectorisation. For instance:

```
df.loc[:, "u"] > df.loc[:, "v"] # here: elementwise comparison
## a True
## b True
## c True
## d False
## e True
## dtype: bool
```

For transforming many numerical columns at once,
it is worthwhile either to
convert them to a numeric matrix explicitly
and then use the basic **numpy** functions:

```
uv = np.array(df.loc[:, ["u", "v"]])
uv2 = (uv-np.mean(uv, axis=0))/np.std(uv, axis=0)
uv2
## array([[ 0.99067229, 0.20286225],
## [-1.0200982 , -0.11790285],
## [-1.3143573 , -0.24794275],
## [ 0.25502455, 1.64197052],
## [ 1.08875866, -1.47898717]])
```

or to use the **pandas.DataFrame.apply** method
which invokes a given function on each column separately:

```
uv2 = df.loc[:, ["u", "v"]].apply(lambda x: (x-np.mean(x))/np.std(x))
uv2
## u v
## a 0.990672 0.202862
## b -1.020098 -0.117903
## c -1.314357 -0.247943
## d 0.255025 1.641971
## e 1.088759 -1.478987
```

Anticipating what we cover in the next section,
in both cases, we can write `df.loc[:, ["u", "v"]] = uv2`

to replace the old content. Also, new columns can be added based on the
transformed versions of the existing ones. For instance:

```
df.loc[:, "uv_squared"] = (df.loc[:, "u"] * df.loc[:, "v"])**2
df
## u v w uv_squared
## a 0.70 0.32 spam 0.050176
## b 0.29 -0.05 bacon 0.000210
## c 0.23 -0.20 spam 0.002116
## d 0.55 1.98 eggs 1.185921
## e 0.72 -1.62 sausage 1.360489
```

(*)
Binary operations on objects with different `index`

slots
are vectorised *labelwisely*:

```
x = pd.Series([1, 10, 1000, 10000, 100000], index=["a", "b", "a", "a", "c"])
x
## a 1
## b 10
## a 1000
## a 10000
## c 100000
## dtype: int64
y = pd.Series([1, 2, 3, 4, 5], index=["b", "b", "a", "d", "c"])
y
## b 1
## b 2
## a 3
## d 4
## c 5
## dtype: int64
```

And now:

```
x * y
## a 3.0
## a 3000.0
## a 30000.0
## b 10.0
## b 20.0
## c 500000.0
## d NaN
## dtype: float64
```

Here, each element in the first `Series`

named `a`

was
multiplied by each (there was only one) element labelled `a`

in the second
`Series`

. For `d`

, there were no matches, hence the result’s being marked
as missing; compare Chapter 15.
Thus, this behaves like a full outer join-type operation;
see Section 10.6.3.

The above is different from *elementwise* vectorisation in **numpy**:

```
np.array(x) * np.array(y)
## array([ 1, 20, 3000, 40000, 500000])
```

Labelwise vectorisation *can* be useful in certain contexts.
However, we need to be aware of this (yet another) incompatibility between
the two packages.

## 10.4. Indexing `Series`

objects#

Recall that each `DataFrame`

and `Series`

object
is equipped with a slot called `index`

, which is
an object of the class `Index`

(or subclass thereof),
giving the row and element labels, respectively.
It turns out that we may apply
the *index* operator, `[...]`

, to subset these objects
not only through the *indexers* known from
the **numpy** part (e.g., numerical ones, i.e., by position)
but also ones that pinpoint the items via their
labels. That is quite a lot of index-ing.

Let us study different forms thereof in very detail.
For illustration, we will be playing with the two following objects of
class `Series`

:

```
np.random.seed(123)
b = pd.Series(np.round(np.random.rand(10), 2))
b.index = np.random.permutation(np.arange(10))
b
## 2 0.70
## 1 0.29
## 8 0.23
## 7 0.55
## 9 0.72
## 4 0.42
## 5 0.98
## 6 0.68
## 3 0.48
## 0 0.39
## dtype: float64
```

and:

```
c = b.copy()
c.index = list("abcdefghij")
c
## a 0.70
## b 0.29
## c 0.23
## d 0.55
## e 0.72
## f 0.42
## g 0.98
## h 0.68
## i 0.48
## j 0.39
## dtype: float64
```

They consist of the same values, in the same order,
but have different labels (`index`

slots).
In particular, `b`

’s labels
are integers that *do not* match the physical element
positions (where 0 would denote the first element, etc.).

Important

For **numpy** vectors, we had four different indexing schemes:
via a scalar (extracts an element at a given position),
a slice, an integer vector, and a logical vector.
`Series`

objects are *additionally* labelled. Therefore,
they can also be accessed through the contents of the `index`

slot.

### 10.4.1. Do not use `[...]`

directly#

Applying the index operator,
`[...]`

, directly on `Series`

is generally a bad idea:

```
b[0] # do not use it
## 0.39
b[ [0] ] # do not use it
## 0 0.39
## dtype: float64
```

both do not select the first item, but the item labelled `0`

.

However:

```
b[:1] # do not use it
## 2 0.7
## dtype: float64
```

and

```
c[0] # there is no label `0`... (do not use it)
## 0.7
##
## <string>:1: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
```

both[4] fall back to position-based indexing.

Confusing? Well, with some self-discipline, the solution is easy:

Important

We should never apply `[...]`

directly on `Series`

nor `DataFrame`

objects.

To avoid ambiguity, we refer to the
**loc**`[...]`

and **iloc**`[...]`

accessors
for the label- and position-based filtering, respectively.

### 10.4.2. **loc**`[...]`

#

**Series.loc**`[...]`

implements label-based indexing.

```
b.loc[0]
## 0.39
```

This returned the element labelled `0`

. On the other hand, `c.loc[0]`

will
raise a `KeyError`

because `c`

consists of string labels only.
But in this case, we can write:

```
c.loc["j"]
## 0.39
```

Next, we can use lists of labels to select a *subset*.

```
b.loc[ [0, 1, 0] ]
## 0 0.39
## 1 0.29
## 0 0.39
## dtype: float64
c.loc[ ["j", "b", "j"] ]
## j 0.39
## b 0.29
## j 0.39
## dtype: float64
```

The result is always of the type `Series`

.

Slicing behaves differently as the range is *inclusive*
(sic![5]) at both sides:

```
b.loc[1:7]
## 1 0.29
## 8 0.23
## 7 0.55
## dtype: float64
b.loc[0:4:-1]
## 0 0.39
## 3 0.48
## 6 0.68
## 5 0.98
## 4 0.42
## dtype: float64
c.loc["d":"g"]
## d 0.55
## e 0.72
## f 0.42
## g 0.98
## dtype: float64
```

The above calls return all elements between the two indicated labels.

Note

Be careful that if there are repeated labels, then
we will be returning *all* (sic![6]) the matching items:

```
d = pd.Series([1, 2, 3, 4], index=["a", "b", "a", "c"])
d.loc["a"]
## a 1
## a 3
## dtype: int64
```

The result is not a scalar but a `Series`

object.

### 10.4.3. **iloc**`[...]`

#

Here are some examples of position-based indexing with
the **iloc**`[...]`

accessor.
It is worth stressing that, fortunately, its behaviour
is consistent with its **numpy** counterpart, i.e.,
the ordinary square brackets applied on objects of the class `ndarray`

.
For example:

```
b.iloc[0] # the same: c.iloc[0]
## 0.7
```

returns the first element.

```
b.iloc[1:7] # the same: b.iloc[1:7]
## 1 0.29
## 8 0.23
## 7 0.55
## 9 0.72
## 4 0.42
## 5 0.98
## dtype: float64
```

returns the second, third, …, seventh element
(not including `b.iloc[7]`

, i.e., the eight one).

### 10.4.4. Logical indexing#

Indexing using a logical vector-like object is also available.
For this purpose, we will usually be using **loc**`[...]`

with either a logical `Series`

object of identical `index`

slot
as the subsetted object, or a Boolean **numpy** vector.

```
b.loc[(b > 0.4) & (b < 0.6)]
## 7 0.55
## 4 0.42
## 3 0.48
## dtype: float64
```

For **iloc**`[...]`

, the indexer must be unlabelled,
e.g., be an ordinary **numpy** vector.

`

## 10.5. Indexing data frames#

### 10.5.1. **loc**`[...]`

and **iloc**`[...]`

#

For data frames, `iloc`

and `loc`

can be applied too.
Now, however, they require *two* arguments: a row and a column
selector. For example:

```
np.random.seed(123)
df = pd.DataFrame(dict(
u = np.round(np.random.rand(5), 2),
v = np.round(np.random.randn(5), 2),
w = ["spam", "bacon", "spam", "eggs", "sausage"],
x = [True, False, True, False, True]
))
```

And now:

```
df.loc[ df.loc[:, "u"] > 0.5, "u":"w" ]
## u v w
## 0 0.70 0.32 spam
## 3 0.55 1.98 eggs
## 4 0.72 -1.62 sausage
```

It selected the rows where the values in the `u`

column are greater than 0.5 and then
returns all columns between `u`

and `w`

(inclusive!).

Furthermore:

```
df.iloc[:3, :].loc[:, ["u", "w"]]
## u w
## 0 0.70 spam
## 1 0.29 bacon
## 2 0.23 spam
```

It fetched the first three rows (by position; `iloc`

is necessary)
and then selects two indicated columns.

Compare this to:

```
df.loc[:3, ["u", "w"]] # df[:3, ["u", "w"]] does not even work; please don't
## u w
## 0 0.70 spam
## 1 0.29 bacon
## 2 0.23 spam
## 3 0.55 eggs
```

which has four (!) rows.

Important

Getting a scrambled numeric index that does not match the physical positions is quite easy, for instance, in the context of data frame sorting (Section 10.6.1):

```
df2 = df.sort_values("v")
df2
## u v w x
## 4 0.72 -1.62 sausage True
## 2 0.23 -0.20 spam True
## 1 0.29 -0.05 bacon False
## 0 0.70 0.32 spam True
## 3 0.55 1.98 eggs False
```

Note how different are the following results

```
df2.loc[:3, :] # up to label 3, inclusive
## u v w x
## 4 0.72 -1.62 sausage True
## 2 0.23 -0.20 spam True
## 1 0.29 -0.05 bacon False
## 0 0.70 0.32 spam True
## 3 0.55 1.98 eggs False
df2.iloc[:3, :] # always: first three
## u v w x
## 4 0.72 -1.62 sausage True
## 2 0.23 -0.20 spam True
## 1 0.29 -0.05 bacon False
```

Important

We can frequently write `df.u`

as a shorter version of `df.loc[:, "u"]`

.
This improves the readability in contexts such as:

```
df.loc[(df.u >= 0.5) & (df.u <= 0.7), ["u", "w"]]
## u w
## 0 0.70 spam
## 3 0.55 eggs
```

This accessor is, sadly, not universal.
We can verify this by considering a data frame with
a column named, e.g., `mean`

: it clashes with the built-in method.
As a workaround, we should either use **loc**`[...]`

or rename the column, for instance, like `Mean`

or `MEAN`

.

Use **pandas.DataFrame.drop**
to select all columns except `v`

in `df`

.

Use **pandas.Series.isin** (amongst others)
to select all rows with `spam`

and `bacon`

on the `df`

’s menu.

In the
`tips`

dataset, select data on male customers where the total bills were in the
\([10, 20]\) interval. Also, select Saturday and Sunday records where
the tips were greater than $5.

### 10.5.2. Adding rows and columns#

**loc**`[...]`

can also be used to add new columns to an
existing data frame:

```
df.loc[:, "y"] = df.loc[:, "u"]**2 # or df.loc[:, "y"] = df.u**2
df
## u v w x y
## 0 0.70 0.32 spam True 0.4900
## 1 0.29 -0.05 bacon False 0.0841
## 2 0.23 -0.20 spam True 0.0529
## 3 0.55 1.98 eggs False 0.3025
## 4 0.72 -1.62 sausage True 0.5184
```

Important

Notation like “`df.new_column = ...`

” does not work.
As we said, only `loc`

and `iloc`

are universal.
For other accessors, this is not necessarily the case.

Use **pandas.DataFrame.insert**
to add a new column not necessarily at the end of `df`

.

Use **pandas.DataFrame.append**
to add a few more rows to `df`

.

### 10.5.3. Modifying items#

In the current version of **pandas**, modifying particular elements
gives a warning:

```
df.loc[:, "u"].iloc[0] = 7 # the same as df.u.iloc[0] = 7
## SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame
df.loc[:, "u"].iloc[0] # testing
## 7.0
```

In order to remedy this, it is best to create a copy of a column, modify it, and then to replace the old contents with the new ones.

```
u = df.loc[:, "u"].copy()
u.iloc[0] = 42 # or a whole for loop to process them all, or whatever
df.loc[:, "u"] = u
df.loc[:, "u"].iloc[0] # testing
## 42.0
```

### 10.5.4. Pseudorandom sampling and splitting#

As a simple application of what we have covered so far, let us consider the ways to sample several rows from an existing data frame.

We can use the **pandas.DataFrame.sample** method
in the most basic scenarios, such as:

select five rows, without replacement,

select 20% rows, with replacement,

rearrange all the rows.

For example:

```
body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
comment="#")
body.sample(5, random_state=123) # 5 rows without replacement
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 4214 58.4 156.2 35.2 34.7 27.2 99.5 77.5
## 3361 73.7 161.0 36.5 34.5 29.0 107.6 98.2
## 3759 61.4 164.6 37.5 40.4 26.9 93.5 84.4
## 3733 120.4 158.8 33.5 34.6 40.5 147.2 129.3
## 1121 123.5 157.5 35.5 29.0 50.5 143.0 136.4
```

Notice the `random_state`

argument which controls the seed
of the pseudorandom number generator so that we get
reproducible results.
Alternatively, we could call **numpy.random.seed**.

Show how the three aforementioned scenarios
can be implemented manually using **iloc**`[...]`

and **numpy.random.permutation**
or **numpy.random.choice**.

In machine learning practice, we are used to training and evaluating machine learning models on different (mutually disjoint) subsets of the whole data frame.

For instance, Section 12.3.3 mentions
that we may be interested in performing the
so-called *training/test split* (partitioning),
where 80% (or 60% or 70%) of the
randomly selected rows would constitute the first
new data frame and the remaining 20% (or 40% or 30%, respectively)
would go to the second one.

Given a data frame like:

```
df = body.head(10) # this is just an example
df
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
## 5 62.0 144.7 32.5 34.2 29.8 106.7 84.8
## 6 66.2 166.5 37.5 37.6 32.0 96.3 95.7
## 7 75.9 154.5 35.4 37.6 32.7 107.7 98.7
## 8 77.2 159.2 38.5 40.5 35.7 102.0 97.5
## 9 91.6 174.5 36.1 45.9 35.2 121.3 100.3
```

one way to perform the aforementioned split is to generate a random permutation of the set of row indexes:

```
np.random.seed(123) # reproducibility matters
idx = np.random.permutation(df.shape[0])
idx
## array([4, 0, 7, 5, 8, 3, 1, 6, 9, 2])
```

And then to pick the first 80% of them to construct the data frame number one:

```
k = int(df.shape[0]*0.8)
df.iloc[idx[:k], :]
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 7 75.9 154.5 35.4 37.6 32.7 107.7 98.7
## 5 62.0 144.7 32.5 34.2 29.8 106.7 84.8
## 8 77.2 159.2 38.5 40.5 35.7 102.0 97.5
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 6 66.2 166.5 37.5 37.6 32.0 96.3 95.7
```

and the remaining ones to generate the second dataset:

```
df.iloc[idx[k:], :]
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 9 91.6 174.5 36.1 45.9 35.2 121.3 100.3
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
```

In the
`wine_quality_all`

dataset, leave out all but the white wines. Partition the resulting data
frame randomly into three data frames: `wines_train`

(60% of the rows),
`wines_validate`

(another 20% of the rows), and `wines_test`

(the remaining 20%).

Compose a function **kfold** which takes a data frame `df`

and an
integer \(k>1\) as arguments. Return a list of data frames resulting
in stemming from randomly partitioning `df`

into \(k\) disjoint chunks of
equal (or almost equal if that is not possible) sizes.

### 10.5.5. Hierarchical indexes (*)#

Consider the following `DataFrame`

object
with a hierarchical index:

```
np.random.seed(123)
df = pd.DataFrame(dict(
year = np.repeat([2023, 2024, 2025], 4),
quarter = np.tile(["Q1", "Q2", "Q3", "Q4"], 3),
data = np.round(np.random.rand(12), 2)
)).set_index(["year", "quarter"])
df
## data
## year quarter
## 2023 Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
## 2024 Q1 0.72
## Q2 0.42
## Q3 0.98
## Q4 0.68
## 2025 Q1 0.48
## Q2 0.39
## Q3 0.34
## Q4 0.73
```

The index has both levels named, but this is purely for aesthetic reasons.

Indexing using **loc**`[...]`

by default relates to the
first level of the hierarchy:

```
df.loc[2023, :]
## data
## quarter
## Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
```

Note that we selected *all* rows corresponding to a given label
and dropped (!) this level of the hierarchy.

Another example:

```
df.loc[ [2023, 2025], : ]
## data
## year quarter
## 2023 Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
## 2025 Q1 0.48
## Q2 0.39
## Q3 0.34
## Q4 0.73
```

To access deeper levels, we can use tuples as indexers:

```
df.loc[ (2023, "Q1"), : ]
## data 0.7
## Name: (2023, Q1), dtype: float64
df.loc[ [(2023, "Q1"), (2024, "Q3")], : ]
## data
## year quarter
## 2023 Q1 0.70
## 2024 Q3 0.98
```

In certain scenarios, though, it will probably be much easier
to subset a hierarchical index by using **reset_index**
and **set_index** creatively (together with
**loc**`[...]`

and **pandas.Series.isin**, etc.).

Let us stress again that the `**:**` operator
can only be used *directly* within the square brackets.
Nonetheless, we can always use the **slice** constructor to create
a slice in any context:

```
df.loc[ (slice(None), ["Q1", "Q3"]), : ] # :, ["Q1", "Q3"]
## data
## year quarter
## 2023 Q1 0.70
## Q3 0.23
## 2024 Q1 0.72
## Q3 0.98
## 2025 Q1 0.48
## Q3 0.34
df.loc[ (slice(None, None, -1), slice("Q2", "Q3")), : ] # ::-1, "Q2":"Q3"
## data
## year quarter
## 2025 Q3 0.34
## Q2 0.39
## 2024 Q3 0.98
## Q2 0.42
## 2023 Q3 0.23
## Q2 0.29
```

## 10.6. Further operations on data frames#

One of the many roles of data frames is to represent tables of values for their nice presentation, e.g., in reports from data analysis or research papers. Here are some functions that can aid in their formatting.

### 10.6.1. Sorting#

Let us consider another example dataset. Here are the yearly (for 2018) average air quality data in the Australian state of Victoria.

```
air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
air = (
air.
loc[air.param_id.isin(["BPM2.5", "NO2"]), :].
reset_index(drop=True)
)
```

We chose two air quality parameters using
**pandas.Series.isin**, which determines whether
each element in a `Series`

is enlisted in a given sequence.
We could also have used **set_index** and **loc**`[...]`

for that.

Notice that the above code spans many lines. We needed to enclose it in round brackets to avoid a syntax error. Alternatively, we could have used backslashes at the end of each line.

Anyway, here is the data frame:

```
air
## sp_name param_id value
## 0 Alphington BPM2.5 7.848758
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 3 Churchill BPM2.5 6.391230
## 4 Dandenong NO2 9.800705
## 5 Footscray BPM2.5 7.640948
## 6 Footscray NO2 10.274531
## 7 Geelong South BPM2.5 6.502762
## 8 Geelong South NO2 5.681722
## 9 Melbourne CBD BPM2.5 8.072998
## 10 Moe BPM2.5 6.427079
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 13 Morwell South NO2 5.124430
## 14 Traralgon BPM2.5 8.024735
## 15 Traralgon NO2 5.776333
```

**sort_values** is a convenient means to order the rows
with respect to one criterion, be it numeric or categorical.

```
air.sort_values("value", ascending=False)
## sp_name param_id value
## 6 Footscray NO2 10.274531
## 4 Dandenong NO2 9.800705
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 9 Melbourne CBD BPM2.5 8.072998
## 14 Traralgon BPM2.5 8.024735
## 0 Alphington BPM2.5 7.848758
## 5 Footscray BPM2.5 7.640948
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 7 Geelong South BPM2.5 6.502762
## 10 Moe BPM2.5 6.427079
## 3 Churchill BPM2.5 6.391230
## 15 Traralgon NO2 5.776333
## 8 Geelong South NO2 5.681722
## 13 Morwell South NO2 5.124430
```

It is also possible to take into account more sorting criteria:

```
air.sort_values(["param_id", "value"], ascending=[True, False])
## sp_name param_id value
## 9 Melbourne CBD BPM2.5 8.072998
## 14 Traralgon BPM2.5 8.024735
## 0 Alphington BPM2.5 7.848758
## 5 Footscray BPM2.5 7.640948
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 7 Geelong South BPM2.5 6.502762
## 10 Moe BPM2.5 6.427079
## 3 Churchill BPM2.5 6.391230
## 6 Footscray NO2 10.274531
## 4 Dandenong NO2 9.800705
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 15 Traralgon NO2 5.776333
## 8 Geelong South NO2 5.681722
## 13 Morwell South NO2 5.124430
```

Here, in each group of identical parameters, we get a decreasing order with respect to the value.

Compare the ordering with respect to `param_id`

and `value`

vs `value`

and then `param_id`

.

Note

(*)
Lamentably, **DataFrame.sort_values** by default
does not use a stable algorithm. If a data frame is sorted with respect to one criterion,
and then we reorder it with respect to another one, tied observations
are not guaranteed to be listed in the original order:

```
(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id")
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
## param_id sp_name value
## 0 BPM2.5 Melbourne CBD 8.072998
## 1 BPM2.5 Moe 6.427079
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Morwell East 6.784596
## 4 BPM2.5 Churchill 6.391230
## 5 BPM2.5 Morwell South 6.512849
## 6 BPM2.5 Traralgon 8.024735
## 7 BPM2.5 Alphington 7.848758
## 8 BPM2.5 Geelong South 6.502762
## 9 NO2 Morwell South 5.124430
## 10 NO2 Traralgon 5.776333
## 11 NO2 Geelong South 5.681722
## 12 NO2 Altona North 9.467912
## 13 NO2 Alphington 9.558120
## 14 NO2 Dandenong 9.800705
## 15 NO2 Footscray 10.274531
```

We lost the ordering based on station names in the two subgroups.
To switch to a mergesort-like method (timsort), we should pass
`kind="stable"`

.

```
(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id", kind="stable") # !
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
## param_id sp_name value
## 0 BPM2.5 Alphington 7.848758
## 1 BPM2.5 Churchill 6.391230
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Geelong South 6.502762
## 4 BPM2.5 Melbourne CBD 8.072998
## 5 BPM2.5 Moe 6.427079
## 6 BPM2.5 Morwell East 6.784596
## 7 BPM2.5 Morwell South 6.512849
## 8 BPM2.5 Traralgon 8.024735
## 9 NO2 Alphington 9.558120
## 10 NO2 Altona North 9.467912
## 11 NO2 Dandenong 9.800705
## 12 NO2 Footscray 10.274531
## 13 NO2 Geelong South 5.681722
## 14 NO2 Morwell South 5.124430
## 15 NO2 Traralgon 5.776333
```

(*) Perform identical reorderings but using only
**loc**`[...]`

, **iloc**`[...]`

,
and **numpy.argsort**.

### 10.6.2. Stacking and unstacking (long/tall and wide forms)#

Let us discuss some further ways to transform data frames that benefit from, make sense thanks to, or are possible because they can have columns of various types.

The above `air`

dataset is in the *long* (*tall*) format.
All measurements are *stacked* one after/below another.
Such a form is quite convenient for data storage,
especially where there are only a few recorded values
but many possible combinations of levels (sparse data).

The long format might not be optimal in all data processing
activities, though; compare [95].
In the part of this book on matrix processing,
it was much more natural for us to have a single *observation*
in each row (e.g., data for each measurement station).

We can *unstack* the `air`

data frame quite easily:

```
air_wide = air.set_index(["sp_name", "param_id"]).unstack().loc[:, "value"]
air_wide
## param_id BPM2.5 NO2
## sp_name
## Alphington 7.848758 9.558120
## Altona North NaN 9.467912
## Churchill 6.391230 NaN
## Dandenong NaN 9.800705
## Footscray 7.640948 10.274531
## Geelong South 6.502762 5.681722
## Melbourne CBD 8.072998 NaN
## Moe 6.427079 NaN
## Morwell East 6.784596 NaN
## Morwell South 6.512849 5.124430
## Traralgon 8.024735 5.776333
```

This is the so-called *wide* format.

The missing values are denoted with `NaN`

s (not-a-number);
see Section 15.1 for more details.
Interestingly, we got a hierarchical index in the `columns`

(sic!) slot,
hence the **loc**`[...]`

part
to drop the last level of the hierarchy.
Also notice that the `index`

and `columns`

slots are named.

The other way around, we can use the **stack** method:

```
air_wide.T.rename_axis(index="location", columns="param").\
stack().rename("value").reset_index()
## location param value
## 0 BPM2.5 Alphington 7.848758
## 1 BPM2.5 Churchill 6.391230
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Geelong South 6.502762
## 4 BPM2.5 Melbourne CBD 8.072998
## 5 BPM2.5 Moe 6.427079
## 6 BPM2.5 Morwell East 6.784596
## 7 BPM2.5 Morwell South 6.512849
## 8 BPM2.5 Traralgon 8.024735
## 9 NO2 Alphington 9.558120
## 10 NO2 Altona North 9.467912
## 11 NO2 Dandenong 9.800705
## 12 NO2 Footscray 10.274531
## 13 NO2 Geelong South 5.681722
## 14 NO2 Morwell South 5.124430
## 15 NO2 Traralgon 5.776333
```

We used the data frame transpose (**T**) to
get a location-major order (less boring an outcome in this context).
Missing values are gone now. We do not need them anymore.
Nevertheless, passing `dropna=False`

would help us identify
the combinations of `location`

and `param`

for which the readings
are not provided.

### 10.6.3. Joining (merging)#

In database design, it is common to normalise the datasets. We do this to avoid the duplication of information and pathologies stemming from them (e.g., [19]).

The above air quality parameters are separately described in another data frame:

```
param = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_param.csv",
comment="#")
param = param.rename(dict(param_std_unit_of_measure="unit"), axis=1)
param
## param_id param_name unit param_short_name
## 0 API Airborne particle index none Visibility Reduction
## 1 BPM2.5 BAM Particles < 2.5 micron ug/m3 PM2.5
## 2 CO Carbon Monoxide ppm CO
## 3 HPM10 Hivol PM10 ug/m3 NaN
## 4 NO2 Nitrogen Dioxide ppb NO2
## 5 O3 Ozone ppb O3
## 6 PM10 TEOM Particles <10micron ug/m3 PM10
## 7 PPM2.5 Partisol PM2.5 ug/m3 NaN
## 8 SO2 Sulfur Dioxide ppb SO2
```

We could have stored them alongside the `air`

data frame,
but that would be a waste of space.
Also, if we wanted to modify some datum
(note, e.g., the annoying double space in `param_name`

for `BPM2.5`

),
we would have to update all the relevant records.

Instead, we can always match the records in both data frames
that have the same `param_id`

s,
and join (merge) these datasets only when we really need this.

Let us discuss the possible join operations by studying the two following toy datasets:

```
A = pd.DataFrame({
"x": ["a0", "a1", "a2", "a3"],
"y": ["b0", "b1", "b2", "b3"]
})
A
## x y
## 0 a0 b0
## 1 a1 b1
## 2 a2 b2
## 3 a3 b3
```

and:

```
B = pd.DataFrame({
"x": ["a0", "a2", "a2", "a4"],
"z": ["c0", "c1", "c2", "c3"]
})
B
## x z
## 0 a0 c0
## 1 a2 c1
## 2 a2 c2
## 3 a4 c3
```

They both have one column somewhat *in common*, `x`

.

The *inner (natural) join* returns the records that have a match in
both datasets:

```
pd.merge(A, B, on="x")
## x y z
## 0 a0 b0 c0
## 1 a2 b2 c1
## 2 a2 b2 c2
```

The *left join* of \(A\) with \(B\) guarantees to return all the records
from \(A\), even those which are not matched by anything in \(B\).

```
pd.merge(A, B, how="left", on="x")
## x y z
## 0 a0 b0 c0
## 1 a1 b1 NaN
## 2 a2 b2 c1
## 3 a2 b2 c2
## 4 a3 b3 NaN
```

The *right join* of \(A\) with \(B\) is the same as the left join of
\(B\) with \(A\):

```
pd.merge(A, B, how="right", on="x")
## x y z
## 0 a0 b0 c0
## 1 a2 b2 c1
## 2 a2 b2 c2
## 3 a4 NaN c3
```

Finally, the *full outer join* is the set-theoretic
union of the left and the right join:

```
pd.merge(A, B, how="outer", on="x")
## x y z
## 0 a0 b0 c0
## 1 a1 b1 NaN
## 2 a2 b2 c1
## 3 a2 b2 c2
## 4 a3 b3 NaN
## 5 a4 NaN c3
```

Join
`air_quality_2018_value`

with `air_quality_2018_point`

,
and then with `air_quality_2018_param`

.

Normalise
`air_quality_2018`

so that you get the three separate data frames mentioned in the previous
exercise (`value`

, `point`

, and `param`

).

(*)
In the National Health and Nutrition Examination Survey,
each participant is uniquely identified by their sequence number (`SEQN`

).
This token is mentioned in numerous datasets, including:

Join a few chosen datasets that you find interesting.

### 10.6.4. Set-theoretic operations and removing duplicates#

Here are two not at all disjoint sets of imaginary persons:

```
A = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/some_birth_dates1.csv",
comment="#")
A
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
## 6 Paul Meckler 29.09.1968
## 7 Katarzyna Lasko 20.10.1971
## 8 Åge Trelstad 07.03.1935
## 9 Duchanee Panomyaong 19.06.1952
```

and:

```
B = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/some_birth_dates2.csv",
comment="#")
B
## Name BirthDate
## 0 Hushang Naigamwala 25.08.1991
## 1 Zhen Wei 16.11.1975
## 2 Micha Kitchen 17.09.1930
## 3 Jodoc Alwin 16.11.1969
## 4 Igor Mazał 14.05.2004
## 5 Katarzyna Lasko 20.10.1971
## 6 Duchanee Panomyaong 19.06.1952
## 7 Mefodiy Shachar 01.10.1914
## 8 Paul Meckler 29.09.1968
## 9 Noe Tae-Woong 11.07.1970
## 10 Åge Trelstad 07.03.1935
```

In both datasets, there is a single categorical column whose
elements uniquely identify each record (i.e., `Name`

).
In the language of relational databases, we would call it
the *primary key*.
In such a case, implementing the set-theoretic operations
is relatively easy, as we can refer to the **pandas.Series.isin**
method.

First, \(A\cap B\) (intersection),
includes only the rows that are *both* in \(A\) and in \(B\):

```
A.loc[A.Name.isin(B.Name), :]
## Name BirthDate
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
## 6 Paul Meckler 29.09.1968
## 7 Katarzyna Lasko 20.10.1971
## 8 Åge Trelstad 07.03.1935
## 9 Duchanee Panomyaong 19.06.1952
```

Second, \(A\smallsetminus B\) (difference), gives all
the rows that are in \(A\) *but not* in \(B\):

```
A.loc[~A.Name.isin(B.Name), :]
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
```

Third, \(A\cup B\) (union), returns the rows that exist in \(A\) *or*
are in \(B\):

```
pd.concat((A, B.loc[~B.Name.isin(A.Name), :]))
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
## 6 Paul Meckler 29.09.1968
## 7 Katarzyna Lasko 20.10.1971
## 8 Åge Trelstad 07.03.1935
## 9 Duchanee Panomyaong 19.06.1952
## 0 Hushang Naigamwala 25.08.1991
## 1 Zhen Wei 16.11.1975
## 3 Jodoc Alwin 16.11.1969
## 4 Igor Mazał 14.05.2004
## 9 Noe Tae-Woong 11.07.1970
```

There are no duplicate rows in any of the above outputs.

Determine \((A\cup B) \smallsetminus (A\cap B) = (A\smallsetminus B)\cup(B\smallsetminus A)\) (symmetric difference).

(*)
Determine the union, intersection, and difference of the
`wine_sample1`

and
`wine_sample2`

datasets,
where there is no column uniquely identifying the observations.
Hint: consider using
**pandas.concat** and
**pandas.DataFrame.duplicated** or
**pandas.DataFrame.drop_duplicates**.

### 10.6.5. …and (too) many more#

Looking at the list of methods
for the `DataFrame`

and `Series`

classes in the **pandas** package’s
documentation,
we can see that they are abundant. Together with the object-orientated
syntax, we will often find ourselves appreciating the high readability
of even quite complex operation chains such as
`data.drop_duplicates().groupby(["year", "month"]).mean().reset_index()`

;
see Chapter 12.

Nevertheless, the methods are probably too plentiful to our taste.
Their developers were overgenerous. They wanted
to include a list of all the possible *verbs* related to data analysis,
even if they can be trivially expressed by a composition of 2-3 simpler
operations from **numpy** or **scipy** instead.

As strong advocates of minimalism, we would rather
save ourselves from being overloaded with too much new information.
This is why our focus in this book is on developing the most
*transferable*[7] skills.
Our approach is also slightly more hygienic.
We do not want the reader to develop a hopeless mindset,
the habit of looking everything up on the internet
when faced with even the simplest kinds of problems.
We have brains for a reason.

## 10.7. Exercises#

How are data frames different from matrices?

What are the use cases of the `name`

slot
in `Series`

and `Index`

objects?

What is the purpose of **set_index** and **reset_index**?

Why learning **numpy** is crucial when someone wants to become
a proficient user of **pandas**?

What is the difference between **iloc**`[...]`

and **loc**`[...]`

?

Why applying the index operator `[...]`

directly on a
`Series`

or `DataFrame`

object is discouraged?

What is the difference between `index`

, `Index`

, and `columns`

?

How can we compute the arithmetic mean and median of all the numeric columns in a data frame, using a single line of code?

What is a training/test split and how to perform it using
**numpy** and **pandas**?

What is the difference between stacking and unstacking? Which one yields a wide (as opposed to long) format?

Name different data frame join (merge) operations and explain how they work.

How does sorting with respect to more than one criterion work?

Name the basic set-theoretic operations on data frames.