Minimalist Data Wrangling with Python

Marek Gagolewski

doi:10.5281/zenodo.6451068

10. Introducing data frames¶

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Deep R Programming [36] too.

numpy arrays are an extremely versatile tool for performing data analysis activities and other numerical computations of various kinds. Even though it is theoretically possible otherwise, in practice, we only store elements of the same type there: most often numbers.

pandas [66] is amongst over one hundred thousand open-source packages and repositories that use numpy to provide additional data wrangling functionality. It was originally written by Wes McKinney but was heavily inspired by the data.frame[1] objects in S and R as well as tables in relational (think: SQL) databases and spreadsheets.

pandas defines a few classes, of which the most important are:

DataFrame – for representing tabular data (matrix-like) with columns of possibly different types, in particular a mix of numerical and categorical variables,
Series – vector-like objects for storing individual columns,
Index and its derivatives – vector-like (usually) objects for labelling individual rows and columns in DataFrames and items in Series objects,
SeriesGroupBy and DataFrameGroupBy – which model observations grouped by a categorical variable or a combination of factors (Chapter 12),

together with many methods for:

transforming/aggregating/processing data, also in groups determined by categorical variables or products thereof,
reshaping (e.g., from wide to long format) and joining datasets,
importing/exporting data from/to various sources and formats, e.g., CSV and HDF5 files or relational databases,
handling missing data,

all of which we introduce in this part.

It is customary to import the pandas package under the following alias:

import pandas as pd

Important

Let’s repeat: pandas is built on top of numpy and most objects therein can be processed by numpy functions as well. Many other functions, e.g., in scikit-learn, accept both DataFrame and ndarray objects, but often convert the former to the latter internally to enable data processing using fast, lower-level C/C++/Fortran routines.

What we have learnt so far[2] still applies. But there is more, hence this part.

10.1. Creating data frames¶

Data frames can be created, amongst others, using the DataFrame class constructor, which can be fed, for example, with a numpy matrix:

np.random.seed(123)
pd.DataFrame(
    np.random.rand(4, 3),
    columns=["a", "b", "c"]
)
##           a         b         c
## 0  0.696469  0.286139  0.226851
## 1  0.551315  0.719469  0.423106
## 2  0.980764  0.684830  0.480932
## 3  0.392118  0.343178  0.729050

Notice that rows and columns are labelled (and how readable that is).

A dictionary of vector-like objects of equal lengths is another common data source:

np.random.seed(123)
df = pd.DataFrame(dict(
    a = np.round(np.random.rand(5), 2),
    b = [1, 2.5, np.nan, 4, np.nan],
    c = [True, True, False, False, True],
    d = ["A", "B", "C", None, "E"],
    e = ["spam", "spam", "bacon", "spam", "eggs"],
    f = np.array([
        "2021-01-01", "2022-02-02", "2023-03-03", "2024-04-04", "2025-05-05"
    ], dtype="datetime64[D]"),
    g = [
        ["spam"], ["bacon", "spam"], None, ["eggs", "bacon", "spam"], ["ham"]
    ],
))
df
##       a    b      c     d      e          f                    g
## 0  0.70  1.0   True     A   spam 2021-01-01               [spam]
## 1  0.29  2.5   True     B   spam 2022-02-02        [bacon, spam]
## 2  0.23  NaN  False     C  bacon 2023-03-03                 None
## 3  0.55  4.0  False  None   spam 2024-04-04  [eggs, bacon, spam]
## 4  0.72  NaN   True     E   eggs 2025-05-05                [ham]

The above illustrates the possibility of having columns of different types.

Exercise 10.1

Check out the pandas.DataFrame.from_dict and pandas.DataFrame.from_records functions in the documentation. Use them to create example data frames.

Further, data frames can be read from files in different formats, for instance, CSV:

body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
    comment="#")
body.head()  # display the first few rows (five by default)
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 0   97.1  160.2     34.7    40.8     35.8   126.1     117.9
## 1   91.1  152.7     33.5    33.0     38.5   125.5     103.1
## 2   73.0  161.2     37.4    38.0     31.8   106.2      92.0
## 3   61.7  157.4     38.0    34.7     29.0   101.0      90.5
## 4   55.4  154.6     34.6    34.0     28.3    92.5      73.2

Reading from URLs and local files is also supported; compare Section 13.6.1.

Exercise 10.2

Check out other pandas.read_* functions in the pandas documentation, e.g., for importing spreadsheets, Apache Parquet and HDF5 files, scraping tables from HTML documents, or reading data from relational databases. We will discuss some of them in more detail later.

Exercise 10.3

(*) Large files that do not fit into computer’s memory (but not too large) can still be read with pandas.read_csv. Check out the meaning of the usecols, dtype, skiprows, and nrows arguments. On a side note, sampling is mentioned in Section 10.5.4 and chunking in Section 13.2.

10.1.1. Data frames are matrix-like¶

Data frames are modelled through numpy matrices. We can thus already feel quite at home with them.

For example, a data frame, it is easy to fetch its number of rows and columns:

df.shape
## (5, 7)

or the type of each column:

df.dtypes  # returns a Series object; see below
## a          float64
## b          float64
## c             bool
## d           object
## e           object
## f    datetime64[s]
## g           object
## dtype: object

Recall that numpy arrays are equipped with the dtype slot.

10.1.2. `Series`¶

There is a separate class for storing individual data frame columns: it is called Series.

s = df.loc[:, "a"]  # extract the `a` column; alternatively: df.a
s
## 0    0.70
## 1    0.29
## 2    0.23
## 3    0.55
## 4    0.72
## Name: a, dtype: float64

Data frames with one column are printed out slightly differently. We get the column name at the top, but do not have the dtype information at the bottom.

s.to_frame()  # or: pd.DataFrame(s)
##       a
## 0  0.70
## 1  0.29
## 2  0.23
## 3  0.55
## 4  0.72

Indexing will be discussed later.

Important

It is crucial to know when we are dealing with a Series and when with a DataFrame object as each of them defines a slightly different set of methods.

We will now be relying on object-orientated syntax (compare Section 2.2.3) much more frequently than before.

Example 10.4

By calling:

s.mean()
## 0.49800000000000005

we refer to pandas.Series.mean (which returns a scalar), whereas:

df.mean(numeric_only=True)
## a    0.498
## b    2.500
## c    0.600
## dtype: float64

uses pandas.DataFrame.mean (which yields a Series).

Look up these two methods in the pandas manual. Note that their argument list is slightly different.

Objects of the class Series are vector-like:

s.shape
## (5,)
s.dtype
## dtype('float64')

They are wrappers around numpy arrays.

s.values
## array([0.7 , 0.29, 0.23, 0.55, 0.72])

Most importantly, numpy functions can be called directly on them:

np.mean(s)
## 0.49800000000000005

As a consequence, what we covered in the part of this book that dealt with vector processing still holds for data frame columns (but there will be more).

Series can also be named.

s.name
## 'a'

This is convenient, especially when we convert them to a data frame as the name sets the label of the newly created column:

s.rename("spam").to_frame()
##    spam
## 0  0.70
## 1  0.29
## 2  0.23
## 3  0.55
## 4  0.72

10.1.3. `Index`¶

Another important class is called Index[3]. We use it for storing element or axes labels.

The index (lowercase) slot of a data frame stores an object of the class Index (or one of its derivatives) that gives the row names:

df.index  # row labels
## RangeIndex(start=0, stop=5, step=1)

It represents a sequence (0, 1, 2, 3, 4). Furthermore, the column slot gives:

df.columns  # column labels
## Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')

Also, we can label the individual elements in Series objects:

s.index
## RangeIndex(start=0, stop=5, step=1)

The set_index method can be applied to make a data frame column act as a sequence of row labels:

df2 = df.set_index("e")
df2
##           a    b      c     d          f                    g
## e                                                            
## spam   0.70  1.0   True     A 2021-01-01               [spam]
## spam   0.29  2.5   True     B 2022-02-02        [bacon, spam]
## bacon  0.23  NaN  False     C 2023-03-03                 None
## spam   0.55  4.0  False  None 2024-04-04  [eggs, bacon, spam]
## eggs   0.72  NaN   True     E 2025-05-05                [ham]

This Index object is named:

df2.index.name
## 'e'

We can also rename the axes on the fly:

df2.rename_axis(index="ROWS", columns="COLS")
## COLS      a    b      c     d          f                    g
## ROWS                                                         
## spam   0.70  1.0   True     A 2021-01-01               [spam]
## spam   0.29  2.5   True     B 2022-02-02        [bacon, spam]
## bacon  0.23  NaN  False     C 2023-03-03                 None
## spam   0.55  4.0  False  None 2024-04-04  [eggs, bacon, spam]
## eggs   0.72  NaN   True     E 2025-05-05                [ham]

Having a named index slot is handy when converting a vector of row labels back to a standalone column:

df2.rename_axis(index="NEW_COLUMN").reset_index()
##   NEW_COLUMN     a    b      c     d          f                    g
## 0       spam  0.70  1.0   True     A 2021-01-01               [spam]
## 1       spam  0.29  2.5   True     B 2022-02-02        [bacon, spam]
## 2      bacon  0.23  NaN  False     C 2023-03-03                 None
## 3       spam  0.55  4.0  False  None 2024-04-04  [eggs, bacon, spam]
## 4       eggs  0.72  NaN   True     E 2025-05-05                [ham]

But we can also do it this way:

df2.reset_index(names="Spanish Inquisition")
##   Spanish Inquisition     a    b  ...     d          f                    g
## 0                spam  0.70  1.0  ...     A 2021-01-01               [spam]
## 1                spam  0.29  2.5  ...     B 2022-02-02        [bacon, spam]
## 2               bacon  0.23  NaN  ...     C 2023-03-03                 None
## 3                spam  0.55  4.0  ...  None 2024-04-04  [eggs, bacon, spam]
## 4                eggs  0.72  NaN  ...     E 2025-05-05                [ham]
## 
## [5 rows x 7 columns]

There is also an option to drop the current index whatsoever and to replace it with the default label sequence, i.e., 0, 1, 2, …:

df2.reset_index(drop=True)
##       a    b      c     d          f                    g
## 0  0.70  1.0   True     A 2021-01-01               [spam]
## 1  0.29  2.5   True     B 2022-02-02        [bacon, spam]
## 2  0.23  NaN  False     C 2023-03-03                 None
## 3  0.55  4.0  False  None 2024-04-04  [eggs, bacon, spam]
## 4  0.72  NaN   True     E 2025-05-05                [ham]

Take note of the fact that reset_index, and many other methods that we have used so far, do not modify the data frame in place.

Important

We will soon get used to calling reset_index(drop=True) frequently: sometimes more than once in a single series of commands.

Exercise 10.5

Use the pandas.DataFrame.rename method to change the name of the a column in df to spam.

Also, a hierarchical index – one that is comprised of more than one level – is possible. For example, here is a sorted (see Section 10.6.1) version of df with a new index based on two columns at the same time:

df.sort_values("e", ascending=False).set_index(["e", "c"])
##                 a    b     d          f                    g
## e     c                                                     
## spam  True   0.70  1.0     A 2021-01-01               [spam]
##       True   0.29  2.5     B 2022-02-02        [bacon, spam]
##       False  0.55  4.0  None 2024-04-04  [eggs, bacon, spam]
## eggs  True   0.72  NaN     E 2025-05-05                [ham]
## bacon False  0.23  NaN     C 2023-03-03                 None

For the sake of readability, the consecutive repeated spams were not printed.

Example 10.6

Hierarchical indexes might arise after aggregating data in groups. For example:

nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/nhanes_p_demo_bmx_2020.csv",
    comment="#").rename({
        "BMXBMI": "bmival",
        "RIAGENDR": "gender",
        "DMDBORN4": "usborn"
    }, axis=1)

In Chapter 12, we will get used to writing:

res = nhanes.groupby(["gender", "usborn"])["bmival"].mean()
res  # BMI by gender and US born-ness
## gender  usborn
## 1       1         25.734110
##         2         27.405251
## 2       1         27.120261
##         2         27.579448
##         77        28.725000
##         99        32.600000
## Name: bmival, dtype: float64

This returned a Series object with a hierarchical index. If we do not fancy it, reset_index comes to our rescue:

res.reset_index()
##    gender  usborn     bmival
## 0       1       1  25.734110
## 1       1       2  27.405251
## 2       2       1  27.120261
## 3       2       2  27.579448
## 4       2      77  28.725000
## 5       2      99  32.600000

10.2. Aggregating data frames¶

Here is another toy data frame:

np.random.seed(123)
df = pd.DataFrame(dict(
    u = np.round(np.random.rand(5), 2),
    v = np.round(np.random.randn(5), 2),
    w = ["spam", "bacon", "spam", "eggs", "sausage"]
), index=["a", "b", "c", "d", "e"])
df
##       u     v        w
## a  0.70  0.32     spam
## b  0.29 -0.05    bacon
## c  0.23 -0.20     spam
## d  0.55  1.98     eggs
## e  0.72 -1.62  sausage

All numpy functions can be applied on individual columns, i.e., objects of the type Series, because they are vector-like.

u = df.loc[:, "u"]  # extract the `u` column (gives a Series; see below)
np.quantile(u, [0, 0.5, 1])
## array([0.23, 0.55, 0.72])

Most numpy functions also work if they are fed with data frames, but we will need to extract the numeric columns manually.

uv = df.loc[:, ["u", "v"]]  # select two columns (a DataFrame; see below)
np.quantile(uv, [0, 0.5, 1], axis=0)
## array([[ 0.23, -1.62],
##        [ 0.55, -0.05],
##        [ 0.72,  1.98]])

Sometimes the results will automatically be coerced to a Series object with the index slot set appropriately:

np.mean(uv, axis=0)
## u    0.498
## v    0.086
## dtype: float64

For convenience, many operations are also available as methods for the Series and DataFrame classes, e.g., mean, median, min, max, quantile, var, std, and skew.

df.mean(numeric_only=True)
## u    0.498
## v    0.086
## dtype: float64
df.quantile([0, 0.5, 1], numeric_only=True)
##         u     v
## 0.0  0.23 -1.62
## 0.5  0.55 -0.05
## 1.0  0.72  1.98

Also note the describe method, which returns a few statistics at the same time.

df.describe()
##               u         v
## count  5.000000  5.000000
## mean   0.498000  0.086000
## std    0.227969  1.289643
## min    0.230000 -1.620000
## 25%    0.290000 -0.200000
## 50%    0.550000 -0.050000
## 75%    0.700000  0.320000
## max    0.720000  1.980000

Exercise 10.7

Check out the pandas.DataFrame.agg method that can apply all aggregates given by a list of functions. Compose a call equivalent to df.describe().

Note

(*) Let’s stress that above we see the corrected for bias (but still only asymptotically unbiased) version of standard deviation, given by $\sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2}$; compare Section 5.1. In pandas, std methods assume ddof=1 by default, whereas we recall that numpy uses ddof=0.

np.round([u.std(), np.std(u), np.std(np.array(u)), u.std(ddof=0)], 3)
## array([0.228, 0.204, 0.204, 0.204])

This is an unfortunate inconsistency between the two packages, but please do not blame the messenger.

10.3. Transforming data frames¶

By applying the already well-known vectorised mathematical functions from numpy, we can transform each data cell and return an object of the same type as the input one.

np.exp(df.loc[:, "u"])
## a    2.013753
## b    1.336427
## c    1.258600
## d    1.733253
## e    2.054433
## Name: u, dtype: float64
np.exp(df.loc[:, ["u", "v"]])
##           u         v
## a  2.013753  1.377128
## b  1.336427  0.951229
## c  1.258600  0.818731
## d  1.733253  7.242743
## e  2.054433  0.197899

When applying the binary arithmetic, relational, and logical operators on an object of the class Series and a scalar or a numpy vector, the operations are performed elementwisely. For instance, here is a standardised version of the u column:

u = df.loc[:, "u"]
(u - np.mean(u)) / np.std(u)
## a    0.990672
## b   -1.020098
## c   -1.314357
## d    0.255025
## e    1.088759
## Name: u, dtype: float64

Arithmetic operators act on the elements with corresponding labels. For two objects having identical index slots (this is the most common scenario), this is the same as elementwise vectorisation. For instance:

df.loc[:, "u"] > df.loc[:, "v"]  # here: elementwise comparison
## a     True
## b     True
## c     True
## d    False
## e     True
## dtype: bool

For transforming many numerical columns at once, it is worthwhile either to convert them to a numeric matrix explicitly and then use the basic numpy functions:

uv = np.array(df.loc[:, ["u", "v"]])
uv2 = (uv-np.mean(uv, axis=0))/np.std(uv, axis=0)
uv2
## array([[ 0.99067229,  0.20286225],
##        [-1.0200982 , -0.11790285],
##        [-1.3143573 , -0.24794275],
##        [ 0.25502455,  1.64197052],
##        [ 1.08875866, -1.47898717]])

or to use the pandas.DataFrame.apply method which invokes a given function on each column separately:

uv2 = df.loc[:, ["u", "v"]].apply(lambda x: (x-np.mean(x))/np.std(x))
uv2
##           u         v
## a  0.990672  0.202862
## b -1.020098 -0.117903
## c -1.314357 -0.247943
## d  0.255025  1.641971
## e  1.088759 -1.478987

Anticipating what we cover in the next section, in both cases, we can write df.loc[:, ["u", "v"]] = uv2 to replace the old content. Also, new columns can be added based on the transformed versions of the existing ones. For instance:

df.loc[:, "uv_squared"] = (df.loc[:, "u"] * df.loc[:, "v"])**2
df
##       u     v        w  uv_squared
## a  0.70  0.32     spam    0.050176
## b  0.29 -0.05    bacon    0.000210
## c  0.23 -0.20     spam    0.002116
## d  0.55  1.98     eggs    1.185921
## e  0.72 -1.62  sausage    1.360489

Example 10.8

(*) Arithmetic operations on objects with different index slots are vectorised labelwisely:

x = pd.Series([1, 10, 100, 1000, 10000], index=["a", "b", "a", "a", "c"])
x
## a        1
## b       10
## a      100
## a     1000
## c    10000
## dtype: int64
y = pd.Series([1, 2, 3, 4, 5], index=["b", "b", "a", "d", "c"])
y
## b    1
## b    2
## a    3
## d    4
## c    5
## dtype: int64

And now:

x * y
## a        3.0
## a      300.0
## a     3000.0
## b       10.0
## b       20.0
## c    50000.0
## d        NaN
## dtype: float64

Here, each element in the first Series named a was multiplied by each (there was only one) element labelled a in the second Series. For d, there were no matches, hence the result’s being marked as missing; compare Chapter 15. Thus, it behaves like a full outer join-type operation; see Section 10.6.3.

The above is different from elementwise vectorisation in numpy:

np.array(x) * np.array(y)
## array([    1,    20,   300,  4000, 50000])

Labelwise vectorisation can be useful in certain contexts. However, we need to be aware of this (yet another) incompatibility between the two packages.

10.4. Indexing `Series` objects¶

Recall that each DataFrame and Series object is equipped with a slot called index, which is an object of the class Index (or subclass thereof), giving the row and element labels, respectively. It turns out that we may apply the index operator, [...], to subset these objects not only through the indexers known from the numpy part (e.g., numerical ones, i.e., by position) but also ones that pinpoint the items via their labels. We have thus quite a lot of index-ing to discuss.

For illustration, we will be playing with two objects of the type Series:

np.random.seed(123)
b = pd.Series(np.round(np.random.rand(10), 2))
b.index = np.random.permutation(np.arange(10))
b
## 2    0.70
## 1    0.29
## 8    0.23
## 7    0.55
## 9    0.72
## 4    0.42
## 5    0.98
## 6    0.68
## 3    0.48
## 0    0.39
## dtype: float64

and:

c = b.copy()
c.index = list("abcdefghij")
c
## a    0.70
## b    0.29
## c    0.23
## d    0.55
## e    0.72
## f    0.42
## g    0.98
## h    0.68
## i    0.48
## j    0.39
## dtype: float64

They consist of the same values, in the same order, but have different labels (index slots). In particular, b’s labels are integers that do not match the physical element positions (where 0 would denote the first element, etc.).

Important

For numpy vectors, we had four different indexing schemes: via a scalar (extracts an element at a given position), a slice, an integer vector, and a logical vector. Series objects are additionally labelled. Therefore, they can also be accessed through the contents of the index slot.

10.4.1. Do not use `[...]` directly (in the current version of pandas)¶

Applying the index operator, [...], directly on Series is currently not a wise idea:

b[0]  # do not use it
## 0.39
b[ [0] ]  # do not use it
## 0    0.39
## dtype: float64

both do not select the first item, but the item labelled 0. However, the undermentioned two calls fall back to position-based indexing.

b[:1]  # do not use it: it will change in the future!
## 2    0.7
## dtype: float64
c[0]  # there is no label `0` (do not use it: it will change in the future!)
## 0.7

Confusing? Well, with some self-discipline, the solution is easy:

Important

In the current version of pandas, we recommend abstaining from applying [...] directly on Series and DataFrame objects. We should explicitly refer to the loc[...] and iloc[...] accessors for the label- and position-based filtering, respectively.

In the future, direct call to [...] on Series will use label-based indexing (as we will see below, b[:5] will hence not mean “select the first five rows”). At the same time, [...] on DataFrame currently serves as a label-based selector of columns only.

10.4.2. loc`[...]`¶

Series.loc[...] implements label-based indexing.

b.loc[0]
## 0.39

This returned the element labelled 0. On the other hand, c.loc[0] will raise a KeyError because c consists of string labels only. But in this case, we can write:

c.loc["j"]
## 0.39

Next, we can use lists of labels to select a subset.

b.loc[ [0, 1, 0] ]
## 0    0.39
## 1    0.29
## 0    0.39
## dtype: float64
c.loc[ ["j", "b", "j"] ]
## j    0.39
## b    0.29
## j    0.39
## dtype: float64

The result is always of the type Series.

Slicing behaves differently as the range is inclusive (sic![4]) at both sides:

b.loc[1:7]
## 1    0.29
## 8    0.23
## 7    0.55
## dtype: float64
b.loc[0:4:-1]
## 0    0.39
## 3    0.48
## 6    0.68
## 5    0.98
## 4    0.42
## dtype: float64
c.loc["d":"g"]
## d    0.55
## e    0.72
## f    0.42
## g    0.98
## dtype: float64

They return all elements between the two indicated labels.

Note

Be careful that if there are repeated labels, then we will be returning all (sic![5]) the matching items:

d = pd.Series([1, 2, 3, 4], index=["a", "b", "a", "c"])
d.loc["a"]
## a    1
## a    3
## dtype: int64

The result is not a scalar but a Series object.

10.4.3. iloc`[...]`¶

Here are some examples of position-based indexing with the iloc[...] accessor. It is worth stressing that, fortunately, its behaviour is consistent with its numpy counterpart, i.e., the ordinary square brackets applied on objects of the class ndarray. For example:

b.iloc[0]  # the same: c.iloc[0]
## 0.7

returns the first element.

b.iloc[1:7]  # the same: b.iloc[1:7]
## 1    0.29
## 8    0.23
## 7    0.55
## 9    0.72
## 4    0.42
## 5    0.98
## dtype: float64

returns the second, third, …, seventh element (not including b.iloc[7], i.e., the eight one).

10.4.4. Logical indexing¶

Indexing using a logical vector-like object is also available. For this purpose, we will usually be using loc[...] with either a logical Series object of identical index slot as the subsetted object, or a Boolean numpy vector.

b.loc[(b > 0.4) & (b < 0.6)]
## 7    0.55
## 4    0.42
## 3    0.48
## dtype: float64

For iloc[...], the indexer must be unlabelled, e.g., be an ordinary numpy vector.

10.5. Indexing data frames¶

10.5.1. loc`[...]` and iloc`[...]`¶

For data frames, iloc and loc can be applied too. Now, however, they require two arguments: a row and a column selector. For example:

np.random.seed(123)
df = pd.DataFrame(dict(
    u = np.round(np.random.rand(5), 2),
    v = np.round(np.random.randn(5), 2),
    w = ["spam", "bacon", "spam", "eggs", "sausage"],
    x = [True, False, True, False, True]
))

And now:

df.loc[ df.loc[:, "u"] > 0.5, "u":"w" ]
##       u     v        w
## 0  0.70  0.32     spam
## 3  0.55  1.98     eggs
## 4  0.72 -1.62  sausage

It selected the rows where the values in the u column are greater than 0.5 and then returns all columns between u and w (inclusive!).

Furthermore:

df.iloc[:3, :].loc[:, ["u", "w"]]
##       u      w
## 0  0.70   spam
## 1  0.29  bacon
## 2  0.23   spam

It fetched the first three rows (by position; iloc is necessary) and then selects two indicated columns.

Compare this to:

df.loc[:3, ["u", "w"]]  # df[:3, ["u", "w"]] does not even work; please don't
##       u      w
## 0  0.70   spam
## 1  0.29  bacon
## 2  0.23   spam
## 3  0.55   eggs

which has four (!) rows.

Important

Getting a scrambled numeric index that does not match the physical positions is not rare: for instance, in the context of data frame sorting (Section 10.6.1):

df2 = df.sort_values("v")
df2
##       u     v        w      x
## 4  0.72 -1.62  sausage   True
## 2  0.23 -0.20     spam   True
## 1  0.29 -0.05    bacon  False
## 0  0.70  0.32     spam   True
## 3  0.55  1.98     eggs  False

Note how different are the following results:

df2.loc[:3, :]  # up to label 3, inclusive
##       u     v        w      x
## 4  0.72 -1.62  sausage   True
## 2  0.23 -0.20     spam   True
## 1  0.29 -0.05    bacon  False
## 0  0.70  0.32     spam   True
## 3  0.55  1.98     eggs  False
df2.iloc[:3, :]  # always: the first three
##       u     v        w      x
## 4  0.72 -1.62  sausage   True
## 2  0.23 -0.20     spam   True
## 1  0.29 -0.05    bacon  False

Important

We can frequently write df.u as a shorter version of df.loc[:, "u"]. This improves the readability in contexts such as:

df.loc[(df.u >= 0.5) & (df.u <= 0.7), ["u", "w"]]
##       u     w
## 0  0.70  spam
## 3  0.55  eggs

This accessor is, sadly, not universal. We can verify this by considering a data frame with a column named, e.g., mean: it clashes with the built-in method. As a workaround, we should either use loc[...] or rename the column, for instance, like Mean or MEAN.

Exercise 10.9

Use pandas.DataFrame.drop to select all columns except v in df.

Exercise 10.10

Use pandas.Series.isin (amongst others) to select all rows with spam and bacon on the df’s menu.

Exercise 10.11

In the tips dataset, select data on male customers where the total bills were in the $[10, 20]$ interval. Also, select Saturday and Sunday records where the tips were greater than $5.

10.5.2. Adding rows and columns¶

loc[...] can also be used to add new columns to an existing data frame:

df.loc[:, "y"] = df.loc[:, "u"]**2  # or df.loc[:, "y"] = df.u**2
df
##       u     v        w      x       y
## 0  0.70  0.32     spam   True  0.4900
## 1  0.29 -0.05    bacon  False  0.0841
## 2  0.23 -0.20     spam   True  0.0529
## 3  0.55  1.98     eggs  False  0.3025
## 4  0.72 -1.62  sausage   True  0.5184

Important

Notation like “df.new_column = ...” does not work. As we said, only loc and iloc are universal. For other accessors, this is not necessarily the case.

Exercise 10.12

Use pandas.DataFrame.insert to add a new column not necessarily at the end of df.

Exercise 10.13

Use pandas.DataFrame.assign to add a new column and replace an existing one with another, not necessarily of the same dtype.

Exercise 10.14

Use pandas.DataFrame.append to add a few more rows to df.

10.5.3. Modifying items¶

In the current version of pandas, modifying particular elements gives a warning:

df.loc[:, "u"].iloc[0] = 7  # the same as df.u.iloc[0] = 7
## SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame
df.loc[:, "u"].iloc[0]  # testing
## 7.0

To remedy this, it is best to create a copy of a column, modify it, and then to replace the old contents with the new ones.

u = df.loc[:, "u"].copy()
u.iloc[0] = 42  # or a whole for loop to process them all, or whatever
df.loc[:, "u"] = u
df.loc[:, "u"].iloc[0]  # testing
## 42.0

10.5.4. Pseudorandom sampling and splitting¶

As a simple application of what we have covered so far, let’s consider some ways to sample several rows from an existing data frame. We may need them, e.g., when our datasets are too large to fit into memory or make data analysis too slow to run. In the most rudimentary scenarios, we can use the pandas.DataFrame.sample method, e.g., to:

select five rows, without replacement,
select 20% rows, with replacement,
rearrange all the rows.

For example:

body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
    comment="#")
body.sample(5, random_state=123)  # 5 rows without replacement
##       BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 4214   58.4  156.2     35.2    34.7     27.2    99.5      77.5
## 3361   73.7  161.0     36.5    34.5     29.0   107.6      98.2
## 3759   61.4  164.6     37.5    40.4     26.9    93.5      84.4
## 3733  120.4  158.8     33.5    34.6     40.5   147.2     129.3
## 1121  123.5  157.5     35.5    29.0     50.5   143.0     136.4

Notice the random_state argument which controls the seed of the pseudorandom number generator: this way, we get reproducible results. Alternatively, we could call numpy.random.seed.

Exercise 10.15

Show how the three aforementioned scenarios can be implemented manually using iloc[...] and numpy.random.permutation or numpy.random.choice.

Exercise 10.16

Can pandas.read_csv be used to read only a random sample of rows from a CSV file?

In machine learning practice, we are used to training and evaluating machine learning models on different (mutually disjoint) subsets of the whole data frame.

For instance, Section 12.3.3 mentions that we may be interested in performing the so-called training/test split (partitioning), where 80% (or 60% or 70%) of the randomly selected rows would constitute the first new data frame and the remaining 20% (or 40% or 30%, respectively) would go to the second one.

Given a data frame like:

df = body.head(10)  # this is just an example
df
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 0   97.1  160.2     34.7    40.8     35.8   126.1     117.9
## 1   91.1  152.7     33.5    33.0     38.5   125.5     103.1
## 2   73.0  161.2     37.4    38.0     31.8   106.2      92.0
## 3   61.7  157.4     38.0    34.7     29.0   101.0      90.5
## 4   55.4  154.6     34.6    34.0     28.3    92.5      73.2
## 5   62.0  144.7     32.5    34.2     29.8   106.7      84.8
## 6   66.2  166.5     37.5    37.6     32.0    96.3      95.7
## 7   75.9  154.5     35.4    37.6     32.7   107.7      98.7
## 8   77.2  159.2     38.5    40.5     35.7   102.0      97.5
## 9   91.6  174.5     36.1    45.9     35.2   121.3     100.3

one way to perform the aforementioned split is to generate a random permutation of the set of row indexes:

np.random.seed(123)  # reproducibility matters
idx = np.random.permutation(df.shape[0])
idx
## array([4, 0, 7, 5, 8, 3, 1, 6, 9, 2])

Then, we pick the first 80% for the training set:

k = int(df.shape[0]*0.8)
df.iloc[idx[:k], :]
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 4   55.4  154.6     34.6    34.0     28.3    92.5      73.2
## 0   97.1  160.2     34.7    40.8     35.8   126.1     117.9
## 7   75.9  154.5     35.4    37.6     32.7   107.7      98.7
## 5   62.0  144.7     32.5    34.2     29.8   106.7      84.8
## 8   77.2  159.2     38.5    40.5     35.7   102.0      97.5
## 3   61.7  157.4     38.0    34.7     29.0   101.0      90.5
## 1   91.1  152.7     33.5    33.0     38.5   125.5     103.1
## 6   66.2  166.5     37.5    37.6     32.0    96.3      95.7

The remaining ones produce the test set:

df.iloc[idx[k:], :]
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 9   91.6  174.5     36.1    45.9     35.2   121.3     100.3
## 2   73.0  161.2     37.4    38.0     31.8   106.2      92.0

Exercise 10.17

In the wine_quality_all dataset, leave out all but the white wines. Partition the resulting data frame randomly into three data frames: wines_train (60% of the rows), wines_validate (another 20% of the rows), and wines_test (the remaining 20%).

Exercise 10.18

Compose a function kfold which takes a data frame df and an integer $k>1$ as arguments. Return a list of data frames resulting in stemming from randomly partitioning df into $k$ disjoint chunks of equal (or almost equal if that is not possible) sizes.

10.5.5. Hierarchical indexes (*)¶

Consider a data frame with a hierarchical index:

np.random.seed(123)
df = pd.DataFrame(dict(
    year = np.repeat([2023, 2024, 2025], 4),
    quarter = np.tile(["Q1", "Q2", "Q3", "Q4"], 3),
    data = np.round(np.random.rand(12), 2)
)).set_index(["year", "quarter"])
df
##               data
## year quarter      
## 2023 Q1       0.70
##      Q2       0.29
##      Q3       0.23
##      Q4       0.55
## 2024 Q1       0.72
##      Q2       0.42
##      Q3       0.98
##      Q4       0.68
## 2025 Q1       0.48
##      Q2       0.39
##      Q3       0.34
##      Q4       0.73

The index has both levels named, but this is purely for aesthetic reasons.

Indexing using loc[...] by default relates to the first level of the hierarchy:

df.loc[2023, :]
##          data
## quarter      
## Q1       0.70
## Q2       0.29
## Q3       0.23
## Q4       0.55

Note that we selected all rows corresponding to a given label and dropped (!) this level of the hierarchy.

Another example:

df.loc[ [2023, 2025], : ]
##               data
## year quarter      
## 2023 Q1       0.70
##      Q2       0.29
##      Q3       0.23
##      Q4       0.55
## 2025 Q1       0.48
##      Q2       0.39
##      Q3       0.34
##      Q4       0.73

To access deeper levels, we can use tuples as indexers:

df.loc[ (2023, "Q1"), : ]
## data    0.7
## Name: (2023, Q1), dtype: float64
df.loc[ [(2023, "Q1"), (2024, "Q3")], : ]
##               data
## year quarter      
## 2023 Q1       0.70
## 2024 Q3       0.98

In certain scenarios, though, it will probably be much easier to subset a hierarchical index by using reset_index and set_index creatively (together with loc[...] and pandas.Series.isin, etc.).

Let’s stress again that the `:` operator can only be used directly within the square brackets. Nonetheless, we can always use the slice constructor to create a slice in any context:

df.loc[ (slice(None), ["Q1", "Q3"]), : ]  # :, ["Q1", "Q3"]
##               data
## year quarter      
## 2023 Q1       0.70
##      Q3       0.23
## 2024 Q1       0.72
##      Q3       0.98
## 2025 Q1       0.48
##      Q3       0.34
df.loc[ (slice(None, None, -1), slice("Q2", "Q3")), : ]  # ::-1, "Q2":"Q3"
##               data
## year quarter      
## 2025 Q3       0.34
##      Q2       0.39
## 2024 Q3       0.98
##      Q2       0.42
## 2023 Q3       0.23
##      Q2       0.29

10.6. Further operations on data frames¶

One of the many roles of data frames is to represent tables of values for their nice presentation, e.g., in reports from data analysis or research papers. Here are some functions that can aid in their formatting.

10.6.1. Sorting¶

Consider another example dataset: the yearly (for 2018) average air quality data in the Australian state of Victoria.

air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/air_quality_2018_means.csv",
    comment="#")
air = (
    air.
    loc[air.param_id.isin(["BPM2.5", "NO2"]), :].
    reset_index(drop=True)
)

We chose two air quality parameters using pandas.Series.isin, which determines whether each element in a Series is enlisted in a given sequence. We could also have used set_index and loc[...] for that.

Notice that the preceding code spans many lines. We needed to enclose it in round brackets to avoid a syntax error. Alternatively, we could have used backslashes at the end of each line.

Anyway, here is the data frame:

air
##           sp_name param_id      value
## 0      Alphington   BPM2.5   7.848758
## 1      Alphington      NO2   9.558120
## 2    Altona North      NO2   9.467912
## 3       Churchill   BPM2.5   6.391230
## 4       Dandenong      NO2   9.800705
## 5       Footscray   BPM2.5   7.640948
## 6       Footscray      NO2  10.274531
## 7   Geelong South   BPM2.5   6.502762
## 8   Geelong South      NO2   5.681722
## 9   Melbourne CBD   BPM2.5   8.072998
## 10            Moe   BPM2.5   6.427079
## 11   Morwell East   BPM2.5   6.784596
## 12  Morwell South   BPM2.5   6.512849
## 13  Morwell South      NO2   5.124430
## 14      Traralgon   BPM2.5   8.024735
## 15      Traralgon      NO2   5.776333

sort_values is a convenient means to order the rows with respect to one criterion, be it numeric or categorical.

air.sort_values("value", ascending=False)
##           sp_name param_id      value
## 6       Footscray      NO2  10.274531
## 4       Dandenong      NO2   9.800705
## 1      Alphington      NO2   9.558120
## 2    Altona North      NO2   9.467912
## 9   Melbourne CBD   BPM2.5   8.072998
## 14      Traralgon   BPM2.5   8.024735
## 0      Alphington   BPM2.5   7.848758
## 5       Footscray   BPM2.5   7.640948
## 11   Morwell East   BPM2.5   6.784596
## 12  Morwell South   BPM2.5   6.512849
## 7   Geelong South   BPM2.5   6.502762
## 10            Moe   BPM2.5   6.427079
## 3       Churchill   BPM2.5   6.391230
## 15      Traralgon      NO2   5.776333
## 8   Geelong South      NO2   5.681722
## 13  Morwell South      NO2   5.124430

It is also possible to take into account more sorting criteria:

air.sort_values(["param_id", "value"], ascending=[True, False])
##           sp_name param_id      value
## 9   Melbourne CBD   BPM2.5   8.072998
## 14      Traralgon   BPM2.5   8.024735
## 0      Alphington   BPM2.5   7.848758
## 5       Footscray   BPM2.5   7.640948
## 11   Morwell East   BPM2.5   6.784596
## 12  Morwell South   BPM2.5   6.512849
## 7   Geelong South   BPM2.5   6.502762
## 10            Moe   BPM2.5   6.427079
## 3       Churchill   BPM2.5   6.391230
## 6       Footscray      NO2  10.274531
## 4       Dandenong      NO2   9.800705
## 1      Alphington      NO2   9.558120
## 2    Altona North      NO2   9.467912
## 15      Traralgon      NO2   5.776333
## 8   Geelong South      NO2   5.681722
## 13  Morwell South      NO2   5.124430

Here, in each group of identical parameters, we get a decreasing order with respect to the value.

Exercise 10.19

Compare the ordering with respect to param_id and value vs value and then param_id.

Note

(*) Lamentably, DataFrame.sort_values by default does not use a stable algorithm. If a data frame is sorted with respect to one criterion, and then we reorder it with respect to another one, tied observations are not guaranteed to be listed in the original order:

(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/air_quality_2018_means.csv",
    comment="#")
    .sort_values("sp_name")
    .sort_values("param_id")
    .set_index("param_id")
    .loc[["BPM2.5", "NO2"], :]
    .reset_index())
##    param_id        sp_name      value
## 0    BPM2.5  Melbourne CBD   8.072998
## 1    BPM2.5            Moe   6.427079
## 2    BPM2.5      Footscray   7.640948
## 3    BPM2.5   Morwell East   6.784596
## 4    BPM2.5      Churchill   6.391230
## 5    BPM2.5  Morwell South   6.512849
## 6    BPM2.5      Traralgon   8.024735
## 7    BPM2.5     Alphington   7.848758
## 8    BPM2.5  Geelong South   6.502762
## 9       NO2  Morwell South   5.124430
## 10      NO2      Traralgon   5.776333
## 11      NO2  Geelong South   5.681722
## 12      NO2   Altona North   9.467912
## 13      NO2     Alphington   9.558120
## 14      NO2      Dandenong   9.800705
## 15      NO2      Footscray  10.274531

We lost the ordering based on station names in the two subgroups. To switch to a mergesort-like method (timsort), we should pass kind="stable".

(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/air_quality_2018_means.csv",
    comment="#")
    .sort_values("sp_name")
    .sort_values("param_id", kind="stable")  # !
    .set_index("param_id")
    .loc[["BPM2.5", "NO2"], :]
    .reset_index())
##    param_id        sp_name      value
## 0    BPM2.5     Alphington   7.848758
## 1    BPM2.5      Churchill   6.391230
## 2    BPM2.5      Footscray   7.640948
## 3    BPM2.5  Geelong South   6.502762
## 4    BPM2.5  Melbourne CBD   8.072998
## 5    BPM2.5            Moe   6.427079
## 6    BPM2.5   Morwell East   6.784596
## 7    BPM2.5  Morwell South   6.512849
## 8    BPM2.5      Traralgon   8.024735
## 9       NO2     Alphington   9.558120
## 10      NO2   Altona North   9.467912
## 11      NO2      Dandenong   9.800705
## 12      NO2      Footscray  10.274531
## 13      NO2  Geelong South   5.681722
## 14      NO2  Morwell South   5.124430
## 15      NO2      Traralgon   5.776333

Exercise 10.20

(*) Perform identical reorderings but using only loc[...], iloc[...], and numpy.argsort.

10.6.2. Stacking and unstacking (long/tall and wide forms)¶

Let’s discuss some further ways to transform data frames that benefit from, make sense thanks to, or are possible because they can have columns of various types.

The air dataset is in the long (tall) format. All measurements are stacked one after/below another. Such a form is convenient for data storage, especially where there are only a few recorded values but many possible combinations of levels (sparse data).

The long format might not be optimal in all data processing activities, though; compare [101]. In the part of this book on matrix processing, it was much more natural for us to have a single observation in each row (e.g., data for each measurement station).

We can unstack the air data frame easily:

air_wide = air.set_index(["sp_name", "param_id"]).unstack().loc[:, "value"]
air_wide
## param_id         BPM2.5        NO2
## sp_name                           
## Alphington     7.848758   9.558120
## Altona North        NaN   9.467912
## Churchill      6.391230        NaN
## Dandenong           NaN   9.800705
## Footscray      7.640948  10.274531
## Geelong South  6.502762   5.681722
## Melbourne CBD  8.072998        NaN
## Moe            6.427079        NaN
## Morwell East   6.784596        NaN
## Morwell South  6.512849   5.124430
## Traralgon      8.024735   5.776333

This is the so-called wide format.

A missing value is denoted by NaN (not-a-number); see Section 15.1 for more details. Interestingly, we obtained a hierarchical index in the columns (sic!) slot. Hence, to drop the last level of the hierarchy, we had to add the loc[...] part. Also notice that the index and columns slots are named.

The other way around, we can use the stack method:

air_wide.T.rename_axis(index="location", columns="param").\
    stack().rename("value").reset_index()
##    location          param      value
## 0    BPM2.5     Alphington   7.848758
## 1    BPM2.5      Churchill   6.391230
## 2    BPM2.5      Footscray   7.640948
## 3    BPM2.5  Geelong South   6.502762
## 4    BPM2.5  Melbourne CBD   8.072998
## 5    BPM2.5            Moe   6.427079
## 6    BPM2.5   Morwell East   6.784596
## 7    BPM2.5  Morwell South   6.512849
## 8    BPM2.5      Traralgon   8.024735
## 9       NO2     Alphington   9.558120
## 10      NO2   Altona North   9.467912
## 11      NO2      Dandenong   9.800705
## 12      NO2      Footscray  10.274531
## 13      NO2  Geelong South   5.681722
## 14      NO2  Morwell South   5.124430
## 15      NO2      Traralgon   5.776333

We used the data frame transpose (T) to get a location-major order (less boring an outcome in this context). Missing values are gone now. We do not need them anymore. Nevertheless, passing dropna=False would help us identify the combinations of location and param for which the readings are not provided.

10.6.3. Joining (merging)¶

In database design, it is common to normalise the datasets. We do this to avoid the duplication of information and pathologies stemming from them (e.g., [21]).

Example 10.21

The air quality parameters are separately described in another data frame:

param = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/air_quality_2018_param.csv",
    comment="#")
param = param.rename(dict(param_std_unit_of_measure="unit"), axis=1)
param
##   param_id                   param_name   unit      param_short_name
## 0      API      Airborne particle index   none  Visibility Reduction
## 1   BPM2.5  BAM  Particles < 2.5 micron  ug/m3                 PM2.5
## 2       CO              Carbon Monoxide    ppm                    CO
## 3    HPM10                   Hivol PM10  ug/m3                   NaN
## 4      NO2             Nitrogen Dioxide    ppb                   NO2
## 5       O3                        Ozone    ppb                    O3
## 6     PM10     TEOM Particles <10micron  ug/m3                  PM10
## 7   PPM2.5               Partisol PM2.5  ug/m3                   NaN
## 8      SO2               Sulfur Dioxide    ppb                   SO2

We could have stored them alongside the air data frame, but that would be a waste of space. Also, if we wanted to modify some datum (note, e.g., the annoying double space in param_name for BPM2.5), we would have to update all the relevant records.

Instead, we can always match the records in both data frames that have the same param_ids, and join (merge) these datasets only when we really need this.

Let’s discuss the possible join operations by studying two toy datasets:

A = pd.DataFrame({
    "x": ["a0", "a1", "a2", "a3"],
    "y": ["b0", "b1", "b2", "b3"]
})
A
##     x   y
## 0  a0  b0
## 1  a1  b1
## 2  a2  b2
## 3  a3  b3

and:

B = pd.DataFrame({
    "x": ["a0", "a2", "a2", "a4"],
    "z": ["c0", "c1", "c2", "c3"]
})
B
##     x   z
## 0  a0  c0
## 1  a2  c1
## 2  a2  c2
## 3  a4  c3

They both have one column somewhat in common, x.

The inner (natural) join returns the records that have a match in both datasets:

pd.merge(A, B, on="x")
##     x   y   z
## 0  a0  b0  c0
## 1  a2  b2  c1
## 2  a2  b2  c2

The left join of $A$ with $B$ guarantees to return all the records from $A$, even those which are not matched by anything in $B$.

pd.merge(A, B, how="left", on="x")
##     x   y    z
## 0  a0  b0   c0
## 1  a1  b1  NaN
## 2  a2  b2   c1
## 3  a2  b2   c2
## 4  a3  b3  NaN

The right join of $A$ with $B$ is the same as the left join of $B$ with $A$:

pd.merge(A, B, how="right", on="x")
##     x    y   z
## 0  a0   b0  c0
## 1  a2   b2  c1
## 2  a2   b2  c2
## 3  a4  NaN  c3

Finally, the full outer join is the set-theoretic union of the left and the right join:

pd.merge(A, B, how="outer", on="x")
##     x    y    z
## 0  a0   b0   c0
## 1  a1   b1  NaN
## 2  a2   b2   c1
## 3  a2   b2   c2
## 4  a3   b3  NaN
## 5  a4  NaN   c3

Exercise 10.22

Join air_quality_2018_value with air_quality_2018_point, and then with air_quality_2018_param.

Exercise 10.23

Normalise air_quality_2018 so that you get the three separate data frames mentioned in the previous exercise (value, point, and param).

Exercise 10.24

(*) In the National Health and Nutrition Examination Survey, each participant is uniquely identified by their sequence number (SEQN). This token is mentioned in numerous datasets, including:

Join a few chosen datasets that you find interesting.

10.6.4. Set-theoretic operations and removing duplicates¶

Here are two not at all disjoint sets of imaginary persons:

A = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/some_birth_dates1.csv",
    comment="#")
A
##                   Name   BirthDate
## 0     Paitoon Ornwimol  26.06.1958
## 1         Antónia Lata  20.05.1935
## 2    Bertoldo Mallozzi  17.08.1972
## 3        Nedeljko Bukv  19.12.1921
## 4        Micha Kitchen  17.09.1930
## 5      Mefodiy Shachar  01.10.1914
## 6         Paul Meckler  29.09.1968
## 7      Katarzyna Lasko  20.10.1971
## 8         Åge Trelstad  07.03.1935
## 9  Duchanee Panomyaong  19.06.1952

and:

B = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/some_birth_dates2.csv",
    comment="#")
B
##                    Name   BirthDate
## 0    Hushang Naigamwala  25.08.1991
## 1              Zhen Wei  16.11.1975
## 2         Micha Kitchen  17.09.1930
## 3           Jodoc Alwin  16.11.1969
## 4            Igor Mazał  14.05.2004
## 5       Katarzyna Lasko  20.10.1971
## 6   Duchanee Panomyaong  19.06.1952
## 7       Mefodiy Shachar  01.10.1914
## 8          Paul Meckler  29.09.1968
## 9         Noe Tae-Woong  11.07.1970
## 10         Åge Trelstad  07.03.1935

In both datasets, there is a single categorical column whose elements uniquely identify each record (i.e., Name). In the language of relational databases, we would call it the primary key. In such a case, implementing the set-theoretic operations is relatively easy, as we can refer to the pandas.Series.isin method.

First, $A\cap B$ (intersection), includes only the rows that are both in $A$ and in $B$:

A.loc[A.Name.isin(B.Name), :]
##                   Name   BirthDate
## 4        Micha Kitchen  17.09.1930
## 5      Mefodiy Shachar  01.10.1914
## 6         Paul Meckler  29.09.1968
## 7      Katarzyna Lasko  20.10.1971
## 8         Åge Trelstad  07.03.1935
## 9  Duchanee Panomyaong  19.06.1952

Second, $A\smallsetminus B$ (difference), gives all the rows that are in $A$ but not in $B$:

A.loc[~A.Name.isin(B.Name), :]
##                 Name   BirthDate
## 0   Paitoon Ornwimol  26.06.1958
## 1       Antónia Lata  20.05.1935
## 2  Bertoldo Mallozzi  17.08.1972
## 3      Nedeljko Bukv  19.12.1921

Third, $A\cup B$ (union), returns the rows that exist in $A$ or are in $B$:

pd.concat((A, B.loc[~B.Name.isin(A.Name), :]))
##                   Name   BirthDate
## 0     Paitoon Ornwimol  26.06.1958
## 1         Antónia Lata  20.05.1935
## 2    Bertoldo Mallozzi  17.08.1972
## 3        Nedeljko Bukv  19.12.1921
## 4        Micha Kitchen  17.09.1930
## 5      Mefodiy Shachar  01.10.1914
## 6         Paul Meckler  29.09.1968
## 7      Katarzyna Lasko  20.10.1971
## 8         Åge Trelstad  07.03.1935
## 9  Duchanee Panomyaong  19.06.1952
## 0   Hushang Naigamwala  25.08.1991
## 1             Zhen Wei  16.11.1975
## 3          Jodoc Alwin  16.11.1969
## 4           Igor Mazał  14.05.2004
## 9        Noe Tae-Woong  11.07.1970

There are no duplicate rows in any of the above outputs.

Exercise 10.25

Determine $(A\cup B) \smallsetminus (A\cap B) = (A\smallsetminus B)\cup(B\smallsetminus A)$ (symmetric difference).

Exercise 10.26

(*) Determine the union, intersection, and difference of the wine_sample1 and wine_sample2 datasets, where there is no column uniquely identifying the observations. Hint: consider using pandas.concat and pandas.DataFrame.duplicated or pandas.DataFrame.drop_duplicates.

10.6.5. …and (too) many more¶

Looking at the list of methods for the DataFrame and Series classes in the pandas package’s documentation, we can see that they are abundant. Together with the object-orientated syntax, we will often find ourselves appreciating the high readability of even complex operation chains such as data.drop_duplicates().groupby(["year", "month"]).mean().reset_index(); see Chapter 12.

Nevertheless, the methods are probably too plentiful to our taste. Their developers were overgenerous. They wanted to include a list of all the possible verbs related to data analysis, even if they can be trivially expressed by a composition of 2-3 simpler operations from numpy or scipy instead.

As strong advocates of minimalism, we would rather save ourselves from being overloaded with too much new information. This is why our focus in this book is on developing the most transferable[6] skills. Our approach is also slightly more hygienic. We do not want the reader to develop a hopeless mindset, the habit of looking everything up on the internet when faced with even the simplest kinds of problems. We have brains for a reason.

10.7. Exercises¶

Exercise 10.27

How are data frames different from matrices?

Exercise 10.28

What are the use cases of the name slot in Series and Index objects?

Exercise 10.29

What is the purpose of set_index and reset_index?

Exercise 10.30

Why learning numpy is crucial when someone wants to become a proficient user of pandas?

Exercise 10.31

What is the difference between iloc[...] and loc[...]?

Exercise 10.32

Why applying the index operator [...] directly on a Series or DataFrame object is discouraged?

Exercise 10.33

What is the difference between index, Index, and columns?

Exercise 10.34

How can we compute the arithmetic mean and median of all the numeric columns in a data frame, using a single line of code?

Exercise 10.35

What is a training/test split and how to perform it using numpy and pandas?

Exercise 10.36

What is the difference between stacking and unstacking? Which one yields a wide (as opposed to long) format?

Exercise 10.37

Name different data frame join (merge) operations and explain how they work.

Exercise 10.38

How does sorting with respect to more than one criterion work?

Exercise 10.39

Name the set-theoretic operations on data frames.

10. Introducing data frames¶

10.1. Creating data frames¶

10.1.1. Data frames are matrix-like¶

10.1.2. Series¶

10.1.3. Index¶

10.2. Aggregating data frames¶

10.3. Transforming data frames¶

10.4. Indexing Series objects¶

10.4.1. Do not use [...] directly (in the current version of pandas)¶

10.4.2. loc[...]¶

10.4.3. iloc[...]¶

10.4.4. Logical indexing¶

10.5. Indexing data frames¶

10.5.1. loc[...] and iloc[...]¶

10.5.2. Adding rows and columns¶

10.5.3. Modifying items¶

10.5.4. Pseudorandom sampling and splitting¶

10.5.5. Hierarchical indexes (*)¶

10.6. Further operations on data frames¶

10.6.1. Sorting¶

10.6.2. Stacking and unstacking (long/tall and wide forms)¶

10.6.3. Joining (merging)¶

10.6.4. Set-theoretic operations and removing duplicates¶

10.6.5. …and (too) many more¶

10.7. Exercises¶

10.1.2. `Series`¶

10.1.3. `Index`¶

10.4. Indexing `Series` objects¶

10.4.1. Do not use `[...]` directly (in the current version of pandas)¶

10.4.2. loc`[...]`¶

10.4.3. iloc`[...]`¶

10.5.1. loc`[...]` and iloc`[...]`¶