10. Introducing Data Frames

The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.

numpy arrays are an extremely versatile tool for performing data analysis and other numerical computations computations of various kinds. Although theoretically possible otherwise, in practice we only store elements of the same type therein, most often numbers.

pandas [McK17] is amongst over one hundred thousand of open-source packages and repositories that use numpy to provide additional data wrangling functionality. It was originally written by Wes McKinney and was heavily inspired by the data.frame1 objects in S and R as well as tables in relational (think: SQL) databases and spreadsheets.

It is customary to load this package with the following alias:

import pandas as pd

The pandas package delivers a few classes, of which the most important are:

  • DataFrame – for representing tabular data (matrix-like) with columns of possibly different types, in particular a mix of numerical and categorical variables,

  • Series – vector-like objects for representing individual columns,

  • Index – vector-like (usually) objects for labelling individual rows and columns of DataFrames and items in Series objects,

together with many methods for:

  • transforming/aggregating/processing data, also in groups determined by categorical variables or products thereof,

  • reshaping (e.g., from wide to long format) and joining datasets,

  • importing/exporting data from/to various sources and formats, e.g., CSV and HDF5 files or relational databases,

  • handling missing data,

which we introduce in this and further chapters.

10.1. Creating Data Frames

Important

Let us repeat: pandas is built on top of numpy and most objects therein can be processed by numpy functions as well. Other functions (e.g., in seaborn or sklearn) will be accepting both DataFrame and ndarray objects, but will be converting the former to the latter automatically to enable data processing using fast C/C++/Fortran routines. What we have learnt so far still applies. But there is of course more, hence this part.

Data frames can be created, amongst others, using the DataFrame class constructor, which can be fed, for example, with a numpy matrix:

np.random.seed(123)
pd.DataFrame(
    np.random.rand(4, 3),
    columns=["a", "b", "c"]
)
##           a         b         c
## 0  0.696469  0.286139  0.226851
## 1  0.551315  0.719469  0.423106
## 2  0.980764  0.684830  0.480932
## 3  0.392118  0.343178  0.729050

Notice that rows and columns are labelled (and how readable that is).

A dictionary of vector-like objects of equal lengths is another common option:

np.random.seed(123)
pd.DataFrame(dict(
    u=np.round(np.random.rand(5), 2),
    v=[True, True, False, False, True],
    w=["A", "B", "C", "D", "E"],
    x=["spam", "spam", "bacon", "spam", "eggs"]
))
##       u      v  w      x
## 0  0.70   True  A   spam
## 1  0.29   True  B   spam
## 2  0.23  False  C  bacon
## 3  0.55  False  D   spam
## 4  0.72   True  E   eggs

This illustrates the possibility of having columns of different types.

Exercise 10.1

Check out the pandas.DataFrame.from_dict and pandas.DataFrame.from_records methods in the documentation. Use them to create some example data frames.

Further, data frames can be read from files in different formats, for instance, CSV:

body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/nhanes_adult_female_bmx_2020.csv",
    comment="#")
body.head()  # display first few rows (5 by default)
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 0   97.1  160.2     34.7    40.8     35.8   126.1     117.9
## 1   91.1  152.7     33.5    33.0     38.5   125.5     103.1
## 2   73.0  161.2     37.4    38.0     31.8   106.2      92.0
## 3   61.7  157.4     38.0    34.7     29.0   101.0      90.5
## 4   55.4  154.6     34.6    34.0     28.3    92.5      73.2

Reading from URLs and local files (compare Section 13.6.1) is of course supported.

Exercise 10.2

Check out the other pandas.read_* functions in the pandas documentation – some of which we shall be discussing later.

10.1.1. Data Frames are Matrix-Like

Data frames are modelled through numpy matrices, hence we can feel quite at home with them.

For example, given:

np.random.seed(123)
df = pd.DataFrame(dict(
    u=np.round(np.random.rand(5), 2),
    v=[True, True, False, False, True],
    w=["A", "B", "C", "D", "E"],
    x=["spam", "spam", "bacon", "spam", "eggs"]
))
df
##       u      v  w      x
## 0  0.70   True  A   spam
## 1  0.29   True  B   spam
## 2  0.23  False  C  bacon
## 3  0.55  False  D   spam
## 4  0.72   True  E   eggs

it is easy to fetch the number of rows and columns:

df.shape
## (5, 4)

or the type of each column:

df.dtypes
## u    float64
## v       bool
## w     object
## x     object
## dtype: object

Recall that arrays are equipped with the dtype slot.

10.1.2. Series

There is a separate class for storing individual data frame columns: it is called Series.

s = df.loc[:, "u"]  # extract the `u` column; alternatively: d.u
s
## 0    0.70
## 1    0.29
## 2    0.23
## 3    0.55
## 4    0.72
## Name: u, dtype: float64

A data frame with one column is printed out slightly differently:

s.to_frame()
##       u
## 0  0.70
## 1  0.29
## 2  0.23
## 3  0.55
## 4  0.72

Indexing will of course be discussed later.

Important

It is crucial to know when we are dealing with a Series and when with a DataFrame object, because each of them defines a slightly different set of methods.

For example:

s.mean()
## 0.49800000000000005

refers to pandas.Series.mean (which returns a scalar), whereas

df.mean(numeric_only=True)
## u    0.498
## v    0.600
## dtype: float64

uses pandas.DataFrame.mean (which yields a Series).

Exercise 10.3

Look up the two methods in the pandas manual. Note that their argument list is slightly different.

Series are wrappers around numpy arrays.

s.values
## array([0.7 , 0.29, 0.23, 0.55, 0.72])

Most importantly, numpy functions can be called directly on them:

np.mean(s)
## 0.49800000000000005

Hence, what we have covered in the part dealing with vector processing still holds for data frame columns as well (there will be more, stay tuned).

Series can also be named.

s.name
## 'u'

This is convenient when we convert from/to a data frame.

s.rename("spam").to_frame()
##    spam
## 0  0.70
## 1  0.29
## 2  0.23
## 3  0.55
## 4  0.72

10.1.3. Index

Another important class is Index, which is used for encoding the row and column names in a data frame:

df.index  # row labels
## RangeIndex(start=0, stop=5, step=1)

The above represents a sequence (0, 1, 2, 3, 4).

df.columns  # column labels
## Index(['u', 'v', 'w', 'x'], dtype='object')

Also, we can label the individual elements in Series objects:

s.index
## RangeIndex(start=0, stop=5, step=1)

A quite frequent operation that we will be applying to make a data frame column act as a vector of row labels is implemented in the set_index method for data frames:

df2 = df.set_index("x")
df2
##           u      v  w
## x                    
## spam   0.70   True  A
## spam   0.29   True  B
## bacon  0.23  False  C
## spam   0.55  False  D
## eggs   0.72   True  E

This Index object is named:

df2.index.name
## 'x'

which is handy when we decide that we want to convert the vector of row labels back to a standalone column:

df2.reset_index()
##        x     u      v  w
## 0   spam  0.70   True  A
## 1   spam  0.29   True  B
## 2  bacon  0.23  False  C
## 3   spam  0.55  False  D
## 4   eggs  0.72   True  E

There is also an option to get rid of the current .index and to replace it with the default label sequence, i.e., 0, 1, 2, …:

df2.reset_index(drop=True)
##       u      v  w
## 0  0.70   True  A
## 1  0.29   True  B
## 2  0.23  False  C
## 3  0.55  False  D
## 4  0.72   True  E

Important

In due course, we will be calling df.reset_index(drop=True) quite frequently, sometimes more than once in a single chain of commands.

Exercise 10.4

Use the pandas.DataFrame.rename method to change the name of the u column in df to spam.

Also, a hierarchical index – one that is comprised of more than two levels – is possible:

df.sort_values("x", ascending=False).set_index(["x", "v"])
##                 u  w
## x     v             
## spam  True   0.70  A
##       True   0.29  B
##       False  0.55  D
## eggs  True   0.72  E
## bacon False  0.23  C

Note that a group of three consecutive spams does not have the labels repeated – this is for increased readability.

Important

We will soon see that hierarchical indexes might arise after aggregating data in groups (via the groupby method).

For example:

nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/nhanes_p_demo_bmx_2020.csv",
    comment="#")
res = nhanes.groupby(["RIAGENDR", "DMDEDUC2"])["BMXBMI"].mean()
res  # BMI by gender and education
## RIAGENDR  DMDEDUC2
## 1         1.0         28.765244
##           2.0         28.534737
##           3.0         29.637793
##           4.0         30.323158
##           5.0         28.689736
##           9.0         28.080000
## 2         1.0         30.489032
##           2.0         31.251247
##           3.0         30.969200
##           4.0         31.471476
##           5.0         28.891357
##           7.0         30.200000
##           9.0         33.950000
## Name: BMXBMI, dtype: float64

But let us fret not, as there is always:

res.reset_index()
##     RIAGENDR  DMDEDUC2     BMXBMI
## 0          1       1.0  28.765244
## 1          1       2.0  28.534737
## 2          1       3.0  29.637793
## 3          1       4.0  30.323158
## 4          1       5.0  28.689736
## 5          1       9.0  28.080000
## 6          2       1.0  30.489032
## 7          2       2.0  31.251247
## 8          2       3.0  30.969200
## 9          2       4.0  31.471476
## 10         2       5.0  28.891357
## 11         2       7.0  30.200000
## 12         2       9.0  33.950000

10.2. Aggregating Data Frames

Here is an example data frame:

np.random.seed(123)
d = pd.DataFrame(dict(
    u = np.round(np.random.rand(5), 2),
    v = np.round(np.random.randn(5), 2),
    w = ["spam", "bacon", "spam", "eggs", "sausage"]
), index=["a", "b", "c", "d", "e"])
d
##       u     v        w
## a  0.70  0.32     spam
## b  0.29 -0.05    bacon
## c  0.23 -0.20     spam
## d  0.55  1.98     eggs
## e  0.72 -1.62  sausage

First, all numpy functions can be applied directly on individual columns (i.e., objects of type Series), as the latter are, after all, vector-like.

np.quantile(d.loc[:, "u"], [0, 0.5, 1])
## array([0.23, 0.55, 0.72])

For more details on the loc[...]-type indexing, see below. By then, the meaning of the above should be clear from the context.

Most numpy functions also work if they are fed with data frames, but we will need to extract the numeric matrix columns manually.

np.quantile(d.loc[:, ["u", "v"]], [0, 0.5, 1], axis=0)
## array([[ 0.23, -1.62],
##        [ 0.55, -0.05],
##        [ 0.72,  1.98]])

Sometimes the results will automatically be coerced to a Series object with index slot set appropriately:

np.mean(d.loc[:, ["u", "v"]], axis=0)
## u    0.498
## v    0.086
## dtype: float64

Many operations, for convenience, have also been implemented as methods for the Series and DataFrame classes, e.g., mean, median, min, max, quantile, var, std (with ddof=1 by default, interestingly), and skew.

d.loc[:, ["u", "v"]].mean(numeric_only=True)
## u    0.498
## v    0.086
## dtype: float64
d.loc[:, ["u", "v"]].quantile([0, 0.5, 1], numeric_only=True)
##         u     v
## 0.0  0.23 -1.62
## 0.5  0.55 -0.05
## 1.0  0.72  1.98

Also note the describe method, which returns a few statistics at the same time.

d.describe()
##               u         v
## count  5.000000  5.000000
## mean   0.498000  0.086000
## std    0.227969  1.289643
## min    0.230000 -1.620000
## 25%    0.290000 -0.200000
## 50%    0.550000 -0.050000
## 75%    0.700000  0.320000
## max    0.720000  1.980000

Note

(*) Let us stress that above we see the corrected for bias (but still only asymptotically unbiased) version of standard deviation (ddof=1), given by \(\sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2}\), compare Section 5.1.

Exercise 10.5

Check out the pandas.DataFrame.agg method that can apply all aggregating operations from a given list of functions. Write a call equivalent to d.describe().

10.3. Transforming Data Frames

By applying the already well-known vectorised mathematical numpy functions. we can transform each data cell and return an object of the same type as the input one.

np.exp(d.loc[:, "u"])
## a    2.013753
## b    1.336427
## c    1.258600
## d    1.733253
## e    2.054433
## Name: u, dtype: float64
np.exp(d.loc[:, ["u", "v"]])
##           u         v
## a  2.013753  1.377128
## b  1.336427  0.951229
## c  1.258600  0.818731
## d  1.733253  7.242743
## e  2.054433  0.197899

When applying the binary arithmetic, comparison, and logical operators on an object of class Series and a scalar or a numpy vector, the operations are performed elementwisely – a style with which we are already familiar.

For instance, here is a standardised version of the u column:

u = d.loc[:, "u"]
(u - np.mean(u)) / np.std(u)
## a    0.990672
## b   -1.020098
## c   -1.314357
## d    0.255025
## e    1.088759
## Name: u, dtype: float64

For two objects of class Series, the operators are vectorised somewhat differently from their numpy counterparts: they are applied on the pairs of elements having identical labels (i.e., labelwisely). Still, if we are transforming the columns that come from the same data frame (and thus having identical index slots; this is the most common scenario), there will be no surprises, because this is equivalent to elementwise action. For instance:

d.loc[:, "u"] > d.loc[:, "v"]
## a     True
## b     True
## c     True
## d    False
## e     True
## dtype: bool

For transforming many numerical columns at once, it is a good idea either to convert them to a numeric matrix explicitly and then use the basic numpy functions:

uv = d.loc[:, ["u", "v"]].values
uv = (uv-np.mean(uv, axis=0))/np.std(uv, axis=0)
uv
## array([[ 0.99067229,  0.20286225],
##        [-1.0200982 , -0.11790285],
##        [-1.3143573 , -0.24794275],
##        [ 0.25502455,  1.64197052],
##        [ 1.08875866, -1.47898717]])

or to use the pandas.DataFrame.apply method which invokes a given function on each column separately:

def standardise(x): return (x-np.mean(x))/np.std(x)
uv = d.loc[:, ["u", "v"]].apply(standardise)
uv
##           u         v
## a  0.990672  0.202862
## b -1.020098 -0.117903
## c -1.314357 -0.247943
## d  0.255025  1.641971
## e  1.088759 -1.478987

In both cases, we can then write:

d.loc[:, ["u", "v"]] = uv

to replace the old content with the transformed one.

Also, new columns can be added based on transformed versions of the existing ones:

d.loc[:, "uv_squared"] = (d.loc[:, "u"] * d.loc[:, "v"])**2
d
##           u         v        w  uv_squared
## a  0.990672  0.202862     spam    0.040389
## b -1.020098 -0.117903    bacon    0.014465
## c -1.314357 -0.247943     spam    0.106201
## d  0.255025  1.641971     eggs    0.175346
## e  1.088759 -1.478987  sausage    2.592938

10.4. Indexing Series Objects

Recall that each DataFrame and Series object is equipped with a slot called index, which is an object of class Index (or subclass thereof), giving the row and element labels, respectively. It turns out that we may apply the index operator, [...], to subset these objects not only through the indexers known from the numpy part (e.g., numerical ones, i.e., by position) but also ones that pinpoint the items via their labels. That is quite a lot of index-ing.

Let us study different forms thereof in very detail.

In this section we play with two example objects of class Series:

np.random.seed(123)
b = pd.Series(np.round(np.random.uniform(0,1,10),2))
b.index =  np.random.permutation(np.r_[0:10])
b
## 2    0.70
## 1    0.29
## 8    0.23
## 7    0.55
## 9    0.72
## 4    0.42
## 5    0.98
## 6    0.68
## 3    0.48
## 0    0.39
## dtype: float64
c = b.copy()
c.index =  list("abcdefghij")
c
## a    0.70
## b    0.29
## c    0.23
## d    0.55
## e    0.72
## f    0.42
## g    0.98
## h    0.68
## i    0.48
## j    0.39
## dtype: float64

They consist of the same values, in the same order, but have different labels (.index slots). In particular, b’s labels are integers which do not match the physical element positions (where 0 would denote the first element, etc.).

Important

For numpy vectors, we had four different indexing schemes: by scalar (extracts an element at given position), slice, integer vector, and logical vector. Series objects are additionally labelled, therefore they can additionally be accessed taking into account the contents of the .index slot.

10.4.1. Do Not Use [...] Directly

Applying the index operator, [...], directly on Series is generally not a good idea:

b[0]
## 0.39
b[ [0] ]
## 0    0.39
## dtype: float64

both do not select the first item, but the item labelled 0.

However:

b[0:1]  # slice - 0 only
## 2    0.7
## dtype: float64

and

c[0]  # there is no label `0`
## 0.7

fall back to position-based indexing.

Confusing? Well, with some self-discipline, the solution is easy. Namely:

Important

To stay away from potential problems, we should never apply [...] directly on Series objects.

10.4.2. loc[...]

To avoid ambiguity, which is what we want, we should be referring to Series.loc[...] and Series.iloc[...] for label- and position-based filtering, respectively.

And thus:

b.loc[0]
## 0.39

returns the element labelled 0. On the other hand, c.loc[0] will raise a KeyError, because c consists of character labels only.

Next, we can use lists of labels to select a subset.

b.loc[ [0, 1, 0] ]
## 0    0.39
## 1    0.29
## 0    0.39
## dtype: float64
c.loc[ ["j", "b", "j"] ]
## j    0.39
## b    0.29
## j    0.39
## dtype: float64

The result is always of type Series.

Slicing behaves differently as the range is inclusive at both sides:

b.loc[1:7]
## 1    0.29
## 8    0.23
## 7    0.55
## dtype: float64
b.loc[0:4:-1]
## 0    0.39
## 3    0.48
## 6    0.68
## 5    0.98
## 4    0.42
## dtype: float64
c.loc["d":"g"]
## d    0.55
## e    0.72
## f    0.42
## g    0.98
## dtype: float64

return all elements between the two indicated labels.

Be careful that if there are repeated labels, then we will be returning all the matching items:

d = pd.Series([1, 2, 3, 4], index=["a", "b", "a", "c"])
d.loc["a"]
## a    1
## a    3
## dtype: int64

The result is not a scalar but a Series object.

10.4.3. iloc[...]

Here are some examples of position-based indexing:

b.iloc[0]  # the same: c.iloc[0]
## 0.7

returns the first element.

b.iloc[1:7]  # the same: b.iloc[1:7]
## 1    0.29
## 8    0.23
## 7    0.55
## 9    0.72
## 4    0.42
## 5    0.98
## dtype: float64

returns the 2nd, 3rd, …, 7th element (not including b.iloc[7], i.e., the 8th one).

10.4.4. Logical Indexing

Indexing using a logical vector-like object is also available.

We usually will be using loc[...] with either a Series object of identical .index slot as the subsetted object or a logical numpy vector.

b.loc[(b > 0.4) & (b < 0.6)]
## 7    0.55
## 4    0.42
## 3    0.48
## dtype: float64

For iloc[...], the indexer must be unlabelled, like in, e.g., b.loc[i.values].

10.5. Indexing Data Frames

10.5.1. loc[...] and iloc[...]

For data frames, iloc and loc can be used too, but they now require two arguments, serving as row and column selectors.

For example:

np.random.seed(123)
d = pd.DataFrame(dict(
    u = np.round(np.random.rand(5), 2),
    v = np.round(np.random.randn(5), 2),
    w = ["spam", "bacon", "spam", "eggs", "sausage"],
    x = [True, False, True, False, True]
), index=["a", "b", "c", "d", "e"])

And now:

d.loc[d.loc[:, "u"] > 0.5, "u":"w"]
##       u     v        w
## a  0.70  0.32     spam
## d  0.55  1.98     eggs
## e  0.72 -1.62  sausage

selects the rows where the values in the u column are greater than 0.5 and then returns all columns between u and w.

Furthermore,

d.iloc[:3, :].loc[:, ["u", "w"]]
##       u      w
## a  0.70   spam
## b  0.29  bacon
## c  0.23   spam

fetches the first 3 rows (by position – iloc is necessary) and then selects two indicated columns.

Important

We can write d.u as a shorter version of d.loc[:, "u"]. This improves the readability in contexts such as:

d.loc[(d.u >= 0.5) & (d.u <= 0.7), ["u", "w"]]
##       u     w
## a  0.70  spam
## d  0.55  eggs

However, this accessor is not universal: we can check this out by considering a data frame featuring a column named, e.g., mean.

Exercise 10.6

Use pandas.DataFrame.drop to select all columns but v in d.

Exercise 10.7

Use pandas.Series.isin (amongst others) to select all rows with spam and bacon on the d’s menu.

Exercise 10.8

In the tips dataset, select data on male customers where total bill was in the [10, 20] interval. Also, select data from Saturday and Sunday with tip greater than $5 from tips.

10.5.2. Adding Rows and Columns

loc[...] can also used to add new columns to an existing data frame:

d.loc[:, "y"] = d.loc[:, "u"]**2  # or d.loc[:, "y"] = d.u**2
d
##       u     v        w      x       y
## a  0.70  0.32     spam   True  0.4900
## b  0.29 -0.05    bacon  False  0.0841
## c  0.23 -0.20     spam   True  0.0529
## d  0.55  1.98     eggs  False  0.3025
## e  0.72 -1.62  sausage   True  0.5184

Notation like “d.new_column = ...” will not work. As we have said, loc and iloc are universal, other accessors – not so much.

Exercise 10.9

Use pandas.DataFrame.insert to add a new column not necessarily at the end of d.

Exercise 10.10

Use pandas.DataFrame.append to add a few more rows to d.

10.5.3. Random Sampling

As a simple application of what we have covered so far, let us consider the case of randomly sampling a number of rows from an existing data frame.

For the most basic use cases, we can use the pandas.DataFrame.sample method. It includes scenarios such as:

  • randomly select 5 rows, without replacement,

  • randomly select 20% rows, with replacement,

  • randomly rearrange all the rows.

For example:

body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/nhanes_adult_female_bmx_2020.csv",
    comment="#")
body.sample(5, random_state=123)  # 5 random rows without replacement
##       BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 4214   58.4  156.2     35.2    34.7     27.2    99.5      77.5
## 3361   73.7  161.0     36.5    34.5     29.0   107.6      98.2
## 3759   61.4  164.6     37.5    40.4     26.9    93.5      84.4
## 3733  120.4  158.8     33.5    34.6     40.5   147.2     129.3
## 1121  123.5  157.5     35.5    29.0     50.5   143.0     136.4

Notice the random_state argument which controls the seed of the random number generator so that we get reproducible results. Alternatively, we could call numpy.random.seed.

Exercise 10.11

Show how the 3 aforementioned scenarios can be implemented manually using iloc[...] and numpy.random.permutation or numpy.random.choice.

In machine learning practice, as we shall see later, we are used to training and evaluating machine learning models on different (mutually disjoint) subsets of the whole data frame.

For instance, we might be interested in performing the so-called train/test split, where 80% (or 60% or 70%) of the randomly selected rows would constitute the first new data frame and the remaining 20% (or 40% or 30%) would go to the second one.

Given a data frame like:

x = body.head(10)  # this is just an example
x
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 0   97.1  160.2     34.7    40.8     35.8   126.1     117.9
## 1   91.1  152.7     33.5    33.0     38.5   125.5     103.1
## 2   73.0  161.2     37.4    38.0     31.8   106.2      92.0
## 3   61.7  157.4     38.0    34.7     29.0   101.0      90.5
## 4   55.4  154.6     34.6    34.0     28.3    92.5      73.2
## 5   62.0  144.7     32.5    34.2     29.8   106.7      84.8
## 6   66.2  166.5     37.5    37.6     32.0    96.3      95.7
## 7   75.9  154.5     35.4    37.6     32.7   107.7      98.7
## 8   77.2  159.2     38.5    40.5     35.7   102.0      97.5
## 9   91.6  174.5     36.1    45.9     35.2   121.3     100.3

one way to perform the aforementioned split is to generate a random permutation of the set of row indices:

np.random.seed(123)  # reproducibility matters
idx = np.random.permutation(x.shape[0])
idx
## array([4, 0, 7, 5, 8, 3, 1, 6, 9, 2])

And then to pick the first 80% of these indices to construct the data frame number one:

k = int(x.shape[0]*0.8)
x.iloc[idx[:k], :]
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 4   55.4  154.6     34.6    34.0     28.3    92.5      73.2
## 0   97.1  160.2     34.7    40.8     35.8   126.1     117.9
## 7   75.9  154.5     35.4    37.6     32.7   107.7      98.7
## 5   62.0  144.7     32.5    34.2     29.8   106.7      84.8
## 8   77.2  159.2     38.5    40.5     35.7   102.0      97.5
## 3   61.7  157.4     38.0    34.7     29.0   101.0      90.5
## 1   91.1  152.7     33.5    33.0     38.5   125.5     103.1
## 6   66.2  166.5     37.5    37.6     32.0    96.3      95.7

and the remaining ones to generate the second dataset:

x.iloc[idx[k:], :]
##    BMXWT  BMXHT  BMXARML  BMXLEG  BMXARMC  BMXHIP  BMXWAIST
## 9   91.6  174.5     36.1    45.9     35.2   121.3     100.3
## 2   73.0  161.2     37.4    38.0     31.8   106.2      92.0
Exercise 10.12

In the wine_quality_all dataset, leave out all but the white wines. Partition the resulting data frame randomly into three data frames: wines_train (60% of the rows), wines_validate (another 20% of the rows), and wines_test (the remaining 20%).

Exercise 10.13

Consider the white wines dataset again. Write a function kfold which takes a data frame x and an integer k>1 on input. Return a list of data frames resulting in randomly splitting x into k disjoint chunks of equal (or almost equal if that is not possible) sizes.

10.5.4. Hierarchical Indices (*)

Consider the following DataFrame object with a hierarchical index:

np.random.seed(123)
d = pd.DataFrame(dict(
    year = sorted([2023, 2024, 2025]*4),
    quarter = ["Q1", "Q2", "Q3", "Q4"]*3,
    data = np.round(np.random.rand(12), 2)
)).set_index(["year", "quarter"])
d
##               data
## year quarter      
## 2023 Q1       0.70
##      Q2       0.29
##      Q3       0.23
##      Q4       0.55
## 2024 Q1       0.72
##      Q2       0.42
##      Q3       0.98
##      Q4       0.68
## 2025 Q1       0.48
##      Q2       0.39
##      Q3       0.34
##      Q4       0.73

The index has both levels named, but this is purely for aesthetic reasons.

Indexing using loc[...] by default relates to the first level of the hierarchy:

d.loc[2023, :]
##          data
## quarter      
## Q1       0.70
## Q2       0.29
## Q3       0.23
## Q4       0.55
d.loc[[2023, 2025], :]
##               data
## year quarter      
## 2023 Q1       0.70
##      Q2       0.29
##      Q3       0.23
##      Q4       0.55
## 2025 Q1       0.48
##      Q2       0.39
##      Q3       0.34
##      Q4       0.73

To access deeper levels, we can use tuples as indexers:

d.loc[(2023, "Q1"), :]
## data    0.7
## Name: (2023, Q1), dtype: float64
d.loc[[(2023, "Q1"), (2024, "Q3")], :]
##               data
## year quarter      
## 2023 Q1       0.70
## 2024 Q3       0.98

In certain scenarios, though, it will probably be much easier to subset a hierarchical index by using reset_index and set_index creatively (together with loc[...] and pandas.Series.isin, etc.).

The `:` operator for slicing can be used only directly within square brackets, but we can always use the slice constructor to create them from within any context:

d.loc[(slice(None), ["Q1", "Q3"]), :]  # `:`, ["Q1", "Q3"]
##               data
## year quarter      
## 2023 Q1       0.70
##      Q3       0.23
## 2024 Q1       0.72
##      Q3       0.98
## 2025 Q1       0.48
##      Q3       0.34
d.loc[(slice(None, None, -1), slice("Q2", "Q3")), :]  # ::-1, "Q2":"Q3"
##               data
## year quarter      
## 2025 Q3       0.34
##      Q2       0.39
## 2024 Q3       0.98
##      Q2       0.42
## 2023 Q3       0.23
##      Q2       0.29

10.6. Further Operations on Data Frames

10.6.1. Sorting

Let us consider another example dataset. Here are the yearly (for 2018) average air quality data in the Australian state of Victoria.

air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/air_quality_2018_means.csv",
    comment="#")
air = (
    air.
    loc[air.param_id.isin(["BPM2.5", "NO2"]), :].
    reset_index(drop=True)
)
air
##           sp_name param_id      value
## 0      Alphington   BPM2.5   7.848758
## 1      Alphington      NO2   9.558120
## 2    Altona North      NO2   9.467912
## 3       Churchill   BPM2.5   6.391230
## 4       Dandenong      NO2   9.800705
## 5       Footscray   BPM2.5   7.640948
## 6       Footscray      NO2  10.274531
## 7   Geelong South   BPM2.5   6.502762
## 8   Geelong South      NO2   5.681722
## 9   Melbourne CBD   BPM2.5   8.072998
## 10            Moe   BPM2.5   6.427079
## 11   Morwell East   BPM2.5   6.784596
## 12  Morwell South   BPM2.5   6.512849
## 13  Morwell South      NO2   5.124430
## 14      Traralgon   BPM2.5   8.024735
## 15      Traralgon      NO2   5.776333

sort_values is a convenient means to order the rows with respect to one criterion:

air.sort_values("value", ascending=False)
##           sp_name param_id      value
## 6       Footscray      NO2  10.274531
## 4       Dandenong      NO2   9.800705
## 1      Alphington      NO2   9.558120
## 2    Altona North      NO2   9.467912
## 9   Melbourne CBD   BPM2.5   8.072998
## 14      Traralgon   BPM2.5   8.024735
## 0      Alphington   BPM2.5   7.848758
## 5       Footscray   BPM2.5   7.640948
## 11   Morwell East   BPM2.5   6.784596
## 12  Morwell South   BPM2.5   6.512849
## 7   Geelong South   BPM2.5   6.502762
## 10            Moe   BPM2.5   6.427079
## 3       Churchill   BPM2.5   6.391230
## 15      Traralgon      NO2   5.776333
## 8   Geelong South      NO2   5.681722
## 13  Morwell South      NO2   5.124430

Sorting with respect to more criteria is also possible:

air.sort_values(["param_id", "value"], ascending=[True, False])
##           sp_name param_id      value
## 9   Melbourne CBD   BPM2.5   8.072998
## 14      Traralgon   BPM2.5   8.024735
## 0      Alphington   BPM2.5   7.848758
## 5       Footscray   BPM2.5   7.640948
## 11   Morwell East   BPM2.5   6.784596
## 12  Morwell South   BPM2.5   6.512849
## 7   Geelong South   BPM2.5   6.502762
## 10            Moe   BPM2.5   6.427079
## 3       Churchill   BPM2.5   6.391230
## 6       Footscray      NO2  10.274531
## 4       Dandenong      NO2   9.800705
## 1      Alphington      NO2   9.558120
## 2    Altona North      NO2   9.467912
## 15      Traralgon      NO2   5.776333
## 8   Geelong South      NO2   5.681722
## 13  Morwell South      NO2   5.124430

Here, in each group of identical parameters, we get a decreasing order with respect to the value.

Exercise 10.14

(*) Compare the ordering with respect to param_id and value vs value and then param_id.

Exercise 10.15

(**) Perform identical reorderings but using only loc[...], iloc[...], and numpy.argsort.

Note

(*) DataFrame.sort_values unfortunately uses a non-stable algorithm by default (a modified quicksort). If a data frame is sorted with respect to one criterion, and then we reorder it with respect to another one, tied observations are not guaranteed to be listed in the original order:

(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/air_quality_2018_means.csv",
    comment="#")
    .sort_values("sp_name")
    .sort_values("param_id")
    .set_index("param_id")
    .loc[["BPM2.5", "NO2"], :]
    .reset_index())
##    param_id        sp_name      value
## 0    BPM2.5  Melbourne CBD   8.072998
## 1    BPM2.5            Moe   6.427079
## 2    BPM2.5      Footscray   7.640948
## 3    BPM2.5   Morwell East   6.784596
## 4    BPM2.5      Churchill   6.391230
## 5    BPM2.5  Morwell South   6.512849
## 6    BPM2.5      Traralgon   8.024735
## 7    BPM2.5     Alphington   7.848758
## 8    BPM2.5  Geelong South   6.502762
## 9       NO2  Morwell South   5.124430
## 10      NO2      Traralgon   5.776333
## 11      NO2  Geelong South   5.681722
## 12      NO2   Altona North   9.467912
## 13      NO2     Alphington   9.558120
## 14      NO2      Dandenong   9.800705
## 15      NO2      Footscray  10.274531

We have lost the ordering based on station names in the two subgroups. To switch to a mergesort-like method (timsort), we should pass kind="stable".

(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/air_quality_2018_means.csv",
    comment="#")
    .sort_values("sp_name")
    .sort_values("param_id", kind="stable")  # !
    .set_index("param_id")
    .loc[["BPM2.5", "NO2"], :]
    .reset_index())
##    param_id        sp_name      value
## 0    BPM2.5     Alphington   7.848758
## 1    BPM2.5      Churchill   6.391230
## 2    BPM2.5      Footscray   7.640948
## 3    BPM2.5  Geelong South   6.502762
## 4    BPM2.5  Melbourne CBD   8.072998
## 5    BPM2.5            Moe   6.427079
## 6    BPM2.5   Morwell East   6.784596
## 7    BPM2.5  Morwell South   6.512849
## 8    BPM2.5      Traralgon   8.024735
## 9       NO2     Alphington   9.558120
## 10      NO2   Altona North   9.467912
## 11      NO2      Dandenong   9.800705
## 12      NO2      Footscray  10.274531
## 13      NO2  Geelong South   5.681722
## 14      NO2  Morwell South   5.124430
## 15      NO2      Traralgon   5.776333

10.6.2. Stacking and Unstacking (Long and Wide Forms)

Let us discuss some further ways to transform data frames, that benefit from, make sense because of, or are possible due to their being able to store data of different types.

The above air dataset is in the so-called long format, where all measurements are stacked one after/below another. Such a form is quite convenient for the purpose of storing data, but not necessarily for all processing tasks, compare [Wic14]. Recall from the matrix part that a more natural way would be to have a single observation (e.g., data for a measurement station) in each row.

We can unstack the air data frame (convert to the wide format) quite easily:

air.set_index(["sp_name", "param_id"]).unstack().loc[:, "value"]
## param_id         BPM2.5        NO2
## sp_name                           
## Alphington     7.848758   9.558120
## Altona North        NaN   9.467912
## Churchill      6.391230        NaN
## Dandenong           NaN   9.800705
## Footscray      7.640948  10.274531
## Geelong South  6.502762   5.681722
## Melbourne CBD  8.072998        NaN
## Moe            6.427079        NaN
## Morwell East   6.784596        NaN
## Morwell South  6.512849   5.124430
## Traralgon      8.024735   5.776333

The missing values are denoted with NaNs (not-a-number), see Section 15.1 for more details. Interestingly, we got a hierarchical index in the columns, hence the loc[...] part.

Equivalently:

air_wide = air.pivot("sp_name", "param_id", "value").\
    rename_axis(index=None, columns=None)
air_wide
##                  BPM2.5        NO2
## Alphington     7.848758   9.558120
## Altona North        NaN   9.467912
## Churchill      6.391230        NaN
## Dandenong           NaN   9.800705
## Footscray      7.640948  10.274531
## Geelong South  6.502762   5.681722
## Melbourne CBD  8.072998        NaN
## Moe            6.427079        NaN
## Morwell East   6.784596        NaN
## Morwell South  6.512849   5.124430
## Traralgon      8.024735   5.776333

The rename_axis part is there so as to get rid of the name parts of the index and columns slots.

The other way around, we can use the stack method:

air_wide.T.rename_axis(index="location", columns="param").\
    stack().rename("value").reset_index()
##    location          param      value
## 0    BPM2.5     Alphington   7.848758
## 1    BPM2.5      Churchill   6.391230
## 2    BPM2.5      Footscray   7.640948
## 3    BPM2.5  Geelong South   6.502762
## 4    BPM2.5  Melbourne CBD   8.072998
## 5    BPM2.5            Moe   6.427079
## 6    BPM2.5   Morwell East   6.784596
## 7    BPM2.5  Morwell South   6.512849
## 8    BPM2.5      Traralgon   8.024735
## 9       NO2     Alphington   9.558120
## 10      NO2   Altona North   9.467912
## 11      NO2      Dandenong   9.800705
## 12      NO2      Footscray  10.274531
## 13      NO2  Geelong South   5.681722
## 14      NO2  Morwell South   5.124430
## 15      NO2      Traralgon   5.776333

We used the data frame transpose (T) to get a location-major order (less boring an outcome in this context).

10.6.3. Set-Theoretic Operations

Here are two not at all disjoint sets of imaginary persons:

A = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/some_birth_dates1.csv",
    comment="#")
A
##                   Name   BirthDate
## 0     Paitoon Ornwimol  26.06.1958
## 1         Antónia Lata  20.05.1935
## 2    Bertoldo Mallozzi  17.08.1972
## 3        Nedeljko Bukv  19.12.1921
## 4        Micha Kitchen  17.09.1930
## 5      Mefodiy Shachar  01.10.1914
## 6         Paul Meckler  29.09.1968
## 7      Katarzyna Lasko  20.10.1971
## 8         Åge Trelstad  07.03.1935
## 9  Duchanee Panomyaong  19.06.1952

and

B = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/some_birth_dates2.csv",
    comment="#")
B
##                    Name   BirthDate
## 0    Hushang Naigamwala  25.08.1991
## 1              Zhen Wei  16.11.1975
## 2         Micha Kitchen  17.09.1930
## 3           Jodoc Alwin  16.11.1969
## 4            Igor Mazał  14.05.2004
## 5       Katarzyna Lasko  20.10.1971
## 6   Duchanee Panomyaong  19.06.1952
## 7       Mefodiy Shachar  01.10.1914
## 8          Paul Meckler  29.09.1968
## 9         Noe Tae-Woong  11.07.1970
## 10         Åge Trelstad  07.03.1935

In both datasets, there is a single column whose elements uniquely identify each record (i.e., Name). In the language of relational databases, we would call it a primary key. In such a case, implementing the set-theoretic operations is relatively easy:

\(A\cap B\) (intersection) – the rows that are both in A and in B:

A.loc[A.Name.isin(B.Name), :]
##                   Name   BirthDate
## 4        Micha Kitchen  17.09.1930
## 5      Mefodiy Shachar  01.10.1914
## 6         Paul Meckler  29.09.1968
## 7      Katarzyna Lasko  20.10.1971
## 8         Åge Trelstad  07.03.1935
## 9  Duchanee Panomyaong  19.06.1952

\(A\setminus B\) (difference) – the rows that are in A but not in B:

A.loc[~A.Name.isin(B.Name), :]
##                 Name   BirthDate
## 0   Paitoon Ornwimol  26.06.1958
## 1       Antónia Lata  20.05.1935
## 2  Bertoldo Mallozzi  17.08.1972
## 3      Nedeljko Bukv  19.12.1921

\(A\cup B\) (union) – the rows that exist in A or are in B:

pd.concat((A, B.loc[~B.Name.isin(A.Name), :]))
##                   Name   BirthDate
## 0     Paitoon Ornwimol  26.06.1958
## 1         Antónia Lata  20.05.1935
## 2    Bertoldo Mallozzi  17.08.1972
## 3        Nedeljko Bukv  19.12.1921
## 4        Micha Kitchen  17.09.1930
## 5      Mefodiy Shachar  01.10.1914
## 6         Paul Meckler  29.09.1968
## 7      Katarzyna Lasko  20.10.1971
## 8         Åge Trelstad  07.03.1935
## 9  Duchanee Panomyaong  19.06.1952
## 0   Hushang Naigamwala  25.08.1991
## 1             Zhen Wei  16.11.1975
## 3          Jodoc Alwin  16.11.1969
## 4           Igor Mazał  14.05.2004
## 9        Noe Tae-Woong  11.07.1970

There are no duplicate rows in the output.

Exercise 10.16

Determine \((A\cup B) \setminus (A\cap B) = (A\setminus B)\cup(B\setminus A)\) (symmetric difference).

Exercise 10.17

(*) Determine the union, intersection, and difference of the wine_sample1 and wine_sample2 datasets, where there is no column uniquely identifying the observations. Hint: consider using pandas.DataFrame.duplicated or pandas.DataFrame.drop_duplicates.

10.6.4. Joining (Merging)

In database design, it is common to normalise the datasets in order to avoid the duplication of information and pathologies stemming from them (e.g., [Dat03]).

Example 10.18

The above air quality parameters are separately described in another data frame:

param = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/air_quality_2018_param.csv",
    comment="#")
param.rename(dict(param_std_unit_of_measure="unit"), axis=1)
##   param_id                   param_name   unit      param_short_name
## 0      API      Airborne particle index   none  Visibility Reduction
## 1   BPM2.5  BAM  Particles < 2.5 micron  ug/m3                 PM2.5
## 2       CO              Carbon Monoxide    ppm                    CO
## 3    HPM10                   Hivol PM10  ug/m3                   NaN
## 4      NO2             Nitrogen Dioxide    ppb                   NO2
## 5       O3                        Ozone    ppb                    O3
## 6     PM10     TEOM Particles <10micron  ug/m3                  PM10
## 7   PPM2.5               Partisol PM2.5  ug/m3                   NaN
## 8      SO2               Sulfur Dioxide    ppb                   SO2

We could have stored them alongside the air data frame, but that would be a waste of space.

Also, if we wanted to modify some datum (note, e.g., the annoying double space in param_name for BPM2.5), we would have to update all the relevant records.

Instead, we can always look the information up by param_id and join (merge) the two data frames only if we need it.

Let us discuss the possible join operations by studying the two following data sets:

A = pd.DataFrame({
    "x": ["a0", "a1", "a2", "a3"],
    "y": ["b0", "b1", "b2", "b3"]
})
A
##     x   y
## 0  a0  b0
## 1  a1  b1
## 2  a2  b2
## 3  a3  b3

and:

B = pd.DataFrame({
    "x": ["a0", "a2", "a2", "a4"],
    "z": ["c0", "c1", "c2", "c3"]
})
B
##     x   z
## 0  a0  c0
## 1  a2  c1
## 2  a2  c2
## 3  a4  c3

They both have one column somewhat in common, x.

The inner (natural) join returns the records that have a match in both datasets:

pd.merge(A, B, on="x")
##     x   y   z
## 0  a0  b0  c0
## 1  a2  b2  c1
## 2  a2  b2  c2

The left join of A with B guarantees to return all the records from A, even those which are not matched by anything in B.

pd.merge(A, B, how="left", on="x")
##     x   y    z
## 0  a0  b0   c0
## 1  a1  b1  NaN
## 2  a2  b2   c1
## 3  a2  b2   c2
## 4  a3  b3  NaN

The right join of A with B is the same as the left join of B with A:

pd.merge(A, B, how="right", on="x")
##     x    y   z
## 0  a0   b0  c0
## 1  a2   b2  c1
## 2  a2   b2  c2
## 3  a4  NaN  c3

Finally, the full outer join is the set-theoretic union of the left and the right join:

pd.merge(A, B, how="outer", on="x")
##     x    y    z
## 0  a0   b0   c0
## 1  a1   b1  NaN
## 2  a2   b2   c1
## 3  a2   b2   c2
## 4  a3   b3  NaN
## 5  a4  NaN   c3
Exercise 10.20

Normalise air_quality_2018 yourself so that you get the three data frames mentioned in the previous exercise (value, point, param).

Exercise 10.21

(*) In the National Health and Nutrition Examination Survey (NHANES) by the US Centres for Disease Control and Prevention, each participant is uniquely identified by their sequence number (SEQN), which is mentioned in numerous datasets, including:

Join a few chosen datasets that are to your liking.

10.6.5. …And (Too) Many More

Looking at the list of methods for the DataFrame and Series classes in the pandas package’s documentation, we can see that they are abundant. Together with the object-oriented syntax, we will often find ourselves appreciating the high readability of even quite complex operation chains such as data.drop_duplicates().groupby(["year", "month"]).mean().reset_index().

Nevertheless, in fact, we should admit that the methods are too plentiful – they make an impression that someone was too generous by including a list of all the possible verbs related to data analysis, even if they can be easily expressed as a combination of 2-3 simpler operations. Therefore, in order to prevent the reader from being overloaded with too much new information, below we will be discussing only the most noteworthy features that often appear in data wrangling scenarios.

As strong advocates of minimalism (which is more environmentally friendly and sustainable), not rarely will we be more eager to fall back to the hand-crafted combinations of the more basic (universal) building blocks from numpy and scipy instead. This is also in line with our putting emphasis on developing transferable skills – as Python with pandas is not the only combo where we can work with data frames (e.g., base R allows that too).

10.7. Exercises

Exercise 10.22

How are data frames different from matrices?

Exercise 10.23

What are the use cases of the name slot in Series and Index objects?

Exercise 10.24

What is the purpose of set_index and reset_index?

Exercise 10.25

Why learning numpy is crucial when someone wants to become a proficient user of pandas?

Exercise 10.26

What is the difference between iloc[...] and loc[...]?

Exercise 10.27

Why applying the index operator [...] directly on a Series or DataFrame object is not necessarily a good idea?

Exercise 10.28

What is the difference between .index, Index, and .columns?

Exercise 10.29

How to compute the arithmetic mean and median of all the numeric columns in a data frame, using a single line of code?

Exercise 10.30

What is a train/test split and how to perform it using numpy and pandas?

Exercise 10.31

What is the difference between stacking and unstacking? Which one yields a wide (as opposed to long) format?

Exercise 10.32

Name different data frame join (merge) operations and explain how they work.

Exercise 10.33

How does sorting with respect to more than one criterion work?

Exercise 10.34

Name the basic set-theoretic operations on data frames.


1

Data frames were first introduced in the 1991 version of the S language [CH91].