10. Introducing data frames¶
This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Deep R Programming [36] too.
numpy arrays are an extremely versatile tool for performing data analysis activities and other numerical computations of various kinds. Even though it is theoretically possible otherwise, in practice, we only store elements of the same type there: most often numbers.
pandas [65]
is amongst over one hundred thousand
open-source packages and repositories that use numpy to provide
additional data wrangling functionality.
It was originally written by Wes McKinney
but was heavily inspired by the data.frame
[1] objects in S and R
as well as tables in relational (think: SQL) databases
and spreadsheets.
pandas defines a few classes, of which the most important are:
DataFrame
– for representing tabular data (matrix-like) with columns of possibly different types, in particular a mix of numerical and categorical variables,Series
– vector-like objects for storing individual columns,Index
and its derivatives – vector-like (usually) objects for labelling individual rows and columns inDataFrame
s and items inSeries
objects,SeriesGroupBy
andDataFrameGroupBy
– which model observations grouped by a categorical variable or a combination of factors (Chapter 12),
together with many methods for:
transforming/aggregating/processing data, also in groups determined by categorical variables or products thereof,
reshaping (e.g., from wide to long format) and joining datasets,
importing/exporting data from/to various sources and formats, e.g., CSV and HDF5 files or relational databases,
handling missing data,
all of which we introduce in this part.
It is customary to import the pandas package under the following alias:
import pandas as pd
Important
Let’s repeat: pandas is built on top of
numpy and most objects therein can be processed by
numpy functions as well.
Many other functions, e.g., in scikit-learn,
accept both DataFrame
and ndarray
objects,
but often convert the former to the latter internally to enable data
processing using fast, lower-level C/C++/Fortran routines.
What we have learnt so far[2] still applies. But there is more, hence this part.
10.1. Creating data frames¶
Data frames can be created, amongst others,
using the DataFrame
class constructor, which
can be fed, for example, with a numpy matrix:
np.random.seed(123)
pd.DataFrame(
np.random.rand(4, 3),
columns=["a", "b", "c"]
)
## a b c
## 0 0.696469 0.286139 0.226851
## 1 0.551315 0.719469 0.423106
## 2 0.980764 0.684830 0.480932
## 3 0.392118 0.343178 0.729050
Notice that rows and columns are labelled (and how readable that is).
A dictionary of vector-like objects of equal lengths is another common data source:
np.random.seed(123)
df = pd.DataFrame(dict(
a = np.round(np.random.rand(5), 2),
b = [1, 2.5, np.nan, 4, np.nan],
c = [True, True, False, False, True],
d = ["A", "B", "C", None, "E"],
e = ["spam", "spam", "bacon", "spam", "eggs"],
f = np.array([
"2021-01-01", "2022-02-02", "2023-03-03", "2024-04-04", "2025-05-05"
], dtype="datetime64[D]"),
g = [
["spam"], ["bacon", "spam"], None, ["eggs", "bacon", "spam"], ["ham"]
],
))
df
## a b c d e f g
## 0 0.70 1.0 True A spam 2021-01-01 [spam]
## 1 0.29 2.5 True B spam 2022-02-02 [bacon, spam]
## 2 0.23 NaN False C bacon 2023-03-03 None
## 3 0.55 4.0 False None spam 2024-04-04 [eggs, bacon, spam]
## 4 0.72 NaN True E eggs 2025-05-05 [ham]
The above illustrates the possibility of having columns of different types.
Check out pandas.DataFrame.from_dict and from_records in the documentation. Use them to create some example data frames.
Further, data frames can be read from files in different formats, for instance, CSV:
body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
comment="#")
body.head() # display the first few rows (five by default)
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
Reading from URLs and local files is also supported; compare Section 13.6.1.
Check out other pandas.read_* functions in the pandas documentation, e.g., for importing spreadsheets, Apache Parquet and HDF5 files, scraping tables from HTML documents, or reading data from relational databases. We will discuss some of them in more detail later.
(*) Large files that do not fit into computer’s memory (but not too large)
can still be read with pandas.read_csv.
Check out the meaning of
the usecols
, dtype
, skiprows
,
and nrows
arguments.
On a side note, sampling is mentioned in Section 10.5.4
and chunking in Section 13.2.
10.1.1. Data frames are matrix-like¶
Data frames are modelled through numpy matrices. We can thus already feel quite at home with them.
For example, a data frame, it is easy to fetch its number of rows and columns:
df.shape
## (5, 7)
or the type of each column:
df.dtypes # returns a Series object; see below
## a float64
## b float64
## c bool
## d object
## e object
## f datetime64[s]
## g object
## dtype: object
Recall that numpy arrays are equipped with the dtype
slot.
10.1.2. Series
¶
There is a separate class for storing individual data frame
columns: it is called Series
.
s = df.loc[:, "a"] # extract the `a` column; alternatively: df.a
s
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
## Name: a, dtype: float64
Data frames with one column are printed
out slightly differently.
We get the column name at the top,
but do not have the dtype
information at the bottom.
s.to_frame() # or: pd.DataFrame(s)
## a
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
Indexing will be discussed later.
Important
It is crucial to know when we are dealing
with a Series
and when with a DataFrame
object
as each of them defines a slightly different set of methods.
We will now be relying on object-orientated syntax (compare Section 2.2.3) much more frequently than before.
By calling:
s.mean()
## 0.49800000000000005
we refer to pandas.Series.mean (which returns a scalar), whereas:
df.mean(numeric_only=True)
## a 0.498
## b 2.500
## c 0.600
## dtype: float64
uses pandas.DataFrame.mean (which yields a Series
).
Look up these two methods in the pandas manual. Note that their argument list is slightly different.
Objects of the class Series
are vector-like:
s.shape
## (5,)
s.dtype
## dtype('float64')
They are wrappers around numpy arrays.
s.values
## array([0.7 , 0.29, 0.23, 0.55, 0.72])
Most importantly, numpy functions can be called directly on them:
np.mean(s)
## 0.49800000000000005
As a consequence, what we covered in the part of this book that dealt with vector processing still holds for data frame columns (but there will be more).
Series
can also be named.
s.name
## 'a'
This is convenient, especially when we convert them to a data frame
as the name
sets the label of the newly created column:
s.rename("spam").to_frame()
## spam
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
10.1.3. Index
¶
Another important class is called Index
[3].
We use it for storing element or axes labels.
The index
(lowercase) slot of a data frame stores an
object of the class Index
(or one of its derivatives)
that gives the row names:
df.index # row labels
## RangeIndex(start=0, stop=5, step=1)
It represents a sequence (0, 1, 2, 3, 4).
Furthermore, the column
slot gives:
df.columns # column labels
## Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')
Also, we can label the individual elements in Series
objects:
s.index
## RangeIndex(start=0, stop=5, step=1)
The set_index method can be applied to make a data frame column act as a sequence of row labels:
df2 = df.set_index("e")
df2
## a b c d f g
## e
## spam 0.70 1.0 True A 2021-01-01 [spam]
## spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## bacon 0.23 NaN False C 2023-03-03 None
## spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## eggs 0.72 NaN True E 2025-05-05 [ham]
This Index
object is named:
df2.index.name
## 'e'
We can also rename the axes on the fly:
df2.rename_axis(index="ROWS", columns="COLS")
## COLS a b c d f g
## ROWS
## spam 0.70 1.0 True A 2021-01-01 [spam]
## spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## bacon 0.23 NaN False C 2023-03-03 None
## spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## eggs 0.72 NaN True E 2025-05-05 [ham]
Having a named index
slot is handy when converting a vector
of row labels back to a standalone column:
df2.rename_axis(index="NEW_COLUMN").reset_index()
## NEW_COLUMN a b c d f g
## 0 spam 0.70 1.0 True A 2021-01-01 [spam]
## 1 spam 0.29 2.5 True B 2022-02-02 [bacon, spam]
## 2 bacon 0.23 NaN False C 2023-03-03 None
## 3 spam 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## 4 eggs 0.72 NaN True E 2025-05-05 [ham]
But we can also do it this way:
df2.reset_index(names="Spanish Inquisition")
## Spanish Inquisition a b ... d f g
## 0 spam 0.70 1.0 ... A 2021-01-01 [spam]
## 1 spam 0.29 2.5 ... B 2022-02-02 [bacon, spam]
## 2 bacon 0.23 NaN ... C 2023-03-03 None
## 3 spam 0.55 4.0 ... None 2024-04-04 [eggs, bacon, spam]
## 4 eggs 0.72 NaN ... E 2025-05-05 [ham]
##
## [5 rows x 7 columns]
There is also an option to drop the current index
whatsoever
and to replace it with the default label sequence, i.e.,
0, 1, 2, …:
df2.reset_index(drop=True)
## a b c d f g
## 0 0.70 1.0 True A 2021-01-01 [spam]
## 1 0.29 2.5 True B 2022-02-02 [bacon, spam]
## 2 0.23 NaN False C 2023-03-03 None
## 3 0.55 4.0 False None 2024-04-04 [eggs, bacon, spam]
## 4 0.72 NaN True E 2025-05-05 [ham]
Take note of the fact that reset_index, and many other methods that we have used so far, do not modify the data frame in place.
Important
We will soon get used to calling
reset_index(drop=True)
frequently: sometimes
more than once in a single series of commands.
Use the
pandas.DataFrame.rename method to
change the name of the a
column in df
to spam
.
Also, a hierarchical index – one that is comprised
of more than one level – is possible.
For example, here is a sorted (see Section 10.6.1) version
of df
with a new index based on two columns at the same time:
df.sort_values("e", ascending=False).set_index(["e", "c"])
## a b d f g
## e c
## spam True 0.70 1.0 A 2021-01-01 [spam]
## True 0.29 2.5 B 2022-02-02 [bacon, spam]
## False 0.55 4.0 None 2024-04-04 [eggs, bacon, spam]
## eggs True 0.72 NaN E 2025-05-05 [ham]
## bacon False 0.23 NaN C 2023-03-03 None
For the sake of readability,
the consecutive repeated spam
s were not printed.
Hierarchical indexes might arise after aggregating data in groups. For example:
nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_p_demo_bmx_2020.csv",
comment="#").rename({
"BMXBMI": "bmival",
"RIAGENDR": "gender",
"DMDBORN4": "usborn"
}, axis=1)
In Chapter 12, we will get used to writing:
res = nhanes.groupby(["gender", "usborn"])["bmival"].mean()
res # BMI by gender and US born-ness
## gender usborn
## 1 1 25.734110
## 2 27.405251
## 2 1 27.120261
## 2 27.579448
## 77 28.725000
## 99 32.600000
## Name: bmival, dtype: float64
This returned a Series
object with a hierarchical index.
If we do not fancy it, reset_index comes to our rescue:
res.reset_index()
## gender usborn bmival
## 0 1 1 25.734110
## 1 1 2 27.405251
## 2 2 1 27.120261
## 3 2 2 27.579448
## 4 2 77 28.725000
## 5 2 99 32.600000
10.2. Aggregating data frames¶
Here is another toy data frame:
np.random.seed(123)
df = pd.DataFrame(dict(
u = np.round(np.random.rand(5), 2),
v = np.round(np.random.randn(5), 2),
w = ["spam", "bacon", "spam", "eggs", "sausage"]
), index=["a", "b", "c", "d", "e"])
df
## u v w
## a 0.70 0.32 spam
## b 0.29 -0.05 bacon
## c 0.23 -0.20 spam
## d 0.55 1.98 eggs
## e 0.72 -1.62 sausage
All numpy functions can be applied on individual
columns, i.e., objects of the type Series
, because they are vector-like.
u = df.loc[:, "u"] # extract the `u` column (gives a Series; see below)
np.quantile(u, [0, 0.5, 1])
## array([0.23, 0.55, 0.72])
Most numpy functions also work if they are fed with data frames, but we will need to extract the numeric columns manually.
uv = df.loc[:, ["u", "v"]] # select two columns (a DataFrame; see below)
np.quantile(uv, [0, 0.5, 1], axis=0)
## array([[ 0.23, -1.62],
## [ 0.55, -0.05],
## [ 0.72, 1.98]])
Sometimes the results will automatically be coerced to a Series
object
with the index
slot set appropriately:
np.mean(uv, axis=0)
## u 0.498
## v 0.086
## dtype: float64
For convenience, many operations are also available as methods for the
Series
and DataFrame
classes, e.g., mean, median,
min, max, quantile,
var, std, and skew.
df.mean(numeric_only=True)
## u 0.498
## v 0.086
## dtype: float64
df.quantile([0, 0.5, 1], numeric_only=True)
## u v
## 0.0 0.23 -1.62
## 0.5 0.55 -0.05
## 1.0 0.72 1.98
Also note the describe method, which returns a few statistics at the same time.
df.describe()
## u v
## count 5.000000 5.000000
## mean 0.498000 0.086000
## std 0.227969 1.289643
## min 0.230000 -1.620000
## 25% 0.290000 -0.200000
## 50% 0.550000 -0.050000
## 75% 0.700000 0.320000
## max 0.720000 1.980000
Check out the pandas.DataFrame.agg method
that can apply all aggregates given by a list of functions.
Compose a call equivalent to df.describe()
.
Note
(*) Let’s stress that above we see
the corrected for bias (but still only asymptotically unbiased)
version of standard deviation,
given by \(\sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2}\);
compare Section 5.1.
In pandas, std methods assume ddof=1
by default,
whereas we recall that numpy uses ddof=0
.
np.round([u.std(), np.std(u), np.std(np.array(u)), u.std(ddof=0)], 3)
## array([0.228, 0.204, 0.204, 0.204])
This is an unfortunate inconsistency between the two packages, but please do not blame the messenger.
10.3. Transforming data frames¶
By applying the already well-known vectorised mathematical functions from numpy, we can transform each data cell and return an object of the same type as the input one.
np.exp(df.loc[:, "u"])
## a 2.013753
## b 1.336427
## c 1.258600
## d 1.733253
## e 2.054433
## Name: u, dtype: float64
np.exp(df.loc[:, ["u", "v"]])
## u v
## a 2.013753 1.377128
## b 1.336427 0.951229
## c 1.258600 0.818731
## d 1.733253 7.242743
## e 2.054433 0.197899
When applying the binary arithmetic, relational,
and logical operators on an object of the class Series
and
a scalar or a numpy vector, the operations are performed
elementwisely – a style with which we are already familiar.
For instance, here is a standardised version of the u
column:
u = df.loc[:, "u"]
(u - np.mean(u)) / np.std(u)
## a 0.990672
## b -1.020098
## c -1.314357
## d 0.255025
## e 1.088759
## Name: u, dtype: float64
Binary operators act on the elements with corresponding labels.
For two objects having identical index
slots
(this is the most common scenario), this is the same as elementwise
vectorisation. For instance:
df.loc[:, "u"] > df.loc[:, "v"] # here: elementwise comparison
## a True
## b True
## c True
## d False
## e True
## dtype: bool
For transforming many numerical columns at once, it is worthwhile either to convert them to a numeric matrix explicitly and then use the basic numpy functions:
uv = np.array(df.loc[:, ["u", "v"]])
uv2 = (uv-np.mean(uv, axis=0))/np.std(uv, axis=0)
uv2
## array([[ 0.99067229, 0.20286225],
## [-1.0200982 , -0.11790285],
## [-1.3143573 , -0.24794275],
## [ 0.25502455, 1.64197052],
## [ 1.08875866, -1.47898717]])
or to use the pandas.DataFrame.apply method which invokes a given function on each column separately:
uv2 = df.loc[:, ["u", "v"]].apply(lambda x: (x-np.mean(x))/np.std(x))
uv2
## u v
## a 0.990672 0.202862
## b -1.020098 -0.117903
## c -1.314357 -0.247943
## d 0.255025 1.641971
## e 1.088759 -1.478987
Anticipating what we cover in the next section,
in both cases, we can write df.loc[:, ["u", "v"]] = uv2
to replace the old content. Also, new columns can be added based on the
transformed versions of the existing ones. For instance:
df.loc[:, "uv_squared"] = (df.loc[:, "u"] * df.loc[:, "v"])**2
df
## u v w uv_squared
## a 0.70 0.32 spam 0.050176
## b 0.29 -0.05 bacon 0.000210
## c 0.23 -0.20 spam 0.002116
## d 0.55 1.98 eggs 1.185921
## e 0.72 -1.62 sausage 1.360489
(*)
Binary operations on objects with different index
slots
are vectorised labelwisely:
x = pd.Series([1, 10, 1000, 10000, 100000], index=["a", "b", "a", "a", "c"])
x
## a 1
## b 10
## a 1000
## a 10000
## c 100000
## dtype: int64
y = pd.Series([1, 2, 3, 4, 5], index=["b", "b", "a", "d", "c"])
y
## b 1
## b 2
## a 3
## d 4
## c 5
## dtype: int64
And now:
x * y
## a 3.0
## a 3000.0
## a 30000.0
## b 10.0
## b 20.0
## c 500000.0
## d NaN
## dtype: float64
Here, each element in the first Series
named a
was
multiplied by each (there was only one) element labelled a
in the second
Series
. For d
, there were no matches, hence the result’s being marked
as missing; compare Chapter 15.
Thus, it behaves like a full outer join-type operation;
see Section 10.6.3.
The above is different from elementwise vectorisation in numpy:
np.array(x) * np.array(y)
## array([ 1, 20, 3000, 40000, 500000])
Labelwise vectorisation can be useful in certain contexts. However, we need to be aware of this (yet another) incompatibility between the two packages.
10.4. Indexing Series
objects¶
Recall that each DataFrame
and Series
object
is equipped with a slot called index
, which is
an object of the class Index
(or subclass thereof),
giving the row and element labels, respectively.
It turns out that we may apply
the index operator, [...]
, to subset these objects
not only through the indexers known from
the numpy part (e.g., numerical ones, i.e., by position)
but also ones that pinpoint the items via their
labels. We have thus quite a lot of index-ing to discuss.
For illustration, we will be playing with two objects of
the type Series
:
np.random.seed(123)
b = pd.Series(np.round(np.random.rand(10), 2))
b.index = np.random.permutation(np.arange(10))
b
## 2 0.70
## 1 0.29
## 8 0.23
## 7 0.55
## 9 0.72
## 4 0.42
## 5 0.98
## 6 0.68
## 3 0.48
## 0 0.39
## dtype: float64
and:
c = b.copy()
c.index = list("abcdefghij")
c
## a 0.70
## b 0.29
## c 0.23
## d 0.55
## e 0.72
## f 0.42
## g 0.98
## h 0.68
## i 0.48
## j 0.39
## dtype: float64
They consist of the same values, in the same order,
but have different labels (index
slots).
In particular, b
’s labels
are integers that do not match the physical element
positions (where 0 would denote the first element, etc.).
Important
For numpy vectors, we had four different indexing schemes:
via a scalar (extracts an element at a given position),
a slice, an integer vector, and a logical vector.
Series
objects are additionally labelled. Therefore,
they can also be accessed through the contents of the index
slot.
10.4.1. Do not use [...]
directly (in the current version of pandas)¶
Applying the index operator,
[...]
, directly on Series
is currently not a wise idea:
b[0] # do not use it
## 0.39
b[ [0] ] # do not use it
## 0 0.39
## dtype: float64
both do not select the first item, but the item labelled 0
.
However, the undermentioned two calls fall back to position-based
indexing.
b[:1] # do not use it: it will change in the future!
## 2 0.7
## dtype: float64
c[0] # there is no label `0` (do not use it: it will change in the future!)
## 0.7
Confusing? Well, with some self-discipline, the solution is easy:
Important
In the current version of pandas, we recommend abstaining
from applying [...]
directly on Series
and DataFrame
objects.
We should explicitly refer to the
loc[...]
and iloc[...]
accessors
for the label- and position-based filtering, respectively.
In the future, direct call to [...]
on Series
will
use label-based indexing (as we will see below, b[:5]
will hence not mean “select the first five rows”).
At the same time, [...]
on DataFrame
currently serves as a label-based selector of columns only.
10.4.2. loc[...]
¶
Series.loc[...]
implements label-based indexing.
b.loc[0]
## 0.39
This returned the element labelled 0
. On the other hand, c.loc[0]
will
raise a KeyError
because c
consists of string labels only.
But in this case, we can write:
c.loc["j"]
## 0.39
Next, we can use lists of labels to select a subset.
b.loc[ [0, 1, 0] ]
## 0 0.39
## 1 0.29
## 0 0.39
## dtype: float64
c.loc[ ["j", "b", "j"] ]
## j 0.39
## b 0.29
## j 0.39
## dtype: float64
The result is always of the type Series
.
Slicing behaves differently as the range is inclusive (sic![4]) at both sides:
b.loc[1:7]
## 1 0.29
## 8 0.23
## 7 0.55
## dtype: float64
b.loc[0:4:-1]
## 0 0.39
## 3 0.48
## 6 0.68
## 5 0.98
## 4 0.42
## dtype: float64
c.loc["d":"g"]
## d 0.55
## e 0.72
## f 0.42
## g 0.98
## dtype: float64
They return all elements between the two indicated labels.
Note
Be careful that if there are repeated labels, then we will be returning all (sic![5]) the matching items:
d = pd.Series([1, 2, 3, 4], index=["a", "b", "a", "c"])
d.loc["a"]
## a 1
## a 3
## dtype: int64
The result is not a scalar but a Series
object.
10.4.3. iloc[...]
¶
Here are some examples of position-based indexing with
the iloc[...]
accessor.
It is worth stressing that, fortunately, its behaviour
is consistent with its numpy counterpart, i.e.,
the ordinary square brackets applied on objects of the class ndarray
.
For example:
b.iloc[0] # the same: c.iloc[0]
## 0.7
returns the first element.
b.iloc[1:7] # the same: b.iloc[1:7]
## 1 0.29
## 8 0.23
## 7 0.55
## 9 0.72
## 4 0.42
## 5 0.98
## dtype: float64
returns the second, third, …, seventh element
(not including b.iloc[7]
, i.e., the eight one).
10.4.4. Logical indexing¶
Indexing using a logical vector-like object is also available.
For this purpose, we will usually be using loc[...]
with either a logical Series
object of identical index
slot
as the subsetted object, or a Boolean numpy vector.
b.loc[(b > 0.4) & (b < 0.6)]
## 7 0.55
## 4 0.42
## 3 0.48
## dtype: float64
For iloc[...]
, the indexer must be unlabelled,
e.g., be an ordinary numpy vector.
10.5. Indexing data frames¶
10.5.1. loc[...]
and iloc[...]
¶
For data frames, iloc
and loc
can be applied too.
Now, however, they require two arguments: a row and a column
selector. For example:
np.random.seed(123)
df = pd.DataFrame(dict(
u = np.round(np.random.rand(5), 2),
v = np.round(np.random.randn(5), 2),
w = ["spam", "bacon", "spam", "eggs", "sausage"],
x = [True, False, True, False, True]
))
And now:
df.loc[ df.loc[:, "u"] > 0.5, "u":"w" ]
## u v w
## 0 0.70 0.32 spam
## 3 0.55 1.98 eggs
## 4 0.72 -1.62 sausage
It selected the rows where the values in the u
column are greater than 0.5 and then
returns all columns between u
and w
(inclusive!).
Furthermore:
df.iloc[:3, :].loc[:, ["u", "w"]]
## u w
## 0 0.70 spam
## 1 0.29 bacon
## 2 0.23 spam
It fetched the first three rows (by position; iloc
is necessary)
and then selects two indicated columns.
Compare this to:
df.loc[:3, ["u", "w"]] # df[:3, ["u", "w"]] does not even work; please don't
## u w
## 0 0.70 spam
## 1 0.29 bacon
## 2 0.23 spam
## 3 0.55 eggs
which has four (!) rows.
Important
Getting a scrambled numeric index that does not match the physical positions is not rare: for instance, in the context of data frame sorting (Section 10.6.1):
df2 = df.sort_values("v")
df2
## u v w x
## 4 0.72 -1.62 sausage True
## 2 0.23 -0.20 spam True
## 1 0.29 -0.05 bacon False
## 0 0.70 0.32 spam True
## 3 0.55 1.98 eggs False
Note how different are the following results:
df2.loc[:3, :] # up to label 3, inclusive
## u v w x
## 4 0.72 -1.62 sausage True
## 2 0.23 -0.20 spam True
## 1 0.29 -0.05 bacon False
## 0 0.70 0.32 spam True
## 3 0.55 1.98 eggs False
df2.iloc[:3, :] # always: the first three
## u v w x
## 4 0.72 -1.62 sausage True
## 2 0.23 -0.20 spam True
## 1 0.29 -0.05 bacon False
Important
We can frequently write df.u
as a shorter version of df.loc[:, "u"]
.
This improves the readability in contexts such as:
df.loc[(df.u >= 0.5) & (df.u <= 0.7), ["u", "w"]]
## u w
## 0 0.70 spam
## 3 0.55 eggs
This accessor is, sadly, not universal.
We can verify this by considering a data frame with
a column named, e.g., mean
: it clashes with the built-in method.
As a workaround, we should either use loc[...]
or rename the column, for instance, like Mean
or MEAN
.
Use pandas.DataFrame.drop
to select all columns except v
in df
.
Use pandas.Series.isin (amongst others)
to select all rows with spam
and bacon
on the df
’s menu.
In the
tips
dataset, select data on male customers where the total bills were in the
\([10, 20]\) interval. Also, select Saturday and Sunday records where
the tips were greater than $5.
10.5.2. Adding rows and columns¶
loc[...]
can also be used to add new columns to an
existing data frame:
df.loc[:, "y"] = df.loc[:, "u"]**2 # or df.loc[:, "y"] = df.u**2
df
## u v w x y
## 0 0.70 0.32 spam True 0.4900
## 1 0.29 -0.05 bacon False 0.0841
## 2 0.23 -0.20 spam True 0.0529
## 3 0.55 1.98 eggs False 0.3025
## 4 0.72 -1.62 sausage True 0.5184
Important
Notation like “df.new_column = ...
” does not work.
As we said, only loc
and iloc
are universal.
For other accessors, this is not necessarily the case.
Use pandas.DataFrame.insert
to add a new column not necessarily at the end of df
.
Use pandas.DataFrame.append
to add a few more rows to df
.
10.5.3. Modifying items¶
In the current version of pandas, modifying particular elements gives a warning:
df.loc[:, "u"].iloc[0] = 7 # the same as df.u.iloc[0] = 7
## SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame
df.loc[:, "u"].iloc[0] # testing
## 7.0
In order to remedy this, it is best to create a copy of a column, modify it, and then to replace the old contents with the new ones.
u = df.loc[:, "u"].copy()
u.iloc[0] = 42 # or a whole for loop to process them all, or whatever
df.loc[:, "u"] = u
df.loc[:, "u"].iloc[0] # testing
## 42.0
10.5.4. Pseudorandom sampling and splitting¶
As a simple application of what we have covered so far, let’s consider some ways to sample several rows from an existing data frame. We may need them, e.g., when our datasets are too large to fit into memory or make data analysis too slow to run. In the most rudimentary scenarios, we can use the pandas.DataFrame.sample method, e.g., to:
select five rows, without replacement,
select 20% rows, with replacement,
rearrange all the rows.
For example:
body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_bmx_2020.csv",
comment="#")
body.sample(5, random_state=123) # 5 rows without replacement
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 4214 58.4 156.2 35.2 34.7 27.2 99.5 77.5
## 3361 73.7 161.0 36.5 34.5 29.0 107.6 98.2
## 3759 61.4 164.6 37.5 40.4 26.9 93.5 84.4
## 3733 120.4 158.8 33.5 34.6 40.5 147.2 129.3
## 1121 123.5 157.5 35.5 29.0 50.5 143.0 136.4
Notice the random_state
argument which controls the seed
of the pseudorandom number generator so that we get
reproducible results.
Alternatively, we could call numpy.random.seed.
Show how the three aforementioned scenarios
can be implemented manually using iloc[...]
and numpy.random.permutation
or numpy.random.choice.
Can pandas.read_csv be used to read only a random sample of rows from a CSV file?
In machine learning practice, we are used to training and evaluating machine learning models on different (mutually disjoint) subsets of the whole data frame.
For instance, Section 12.3.3 mentions that we may be interested in performing the so-called training/test split (partitioning), where 80% (or 60% or 70%) of the randomly selected rows would constitute the first new data frame and the remaining 20% (or 40% or 30%, respectively) would go to the second one.
Given a data frame like:
df = body.head(10) # this is just an example
df
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
## 5 62.0 144.7 32.5 34.2 29.8 106.7 84.8
## 6 66.2 166.5 37.5 37.6 32.0 96.3 95.7
## 7 75.9 154.5 35.4 37.6 32.7 107.7 98.7
## 8 77.2 159.2 38.5 40.5 35.7 102.0 97.5
## 9 91.6 174.5 36.1 45.9 35.2 121.3 100.3
one way to perform the aforementioned split is to generate a random permutation of the set of row indexes:
np.random.seed(123) # reproducibility matters
idx = np.random.permutation(df.shape[0])
idx
## array([4, 0, 7, 5, 8, 3, 1, 6, 9, 2])
And then to pick the first 80% of them to construct the data frame number one:
k = int(df.shape[0]*0.8)
df.iloc[idx[:k], :]
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 7 75.9 154.5 35.4 37.6 32.7 107.7 98.7
## 5 62.0 144.7 32.5 34.2 29.8 106.7 84.8
## 8 77.2 159.2 38.5 40.5 35.7 102.0 97.5
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 6 66.2 166.5 37.5 37.6 32.0 96.3 95.7
and the remaining ones to generate the second dataset:
df.iloc[idx[k:], :]
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 9 91.6 174.5 36.1 45.9 35.2 121.3 100.3
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
In the
wine_quality_all
dataset, leave out all but the white wines. Partition the resulting data
frame randomly into three data frames: wines_train
(60% of the rows),
wines_validate
(another 20% of the rows), and wines_test
(the remaining 20%).
Compose a function kfold which takes a data frame df
and an
integer \(k>1\) as arguments. Return a list of data frames resulting
in stemming from randomly partitioning df
into \(k\) disjoint chunks of
equal (or almost equal if that is not possible) sizes.
10.5.5. Hierarchical indexes (*)¶
Consider a data frame with a hierarchical index:
np.random.seed(123)
df = pd.DataFrame(dict(
year = np.repeat([2023, 2024, 2025], 4),
quarter = np.tile(["Q1", "Q2", "Q3", "Q4"], 3),
data = np.round(np.random.rand(12), 2)
)).set_index(["year", "quarter"])
df
## data
## year quarter
## 2023 Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
## 2024 Q1 0.72
## Q2 0.42
## Q3 0.98
## Q4 0.68
## 2025 Q1 0.48
## Q2 0.39
## Q3 0.34
## Q4 0.73
The index has both levels named, but this is purely for aesthetic reasons.
Indexing using loc[...]
by default relates to
the first level of the hierarchy:
df.loc[2023, :]
## data
## quarter
## Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
Note that we selected all rows corresponding to a given label and dropped (!) this level of the hierarchy.
Another example:
df.loc[ [2023, 2025], : ]
## data
## year quarter
## 2023 Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
## 2025 Q1 0.48
## Q2 0.39
## Q3 0.34
## Q4 0.73
To access deeper levels, we can use tuples as indexers:
df.loc[ (2023, "Q1"), : ]
## data 0.7
## Name: (2023, Q1), dtype: float64
df.loc[ [(2023, "Q1"), (2024, "Q3")], : ]
## data
## year quarter
## 2023 Q1 0.70
## 2024 Q3 0.98
In certain scenarios, though, it will probably be much easier
to subset a hierarchical index by using reset_index
and set_index creatively (together with
loc[...]
and pandas.Series.isin, etc.).
Let’s stress again that the `:` operator can only be used directly within the square brackets. Nonetheless, we can always use the slice constructor to create a slice in any context:
df.loc[ (slice(None), ["Q1", "Q3"]), : ] # :, ["Q1", "Q3"]
## data
## year quarter
## 2023 Q1 0.70
## Q3 0.23
## 2024 Q1 0.72
## Q3 0.98
## 2025 Q1 0.48
## Q3 0.34
df.loc[ (slice(None, None, -1), slice("Q2", "Q3")), : ] # ::-1, "Q2":"Q3"
## data
## year quarter
## 2025 Q3 0.34
## Q2 0.39
## 2024 Q3 0.98
## Q2 0.42
## 2023 Q3 0.23
## Q2 0.29
10.6. Further operations on data frames¶
One of the many roles of data frames is to represent tables of values for their nice presentation, e.g., in reports from data analysis or research papers. Here are some functions that can aid in their formatting.
10.6.1. Sorting¶
Consider another example dataset: the yearly (for 2018) average air quality data in the Australian state of Victoria.
air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
air = (
air.
loc[air.param_id.isin(["BPM2.5", "NO2"]), :].
reset_index(drop=True)
)
We chose two air quality parameters using
pandas.Series.isin, which determines whether
each element in a Series
is enlisted in a given sequence.
We could also have used set_index and loc[...]
for that.
Notice that the preceding code spans many lines. We needed to enclose it in round brackets to avoid a syntax error. Alternatively, we could have used backslashes at the end of each line.
Anyway, here is the data frame:
air
## sp_name param_id value
## 0 Alphington BPM2.5 7.848758
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 3 Churchill BPM2.5 6.391230
## 4 Dandenong NO2 9.800705
## 5 Footscray BPM2.5 7.640948
## 6 Footscray NO2 10.274531
## 7 Geelong South BPM2.5 6.502762
## 8 Geelong South NO2 5.681722
## 9 Melbourne CBD BPM2.5 8.072998
## 10 Moe BPM2.5 6.427079
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 13 Morwell South NO2 5.124430
## 14 Traralgon BPM2.5 8.024735
## 15 Traralgon NO2 5.776333
sort_values is a convenient means to order the rows with respect to one criterion, be it numeric or categorical.
air.sort_values("value", ascending=False)
## sp_name param_id value
## 6 Footscray NO2 10.274531
## 4 Dandenong NO2 9.800705
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 9 Melbourne CBD BPM2.5 8.072998
## 14 Traralgon BPM2.5 8.024735
## 0 Alphington BPM2.5 7.848758
## 5 Footscray BPM2.5 7.640948
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 7 Geelong South BPM2.5 6.502762
## 10 Moe BPM2.5 6.427079
## 3 Churchill BPM2.5 6.391230
## 15 Traralgon NO2 5.776333
## 8 Geelong South NO2 5.681722
## 13 Morwell South NO2 5.124430
It is also possible to take into account more sorting criteria:
air.sort_values(["param_id", "value"], ascending=[True, False])
## sp_name param_id value
## 9 Melbourne CBD BPM2.5 8.072998
## 14 Traralgon BPM2.5 8.024735
## 0 Alphington BPM2.5 7.848758
## 5 Footscray BPM2.5 7.640948
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 7 Geelong South BPM2.5 6.502762
## 10 Moe BPM2.5 6.427079
## 3 Churchill BPM2.5 6.391230
## 6 Footscray NO2 10.274531
## 4 Dandenong NO2 9.800705
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 15 Traralgon NO2 5.776333
## 8 Geelong South NO2 5.681722
## 13 Morwell South NO2 5.124430
Here, in each group of identical parameters, we get a decreasing order with respect to the value.
Compare the ordering with respect to param_id
and value
vs value
and then param_id
.
Note
(*) Lamentably, DataFrame.sort_values by default does not use a stable algorithm. If a data frame is sorted with respect to one criterion, and then we reorder it with respect to another one, tied observations are not guaranteed to be listed in the original order:
(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id")
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
## param_id sp_name value
## 0 BPM2.5 Melbourne CBD 8.072998
## 1 BPM2.5 Moe 6.427079
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Morwell East 6.784596
## 4 BPM2.5 Churchill 6.391230
## 5 BPM2.5 Morwell South 6.512849
## 6 BPM2.5 Traralgon 8.024735
## 7 BPM2.5 Alphington 7.848758
## 8 BPM2.5 Geelong South 6.502762
## 9 NO2 Morwell South 5.124430
## 10 NO2 Traralgon 5.776333
## 11 NO2 Geelong South 5.681722
## 12 NO2 Altona North 9.467912
## 13 NO2 Alphington 9.558120
## 14 NO2 Dandenong 9.800705
## 15 NO2 Footscray 10.274531
We lost the ordering based on station names in the two subgroups.
To switch to a mergesort-like method (timsort), we should pass
kind="stable"
.
(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id", kind="stable") # !
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
## param_id sp_name value
## 0 BPM2.5 Alphington 7.848758
## 1 BPM2.5 Churchill 6.391230
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Geelong South 6.502762
## 4 BPM2.5 Melbourne CBD 8.072998
## 5 BPM2.5 Moe 6.427079
## 6 BPM2.5 Morwell East 6.784596
## 7 BPM2.5 Morwell South 6.512849
## 8 BPM2.5 Traralgon 8.024735
## 9 NO2 Alphington 9.558120
## 10 NO2 Altona North 9.467912
## 11 NO2 Dandenong 9.800705
## 12 NO2 Footscray 10.274531
## 13 NO2 Geelong South 5.681722
## 14 NO2 Morwell South 5.124430
## 15 NO2 Traralgon 5.776333
(*) Perform identical reorderings but using only
loc[...]
, iloc[...]
,
and numpy.argsort.
10.6.2. Stacking and unstacking (long/tall and wide forms)¶
Let’s discuss some further ways to transform data frames that benefit from, make sense thanks to, or are possible because they can have columns of various types.
The air
dataset is in the long (tall) format.
All measurements are stacked one after/below another.
Such a form is convenient for data storage,
especially where there are only a few recorded values
but many possible combinations of levels (sparse data).
The long format might not be optimal in all data processing activities, though; compare [99]. In the part of this book on matrix processing, it was much more natural for us to have a single observation in each row (e.g., data for each measurement station).
We can unstack the air
data frame easily:
air_wide = air.set_index(["sp_name", "param_id"]).unstack().loc[:, "value"]
air_wide
## param_id BPM2.5 NO2
## sp_name
## Alphington 7.848758 9.558120
## Altona North NaN 9.467912
## Churchill 6.391230 NaN
## Dandenong NaN 9.800705
## Footscray 7.640948 10.274531
## Geelong South 6.502762 5.681722
## Melbourne CBD 8.072998 NaN
## Moe 6.427079 NaN
## Morwell East 6.784596 NaN
## Morwell South 6.512849 5.124430
## Traralgon 8.024735 5.776333
This is the so-called wide format.
A missing value is denoted by NaN
(not-a-number);
see Section 15.1 for more details.
Interestingly, we obtained a hierarchical index in the columns
(sic!) slot.
Hence, to drop the last level of the hierarchy,
we had to add the loc[...]
part.
Also notice that the index
and columns
slots are named.
The other way around, we can use the stack method:
air_wide.T.rename_axis(index="location", columns="param").\
stack().rename("value").reset_index()
## location param value
## 0 BPM2.5 Alphington 7.848758
## 1 BPM2.5 Churchill 6.391230
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Geelong South 6.502762
## 4 BPM2.5 Melbourne CBD 8.072998
## 5 BPM2.5 Moe 6.427079
## 6 BPM2.5 Morwell East 6.784596
## 7 BPM2.5 Morwell South 6.512849
## 8 BPM2.5 Traralgon 8.024735
## 9 NO2 Alphington 9.558120
## 10 NO2 Altona North 9.467912
## 11 NO2 Dandenong 9.800705
## 12 NO2 Footscray 10.274531
## 13 NO2 Geelong South 5.681722
## 14 NO2 Morwell South 5.124430
## 15 NO2 Traralgon 5.776333
We used the data frame transpose (T) to
get a location-major order (less boring an outcome in this context).
Missing values are gone now. We do not need them anymore.
Nevertheless, passing dropna=False
would help us identify
the combinations of location
and param
for which the readings
are not provided.
10.6.3. Joining (merging)¶
In database design, it is common to normalise the datasets. We do this to avoid the duplication of information and pathologies stemming from them (e.g., [21]).
The air quality parameters are separately described in another data frame:
param = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/air_quality_2018_param.csv",
comment="#")
param = param.rename(dict(param_std_unit_of_measure="unit"), axis=1)
param
## param_id param_name unit param_short_name
## 0 API Airborne particle index none Visibility Reduction
## 1 BPM2.5 BAM Particles < 2.5 micron ug/m3 PM2.5
## 2 CO Carbon Monoxide ppm CO
## 3 HPM10 Hivol PM10 ug/m3 NaN
## 4 NO2 Nitrogen Dioxide ppb NO2
## 5 O3 Ozone ppb O3
## 6 PM10 TEOM Particles <10micron ug/m3 PM10
## 7 PPM2.5 Partisol PM2.5 ug/m3 NaN
## 8 SO2 Sulfur Dioxide ppb SO2
We could have stored them alongside the air
data frame,
but that would be a waste of space.
Also, if we wanted to modify some datum
(note, e.g., the annoying double space in param_name
for BPM2.5
),
we would have to update all the relevant records.
Instead, we can always match the records in both data frames
that have the same param_id
s,
and join (merge) these datasets only when we really need this.
Let’s discuss the possible join operations by studying two toy datasets:
A = pd.DataFrame({
"x": ["a0", "a1", "a2", "a3"],
"y": ["b0", "b1", "b2", "b3"]
})
A
## x y
## 0 a0 b0
## 1 a1 b1
## 2 a2 b2
## 3 a3 b3
and:
B = pd.DataFrame({
"x": ["a0", "a2", "a2", "a4"],
"z": ["c0", "c1", "c2", "c3"]
})
B
## x z
## 0 a0 c0
## 1 a2 c1
## 2 a2 c2
## 3 a4 c3
They both have one column somewhat in common, x
.
The inner (natural) join returns the records that have a match in both datasets:
pd.merge(A, B, on="x")
## x y z
## 0 a0 b0 c0
## 1 a2 b2 c1
## 2 a2 b2 c2
The left join of \(A\) with \(B\) guarantees to return all the records from \(A\), even those which are not matched by anything in \(B\).
pd.merge(A, B, how="left", on="x")
## x y z
## 0 a0 b0 c0
## 1 a1 b1 NaN
## 2 a2 b2 c1
## 3 a2 b2 c2
## 4 a3 b3 NaN
The right join of \(A\) with \(B\) is the same as the left join of \(B\) with \(A\):
pd.merge(A, B, how="right", on="x")
## x y z
## 0 a0 b0 c0
## 1 a2 b2 c1
## 2 a2 b2 c2
## 3 a4 NaN c3
Finally, the full outer join is the set-theoretic union of the left and the right join:
pd.merge(A, B, how="outer", on="x")
## x y z
## 0 a0 b0 c0
## 1 a1 b1 NaN
## 2 a2 b2 c1
## 3 a2 b2 c2
## 4 a3 b3 NaN
## 5 a4 NaN c3
Join
air_quality_2018_value
with air_quality_2018_point
,
and then with air_quality_2018_param
.
Normalise
air_quality_2018
so that you get the three separate data frames mentioned in the previous
exercise (value
, point
, and param
).
(*)
In the National Health and Nutrition Examination Survey,
each participant is uniquely identified by their sequence number (SEQN
).
This token is mentioned in numerous datasets, including:
Join a few chosen datasets that you find interesting.
10.6.4. Set-theoretic operations and removing duplicates¶
Here are two not at all disjoint sets of imaginary persons:
A = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/some_birth_dates1.csv",
comment="#")
A
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
## 6 Paul Meckler 29.09.1968
## 7 Katarzyna Lasko 20.10.1971
## 8 Åge Trelstad 07.03.1935
## 9 Duchanee Panomyaong 19.06.1952
and:
B = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/some_birth_dates2.csv",
comment="#")
B
## Name BirthDate
## 0 Hushang Naigamwala 25.08.1991
## 1 Zhen Wei 16.11.1975
## 2 Micha Kitchen 17.09.1930
## 3 Jodoc Alwin 16.11.1969
## 4 Igor Mazał 14.05.2004
## 5 Katarzyna Lasko 20.10.1971
## 6 Duchanee Panomyaong 19.06.1952
## 7 Mefodiy Shachar 01.10.1914
## 8 Paul Meckler 29.09.1968
## 9 Noe Tae-Woong 11.07.1970
## 10 Åge Trelstad 07.03.1935
In both datasets, there is a single categorical column whose
elements uniquely identify each record (i.e., Name
).
In the language of relational databases, we would call it
the primary key.
In such a case, implementing the set-theoretic operations
is relatively easy, as we can refer to the pandas.Series.isin
method.
First, \(A\cap B\) (intersection), includes only the rows that are both in \(A\) and in \(B\):
A.loc[A.Name.isin(B.Name), :]
## Name BirthDate
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
## 6 Paul Meckler 29.09.1968
## 7 Katarzyna Lasko 20.10.1971
## 8 Åge Trelstad 07.03.1935
## 9 Duchanee Panomyaong 19.06.1952
Second, \(A\smallsetminus B\) (difference), gives all the rows that are in \(A\) but not in \(B\):
A.loc[~A.Name.isin(B.Name), :]
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
Third, \(A\cup B\) (union), returns the rows that exist in \(A\) or are in \(B\):
pd.concat((A, B.loc[~B.Name.isin(A.Name), :]))
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
## 6 Paul Meckler 29.09.1968
## 7 Katarzyna Lasko 20.10.1971
## 8 Åge Trelstad 07.03.1935
## 9 Duchanee Panomyaong 19.06.1952
## 0 Hushang Naigamwala 25.08.1991
## 1 Zhen Wei 16.11.1975
## 3 Jodoc Alwin 16.11.1969
## 4 Igor Mazał 14.05.2004
## 9 Noe Tae-Woong 11.07.1970
There are no duplicate rows in any of the above outputs.
Determine \((A\cup B) \smallsetminus (A\cap B) = (A\smallsetminus B)\cup(B\smallsetminus A)\) (symmetric difference).
(*)
Determine the union, intersection, and difference of the
wine_sample1
and
wine_sample2
datasets,
where there is no column uniquely identifying the observations.
Hint: consider using
pandas.concat and
pandas.DataFrame.duplicated or
pandas.DataFrame.drop_duplicates.
10.6.5. …and (too) many more¶
Looking at the list of methods
for the DataFrame
and Series
classes in the pandas package’s
documentation,
we can see that they are abundant. Together with the object-orientated
syntax, we will often find ourselves appreciating the high readability
of even complex operation chains such as
data.drop_duplicates().groupby(["year", "month"]).mean().reset_index()
;
see Chapter 12.
Nevertheless, the methods are probably too plentiful to our taste. Their developers were overgenerous. They wanted to include a list of all the possible verbs related to data analysis, even if they can be trivially expressed by a composition of 2-3 simpler operations from numpy or scipy instead.
As strong advocates of minimalism, we would rather save ourselves from being overloaded with too much new information. This is why our focus in this book is on developing the most transferable[6] skills. Our approach is also slightly more hygienic. We do not want the reader to develop a hopeless mindset, the habit of looking everything up on the internet when faced with even the simplest kinds of problems. We have brains for a reason.
10.7. Exercises¶
How are data frames different from matrices?
What are the use cases of the name
slot
in Series
and Index
objects?
What is the purpose of set_index and reset_index?
Why learning numpy is crucial when someone wants to become a proficient user of pandas?
What is the difference between iloc[...]
and loc[...]
?
Why applying the index operator [...]
directly on a
Series
or DataFrame
object is discouraged?
What is the difference between index
, Index
, and columns
?
How can we compute the arithmetic mean and median of all the numeric columns in a data frame, using a single line of code?
What is a training/test split and how to perform it using numpy and pandas?
What is the difference between stacking and unstacking? Which one yields a wide (as opposed to long) format?
Name different data frame join (merge) operations and explain how they work.
How does sorting with respect to more than one criterion work?
Name the set-theoretic operations on data frames.