10. Introducing Data Frames
The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.
numpy arrays are an extremely versatile tool for performing data analysis and other numerical computations computations of various kinds. Although theoretically possible otherwise, in practice we only store elements of the same type therein, most often numbers.
pandas [McK17]
is amongst over one hundred thousand
of open-source packages and repositories that use numpy to provide
additional data wrangling functionality.
It was originally written by Wes McKinney
and was heavily inspired by the data.frame
1 objects in S and R
as well as tables in relational (think: SQL) databases
and spreadsheets.
It is customary to load this package with the following alias:
import pandas as pd
The pandas package delivers a few classes, of which the most important are:
DataFrame
– for representing tabular data (matrix-like) with columns of possibly different types, in particular a mix of numerical and categorical variables,Series
– vector-like objects for representing individual columns,Index
– vector-like (usually) objects for labelling individual rows and columns ofDataFrame
s and items inSeries
objects,
together with many methods for:
transforming/aggregating/processing data, also in groups determined by categorical variables or products thereof,
reshaping (e.g., from wide to long format) and joining datasets,
importing/exporting data from/to various sources and formats, e.g., CSV and HDF5 files or relational databases,
handling missing data,
which we introduce in this and further chapters.
10.1. Creating Data Frames
Important
Let us repeat: pandas is built on top of
numpy and most objects therein can be processed by
numpy functions as well.
Other functions (e.g., in seaborn or sklearn)
will be accepting both DataFrame
and ndarray
objects,
but will be converting the former to the latter
automatically to enable data processing using
fast C/C++/Fortran routines.
What we have learnt so far still applies.
But there is of course more, hence this part.
Data frames can be created, amongst others,
using the DataFrame
class constructor, which
can be fed, for example, with a numpy matrix:
np.random.seed(123)
pd.DataFrame(
np.random.rand(4, 3),
columns=["a", "b", "c"]
)
## a b c
## 0 0.696469 0.286139 0.226851
## 1 0.551315 0.719469 0.423106
## 2 0.980764 0.684830 0.480932
## 3 0.392118 0.343178 0.729050
Notice that rows and columns are labelled (and how readable that is).
A dictionary of vector-like objects of equal lengths is another common option:
np.random.seed(123)
pd.DataFrame(dict(
u=np.round(np.random.rand(5), 2),
v=[True, True, False, False, True],
w=["A", "B", "C", "D", "E"],
x=["spam", "spam", "bacon", "spam", "eggs"]
))
## u v w x
## 0 0.70 True A spam
## 1 0.29 True B spam
## 2 0.23 False C bacon
## 3 0.55 False D spam
## 4 0.72 True E eggs
This illustrates the possibility of having columns of different types.
Check out the pandas.DataFrame.from_dict and pandas.DataFrame.from_records methods in the documentation. Use them to create some example data frames.
Further, data frames can be read from files in different formats, for instance, CSV:
body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/nhanes_adult_female_bmx_2020.csv",
comment="#")
body.head() # display first few rows (5 by default)
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
Reading from URLs and local files (compare Section 13.6.1) is of course supported.
Check out the other pandas.read_* functions in the pandas documentation – some of which we shall be discussing later.
10.1.1. Data Frames are Matrix-Like
Data frames are modelled through numpy matrices, hence we can feel quite at home with them.
For example, given:
np.random.seed(123)
df = pd.DataFrame(dict(
u=np.round(np.random.rand(5), 2),
v=[True, True, False, False, True],
w=["A", "B", "C", "D", "E"],
x=["spam", "spam", "bacon", "spam", "eggs"]
))
df
## u v w x
## 0 0.70 True A spam
## 1 0.29 True B spam
## 2 0.23 False C bacon
## 3 0.55 False D spam
## 4 0.72 True E eggs
it is easy to fetch the number of rows and columns:
df.shape
## (5, 4)
or the type of each column:
df.dtypes
## u float64
## v bool
## w object
## x object
## dtype: object
Recall that arrays are equipped with the dtype
slot.
10.1.2. Series
There is a separate class for storing individual data frame
columns: it is called Series
.
s = df.loc[:, "u"] # extract the `u` column; alternatively: d.u
s
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
## Name: u, dtype: float64
A data frame with one column is printed out slightly differently:
s.to_frame()
## u
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
Indexing will of course be discussed later.
Important
It is crucial to know when we are dealing
with a Series
and when with a DataFrame
object,
because each of them defines a slightly different set of methods.
For example:
s.mean()
## 0.49800000000000005
refers to pandas.Series.mean (which returns a scalar), whereas
df.mean(numeric_only=True)
## u 0.498
## v 0.600
## dtype: float64
uses pandas.DataFrame.mean (which yields a Series
).
Look up the two methods in the pandas manual. Note that their argument list is slightly different.
Series
are wrappers around numpy arrays.
s.values
## array([0.7 , 0.29, 0.23, 0.55, 0.72])
Most importantly, numpy functions can be called directly on them:
np.mean(s)
## 0.49800000000000005
Hence, what we have covered in the part dealing with vector processing still holds for data frame columns as well (there will be more, stay tuned).
Series
can also be named.
s.name
## 'u'
This is convenient when we convert from/to a data frame.
s.rename("spam").to_frame()
## spam
## 0 0.70
## 1 0.29
## 2 0.23
## 3 0.55
## 4 0.72
10.1.3. Index
Another important class is Index
, which
is used for encoding the row and column names in a data frame:
df.index # row labels
## RangeIndex(start=0, stop=5, step=1)
The above represents a sequence (0, 1, 2, 3, 4).
df.columns # column labels
## Index(['u', 'v', 'w', 'x'], dtype='object')
Also, we can label the individual elements in Series
objects:
s.index
## RangeIndex(start=0, stop=5, step=1)
A quite frequent operation that we will be applying
to make a data frame column act as a vector of row labels
is implemented in the set_index
method for data frames:
df2 = df.set_index("x")
df2
## u v w
## x
## spam 0.70 True A
## spam 0.29 True B
## bacon 0.23 False C
## spam 0.55 False D
## eggs 0.72 True E
This Index
object is named:
df2.index.name
## 'x'
which is handy when we decide that we want to convert the vector of row labels back to a standalone column:
df2.reset_index()
## x u v w
## 0 spam 0.70 True A
## 1 spam 0.29 True B
## 2 bacon 0.23 False C
## 3 spam 0.55 False D
## 4 eggs 0.72 True E
There is also an option to get rid of the current .index
and to replace it with the default label sequence, i.e.,
0, 1, 2, …:
df2.reset_index(drop=True)
## u v w
## 0 0.70 True A
## 1 0.29 True B
## 2 0.23 False C
## 3 0.55 False D
## 4 0.72 True E
Important
In due course, we will be calling
df.reset_index(drop=True)
quite frequently, sometimes
more than once in a single chain of commands.
Use the
pandas.DataFrame.rename method to
change the name of the u
column in df
to spam
.
Also, a hierarchical index – one that is comprised of more than two levels – is possible:
df.sort_values("x", ascending=False).set_index(["x", "v"])
## u w
## x v
## spam True 0.70 A
## True 0.29 B
## False 0.55 D
## eggs True 0.72 E
## bacon False 0.23 C
Note that a group of three consecutive spam
s
does not have the labels repeated – this is for increased
readability.
Important
We will soon see that hierarchical indexes might arise after aggregating data in groups (via the groupby method).
For example:
nhanes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/nhanes_p_demo_bmx_2020.csv",
comment="#")
res = nhanes.groupby(["RIAGENDR", "DMDEDUC2"])["BMXBMI"].mean()
res # BMI by gender and education
## RIAGENDR DMDEDUC2
## 1 1.0 28.765244
## 2.0 28.534737
## 3.0 29.637793
## 4.0 30.323158
## 5.0 28.689736
## 9.0 28.080000
## 2 1.0 30.489032
## 2.0 31.251247
## 3.0 30.969200
## 4.0 31.471476
## 5.0 28.891357
## 7.0 30.200000
## 9.0 33.950000
## Name: BMXBMI, dtype: float64
But let us fret not, as there is always:
res.reset_index()
## RIAGENDR DMDEDUC2 BMXBMI
## 0 1 1.0 28.765244
## 1 1 2.0 28.534737
## 2 1 3.0 29.637793
## 3 1 4.0 30.323158
## 4 1 5.0 28.689736
## 5 1 9.0 28.080000
## 6 2 1.0 30.489032
## 7 2 2.0 31.251247
## 8 2 3.0 30.969200
## 9 2 4.0 31.471476
## 10 2 5.0 28.891357
## 11 2 7.0 30.200000
## 12 2 9.0 33.950000
10.2. Aggregating Data Frames
Here is an example data frame:
np.random.seed(123)
d = pd.DataFrame(dict(
u = np.round(np.random.rand(5), 2),
v = np.round(np.random.randn(5), 2),
w = ["spam", "bacon", "spam", "eggs", "sausage"]
), index=["a", "b", "c", "d", "e"])
d
## u v w
## a 0.70 0.32 spam
## b 0.29 -0.05 bacon
## c 0.23 -0.20 spam
## d 0.55 1.98 eggs
## e 0.72 -1.62 sausage
First, all numpy functions can be applied directly on individual
columns (i.e., objects of type Series
),
as the latter are, after all, vector-like.
np.quantile(d.loc[:, "u"], [0, 0.5, 1])
## array([0.23, 0.55, 0.72])
For more details on the loc[...]
-type indexing, see below.
By then, the meaning of the above should be clear from the
context.
Most numpy functions also work if they are fed with data frames, but we will need to extract the numeric matrix columns manually.
np.quantile(d.loc[:, ["u", "v"]], [0, 0.5, 1], axis=0)
## array([[ 0.23, -1.62],
## [ 0.55, -0.05],
## [ 0.72, 1.98]])
Sometimes the results will automatically be coerced to a Series
object
with index
slot set appropriately:
np.mean(d.loc[:, ["u", "v"]], axis=0)
## u 0.498
## v 0.086
## dtype: float64
Many operations, for convenience, have also
been implemented as methods for the Series
and DataFrame
classes, e.g.,
mean, median, min, max, quantile,
var, std (with ddof=1
by default, interestingly),
and skew.
d.loc[:, ["u", "v"]].mean(numeric_only=True)
## u 0.498
## v 0.086
## dtype: float64
d.loc[:, ["u", "v"]].quantile([0, 0.5, 1], numeric_only=True)
## u v
## 0.0 0.23 -1.62
## 0.5 0.55 -0.05
## 1.0 0.72 1.98
Also note the describe method, which returns a few statistics at the same time.
d.describe()
## u v
## count 5.000000 5.000000
## mean 0.498000 0.086000
## std 0.227969 1.289643
## min 0.230000 -1.620000
## 25% 0.290000 -0.200000
## 50% 0.550000 -0.050000
## 75% 0.700000 0.320000
## max 0.720000 1.980000
Note
(*) Let us stress that above we see
the corrected for bias (but still only asymptotically unbiased)
version of standard deviation (ddof=1
),
given by \(\sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2}\),
compare Section 5.1.
Check out the pandas.DataFrame.agg method
that can apply all aggregating operations from
a given list of functions.
Write a call equivalent to d.describe()
.
10.3. Transforming Data Frames
By applying the already well-known vectorised mathematical numpy functions. we can transform each data cell and return an object of the same type as the input one.
np.exp(d.loc[:, "u"])
## a 2.013753
## b 1.336427
## c 1.258600
## d 1.733253
## e 2.054433
## Name: u, dtype: float64
np.exp(d.loc[:, ["u", "v"]])
## u v
## a 2.013753 1.377128
## b 1.336427 0.951229
## c 1.258600 0.818731
## d 1.733253 7.242743
## e 2.054433 0.197899
When applying the binary arithmetic, comparison,
and logical operators on an object of class Series
and
a scalar or a numpy vector, the operations are performed
elementwisely – a style with which we are already familiar.
For instance, here is a standardised version of the u
column:
u = d.loc[:, "u"]
(u - np.mean(u)) / np.std(u)
## a 0.990672
## b -1.020098
## c -1.314357
## d 0.255025
## e 1.088759
## Name: u, dtype: float64
For two objects of class Series
, the operators
are vectorised somewhat differently
from their numpy counterparts: they are applied
on the pairs of elements having identical labels
(i.e., labelwisely).
Still, if we are transforming the columns
that come from the same data frame
(and thus having identical index
slots;
this is the most common scenario), there will be no surprises,
because this is equivalent to elementwise action.
For instance:
d.loc[:, "u"] > d.loc[:, "v"]
## a True
## b True
## c True
## d False
## e True
## dtype: bool
For transforming many numerical columns at once, it is a good idea either to convert them to a numeric matrix explicitly and then use the basic numpy functions:
uv = d.loc[:, ["u", "v"]].values
uv = (uv-np.mean(uv, axis=0))/np.std(uv, axis=0)
uv
## array([[ 0.99067229, 0.20286225],
## [-1.0200982 , -0.11790285],
## [-1.3143573 , -0.24794275],
## [ 0.25502455, 1.64197052],
## [ 1.08875866, -1.47898717]])
or to use the pandas.DataFrame.apply method which invokes a given function on each column separately:
def standardise(x): return (x-np.mean(x))/np.std(x)
uv = d.loc[:, ["u", "v"]].apply(standardise)
uv
## u v
## a 0.990672 0.202862
## b -1.020098 -0.117903
## c -1.314357 -0.247943
## d 0.255025 1.641971
## e 1.088759 -1.478987
In both cases, we can then write:
d.loc[:, ["u", "v"]] = uv
to replace the old content with the transformed one.
Also, new columns can be added based on transformed versions of the existing ones:
d.loc[:, "uv_squared"] = (d.loc[:, "u"] * d.loc[:, "v"])**2
d
## u v w uv_squared
## a 0.990672 0.202862 spam 0.040389
## b -1.020098 -0.117903 bacon 0.014465
## c -1.314357 -0.247943 spam 0.106201
## d 0.255025 1.641971 eggs 0.175346
## e 1.088759 -1.478987 sausage 2.592938
10.4. Indexing Series
Objects
Recall that each DataFrame
and Series
object
is equipped with a slot called index
, which is
an object of class Index
(or subclass thereof),
giving the row and element labels, respectively.
It turns out that we may apply
the index operator, [...]
, to subset these objects
not only through the indexers known from
the numpy part (e.g., numerical ones, i.e., by position)
but also ones that pinpoint the items via their
labels. That is quite a lot of index-ing.
Let us study different forms thereof in very detail.
In this section we play with two example objects of
class Series
:
np.random.seed(123)
b = pd.Series(np.round(np.random.uniform(0,1,10),2))
b.index = np.random.permutation(np.r_[0:10])
b
## 2 0.70
## 1 0.29
## 8 0.23
## 7 0.55
## 9 0.72
## 4 0.42
## 5 0.98
## 6 0.68
## 3 0.48
## 0 0.39
## dtype: float64
c = b.copy()
c.index = list("abcdefghij")
c
## a 0.70
## b 0.29
## c 0.23
## d 0.55
## e 0.72
## f 0.42
## g 0.98
## h 0.68
## i 0.48
## j 0.39
## dtype: float64
They consist of the same values, in the same order,
but have different labels (.index
slots).
In particular, b
’s labels
are integers which do not match the physical element
positions (where 0 would denote the first element, etc.).
Important
For numpy vectors, we had four different indexing schemes:
by scalar (extracts an element at given position),
slice, integer vector, and logical vector.
Series
objects are additionally labelled, therefore
they can additionally be accessed taking into account
the contents of the .index
slot.
10.4.1. Do Not Use [...]
Directly
Applying the index operator,
[...]
, directly on Series
is generally not a good idea:
b[0]
## 0.39
b[ [0] ]
## 0 0.39
## dtype: float64
both do not select the first item, but the item labelled 0
.
However:
b[0:1] # slice - 0 only
## 2 0.7
## dtype: float64
and
c[0] # there is no label `0`
## 0.7
fall back to position-based indexing.
Confusing? Well, with some self-discipline, the solution is easy. Namely:
Important
To stay away from potential problems, we should
never apply [...]
directly on Series
objects.
10.4.2. loc[...]
To avoid ambiguity, which is what we want,
we should be referring to Series.loc[...]
and Series.iloc[...]
for label- and position-based filtering, respectively.
And thus:
b.loc[0]
## 0.39
returns the element labelled 0
. On the other
hand, c.loc[0]
will raise a KeyError
,
because c
consists of character labels only.
Next, we can use lists of labels to select a subset.
b.loc[ [0, 1, 0] ]
## 0 0.39
## 1 0.29
## 0 0.39
## dtype: float64
c.loc[ ["j", "b", "j"] ]
## j 0.39
## b 0.29
## j 0.39
## dtype: float64
The result is always of type Series
.
Slicing behaves differently as the range is inclusive at both sides:
b.loc[1:7]
## 1 0.29
## 8 0.23
## 7 0.55
## dtype: float64
b.loc[0:4:-1]
## 0 0.39
## 3 0.48
## 6 0.68
## 5 0.98
## 4 0.42
## dtype: float64
c.loc["d":"g"]
## d 0.55
## e 0.72
## f 0.42
## g 0.98
## dtype: float64
return all elements between the two indicated labels.
Be careful that if there are repeated labels, then we will be returning all the matching items:
d = pd.Series([1, 2, 3, 4], index=["a", "b", "a", "c"])
d.loc["a"]
## a 1
## a 3
## dtype: int64
The result is not a scalar but a Series
object.
10.4.3. iloc[...]
Here are some examples of position-based indexing:
b.iloc[0] # the same: c.iloc[0]
## 0.7
returns the first element.
b.iloc[1:7] # the same: b.iloc[1:7]
## 1 0.29
## 8 0.23
## 7 0.55
## 9 0.72
## 4 0.42
## 5 0.98
## dtype: float64
returns the 2nd, 3rd, …, 7th element
(not including b.iloc[7]
, i.e., the 8th one).
10.4.4. Logical Indexing
Indexing using a logical vector-like object is also available.
We usually will be using loc[...]
with either a Series
object of identical .index
slot
as the subsetted object or a logical numpy vector.
b.loc[(b > 0.4) & (b < 0.6)]
## 7 0.55
## 4 0.42
## 3 0.48
## dtype: float64
For iloc[...]
, the indexer must be unlabelled, like in, e.g.,
b.loc[i.values]
.
10.5. Indexing Data Frames
10.5.1. loc[...]
and iloc[...]
For data frames,
iloc
and loc
can be used too,
but they now require two arguments, serving as row and column
selectors.
For example:
np.random.seed(123)
d = pd.DataFrame(dict(
u = np.round(np.random.rand(5), 2),
v = np.round(np.random.randn(5), 2),
w = ["spam", "bacon", "spam", "eggs", "sausage"],
x = [True, False, True, False, True]
), index=["a", "b", "c", "d", "e"])
And now:
d.loc[d.loc[:, "u"] > 0.5, "u":"w"]
## u v w
## a 0.70 0.32 spam
## d 0.55 1.98 eggs
## e 0.72 -1.62 sausage
selects the rows where the values in the u
column are greater than 0.5 and then
returns all columns between u
and w
.
Furthermore,
d.iloc[:3, :].loc[:, ["u", "w"]]
## u w
## a 0.70 spam
## b 0.29 bacon
## c 0.23 spam
fetches the first 3 rows (by position – iloc
is necessary)
and then selects two indicated columns.
Important
We can write d.u
as a shorter version of d.loc[:, "u"]
.
This improves the readability in contexts such as:
d.loc[(d.u >= 0.5) & (d.u <= 0.7), ["u", "w"]]
## u w
## a 0.70 spam
## d 0.55 eggs
However, this accessor is not universal: we can check this out by
considering a data frame featuring a column named, e.g., mean
.
Use pandas.DataFrame.drop
to select all columns but v
in d
.
Use pandas.Series.isin (amongst others)
to select all rows with spam
and bacon
on the d
’s menu.
In the tips
dataset,
select data on male customers where total bill
was in the [10, 20] interval.
Also, select data from Saturday and Sunday with tip greater than
$5 from tips
.
10.5.2. Adding Rows and Columns
loc[...]
can also used to add new columns to an existing
data frame:
d.loc[:, "y"] = d.loc[:, "u"]**2 # or d.loc[:, "y"] = d.u**2
d
## u v w x y
## a 0.70 0.32 spam True 0.4900
## b 0.29 -0.05 bacon False 0.0841
## c 0.23 -0.20 spam True 0.0529
## d 0.55 1.98 eggs False 0.3025
## e 0.72 -1.62 sausage True 0.5184
Notation like
“d.new_column = ...
” will not work.
As we have said, loc
and iloc
are universal, other
accessors – not so much.
Use pandas.DataFrame.insert
to add a new column not necessarily at the end of d
.
Use pandas.DataFrame.append
to add a few more rows to d
.
10.5.3. Random Sampling
As a simple application of what we have covered so far, let us consider the case of randomly sampling a number of rows from an existing data frame.
For the most basic use cases, we can use the pandas.DataFrame.sample method. It includes scenarios such as:
randomly select 5 rows, without replacement,
randomly select 20% rows, with replacement,
randomly rearrange all the rows.
For example:
body = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/nhanes_adult_female_bmx_2020.csv",
comment="#")
body.sample(5, random_state=123) # 5 random rows without replacement
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 4214 58.4 156.2 35.2 34.7 27.2 99.5 77.5
## 3361 73.7 161.0 36.5 34.5 29.0 107.6 98.2
## 3759 61.4 164.6 37.5 40.4 26.9 93.5 84.4
## 3733 120.4 158.8 33.5 34.6 40.5 147.2 129.3
## 1121 123.5 157.5 35.5 29.0 50.5 143.0 136.4
Notice the random_state
argument which controls the seed
of the random number generator so that we get
reproducible results.
Alternatively, we could call numpy.random.seed.
Show how the 3 aforementioned scenarios
can be implemented manually using iloc[...]
and numpy.random.permutation
or numpy.random.choice.
In machine learning practice, as we shall see later, we are used to training and evaluating machine learning models on different (mutually disjoint) subsets of the whole data frame.
For instance, we might be interested in performing the so-called train/test split, where 80% (or 60% or 70%) of the randomly selected rows would constitute the first new data frame and the remaining 20% (or 40% or 30%) would go to the second one.
Given a data frame like:
x = body.head(10) # this is just an example
x
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
## 5 62.0 144.7 32.5 34.2 29.8 106.7 84.8
## 6 66.2 166.5 37.5 37.6 32.0 96.3 95.7
## 7 75.9 154.5 35.4 37.6 32.7 107.7 98.7
## 8 77.2 159.2 38.5 40.5 35.7 102.0 97.5
## 9 91.6 174.5 36.1 45.9 35.2 121.3 100.3
one way to perform the aforementioned split is to generate a random permutation of the set of row indices:
np.random.seed(123) # reproducibility matters
idx = np.random.permutation(x.shape[0])
idx
## array([4, 0, 7, 5, 8, 3, 1, 6, 9, 2])
And then to pick the first 80% of these indices to construct the data frame number one:
k = int(x.shape[0]*0.8)
x.iloc[idx[:k], :]
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 4 55.4 154.6 34.6 34.0 28.3 92.5 73.2
## 0 97.1 160.2 34.7 40.8 35.8 126.1 117.9
## 7 75.9 154.5 35.4 37.6 32.7 107.7 98.7
## 5 62.0 144.7 32.5 34.2 29.8 106.7 84.8
## 8 77.2 159.2 38.5 40.5 35.7 102.0 97.5
## 3 61.7 157.4 38.0 34.7 29.0 101.0 90.5
## 1 91.1 152.7 33.5 33.0 38.5 125.5 103.1
## 6 66.2 166.5 37.5 37.6 32.0 96.3 95.7
and the remaining ones to generate the second dataset:
x.iloc[idx[k:], :]
## BMXWT BMXHT BMXARML BMXLEG BMXARMC BMXHIP BMXWAIST
## 9 91.6 174.5 36.1 45.9 35.2 121.3 100.3
## 2 73.0 161.2 37.4 38.0 31.8 106.2 92.0
In the wine_quality_all
dataset,
leave out all but the white wines.
Partition the resulting data frame randomly into three data frames:
wines_train
(60% of the rows),
wines_validate
(another 20% of the rows),
and wines_test
(the remaining 20%).
Consider the white wines dataset again.
Write a function kfold which takes
a data frame x
and an integer k>1 on input.
Return a list of data frames resulting
in randomly splitting x
into k disjoint chunks of
equal (or almost equal if that is not possible) sizes.
10.5.4. Hierarchical Indices (*)
Consider the following DataFrame
object
with a hierarchical index:
np.random.seed(123)
d = pd.DataFrame(dict(
year = sorted([2023, 2024, 2025]*4),
quarter = ["Q1", "Q2", "Q3", "Q4"]*3,
data = np.round(np.random.rand(12), 2)
)).set_index(["year", "quarter"])
d
## data
## year quarter
## 2023 Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
## 2024 Q1 0.72
## Q2 0.42
## Q3 0.98
## Q4 0.68
## 2025 Q1 0.48
## Q2 0.39
## Q3 0.34
## Q4 0.73
The index has both levels named, but this is purely for aesthetic reasons.
Indexing using loc[...]
by default relates to the
first level of the hierarchy:
d.loc[2023, :]
## data
## quarter
## Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
d.loc[[2023, 2025], :]
## data
## year quarter
## 2023 Q1 0.70
## Q2 0.29
## Q3 0.23
## Q4 0.55
## 2025 Q1 0.48
## Q2 0.39
## Q3 0.34
## Q4 0.73
To access deeper levels, we can use tuples as indexers:
d.loc[(2023, "Q1"), :]
## data 0.7
## Name: (2023, Q1), dtype: float64
d.loc[[(2023, "Q1"), (2024, "Q3")], :]
## data
## year quarter
## 2023 Q1 0.70
## 2024 Q3 0.98
In certain scenarios, though, it will probably be much easier
to subset a hierarchical index
by using reset_index and set_index creatively
(together with loc[...]
and pandas.Series.isin, etc.).
The `:` operator for slicing can be used only directly within square brackets, but we can always use the slice constructor to create them from within any context:
d.loc[(slice(None), ["Q1", "Q3"]), :] # `:`, ["Q1", "Q3"]
## data
## year quarter
## 2023 Q1 0.70
## Q3 0.23
## 2024 Q1 0.72
## Q3 0.98
## 2025 Q1 0.48
## Q3 0.34
d.loc[(slice(None, None, -1), slice("Q2", "Q3")), :] # ::-1, "Q2":"Q3"
## data
## year quarter
## 2025 Q3 0.34
## Q2 0.39
## 2024 Q3 0.98
## Q2 0.42
## 2023 Q3 0.23
## Q2 0.29
10.6. Further Operations on Data Frames
10.6.1. Sorting
Let us consider another example dataset. Here are the yearly (for 2018) average air quality data in the Australian state of Victoria.
air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/air_quality_2018_means.csv",
comment="#")
air = (
air.
loc[air.param_id.isin(["BPM2.5", "NO2"]), :].
reset_index(drop=True)
)
air
## sp_name param_id value
## 0 Alphington BPM2.5 7.848758
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 3 Churchill BPM2.5 6.391230
## 4 Dandenong NO2 9.800705
## 5 Footscray BPM2.5 7.640948
## 6 Footscray NO2 10.274531
## 7 Geelong South BPM2.5 6.502762
## 8 Geelong South NO2 5.681722
## 9 Melbourne CBD BPM2.5 8.072998
## 10 Moe BPM2.5 6.427079
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 13 Morwell South NO2 5.124430
## 14 Traralgon BPM2.5 8.024735
## 15 Traralgon NO2 5.776333
sort_values is a convenient means to order the rows with respect to one criterion:
air.sort_values("value", ascending=False)
## sp_name param_id value
## 6 Footscray NO2 10.274531
## 4 Dandenong NO2 9.800705
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 9 Melbourne CBD BPM2.5 8.072998
## 14 Traralgon BPM2.5 8.024735
## 0 Alphington BPM2.5 7.848758
## 5 Footscray BPM2.5 7.640948
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 7 Geelong South BPM2.5 6.502762
## 10 Moe BPM2.5 6.427079
## 3 Churchill BPM2.5 6.391230
## 15 Traralgon NO2 5.776333
## 8 Geelong South NO2 5.681722
## 13 Morwell South NO2 5.124430
Sorting with respect to more criteria is also possible:
air.sort_values(["param_id", "value"], ascending=[True, False])
## sp_name param_id value
## 9 Melbourne CBD BPM2.5 8.072998
## 14 Traralgon BPM2.5 8.024735
## 0 Alphington BPM2.5 7.848758
## 5 Footscray BPM2.5 7.640948
## 11 Morwell East BPM2.5 6.784596
## 12 Morwell South BPM2.5 6.512849
## 7 Geelong South BPM2.5 6.502762
## 10 Moe BPM2.5 6.427079
## 3 Churchill BPM2.5 6.391230
## 6 Footscray NO2 10.274531
## 4 Dandenong NO2 9.800705
## 1 Alphington NO2 9.558120
## 2 Altona North NO2 9.467912
## 15 Traralgon NO2 5.776333
## 8 Geelong South NO2 5.681722
## 13 Morwell South NO2 5.124430
Here, in each group of identical parameters, we get a decreasing order with respect to the value.
(*)
Compare the ordering with respect to param_id
and value
vs value
and then param_id
.
(**) Perform identical reorderings but using only
loc[...]
, iloc[...]
, and numpy.argsort.
Note
(*) DataFrame.sort_values unfortunately uses a non-stable algorithm by default (a modified quicksort). If a data frame is sorted with respect to one criterion, and then we reorder it with respect to another one, tied observations are not guaranteed to be listed in the original order:
(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id")
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
## param_id sp_name value
## 0 BPM2.5 Melbourne CBD 8.072998
## 1 BPM2.5 Moe 6.427079
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Morwell East 6.784596
## 4 BPM2.5 Churchill 6.391230
## 5 BPM2.5 Morwell South 6.512849
## 6 BPM2.5 Traralgon 8.024735
## 7 BPM2.5 Alphington 7.848758
## 8 BPM2.5 Geelong South 6.502762
## 9 NO2 Morwell South 5.124430
## 10 NO2 Traralgon 5.776333
## 11 NO2 Geelong South 5.681722
## 12 NO2 Altona North 9.467912
## 13 NO2 Alphington 9.558120
## 14 NO2 Dandenong 9.800705
## 15 NO2 Footscray 10.274531
We have lost the ordering based on station names in the two subgroups.
To switch to a mergesort-like method (timsort), we should pass
kind="stable"
.
(pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/air_quality_2018_means.csv",
comment="#")
.sort_values("sp_name")
.sort_values("param_id", kind="stable") # !
.set_index("param_id")
.loc[["BPM2.5", "NO2"], :]
.reset_index())
## param_id sp_name value
## 0 BPM2.5 Alphington 7.848758
## 1 BPM2.5 Churchill 6.391230
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Geelong South 6.502762
## 4 BPM2.5 Melbourne CBD 8.072998
## 5 BPM2.5 Moe 6.427079
## 6 BPM2.5 Morwell East 6.784596
## 7 BPM2.5 Morwell South 6.512849
## 8 BPM2.5 Traralgon 8.024735
## 9 NO2 Alphington 9.558120
## 10 NO2 Altona North 9.467912
## 11 NO2 Dandenong 9.800705
## 12 NO2 Footscray 10.274531
## 13 NO2 Geelong South 5.681722
## 14 NO2 Morwell South 5.124430
## 15 NO2 Traralgon 5.776333
10.6.2. Stacking and Unstacking (Long and Wide Forms)
Let us discuss some further ways to transform data frames, that benefit from, make sense because of, or are possible due to their being able to store data of different types.
The above air
dataset is in the so-called long format,
where all measurements are stacked one after/below another.
Such a form is quite convenient for the purpose of storing data,
but not necessarily for all processing tasks, compare [Wic14].
Recall from the matrix part that a more natural way
would be to have a single observation (e.g.,
data for a measurement station) in each row.
We can unstack the air
data frame
(convert to the wide format) quite easily:
air.set_index(["sp_name", "param_id"]).unstack().loc[:, "value"]
## param_id BPM2.5 NO2
## sp_name
## Alphington 7.848758 9.558120
## Altona North NaN 9.467912
## Churchill 6.391230 NaN
## Dandenong NaN 9.800705
## Footscray 7.640948 10.274531
## Geelong South 6.502762 5.681722
## Melbourne CBD 8.072998 NaN
## Moe 6.427079 NaN
## Morwell East 6.784596 NaN
## Morwell South 6.512849 5.124430
## Traralgon 8.024735 5.776333
The missing values are denoted with NaN
s (not-a-number),
see Section 15.1 for more details.
Interestingly, we got a hierarchical index in the columns,
hence the loc[...]
part.
Equivalently:
air_wide = air.pivot("sp_name", "param_id", "value").\
rename_axis(index=None, columns=None)
air_wide
## BPM2.5 NO2
## Alphington 7.848758 9.558120
## Altona North NaN 9.467912
## Churchill 6.391230 NaN
## Dandenong NaN 9.800705
## Footscray 7.640948 10.274531
## Geelong South 6.502762 5.681722
## Melbourne CBD 8.072998 NaN
## Moe 6.427079 NaN
## Morwell East 6.784596 NaN
## Morwell South 6.512849 5.124430
## Traralgon 8.024735 5.776333
The rename_axis part is there so as to get rid of the name
parts
of the index
and columns
slots.
The other way around, we can use the stack method:
air_wide.T.rename_axis(index="location", columns="param").\
stack().rename("value").reset_index()
## location param value
## 0 BPM2.5 Alphington 7.848758
## 1 BPM2.5 Churchill 6.391230
## 2 BPM2.5 Footscray 7.640948
## 3 BPM2.5 Geelong South 6.502762
## 4 BPM2.5 Melbourne CBD 8.072998
## 5 BPM2.5 Moe 6.427079
## 6 BPM2.5 Morwell East 6.784596
## 7 BPM2.5 Morwell South 6.512849
## 8 BPM2.5 Traralgon 8.024735
## 9 NO2 Alphington 9.558120
## 10 NO2 Altona North 9.467912
## 11 NO2 Dandenong 9.800705
## 12 NO2 Footscray 10.274531
## 13 NO2 Geelong South 5.681722
## 14 NO2 Morwell South 5.124430
## 15 NO2 Traralgon 5.776333
We used the data frame transpose (T) to get a location-major order (less boring an outcome in this context).
10.6.3. Set-Theoretic Operations
Here are two not at all disjoint sets of imaginary persons:
A = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/some_birth_dates1.csv",
comment="#")
A
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
## 6 Paul Meckler 29.09.1968
## 7 Katarzyna Lasko 20.10.1971
## 8 Åge Trelstad 07.03.1935
## 9 Duchanee Panomyaong 19.06.1952
and
B = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/some_birth_dates2.csv",
comment="#")
B
## Name BirthDate
## 0 Hushang Naigamwala 25.08.1991
## 1 Zhen Wei 16.11.1975
## 2 Micha Kitchen 17.09.1930
## 3 Jodoc Alwin 16.11.1969
## 4 Igor Mazał 14.05.2004
## 5 Katarzyna Lasko 20.10.1971
## 6 Duchanee Panomyaong 19.06.1952
## 7 Mefodiy Shachar 01.10.1914
## 8 Paul Meckler 29.09.1968
## 9 Noe Tae-Woong 11.07.1970
## 10 Åge Trelstad 07.03.1935
In both datasets, there is a single column whose
elements uniquely identify each record (i.e., Name
).
In the language of relational databases, we would call it
a primary key.
In such a case, implementing the set-theoretic operations
is relatively easy:
\(A\cap B\) (intersection) – the rows that are both in A and in B:
A.loc[A.Name.isin(B.Name), :]
## Name BirthDate
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
## 6 Paul Meckler 29.09.1968
## 7 Katarzyna Lasko 20.10.1971
## 8 Åge Trelstad 07.03.1935
## 9 Duchanee Panomyaong 19.06.1952
\(A\setminus B\) (difference) – the rows that are in A but not in B:
A.loc[~A.Name.isin(B.Name), :]
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
\(A\cup B\) (union) – the rows that exist in A or are in B:
pd.concat((A, B.loc[~B.Name.isin(A.Name), :]))
## Name BirthDate
## 0 Paitoon Ornwimol 26.06.1958
## 1 Antónia Lata 20.05.1935
## 2 Bertoldo Mallozzi 17.08.1972
## 3 Nedeljko Bukv 19.12.1921
## 4 Micha Kitchen 17.09.1930
## 5 Mefodiy Shachar 01.10.1914
## 6 Paul Meckler 29.09.1968
## 7 Katarzyna Lasko 20.10.1971
## 8 Åge Trelstad 07.03.1935
## 9 Duchanee Panomyaong 19.06.1952
## 0 Hushang Naigamwala 25.08.1991
## 1 Zhen Wei 16.11.1975
## 3 Jodoc Alwin 16.11.1969
## 4 Igor Mazał 14.05.2004
## 9 Noe Tae-Woong 11.07.1970
There are no duplicate rows in the output.
Determine \((A\cup B) \setminus (A\cap B) = (A\setminus B)\cup(B\setminus A)\) (symmetric difference).
(*)
Determine the union, intersection, and difference of the
wine_sample1
and
wine_sample2
datasets,
where there is no column uniquely identifying the observations.
Hint: consider using
pandas.DataFrame.duplicated or
pandas.DataFrame.drop_duplicates.
10.6.4. Joining (Merging)
In database design, it is common to normalise the datasets in order to avoid the duplication of information and pathologies stemming from them (e.g., [Dat03]).
The above air quality parameters are separately described in another data frame:
param = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/air_quality_2018_param.csv",
comment="#")
param.rename(dict(param_std_unit_of_measure="unit"), axis=1)
## param_id param_name unit param_short_name
## 0 API Airborne particle index none Visibility Reduction
## 1 BPM2.5 BAM Particles < 2.5 micron ug/m3 PM2.5
## 2 CO Carbon Monoxide ppm CO
## 3 HPM10 Hivol PM10 ug/m3 NaN
## 4 NO2 Nitrogen Dioxide ppb NO2
## 5 O3 Ozone ppb O3
## 6 PM10 TEOM Particles <10micron ug/m3 PM10
## 7 PPM2.5 Partisol PM2.5 ug/m3 NaN
## 8 SO2 Sulfur Dioxide ppb SO2
We could have stored them alongside the air
data frame,
but that would be a waste of space.
Also, if we wanted to modify some datum
(note, e.g., the annoying double space in param_name
for BPM2.5
),
we would have to update all the relevant records.
Instead, we can always look the information up by param_id
and join (merge) the two data frames only if we need it.
Let us discuss the possible join operations by studying the two following data sets:
A = pd.DataFrame({
"x": ["a0", "a1", "a2", "a3"],
"y": ["b0", "b1", "b2", "b3"]
})
A
## x y
## 0 a0 b0
## 1 a1 b1
## 2 a2 b2
## 3 a3 b3
and:
B = pd.DataFrame({
"x": ["a0", "a2", "a2", "a4"],
"z": ["c0", "c1", "c2", "c3"]
})
B
## x z
## 0 a0 c0
## 1 a2 c1
## 2 a2 c2
## 3 a4 c3
They both have one column somewhat in common, x
.
The inner (natural) join returns the records that have a match in both datasets:
pd.merge(A, B, on="x")
## x y z
## 0 a0 b0 c0
## 1 a2 b2 c1
## 2 a2 b2 c2
The left join of A with B guarantees to return all the records from A, even those which are not matched by anything in B.
pd.merge(A, B, how="left", on="x")
## x y z
## 0 a0 b0 c0
## 1 a1 b1 NaN
## 2 a2 b2 c1
## 3 a2 b2 c2
## 4 a3 b3 NaN
The right join of A with B is the same as the left join of B with A:
pd.merge(A, B, how="right", on="x")
## x y z
## 0 a0 b0 c0
## 1 a2 b2 c1
## 2 a2 b2 c2
## 3 a4 NaN c3
Finally, the full outer join is the set-theoretic union of the left and the right join:
pd.merge(A, B, how="outer", on="x")
## x y z
## 0 a0 b0 c0
## 1 a1 b1 NaN
## 2 a2 b2 c1
## 3 a2 b2 c2
## 4 a3 b3 NaN
## 5 a4 NaN c3
Join
air_quality_2018_value
with air_quality_2018_point
and air_quality_2018_param
.
Normalise
air_quality_2018
yourself so that you get the three data frames mentioned in the previous
exercise (value
, point
, param
).
(*)
In the National Health and Nutrition Examination Survey
(NHANES) by the US Centres for Disease Control and Prevention,
each participant is uniquely identified
by their sequence number (SEQN
),
which is mentioned in numerous datasets, including:
Join a few chosen datasets that are to your liking.
10.6.5. …And (Too) Many More
Looking at the list of methods
for the DataFrame
and Series
classes in the pandas package’s
documentation,
we can see that they are abundant.
Together with the object-oriented syntax, we will often find ourselves
appreciating the high readability of even quite complex operation
chains such as data.drop_duplicates().groupby(["year", "month"]).mean().reset_index()
.
Nevertheless, in fact, we should admit that the methods are too plentiful – they make an impression that someone was too generous by including a list of all the possible verbs related to data analysis, even if they can be easily expressed as a combination of 2-3 simpler operations. Therefore, in order to prevent the reader from being overloaded with too much new information, below we will be discussing only the most noteworthy features that often appear in data wrangling scenarios.
As strong advocates of minimalism (which is more environmentally friendly and sustainable), not rarely will we be more eager to fall back to the hand-crafted combinations of the more basic (universal) building blocks from numpy and scipy instead. This is also in line with our putting emphasis on developing transferable skills – as Python with pandas is not the only combo where we can work with data frames (e.g., base R allows that too).
10.7. Exercises
How are data frames different from matrices?
What are the use cases of the name
slot
in Series
and Index
objects?
What is the purpose of set_index and reset_index?
Why learning numpy is crucial when someone wants to become a proficient user of pandas?
What is the difference between iloc[...]
and loc[...]
?
Why applying the index operator [...]
directly on a
Series
or DataFrame
object is not necessarily a good idea?
What is the difference between .index
, Index
, and .columns
?
How to compute the arithmetic mean and median of all the numeric columns in a data frame, using a single line of code?
What is a train/test split and how to perform it using numpy and pandas?
What is the difference between stacking and unstacking? Which one yields a wide (as opposed to long) format?
Name different data frame join (merge) operations and explain how they work.
How does sorting with respect to more than one criterion work?
Name the basic set-theoretic operations on data frames.