# 16. Time Series

The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.

So far we have been using numpy and pandas mostly for storing:

• independent measurements, e.g., where each element is a height record of a different subject; we often consider these a sample of a representative subset of one or more populations, each recorded at a particular point in time;

• frequency distributions, i.e., count data and the corresponding categories or labels,

• data to report (as in: tables, figures, heatmaps, etc.).

In this part we will explore the most basic concepts related to the wrangling of time series data – signals indexed by discrete time. Usually a time series is a sequence of measurements sampled at equally spaced moments, e.g., a patient’s heart rate probed every second, daily opening stock market prices, or average monthly temperatures recorded in some location.

## 16.1. Temporal Ordering and Line Charts

Consider the midrange daily temperatures in degrees Celsius at Spokane International Airport (Spokane, WA, US) between 1889-08-01 (first observation) and 2021-12-31 (last observation). Note that midrange, being the mean of the lowest and highest observed temperature on a given day, is not a particularly good estimate of the average daily reading, however, we must work with the data we have.

temps = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/spokane_temperature.txt")


Let us preview the December 2021 data:

temps[-31:]  # last 31 days
## array([ 11.9,   5.8,   0.6,   0.8,  -1.9,  -4.4,  -1.9,   1.4,  -1.9,
##         -1.4,   3.9,   1.9,   1.9,  -0.8,  -2.5,  -3.6, -10. ,  -1.1,
##         -1.7,  -5.3,  -5.3,  -0.3,   1.9,  -0.6,  -1.4,  -5. ,  -9.4,
##        -12.8, -12.2, -11.4, -11.4])


Here are some data aggregates – the popular quantiles:

np.quantile(temps, [0, 0.25, 0.5, 0.75, 1])
## array([-26.9,   2.2,   8.6,  16.4,  33.9])


as well as the arithmetic mean and standard deviation:

np.mean(temps), np.std(temps)
## (8.990273958441023, 9.16204388619955)


and a graphical summary of the data distribution in Figure 16.1:

sns.violinplot(data=temps, orient="h", color="lightgray")
plt.show()


Figure 16.1 Distribution of the midrange daily temperatures in Spokane in the period 1889-2021; observations are treated as a bag of unrelated items (temperature on a “randomly chosen day” in a version of planet Earth where there is no climate change)

When computing data aggregates or plotting histograms, the order of elements does not matter. Contrary to the independent measurements case, vectors representing time series do not have to be treated simply as mixed bags of unrelated items.

Important

In time series, for any given item $$x_i$$, its neighbouring elements $$x_{i-1}$$ and $$x_{i+1}$$ denote the recordings occurring directly before and after it. We can use this temporal ordering to model how consecutive measurements depend on each other, describe how they change over time, forecast future values, detect long-time trends, and so forth.

Here are the data for 2021, plotted as a function of time:

plt.plot(temps[-365:])
plt.xticks([0, 181, 364], ["2021-01-01", "2021-07-01", "2021-12-31"])
plt.show()


Figure 16.2 Line chart of midrange daily temperatures in Spokane for 2021; plotting data as a function of the time variable reveals some seasonal pattern

What we see in Figure 16.2 is often referred to as a line chart – data points are connected by straight line segments. There are some visible seasonal variations, such as that, well, obviously, winter is colder than summer.

## 16.2. Working with Datetimes and Timedeltas

### 16.2.1. Representation: The UNIX Epoch

numpy.datetime64 is a type to represent datetimes. Usually, we will be creating dates from strings, for instance:

d = np.array([
"1889-08-01", "1970-01-01", "2021-12-31", "today"
], dtype="datetime64")
d
## array(['1889-08-01', '1970-01-01', '2021-12-31', '2022-06-23'],
##       dtype='datetime64[D]')


Similarly with datetimes:

dt = np.array(["1970-01-01T02:01:05", "now"], dtype="datetime64")
dt
## array(['1970-01-01T02:01:05', '2022-06-23T05:58:15'],
##       dtype='datetime64[s]')


Important

Internally, the above are represented as the number of days or seconds since the so-called Unix Epoch, 1970-01-01T00:00:00 in the UTC time zone.

Let us verify that:

d.astype(float)
## array([-29372.,      0.,  18992.,  19166.])
dt.astype(float)
## array([7.2650000e+03, 1.6559639e+09])


Which, when we think about it for a while, is exactly what we expected.

Exercise 16.1

Write a regular expression that extract all dates in the YYYY-MM-DD format from a (possibly long) string and converts them to datetime64.

### 16.2.2. Time Differences

Computing datetime differences is possible thanks to the numpy.timedelta64 objects:

d - np.timedelta64(1, "D")  # minus 1 Day
## array(['1889-07-31', '1969-12-31', '2021-12-30', '2022-06-22'],
##       dtype='datetime64[D]')
dt + np.timedelta64(12, "h")  # plus 12 hours
## array(['1970-01-01T14:01:05', '2022-06-23T17:58:15'],
##       dtype='datetime64[s]')


Also, numpy.arange (and pandas.date_range) can be used to generate sequences of equidistant datetimes:

dates = np.arange("1889-08-01", "2022-01-01", dtype="datetime64[D]")
dates[:3]   # preview
dates[-3:]  # preview
## array(['1889-08-01', '1889-08-02', '1889-08-03'], dtype='datetime64[D]')
## array(['2021-12-29', '2021-12-30', '2021-12-31'], dtype='datetime64[D]')


### 16.2.3. Datetimes in Data Frames

Dates and datetimes can of course be emplaced in pandas data frames:

spokane = pd.DataFrame(dict(
date=np.arange("1889-08-01", "2022-01-01", dtype="datetime64[D]"),
temp=temps
))
##         date  temp
## 0 1889-08-01  21.1
## 1 1889-08-02  20.8
## 2 1889-08-03  22.2
## 3 1889-08-04  21.7
## 4 1889-08-05  18.3


Interestingly, if we ask the date column to become the data frame’s .index (i.e., row labels), we will be able select date ranges quite easily with loc[...] and string slices (refer to the manual of pandas.DateTimeIndex for more details).

spokane.set_index("date").loc["2021-12-25":, :].reset_index()
##         date  temp
## 0 2021-12-25  -1.4
## 1 2021-12-26  -5.0
## 2 2021-12-27  -9.4
## 3 2021-12-28 -12.8
## 4 2021-12-29 -12.2
## 5 2021-12-30 -11.4
## 6 2021-12-31 -11.4

Example 16.2

Based on the above, we can plot the data for the last 5 years quite easily, see Figure 16.3.

x = spokane.set_index("date").loc["2017-01-01":, "temp"]
plt.plot(x)
plt.show()


Figure 16.3 Line chart of midrange daily temperatures in Spokane for 2017–2021

The x-axis labels have now been generated automatically.

Note that pandas.to_datetime can be used to convert strings to datetime objects:

dates = ["1991-04-05", "2021-09-12", "2042-12-31"]
dates = pd.Series(pd.to_datetime(dates))
dates
## 0   1991-04-05
## 1   2021-09-12
## 2   2042-12-31
## dtype: datetime64[ns]

Exercise 16.3

From the birth_dates dataset, select all people less than 18 years old (as of the current day). The pandas.to_datetime function can also convert arbitrarily formatted date strings, e.g., "MM/DD/YYYY" or DD.MM.YYYY to Series of datetime64s.

A number of datetime functions and related properties can be referred to via the pandas.Series.dt accessor (similarly to pandas.Series.str discussed in Chapter 14). In particular, they are are a convenient means for extracting different date or time fields, such as:

dates_ymd = pd.DataFrame(dict(
year  = dates.dt.year,
month = dates.dt.month,
day   = dates.dt.day
))
dates_ymd
##    year  month  day
## 0  1991      4    5
## 1  2021      9   12
## 2  2042     12   31


Interestingly, pandas.to_datetime can also convert data frames with columns named year, month, day, etc. back to datetime objects directly:

pd.to_datetime(dates_ymd)
## 0   1991-04-05
## 1   2021-09-12
## 2   2042-12-31
## dtype: datetime64[ns]

Example 16.4

For instance, we can extract the month and year parts of dates to compute the average monthly temperatures it the last 50-ish years:

x = spokane.set_index("date").loc["1970":, ].reset_index()
mean_monthly_temps = x.groupby([
x.date.dt.year.rename("year"),
x.date.dt.month.rename("month")
]).temp.mean().unstack()
## month   1    2    3    4     5     6     7     8     9    10   11   12
## year
## 1970  -3.4  2.3  2.8  5.3  12.7  19.0  22.5  21.2  12.3  7.2  2.2 -2.4
## 1971  -0.1  0.8  1.7  7.4  13.5  14.6  21.0  23.4  12.9  6.8  1.9 -3.5
## 1972  -5.2 -0.7  5.2  5.6  13.8  16.6  20.0  21.7  13.0  8.4  3.5 -3.7
## 1973  -2.8  1.6  5.0  7.8  13.6  16.7  21.8  20.6  15.4  8.4  0.9  0.7
## 1974  -4.4  1.8  3.6  8.0  10.1  18.9  19.9  20.1  15.8  8.9  2.4 -0.8


Figure 16.4 depicts these data on a heatmap:

sns.heatmap(mean_monthly_temps)
plt.show()


Figure 16.4 Average monthly temperatures

Again we discover the ultimate truth that winters are cold, whereas in the summertime the living is easy, what a wonderful world.

### 16.2.4. Modelling Event Times with Exponential Distributions (*)

The exponential distribution family is frequently used for modelling times between different events (i.e., deltas) under the assumption that a system generates on average a constant number of events and that they occur independently of each other.

This may be the case for the times between requests to a cloud service during peak hours, wait times for the next pedestrian to appear at a crossing near the Southern Cross Station in Melbourne, or the amount of time it takes a bank teller to interact with a customer (there is a whole branch of applied mathematics – or should we refer to it more fashionably: data science – called queuing theory that deals with this type of modelling).

An exponential family is identified by the scale parameter $$s > 0$$, being at the same time its expected value. The probability density function of Exp(s) is given by

$f(x) = \tfrac{1}{s} e^{-x/s}$

for $$x\ge 0$$ and $$f(x)=0$$ otherwise. We should be careful: some textbooks choose the parametrisation by $$\lambda=1/s$$ instead of $$s$$; also scipy uses this convention.

Here is a pseudorandom sample where there are 5 events per minute on average:

np.random.seed(123)
λ = 60/5  # 5 events per 60 seconds on average
d = scipy.stats.expon.rvs(size=1200, scale=λ)
np.round(d[:8], 3)  # preview
## array([14.307,  4.045,  3.087,  9.617, 15.253,  6.601, 47.412, 13.856])


This gave us the wait times between the events (deltas), in seconds.

A natural sample estimator of the scale parameter is of course:

np.mean(d)
## 11.839894504211724


Which is close to what we expect, i.e., 12 seconds between the events.

We can convert the above to datetime (starting at a fixed calendar date), e.g., as follows:

t0 = np.array("2022-01-01T00:00:00", dtype="datetime64[ms]")
d_ms = np.round(d*1000).astype(int)  # in milliseconds
t = t0 + np.array(np.cumsum(d_ms), dtype="timedelta64[ms]")
t[:8], t[-2:]  # preview
## (array(['2022-01-01T00:00:14.307', '2022-01-01T00:00:18.352',
##        '2022-01-01T00:00:21.439', '2022-01-01T00:00:31.056',
##        '2022-01-01T00:00:46.309', '2022-01-01T00:00:52.910',
##        '2022-01-01T00:01:40.322', '2022-01-01T00:01:54.178'],
##       dtype='datetime64[ms]'), array(['2022-01-01T03:56:45.312', '2022-01-01T03:56:47.890'],
##       dtype='datetime64[ms]'))


W converted the deltas to milliseconds so that we did not lose precision; datetime64 is based on integers, not floating-point numbers.

As an exercise, let us apply binning and count how many events occur in each hour:

b = np.arange(
"2022-01-01T00:00:00", "2022-01-01T05:00:00",
1000*60*60,  # number of milliseconds in 1 hour
dtype="datetime64[ms]"
)
np.histogram(t, bins=b)
## (array([305, 300, 274, 321]), array(['2022-01-01T00:00:00.000', '2022-01-01T01:00:00.000',
##        '2022-01-01T02:00:00.000', '2022-01-01T03:00:00.000',
##        '2022-01-01T04:00:00.000'], dtype='datetime64[ms]'))


We expect 5 events per second, hence, 300 of them per hour. On a side note, from a course in statistics we know that for exponential inter-event times, the number of events per unit of time follows a Poisson distribution.

Exercise 16.5

(**) Consider the wait_times dataset that gives the times between consecutive events, in seconds. Estimate the event rate per hour. Draw a histogram representing the number of events per hour.

## 16.3. Basic Operations

### 16.3.1. Iterative Differences and Cumulative Sums Revisited

Recall the numpy.diff function and its almost-inverse, numpy.cumsum. The former can turn a time series into a vector of relative changes (also called deltas, $$\Delta_i=x_{i+1}-x_i$$).

x = temps[-7:]  # last 7 days
x
## array([ -1.4,  -5. ,  -9.4, -12.8, -12.2, -11.4, -11.4])


The iterative differences (deltas) are:

d = np.diff(x)
d
## array([-3.6, -4.4, -3.4,  0.6,  0.8,  0. ])


For instance, between the second and the first day of the last week, the midrange temperature dropped by -3.6°C.

The other way around, here the cumulative sums of the deltas:

np.cumsum(d)
## array([ -3.6,  -8. , -11.4, -10.8, -10. , -10. ])


This turned deltas back to a shifted version of the original series. However, we will need the first (root) observation therefrom to restore the dataset in full.

x[0] + np.append(0, np.cumsum(d))
## array([ -1.4,  -5. ,  -9.4, -12.8, -12.2, -11.4, -11.4])

Exercise 16.6

Consider the euraud-20200101-20200630-no-na dataset which lists daily EUR/AUD exchange rates in the first half of 2020 (remember COVID-19?), with missing observations removed. Using numpy.diff, compute the minimal, median, average, and maximal daily price changes. Also, draw a box and whisker plot for these deltas.

Exercise 16.7

(*) Consider the btcusd_ohlcv_2021_dates dataset which gives the daily BTC/USD exchange rates in 2021:

btc = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/btcusd_ohlcv_2021_dates.csv",
comment="#").loc[:, ["Date", "Close"]]
btc["Date"] = btc["Date"].astype("datetime64[D]")
##          Date      Close
## 0  2021-01-01  29374.152
## 1  2021-01-02  32127.268
## 2  2021-01-03  32782.023
## 3  2021-01-04  31971.914
## 4  2021-01-05  33992.430
## 5  2021-01-06  36824.363
## 6  2021-01-07  39371.043
## 7  2021-01-08  40797.609
## 8  2021-01-09  40254.547
## 9  2021-01-10  38356.441
## 10 2021-01-11  35566.656
## 11 2021-01-12  33922.961


Convert it to a “lagged” representation, being a convenient form for some machine learning algorithms:

1. Add the Change column giving by how much the price changed since the previous day.

2. Add the Dir column indicating if the change was positive or negative.

3. Add the Lag1, …, Lag5 columns which give the Changes in the 5 preceding days.

The first few rows of the resulting data frame should look like this (assuming we do not want any missing values):

##       Date Close   Change Dir     Lag1     Lag2    Lag3    Lag4    Lag5
## 2021-01-07 39371  2546.68 inc  2831.93  2020.52 -810.11  654.76 2753.12
## 2021-01-08 40798  1426.57 inc  2546.68  2831.93 2020.52 -810.11  654.76
## 2021-01-09 40255  -543.06 dec  1426.57  2546.68 2831.93 2020.52 -810.11
## 2021-01-10 38356 -1898.11 dec  -543.06  1426.57 2546.68 2831.93 2020.52
## 2021-01-11 35567 -2789.78 dec -1898.11  -543.06 1426.57 2546.68 2831.93
## 2021-01-12 33923 -1643.69 dec -2789.78 -1898.11 -543.06 1426.57 2546.68


In the 6th row (representing 2021-01-12), Lag1 corresponds to Change on 2021-01-11, Lag2 gives the Change on 2021-01-10, and so forth.

To spice things up, make sure your code can generate any number, say k of lagged variables.

### 16.3.2. Smoothing with Moving Averages

With time series it makes sense to consider batches of consecutive points as there is a time dependence between them. In particular, we can consider computing different aggregates inside rolling windows of a particular size.

For example, given sequence $$(x_1, x_2, \dots, x_n)$$ and some $$k \le n$$, the $$k$$-moving average is a vector $$(y_1, y_2, \dots, y_{n-k+1})$$ such that

$y_i = \frac{1}{k} \sum_{j=1}^k x_{i+j-1},$

i.e., the arithmetic mean of $$k$$ consecutive observations starting at $$x_i$$.

For example, here are the temperatures in the last 7 days of December 2011:

x = spokane.set_index("date").iloc[-7:, :]
x
##             temp
## date
## 2021-12-25  -1.4
## 2021-12-26  -5.0
## 2021-12-27  -9.4
## 2021-12-28 -12.8
## 2021-12-29 -12.2
## 2021-12-30 -11.4
## 2021-12-31 -11.4


The 3-moving average:

x.rolling(3, center=True).mean().round(2)
##              temp
## date
## 2021-12-25    NaN
## 2021-12-26  -5.27
## 2021-12-27  -9.07
## 2021-12-28 -11.47
## 2021-12-29 -12.13
## 2021-12-30 -11.67
## 2021-12-31    NaN


We get, in this order: the mean of the first 3 observations; the mean of the 2nd, 3rd, and 4th items; then the mean of the 3rd, 4th, and 5th; and so forth. Notice that we have centred the observations in such a way that we have the same number of missing values at the start and end of the series. This way, we treat the first 3-day moving average (the average of the temperatures on the first 3 days) as representative for the 2nd day.

And now for something completely different; the 5-moving average:

x.rolling(5, center=True).mean().round(2)
##              temp
## date
## 2021-12-25    NaN
## 2021-12-26    NaN
## 2021-12-27  -8.16
## 2021-12-28 -10.16
## 2021-12-29 -11.44
## 2021-12-30    NaN
## 2021-12-31    NaN


Applying the moving average has the nice effect of smoothing out all kinds of broadly-conceived noise. To illustrate this, compare the temperature data for the last 5 years in Figure 16.3 and in Figure 16.5.

x = spokane.set_index("date").loc["2017-01-01":, "temp"]
x30 = x.rolling(30, center=True).mean()
x100 = x.rolling(100, center=True).mean()
plt.plot(x30, label="30-day moving average")
plt.plot(x100, "r--", label="100-day moving average")
plt.legend()
plt.show()


Figure 16.5 Line chart of 30- and 100-moving averages of the midrange daily temperatures in Spokane for 2017-2021

Exercise 16.8

(*) Other aggregation functions can be applied in rolling windows too. Draw, in the same figure, the plots of the 1-year moving minimums, medians, and maximums.

### 16.3.4. Imputing Missing Values

Missing values in time series can use the information from the neighbouring non-missing observations. After all, it is usually the case that, e.g., today’s weather is “similar” to yesterday’s and tomorrow’s.

The most straightforward ways for dealing with missing values in time series are:

• forward-fill – propagate the last non-missing observation,

• backward-fill – get the next non-missing value,

• linearly interpolate between two adjacent non-missing values – in particular, a single missing value will be replaced by the average of its neighbours.

Example 16.9

The classic air_quality_1973 dataset gives some daily air quality measurements in New York, between May and September 1973. As an example, let us impute a first few observations in the solar radiation column:

air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/r/air_quality_1973.csv",
comment="#")
x = air.loc[:, "Solar.R"].iloc[:12]
pd.DataFrame(dict(
original=x,
ffilled=x.fillna(method="ffill"),
bfilled=x.fillna(method="bfill"),
interpolated=x.interpolate(method="linear")
))
##     original  ffilled  bfilled  interpolated
## 0      190.0    190.0    190.0    190.000000
## 1      118.0    118.0    118.0    118.000000
## 2      149.0    149.0    149.0    149.000000
## 3      313.0    313.0    313.0    313.000000
## 4        NaN    313.0    299.0    308.333333
## 5        NaN    313.0    299.0    303.666667
## 6      299.0    299.0    299.0    299.000000
## 7       99.0     99.0     99.0     99.000000
## 8       19.0     19.0     19.0     19.000000
## 9      194.0    194.0    194.0    194.000000
## 10       NaN    194.0    256.0    225.000000
## 11     256.0    256.0    256.0    256.000000

Exercise 16.10

(*) With the air_quality_2018 dataset:

1. Based on the hourly observations, compute the daily mean PM2.5s for Melbourne CBD and Morwell South.

For Melbourne CBD, if some hourly measurement is missing, linearly interpolate between the preceding and following non-missing data, e.g., a PM2.5 sequence of [..., 10, NaN, NaN, 40, ...] (you need to manually add the NaN rows to the dataset) should be transformed to [..., 10, 20, 30, 40, ...].

For Morwell South, impute the reading as an average of the records in the nearest air quality stations, located in Morwell East, Moe, Churchill, or Traralgon.

2. Present the daily mean PM2.5 measurements for Melbourne CBD and Morwell South on a single plot. The x-axis labels should be human-readable and intuitive.

3. For the Melbourne data, count on how many days the average PM2.5 was greater than in the preceding day.

4. Find 5 most air-polluted days for Melbourne.

### 16.3.5. Plotting Multidimensional Time Series

Multidimensional time series stored in the form of an n-by-m matrix are best viewed as m time series – possibly but not necessarily somewhat related to each other – all sampled at the same n points in time (e.g., m different stocks on n consecutive days).

Consider the currency exchange rates for the first half of 2020:

eurxxx = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/eurxxx-20200101-20200630-no-na.csv",
delimiter=",")
eurxxx[:6, :]  # preview
## array([[1.6006 , 7.7946 , 0.84828, 4.2544 ],
##        [1.6031 , 7.7712 , 0.85115, 4.2493 ],
##        [1.6119 , 7.8049 , 0.85215, 4.2415 ],
##        [1.6251 , 7.7562 , 0.85183, 4.2457 ],
##        [1.6195 , 7.7184 , 0.84868, 4.2429 ],
##        [1.6193 , 7.7011 , 0.85285, 4.2422 ]])


This gives EUR/AUD (how many Australian Dollars we pay for 1 Euro), EUR/CNY (Chinese Yuans), EUR/GBP (British Pounds), and EUR/PLN (Polish Złotys), in this order. Let us draw the four time series, see Figure 16.8.

dates = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/euraud-20200101-20200630-dates.txt",
dtype="datetime64")
labels = ["AUD", "CNY", "GBP", "PLN"]
styles = ["solid", "dotted", "dashed", "dashdot"]
for i in range(eurxxx.shape[1]):
plt.plot(dates, eurxxx[:, i], ls=styles[i], label=labels[i])
plt.legend()
plt.show()


Figure 16.8 EUR/AUD, EUR/CNY, EUR/GBP, and EUR/PLN exchange rates in the first half of 2020

Unfortunately, they are all on different scales, hence the plot is not necessarily readable. It would be better to draw these time series on 4 separate plots (compare the trellis plots in Section 12.2.5).

Another idea is to depict the currency exchange rates relative to the prices on some day, say, the first one; see Figure 16.9.

for i in range(eurxxx.shape[1]):
plt.plot(dates, eurxxx[:, i]/eurxxx[0, i],
ls=styles[i], label=labels[i])
plt.legend()
plt.show()


Figure 16.9 EUR/AUD, EUR/CNY, EUR/GBP, and EUR/PLN exchange rates relative to the prices on the first day

This way, e.g., relative EUR/AUD rate of ca. 1.15 in mid-March denotes means that if an Australian bought some Euros on the first day and then sold them three-ish months later, they would have 15% more currency (the Euro become 15% stronger relative to AUD).

Exercise 16.11

Based on EUR/AUD and EUR/PLN rates, compute and plot the AUD/PLN as well as PLN/AUD ones.

Exercise 16.12

(*) Draw the EUR/AUD and EUR/GBP rates on a single plot, but with each series having its own y axis.

Exercise 16.13

(*) Draw the EUR/xxx rates for your favourite currencies over a larger period of time. Use data downloaded from the European Central Bank. Add a few moving averages. For each year, identify the lowest and the highest rate.

### 16.3.6. Candlestick Plots (*)

Consider the BTC/USD data for 2021:

btcusd = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching_data/master/marek/btcusd_ohlcv_2021.csv",
delimiter=",")
btcusd[:6, :4]  # preview
## array([[28994.01 , 29600.627, 28803.586, 29374.152],
##        [29376.455, 33155.117, 29091.182, 32127.268],
##        [32129.408, 34608.559, 32052.316, 32782.023],
##        [32810.949, 33440.219, 28722.756, 31971.914],
##        [31977.041, 34437.59 , 30221.188, 33992.43 ],
##        [34013.613, 36879.699, 33514.035, 36824.363]])


This gives the open, high, low, close prices on the 365 consecutive days (we skipped the marked volume column for readability), which is a common way to summarise daily rates.

The mplfinance package (matplotlib-finance) features a few functions related to the plotting of financial data. Here, let us briefly mention the candlestick plot.

import mplfinance as mpf
dates = np.arange("2021-01-01", "2022-01-01", dtype="datetime64[D]")
mpf.plot(pd.DataFrame(
btcusd,
columns=["Open", "High", "Low", "Close", "Volume"]
).set_index(dates).iloc[:31, :], type="candle", style="charles")
# plt.show() # not needed...


Figure 16.10 Candlestick plot for the BTC/USD exchange rates in January 2021

Figure 16.10 depicts the January 2021 data. Let us stress that this is not a box and whisker plot. The candlestick body denotes the difference in the market opening and the closing price. The wicks (shadows) give the range (high to low).

Green candlesticks represent bullish days – where the closing rate is greater than the opening one (uptrend). Red candles are bearish (decline)

Exercise 16.14

Draw the BTC/USD rates for the whole year and add the 10-day moving averages.

Exercise 16.15

(*) Draw a candlestick plot manually, without using the mplfinance package. matplotlib.pyplot.fill might be helpful.

Exercise 16.16

(*) Using matplotlib.pyplot.fill_between add a semi-transparent polygon that fills the area bounded between the Low and High prices on all the days.

## 16.4. Notes on Signal Processing (Audio, Image, Video) (*)

Data science classically deals with information that is or can be represented in tabular form and where particular observations (which can be multidimensional) are usually independent from but still to some extent similar to each other. We often treat them as samples from different larger populations which we would like to describe or compare at some level of generality (think: health data on patients being subject to two treatment plans that we wish to evaluate).

From this perspective, time series which we have briefly touched upon above are already quite distinct, because there is some dependence observed in the time domain: a price of a stock that we observe today is influenced by what was happening yesterday. There might also be some seasonal patterns or trends under the hood. Still, for data of this kind, employing statistical modelling techniques (stochastic processes) still can make sense.

Signals such as audio, images, and video, are slightly different, because structured randomness does not play dominant role there (unless it is noise that we would like to filter out). Instead, what is happening in the frequency (think: perceiving pitches when listening to music) or spatial (seeing green grass and sky on a photo) domain will play key role there.

Signal processing thus requires a quite distinct set of tools, e.g., Fourier analysis and finite impulse response (discrete convolution) filters. This course obviously cannot be about everything (also because it requires some more advanced calculus skills that we did not assume the reader to have at this time). Therefore, see, e.g., [Smi02, Ste96].

Nevertheless, we should keep in mind that these are not completely independent domains. For example, we can extract various features of audio signals (e.g., overall loudness, timbre, and danceability of each recording in a large song database) and then treat them as tabular data to be analysed using the techniques described in this course. Moreover, machine learning (e.g., deep convolutional neural networks) algorithms may also be used for tasks such as object detection on images or optical character recognition.

Note

Some Python packages for signal processing worth inspecting include:

• scipy.signal and scipy.ndimage,

• pillow (PIL) ,

• OpenCV (a C++ library for image and video processing with Python bindings).

Due to limited spacetime, we merely touched upon the most basic methods for dealing with time series. The reader is encouraged to take a look at the broad literature concerning statistical analysis of time series (e.g., issues in forecasting, [HA21, OFK17]) as well as on signal processing (e.g., [Smi02, Ste96]); their toolkits somewhat overlap.

For probabilistic modelling, see textbooks on stochastic processes, e.g., [Tij03].

## 16.6. Exercises

Exercise 16.17

Assume we have a time series with n observations. What is a 1- and an n-moving average? Which one is smoother, an (0.01n)- or a (0.1n)- one?

Exercise 16.18

What is the Unix Epoch?

Exercise 16.19

How can we recreate the original series when we are given its numpy.diff-transformed version?

Exercise 16.20

(*) In your own words, describe the key elements of a candlestick plot.