16. Time Series

The open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated.

So far, we have been using numpy and pandas mostly for storing:

  • independent measurements, where each row gives, e.g., weight, height, … records of a different subject; we often consider these a sample of a representative subset of one or more populations, each recorded at a particular point in time;

  • data summaries to be reported in the form of tables or figures, e.g., frequency distributions giving counts for the corresponding categories or labels.

In this chapter, we will explore the most basic concepts related to the wrangling of time series, i.e., signals indexed by discrete time. Usually, a time series is a sequence of measurements sampled at equally spaced moments, e.g., a patient’s heart rate probed every second, daily average currency exchange rates, or highest yearly temperatures recorded in some location.

16.1. Temporal Ordering and Line Charts

Consider the midrange1 daily temperatures in degrees Celsius at the Spokane International Airport (Spokane, WA, US) between 1889-08-01 (first observation) and 2021-12-31 (last observation).

temps = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/spokane_temperature.txt")

Let us preview the December 2021 data:

temps[-31:]  # last 31 days
## array([ 11.9,   5.8,   0.6,   0.8,  -1.9,  -4.4,  -1.9,   1.4,  -1.9,
##         -1.4,   3.9,   1.9,   1.9,  -0.8,  -2.5,  -3.6, -10. ,  -1.1,
##         -1.7,  -5.3,  -5.3,  -0.3,   1.9,  -0.6,  -1.4,  -5. ,  -9.4,
##        -12.8, -12.2, -11.4, -11.4])

Here are some data aggregates for the whole sample. First, the popular quantiles:

np.quantile(temps, [0, 0.25, 0.5, 0.75, 1])
## array([-26.9,   2.2,   8.6,  16.4,  33.9])

Then, the arithmetic mean and standard deviation:

np.mean(temps), np.std(temps)
## (8.990273958441023, 9.16204388619955)

A graphical summary of the data distribution is depicted in Figure 16.1.

sns.violinplot(data=temps, orient="h", color="lightgray")
plt.yticks([])
plt.show()
../_images/spokane-violinplot-1.png

Figure 16.1 Distribution of the midrange daily temperatures in Spokane in the period 1889-2021; observations are treated as a bag of unrelated items (temperature on a “randomly chosen day” in a version of planet Earth where there is no climate change)

When computing data aggregates or plotting histograms, the order of elements does not matter. Contrary to the case of the independent measurements, vectors representing time series do not have to be treated simply as mixed bags of unrelated items.

Important

In time series, for any given item \(x_i\), its neighbouring elements \(x_{i-1}\) and \(x_{i+1}\) denote the recordings occurring directly before and after it. We can use this temporal ordering to model how consecutive measurements depend on each other, describe how they change over time, forecast future values, detect seasonal and long-time trends, and so forth.

In Figure 16.2, we depict the data for 2021, plotted as a function of time. What we see is often referred to as a line chart: data points are connected by straight line segments. There are some visible seasonal variations, such as, well, obviously, that winter is colder than summer. There is also some natural variability on top of seasonal patterns typical for the Northern Hemisphere.

plt.plot(temps[-365:])
plt.xticks([0, 181, 364], ["2021-01-01", "2021-07-01", "2021-12-31"])
plt.show()
../_images/spokane-1-linechart-3.png

Figure 16.2 Line chart of midrange daily temperatures in Spokane for 2021

16.2. Working with Datetimes and Timedeltas

16.2.1. Representation: The UNIX Epoch

numpy.datetime64 is a type to represent datetimes. Usually, we will be creating dates from strings, for instance:

d = np.array([
    "1889-08-01", "1970-01-01", "1970-01-02", "2021-12-31", "today"
], dtype="datetime64")
d
## array(['1889-08-01', '1970-01-01', '1970-01-02', '2021-12-31',
##        '2022-11-13'], dtype='datetime64[D]')

Similarly with datetimes:

dt = np.array(["1970-01-01T02:01:05", "now"], dtype="datetime64")
dt
## array(['1970-01-01T02:01:05', '2022-11-13T01:58:25'],
##       dtype='datetime64[s]')

Important

Internally, the above are represented as the number of days or seconds since the so-called Unix Epoch, 1970-01-01T00:00:00 in the UTC time zone.

Let us verify the above statement:

d.astype(float)
## array([-2.9372e+04,  0.0000e+00,  1.0000e+00,  1.8992e+04,  1.9309e+04])
dt.astype(float)
## array([7.2650000e+03, 1.6683047e+09])

When we think about it for a while, this is exactly what we expected.

Exercise 16.1

(*) Write a regular expression that extracts all dates in the YYYY-MM-DD format from a (possibly long) string and converts them to datetime64.

16.2.2. Time Differences

Computing datetime differences is possible thanks to the numpy.timedelta64 objects:

d - np.timedelta64(1, "D")  # minus 1 Day
## array(['1889-07-31', '1969-12-31', '1970-01-01', '2021-12-30',
##        '2022-11-12'], dtype='datetime64[D]')
dt + np.timedelta64(12, "h")  # plus 12 hours
## array(['1970-01-01T14:01:05', '2022-11-13T13:58:25'],
##       dtype='datetime64[s]')

Also, numpy.arange (and pandas.date_range) can be used to generate sequences of equidistant datetimes:

dates = np.arange("1889-08-01", "2022-01-01", dtype="datetime64[D]")
dates[:3]   # preview
## array(['1889-08-01', '1889-08-02', '1889-08-03'], dtype='datetime64[D]')
dates[-3:]  # preview
## array(['2021-12-29', '2021-12-30', '2021-12-31'], dtype='datetime64[D]')

16.2.3. Datetimes in Data Frames

Dates and datetimes can of course be emplaced in pandas data frames:

spokane = pd.DataFrame(dict(
    date=np.arange("1889-08-01", "2022-01-01", dtype="datetime64[D]"),
    temp=temps
))
spokane.head()
##         date  temp
## 0 1889-08-01  21.1
## 1 1889-08-02  20.8
## 2 1889-08-03  22.2
## 3 1889-08-04  21.7
## 4 1889-08-05  18.3

Interestingly, if we ask the date column to become the data frame’s index (i.e., row labels), we will be able select date ranges quite easily with loc[...] and string slices (refer to the manual of pandas.DateTimeIndex for more details).

spokane.set_index("date").loc["2021-12-25":, :].reset_index()
##         date  temp
## 0 2021-12-25  -1.4
## 1 2021-12-26  -5.0
## 2 2021-12-27  -9.4
## 3 2021-12-28 -12.8
## 4 2021-12-29 -12.2
## 5 2021-12-30 -11.4
## 6 2021-12-31 -11.4
Example 16.2

Based on the above, we can plot the data for the last five years quite easily; see Figure 16.3. Note that the x-axis labels are generated automatically.

x = spokane.set_index("date").loc["2017-01-01":, "temp"]
plt.plot(x)
plt.show()
../_images/spokane-5-linechart-5.png

Figure 16.3 Line chart of midrange daily temperatures in Spokane for 2017–2021

The pandas.to_datetime function can also convert arbitrarily formatted date strings, e.g., "MM/DD/YYYY" or "DD.MM.YYYY" to Series of datetime64s.

dates = ["05.04.1991", "14.07.2022", "21.12.2042"]
dates = pd.Series(pd.to_datetime(dates, format="%d.%m.%Y"))
dates
## 0   1991-04-05
## 1   2022-07-14
## 2   2042-12-21
## dtype: datetime64[ns]
Exercise 16.3

From the birth_dates dataset, select all people less than 18 years old (as of the current day).

Several datetime functions and related properties can be referred to via the pandas.Series.dt accessor (similarly to pandas.Series.str discussed in Chapter 14). In particular, they deliver a convenient means for extracting different date or time fields, such as:

dates_ymd = pd.DataFrame(dict(
    year  = dates.dt.year,
    month = dates.dt.month,
    day   = dates.dt.day
))
dates_ymd
##    year  month  day
## 0  1991      4    5
## 1  2022      7   14
## 2  2042     12   21

Interestingly, pandas.to_datetime can also convert data frames with columns named year, month, day, etc., back to datetime objects directly:

pd.to_datetime(dates_ymd)
## 0   1991-04-05
## 1   2022-07-14
## 2   2042-12-21
## dtype: datetime64[ns]
Example 16.4

Let us extract the month and year parts of dates to compute the average monthly temperatures it the last 50-ish years:

x = spokane.set_index("date").loc["1970":, ].reset_index()
mean_monthly_temps = x.groupby([
    x.date.dt.year.rename("year"),
    x.date.dt.month.rename("month")
]).temp.mean().unstack()
mean_monthly_temps.head().round(1)  # preview
## month   1    2    3    4     5     6     7     8     9    10   11   12
## year                                                                  
## 1970  -3.4  2.3  2.8  5.3  12.7  19.0  22.5  21.2  12.3  7.2  2.2 -2.4
## 1971  -0.1  0.8  1.7  7.4  13.5  14.6  21.0  23.4  12.9  6.8  1.9 -3.5
## 1972  -5.2 -0.7  5.2  5.6  13.8  16.6  20.0  21.7  13.0  8.4  3.5 -3.7
## 1973  -2.8  1.6  5.0  7.8  13.6  16.7  21.8  20.6  15.4  8.4  0.9  0.7
## 1974  -4.4  1.8  3.6  8.0  10.1  18.9  19.9  20.1  15.8  8.9  2.4 -0.8

Figure 16.4 depicts these data on a heatmap. We rediscover the ultimate truth that winters are cold, whereas in the summertime the living is easy, what a wonderful world.

sns.heatmap(mean_monthly_temps)
plt.show()
../_images/mean-monthly-temps-7.png

Figure 16.4 Average monthly temperatures

16.3. Basic Operations

16.3.1. Iterative Differences and Cumulative Sums Revisited

Recall from Section 5.5.1 the numpy.diff function and its almost-inverse, numpy.cumsum. The former can turn a time series into a vector of relative changes (deltas), \(\Delta_i=x_{i+1}-x_i\).

x = temps[-7:]  # last 7 days
x
## array([ -1.4,  -5. ,  -9.4, -12.8, -12.2, -11.4, -11.4])

The iterative differences (deltas) are:

d = np.diff(x)
d
## array([-3.6, -4.4, -3.4,  0.6,  0.8,  0. ])

For instance, between the second and the first day of the last week, the midrange temperature dropped by -3.6°C.

The other way around, here the cumulative sums of the deltas:

np.cumsum(d)
## array([ -3.6,  -8. , -11.4, -10.8, -10. , -10. ])

This turned deltas back to a shifted version of the original series. But we will need the first (root) observation therefrom to restore the dataset in full:

x[0] + np.append(0, np.cumsum(d))
## array([ -1.4,  -5. ,  -9.4, -12.8, -12.2, -11.4, -11.4])
Exercise 16.5

Consider the euraud-20200101-20200630-no-na dataset which lists daily EUR/AUD exchange rates in the first half of 2020 (remember COVID-19?), with missing observations removed. Using numpy.diff, compute the minimum, median, average, and maximum daily price changes. Also, draw a box and whisker plot for these deltas.

Example 16.6

(*) The exponential distribution family is sometimes used for the modelling of times between different events (deltas). It might be a sensible choice under the assumption that a system generates a constant number of events on average and that they occur independently of each other, e.g., for the times between requests to a cloud service during peak hours, wait times for the next pedestrian to appear at a crossing near the Southern Cross Station in Melbourne, or the amount of time it takes a bank teller to interact with a customer (there is a whole branch of applied mathematics called queuing theory that deals with this type of modelling).

An exponential family is identified by the scale parameter \(s > 0\), being at the same time its expected value. The probability density function of Exp(s) is given for \(x\ge 0\) by:

\[ f(x) = \tfrac{1}{s} e^{-x/s}, \]

and \(f(x)=0\) otherwise. We should be careful: some textbooks choose the parametrisation by \(\lambda=1/s\) instead of \(s\). The scipy package also uses this convention.

Here is a pseudorandom sample where there are five events per minute on average:

np.random.seed(123)
λ = 60/5  # 5 events per 60 seconds on average
d = scipy.stats.expon.rvs(size=1200, scale=λ)
np.round(d[:8], 3)  # preview
## array([14.307,  4.045,  3.087,  9.617, 15.253,  6.601, 47.412, 13.856])

This gave us the wait times between the events, in seconds.

A natural sample estimator of the scale parameter is:

np.mean(d)
## 11.839894504211724

The result is close to what we expected, i.e., \(s=12\) seconds between the events.

We can convert the above to datetime (starting at a fixed calendar date) as follows. Note that we will measure the deltas in milliseconds so that we do not loose precision; datetime64 is based on integers, not floating-point numbers.

t0 = np.array("2022-01-01T00:00:00", dtype="datetime64[ms]")
d_ms = np.round(d*1000).astype(int)  # in milliseconds
t = t0 + np.array(np.cumsum(d_ms), dtype="timedelta64[ms]")
t[:8]  # preview
## array(['2022-01-01T00:00:14.307', '2022-01-01T00:00:18.352',
##        '2022-01-01T00:00:21.439', '2022-01-01T00:00:31.056',
##        '2022-01-01T00:00:46.309', '2022-01-01T00:00:52.910',
##        '2022-01-01T00:01:40.322', '2022-01-01T00:01:54.178'],
##       dtype='datetime64[ms]')
t[-2:]  # preview
## array(['2022-01-01T03:56:45.312', '2022-01-01T03:56:47.890'],
##       dtype='datetime64[ms]')

As an exercise, let us apply binning and count how many events occur in each hour:

b = np.arange(  # four 1-hour interval (five time points)
    "2022-01-01T00:00:00", "2022-01-01T05:00:00",
    1000*60*60,  # number of milliseconds in 1 hour
    dtype="datetime64[ms]"
)
np.histogram(t, bins=b)[0]
## array([305, 300, 274, 321])

We expect 5 events per second, i.e., 300 of them per hour. On a side note, from a course in statistics we know that for exponential inter-event times, the number of events per unit of time follows a Poisson distribution.

Exercise 16.7

(*) Consider the wait_times dataset that gives the times between some consecutive events, in seconds. Estimate the event rate per hour. Draw a histogram representing the number of events per hour.

Exercise 16.8

(*) Consider the btcusd_ohlcv_2021_dates dataset which gives the daily BTC/USD exchange rates in 2021:

btc = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/btcusd_ohlcv_2021_dates.csv",
    comment="#").loc[:, ["Date", "Close"]]
btc["Date"] = btc["Date"].astype("datetime64[D]")
btc.head(12)
##          Date      Close
## 0  2021-01-01  29374.152
## 1  2021-01-02  32127.268
## 2  2021-01-03  32782.023
## 3  2021-01-04  31971.914
## 4  2021-01-05  33992.430
## 5  2021-01-06  36824.363
## 6  2021-01-07  39371.043
## 7  2021-01-08  40797.609
## 8  2021-01-09  40254.547
## 9  2021-01-10  38356.441
## 10 2021-01-11  35566.656
## 11 2021-01-12  33922.961

Write a function that converts it to a lagged representation, being a convenient form for some machine learning algorithms.

  1. Add the Change column that gives by how much the price changed since the previous day.

  2. Add the Dir column indicating if the change was positive or negative.

  3. Add the Lag1, …, Lag5 columns which give the Changes in the five preceding days.

The first few rows of the resulting data frame should look like this (assuming we do not want any missing values):

##       Date Close   Change Dir     Lag1     Lag2    Lag3    Lag4    Lag5
## 2021-01-07 39371  2546.68 inc  2831.93  2020.52 -810.11  654.76 2753.12
## 2021-01-08 40798  1426.57 inc  2546.68  2831.93 2020.52 -810.11  654.76
## 2021-01-09 40255  -543.06 dec  1426.57  2546.68 2831.93 2020.52 -810.11
## 2021-01-10 38356 -1898.11 dec  -543.06  1426.57 2546.68 2831.93 2020.52
## 2021-01-11 35567 -2789.78 dec -1898.11  -543.06 1426.57 2546.68 2831.93
## 2021-01-12 33923 -1643.69 dec -2789.78 -1898.11 -543.06 1426.57 2546.68

In the 6th row (representing 2021-01-12), Lag1 corresponds to Change on 2021-01-11, Lag2 gives the Change on 2021-01-10, and so forth.

To spice things up, make sure your code can generate any number (as defined by another parameter to the function) of lagged variables.

16.3.2. Smoothing with Moving Averages

With time series it makes sense to consider processing whole batches of consecutive points, because there is a time dependence between them. In particular, we can consider computing different aggregates inside rolling windows of a particular size. For instance, the \(k\)-moving average of a given sequence \((x_1, x_2, \dots, x_n)\) is a vector \((y_1, y_2, \dots, y_{n-k+1})\) such that:

\[ y_i = \frac{1}{k}\left( x_{i}+x_{i+1}+\dots+x_{i+k-1} \right) = \frac{1}{k} \sum_{j=1}^k x_{i+j-1}, \]

i.e., the arithmetic mean of \(k\le n\) consecutive observations starting at \(x_i\).

For example, here are the temperatures in the last 7 days of December 2011:

x = spokane.set_index("date").iloc[-7:, :]
x
##             temp
## date            
## 2021-12-25  -1.4
## 2021-12-26  -5.0
## 2021-12-27  -9.4
## 2021-12-28 -12.8
## 2021-12-29 -12.2
## 2021-12-30 -11.4
## 2021-12-31 -11.4

The 3-moving (rolling) average:

x.rolling(3, center=True).mean().round(2)
##              temp
## date             
## 2021-12-25    NaN
## 2021-12-26  -5.27
## 2021-12-27  -9.07
## 2021-12-28 -11.47
## 2021-12-29 -12.13
## 2021-12-30 -11.67
## 2021-12-31    NaN

We get, in this order: the mean of the first three observations; the mean of the 2nd, 3rd, and 4th items; then the mean of the 3rd, 4th, and 5th; and so forth. Notice that the observations were centred in such a way that we have the same number of missing values at the start and end of the series. This way, we treat the first 3-day moving average (the average of the temperatures on the first three days) as representative of the 2nd day.

And now for something completely different; the 5-moving average:

x.rolling(5, center=True).mean().round(2)
##              temp
## date             
## 2021-12-25    NaN
## 2021-12-26    NaN
## 2021-12-27  -8.16
## 2021-12-28 -10.16
## 2021-12-29 -11.44
## 2021-12-30    NaN
## 2021-12-31    NaN

Applying the moving average has the nice effect of smoothing out all kinds of broadly conceived noise. To illustrate this, compare the temperature data for the last five years in Figure 16.3 to their averaged versions in Figure 16.5.

x = spokane.set_index("date").loc["2017-01-01":, "temp"]
x30 = x.rolling(30, center=True).mean()
x100 = x.rolling(100, center=True).mean()
plt.plot(x30, label="30-day moving average")
plt.plot(x100, "r--", label="100-day moving average")
plt.legend()
plt.show()
../_images/spokane-ma-linechart-9.png

Figure 16.5 Line chart of 30- and 100-moving averages of the midrange daily temperatures in Spokane for 2017-2021

Exercise 16.9

(*) Other aggregation functions can be applied in rolling windows as well. Draw, in the same figure, the plots of the 1-year moving minimums, medians, and maximums.

16.3.4. Imputing Missing Values

Missing values in time series can be imputed based on the information from the neighbouring non-missing observations. After all, it is usually the case that, e.g., today’s weather is “similar” to yesterday’s and tomorrow’s.

The most straightforward ways for dealing with missing values in time series are:

  • forward-fill – propagate the last non-missing observation,

  • backward-fill – get the next non-missing value,

  • linearly interpolate between two adjacent non-missing values – in particular, a single missing value will be replaced by the average of its neighbours.

Example 16.11

The classic air_quality_1973 dataset gives some daily air quality measurements in New York, between May and September 1973. Let us impute the first few observations in the solar radiation column:

air = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/r/air_quality_1973.csv",
    comment="#")
x = air.loc[:, "Solar.R"].iloc[:12]
pd.DataFrame(dict(
    original=x,
    ffilled=x.fillna(method="ffill"),
    bfilled=x.fillna(method="bfill"),
    interpolated=x.interpolate(method="linear")
))
##     original  ffilled  bfilled  interpolated
## 0      190.0    190.0    190.0    190.000000
## 1      118.0    118.0    118.0    118.000000
## 2      149.0    149.0    149.0    149.000000
## 3      313.0    313.0    313.0    313.000000
## 4        NaN    313.0    299.0    308.333333
## 5        NaN    313.0    299.0    303.666667
## 6      299.0    299.0    299.0    299.000000
## 7       99.0     99.0     99.0     99.000000
## 8       19.0     19.0     19.0     19.000000
## 9      194.0    194.0    194.0    194.000000
## 10       NaN    194.0    256.0    225.000000
## 11     256.0    256.0    256.0    256.000000
Exercise 16.12

(*) With the air_quality_2018 dataset:

  1. Based on the hourly observations, compute the daily mean PM2.5 measurements for Melbourne CBD and Morwell South.

    For Melbourne CBD, if some hourly measurement is missing, linearly interpolate between the preceding and following non-missing data, e.g., a PM2.5 sequence of [..., 10, NaN, NaN, 40, ...] (you need to manually add the NaN rows to the dataset) should be transformed to [..., 10, 20, 30, 40, ...].

    For Morwell South, impute the readings with the averages of the records in the nearest air quality stations, which are located in Morwell East, Moe, Churchill, and Traralgon.

  2. Present the daily mean PM2.5 measurements for Melbourne CBD and Morwell South on a single plot. The x-axis labels should be human-readable and intuitive.

  3. For the Melbourne data, determine the number of days where the average PM2.5 was greater than in the preceding day.

  4. Find five most air-polluted days for Melbourne.

16.3.5. Plotting Multidimensional Time Series

Multidimensional time series stored in the form of an n-by-m matrix are best viewed as m time series – possibly but not necessarily related to each other – all sampled at the same n points in time (e.g., m different stocks on n consecutive days).

Consider the currency exchange rates for the first half of 2020:

eurxxx = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/eurxxx-20200101-20200630-no-na.csv",
    delimiter=",")
eurxxx[:6, :]  # preview
## array([[1.6006 , 7.7946 , 0.84828, 4.2544 ],
##        [1.6031 , 7.7712 , 0.85115, 4.2493 ],
##        [1.6119 , 7.8049 , 0.85215, 4.2415 ],
##        [1.6251 , 7.7562 , 0.85183, 4.2457 ],
##        [1.6195 , 7.7184 , 0.84868, 4.2429 ],
##        [1.6193 , 7.7011 , 0.85285, 4.2422 ]])

This gives EUR/AUD (how many Australian Dollars we pay for 1 Euro), EUR/CNY (Chinese Yuans), EUR/GBP (British Pounds), and EUR/PLN (Polish Złotys), in this order. Let us draw the four time series; see Figure 16.8.

dates = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/euraud-20200101-20200630-dates.txt",
    dtype="datetime64")
labels = ["AUD", "CNY", "GBP", "PLN"]
styles = ["solid", "dotted", "dashed", "dashdot"]
for i in range(eurxxx.shape[1]):
    plt.plot(dates, eurxxx[:, i], ls=styles[i], label=labels[i])
plt.legend(loc="upper right", bbox_to_anchor=(1, 0.9))  # a bit lower
plt.show()
../_images/currency-abs-15.png

Figure 16.8 EUR/AUD, EUR/CNY, EUR/GBP, and EUR/PLN exchange rates in the first half of 2020

Unfortunately, they are all on different scales. This is why the plot is not necessarily readable. It would be better to draw these time series on four separate plots (compare the trellis plots in Section 12.2.5).

Another idea is to depict the currency exchange rates relative to the prices on some day, say, the first one; see Figure 16.9.

for i in range(eurxxx.shape[1]):
    plt.plot(dates, eurxxx[:, i]/eurxxx[0, i],
        ls=styles[i], label=labels[i])
plt.legend()
plt.show()
../_images/currency-rel-17.png

Figure 16.9 EUR/AUD, EUR/CNY, EUR/GBP, and EUR/PLN exchange rates relative to the prices on the first day

This way, e.g., a relative EUR/AUD rate of ca. 1.15 in mid-March means that if an Aussie bought some Euros on the first day, and then sold them three-ish months later, they would have 15% more wealth (the Euro become 15% stronger relative to AUD).

Exercise 16.13

Based on the EUR/AUD and EUR/PLN records, compute and plot the AUD/PLN as well as PLN/AUD rates.

Exercise 16.14

(*) Draw the EUR/AUD and EUR/GBP rates on a single plot, but where each series has its own y-axis.

Exercise 16.15

(*) Draw the EUR/xxx rates for your favourite currencies over a larger period. Use data downloaded from the European Central Bank. Add a few moving averages. For each year, identify the lowest and the highest rate.

16.3.6. Candlestick Plots (*)

Consider the BTC/USD data for 2021:

btcusd = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/marek/btcusd_ohlcv_2021.csv",
    delimiter=",")
btcusd[:6, :4]  # preview (we skip the Volume column for readability)
## array([[28994.01 , 29600.627, 28803.586, 29374.152],
##        [29376.455, 33155.117, 29091.182, 32127.268],
##        [32129.408, 34608.559, 32052.316, 32782.023],
##        [32810.949, 33440.219, 28722.756, 31971.914],
##        [31977.041, 34437.59 , 30221.188, 33992.43 ],
##        [34013.613, 36879.699, 33514.035, 36824.363]])

This gives the open, high, low, and close (OHLC) prices on the 365 consecutive days, which is a common way to summarise daily rates.

The mplfinance package (matplotlib-finance) features a few functions related to the plotting of financial data. Here, let us briefly describe the well-known candlestick plot.

import mplfinance as mpf
dates = np.arange("2021-01-01", "2022-01-01", dtype="datetime64[D]")
mpf.plot(pd.DataFrame(
    btcusd,
    columns=["Open", "High", "Low", "Close", "Volume"]
).set_index(dates).iloc[:31, :], type="candle")
# plt.show() # not needed...
../_images/candlestick-plot-19.png

Figure 16.10 Candlestick plot for the BTC/USD exchange rates in January 2021

Figure 16.10 depicts the January 2021 data. Let us stress that this is not a box and whisker plot. The candlestick body denotes the difference in the market opening and the closing price. The wicks (shadows) give the range (high to low). White candlesticks represent bullish days – where the closing rate is greater than the opening one (uptrend). Black candles are bearish (decline).

Exercise 16.16

Draw the BTC/USD rates for the entire year and add the 10-day moving averages.

Exercise 16.17

(*) Draw a candlestick plot manually, without using the mplfinance package. Hint: matplotlib.pyplot.fill might be helpful.

Exercise 16.18

(*) Using matplotlib.pyplot.fill_between add a semi-transparent polygon that fills the area bounded between the Low and High prices on all the days.

16.4. Further Reading

Data science classically deals with information that is or can be represented in tabular form and where particular observations (which can be multidimensional) are usually independent from but still to some extent similar to each other. We often treat them as samples from different larger populations which we would like to describe or compare at some level of generality (think: health data on patients being subject to two treatment plans that we wish to evaluate).

From this perspective, time series are already quite distinct, because there is some dependence observed in the time domain: a price of a stock that we observe today is influenced by what was happening yesterday. There might also be some seasonal patterns or trends under the hood. For a good introduction to forecasting; see [46, 63]. Also, for data of this kind, employing statistical modelling techniques (stochastic processes) can make a lot of sense; see, e.g., [76].

Signals such as audio, images, and video are different, because structured randomness does not play a dominant role there (unless it is a noise that we would like to filter out). Instead, what is happening in the frequency (think: perceiving pitches when listening to music) or spatial (seeing green grass and sky in a photo) domain will play a key role there.

Signal processing thus requires a distinct set of tools, e.g., Fourier analysis and finite impulse response (discrete convolution) filters. This course obviously cannot be about everything (also because it requires some more advanced calculus skills that we did not assume the reader to have at this time); but see, e.g., [74, 75].

Nevertheless, we should keep in mind that these are not completely independent domains. For example, we can extract various features of audio signals (e.g., overall loudness, timbre, and danceability of each recording in a large song database) and then treat them as tabular data to be analysed using the techniques described in this course. Moreover, machine learning (e.g., convolutional neural networks) algorithms may also be used for tasks such as object detection on images or optical character recognition; see, e.g., [37].

16.5. Exercises

Exercise 16.19

Assume we have a time series with n observations. What is a 1- and an n-moving average? Which one is smoother, a (0.01n)- or a (0.1n)- one?

Exercise 16.20

What is the Unix Epoch?

Exercise 16.21

How can we recreate the original series when we are given its numpy.diff-transformed version?

Exercise 16.22

(*) In your own words, describe the key elements of a candlestick plot.


1

Note that midrange, being the mean of the lowest and the highest observed temperature on a given day, is not a particularly good estimate of the average daily reading. This dataset is considered for illustrational purposes only.