6. Continuous Probability Distributions

The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.

Each successful data analyst will deal with hundreds or thousands of datasets in their lifetime. In the long run, at some level, most of them will be deemed boring. This is because only a few common patterns will be occurring over and over again.

In particular, the previously mentioned bell-shapedness and right-skewness is quite prevalent in the so-called real world. But, surprisingly, this is exactly when things become scientific and interesting – allowing us to study various phenomena at an appropriate level of generality.

Mathematically, such idealised patterns in the histogram shapes can be formalised using the notion of a probability distribution of a continuous, real-valued random variable, and more precisely in this case, its probability density function.

Intuitively1, a probability density function is a nicely smooth curve that would arise if we drew a histogram for the whole population (e.g., all women living currently on Earth and beyond or otherwise an extremely large data sample obtained by independently querying the same underlying data generating process) in such a way that the total area of all the bars is equal to 1 and the bin sizes are very small.

As stated at the beginning, we do not intend this to be a course in probability theory and mathematical statistics, but a one that precedes and motivates them (e.g., [Bil95, DKLM05, Gen09, Gen20]), therefore our definitions are out of necessity simplified so that they are digestible. Hence, for the purpose of our illustrations, let us consider the following characterisation.

Important

(*) We call an integrable function \(f:\mathbb{R}\to\mathbb{R}\) a probability density function (PDF) if \(f(x)\ge 0\) for all \(x\) and \(\int_{-\infty}^\infty f(x)\,dx=1\), i.e., it is nonnegative and normalised in such a way that the total area under the whole curve is 1.

For any \(a < b\), we treat the area under the fragment of the \(f(x)\) curve for \(x\) between \(a\) and \(b\), i.e., \(\int_a^b f(x)\,dx\), as the probability of the underlying real-valued random variable’s (theoretical data generating process’) falling into the \([a, b]\) interval.

Some distributions appear more frequently than others and appear to fit empirical data or parts thereof particularly well; compare [FEHP10]. Let us thus review a few noteworthy probability distributions: the normal, log-normal, Pareto, and uniform families (in the course of this course we will also mention the chi-squared, Kolmogorov, and exponential ones).

6.1. Normal Distribution

A normal (Gaussian) distribution is the one that has a prototypical, nicely symmetric, bell-shaped density. It is described by two parameters: \(\mu\in\mathbb{R}\) (the expected value, at which the density is centred) and \(\sigma>0\) (the standard deviation, saying how much the distribution is dispersed around μ); compare Figure 6.1.

The probability density function of N(μ, σ) is given by:

\[ f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right). \]
../_images/normalpdf-1.png

Figure 6.1 The probability density functions of some normal distributions N(μ, σ); note that μ is responsible for shifting and σ affects scaling/stretching of the probability mass

6.1.1. Estimating Parameters

A course in statistics (which, again, this one is not, we are merely making an illustration here), may tell us that the sample arithmetic mean \(\bar{x}\) and standard deviation \(s\) are natural, statistically well-behaving estimators of the said parameters: if all observations would really be drawn independently from N(μ, σ) each, then we expect \(\bar{x}\) and \(s\) be equal to, more or less, μ and σ (the larger the sample size, the smaller the error).

Recall the heights (females from the NHANES study) dataset and its bell-shaped histogram in Figure 4.2.

heights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/nhanes_adult_female_height_2020.txt")
n = len(heights)
n
## 4221

Let us estimate the said parameters for this sample:

mu = np.mean(heights)
sigma = np.std(heights, ddof=1)
mu, sigma
## (160.13679222932953, 7.062858532891359)

Mathematically, we will denote these two with \(\hat{\mu}\) and \(\hat{\sigma}\) (mu and sigma with a hat) to emphasise that they are merely guesstimates2 of the unknown respective parameters \(\mu\) and \(\sigma\). On a side note, we use ddof=1 in the latter case, because this estimator has slightly better statistical properties in this context.

Let us draw the fitted density function (i.e., the PDF of N(160.1, 7.06) which we can compute using scipy.stats.norm.pdf), on top of the histogram; see Figure 6.2. We pass stat="density" to seaborn.histplot so that the histogram bars are normalised (i.e., the total area of these rectangles sums to 1).

sns.histplot(heights, stat="density", color="lightgray")
x = np.linspace(np.min(heights), np.max(heights), 1000)
plt.plot(x, scipy.stats.norm.pdf(x, mu, sigma), "r--",
    label=f"PDF of N({mu:.1f}, {sigma:.2f})")
plt.legend()
plt.show()
../_images/heights-normal-3.png

Figure 6.2 A histogram and the probability density function of the fitted normal distribution for the heights dataset

At first glance, this is a very nice match. Before proceeding with some ways of assessing the goodness of fit slightly more rigorously, we should praise the potential benefits of having an idealised model of our dataset at our disposal.

6.1.2. Data Models Are Useful

If (provided that, assuming that, on condition that) our sample is a realisation of independent random variables following a given distribution, or a data analyst judges that such an approximation might be justified or beneficial, then we have a set of many numbers reduced to merely few parameters.

In the above case, we might want to risk the statement that data follow the normal distribution (assumption 1) with parameters \(\mu=160.1\) and \(\sigma=7.06\) (assumption 2). Still, the choice of the distribution family is one thing, and the way we estimate the underlying parameters (in our case, we use \(\hat{\mu}\) and \(\hat{\sigma}\)) is another.

This not only saves storage space and computational time, but also – based on what we can learn from a course in probability and statistics (by appropriately integrating the density function) – we can imply facts such as for normally distributed data:

  • ca. 68% of (i.e., a majority) women are \(\mu\pm \sigma\) tall (the 1σ rule),

  • ca. 95% of (i.e., most typical) women are \(\mu\pm 2\sigma\) tall (the 2σ rule),

  • ca. 99.7% of (i.e., almost all) women are \(\mu\pm 3\sigma\) tall (the 3σ rule).

Also, if we knew that the distribution of heights of men is also normal with some other parameters, we could be able to make some comparisons between the two samples. For example, what is the probability that a woman randomly selected from the crowd is taller than a male passerby.

Exercise 6.1

How different manufacturing industries (e.g., clothing) can make use of such knowledge? Are simplifications necessary when dealing with complexity? What are the alternatives?

Furthermore, there is a range of parametric (assuming some distribution family) of statistical methods that can be used if we assume the data normality, e.g., the t-test to compare the expected values. So many options that open many new doors.

Important

We should always verify the assumptions of a model that we wish to apply in practice. In particular, we will soon note that incomes are not normally distributed. Therefore, we must not refer to the above 2σ or 3σ rule in their case. A cow neither barks nor can it serve as a screwdriver. Period.

6.2. Assessing Goodness of Fit

6.2.1. Comparing Cumulative Distribution Functions

In the previous subsection we were comparing densities and histograms. It turns out that there is a better way of assessing the extent to which a sample deviates from a hypothesised distribution. Namely, we can measure the discrepancy between some theoretical cumulative distribution function (CDF) and the empirical one (\(\hat{F}_n\), see Section 4.3.8).

Important

If \(f\) is a density function, then the corresponding theoretical CDF is defined as \(F(x) = \int_{-\infty}^x f(t)\,dt\), i.e., the probability of the underlying random variable’s taking a value less than or equal to \(x\).

By definition3, each CDF takes values in the unit interval (\([0,1]\)) and is nondecreasing.

For the normal distribution family, the values of the theoretical CDF can be computed by calling scipy.stats.norm.cdf; see Figure 6.3.

x = np.linspace(np.min(heights), np.max(heights), 1001)
probs = scipy.stats.norm.cdf(x, mu, sigma)  # sample the CDF at many points
plt.plot(x, probs, "r--", label=f"CDF of N({mu:.1f}, {sigma:.2f})")
heights_sorted = np.sort(heights)
plt.plot(heights_sorted, np.arange(1, n+1)/n,
    drawstyle="steps-post", label="Empirical CDF")
plt.xlabel("$x$")
plt.ylabel("Prob(height $\\leq$ x)")
plt.legend()
plt.show()
../_images/heights-cdf-5.png

Figure 6.3 The empirical CDF and the fitted normal CDF for the heights dataset; the fit is superb

This looks like a superb match.

Example 6.2

\(F(b)-F(a)=\int_{a}^b f(t)\,dt\) is the probability of generating a value in the interval \([a, b]\).

Thus, let us empirically verify the aforementioned 3σ rule:

F = lambda x: scipy.stats.norm.cdf(x, mu, sigma)
F(mu+3*sigma) - F(mu-3*sigma)
## 0.9973002039367398

Indeed, almost all observations are within \([\mu-3\sigma, \mu+3\sigma]\), if data are normally distributed.

Note

A common way to summarise the discrepancy between the empirical and a given theoretical CDF is by computing the greatest absolute deviation between these two functions

\[ \hat{D}_n = \sup_{t\in\mathbb{R}} | \hat{F}_n(t) - F(t) |. \]

The supremum is a continuous version of the maximum.

It holds:

\[ \hat{D}_n = \max \left\{ \max_{k=1,\dots,n}\left\{ \left|\tfrac{k-1}{n} - F(x_{(k)})\right| \right\}, \max_{k=1,\dots,n}\left\{ \left|\tfrac{k}{n} - F(x_{(k)})\right| \right\} \right\}, \]

i.e., \(F\) needs to be probed only at the \(n\) points from the sorted input sample.

def compute_Dn(x, F):  # equivalent to scipy.stats.kstest(x, F)[0]
    Fx = F(np.sort(x))
    n = len(x)
    k = np.arange(1, n+1)  # 1, 2, ..., n
    Dn1 = np.max(np.abs((k-1)/n - Fx))
    Dn2 = np.max(np.abs(k/n - Fx))
    return max(Dn1, Dn2)

Dn = compute_Dn(heights, F)
Dn
## 0.010470976524201148

If the difference is sufficiently4 small, then we can assume that a normal model describes data quite well. This is indeed the case here: we may estimate the probability of someone being as tall as any given height with error less than about 1.05%.

6.2.2. Comparing Quantiles

A Q-Q (quantile-quantile) plot is another graphical method for comparing two distributions. This time, instead of working with a cumulative distribution function \(F\), we will be dealing with its (generalised) inverse, i.e., the quantile function \(Q\).

Given a CDF \(F\), the corresponding quantile function is defined for any \(p\in(0,1)\) as

\[ Q(p) = \inf\{ x: F(x) \ge p \}, \]

i.e., the smallest \(x\) such that the probability of drawing a value not greater than \(x\) is at least \(p\).

Important

If a CDF \(F\) is continuous, and this is the assumption in the current chapter, then \(Q\) is exactly its inverse, i.e., it holds \(Q(p)=F^{-1}(p)\) for all \(p\in(0, 1)\); compare Figure 6.4.

../_images/normalcdfvsquant-7.png

Figure 6.4 The cumulative distribution functions (left) and the quantile functions (being the inverse of the CDF; right) of some normal distributions

The theoretical quantiles can be generated by the scipy.stats.norm.ppf function. Here, ppf stands for the percent point function which is another (yet quite esoteric) name for the above \(Q\).

For instance, given our N(160.1, 7.06)-distributed NHANES dataset, \(Q(0.9)\) is the height not exceeded by 90% of female population. In other words, only 10% of American women are taller than:

scipy.stats.norm.ppf(0.9, mu, sigma)
## 169.18820963937648

A Q-Q plot draws a version of sample quantiles as a function of the corresponding theoretical quantiles. The sample quantiles that we have introduced in Section 5.1.1.3 are natural estimators of the theoretical quantile function. However, we have also mentioned that there are quite a few possible definitions thereof in the literature; compare [HF96].

For simplicity, instead of using numpy.quantile, we will assume that the \(\frac{i}{n+1}\)-quantile is equal to \(x_{(i)}\), i.e., the i-th smallest value in a given sample \((x_1,x_2,\dots,x_n)\) and consider only \(i=1, 2, \dots, n\).

Our simplified setting successfully avoids the problem which arises when the 0- or 1-quantile of the theoretical distribution, i.e., \(Q(0)\) and \(Q(1)\) is infinite (and this is the case for the normal distribution family).

def qq_plot(x, Q):
    """
    Draws a Q-Q plot, given:
    * x - a data sample (vector)
    * Q - a theoretical quantile function
    """
    n = len(x)
    q = np.arange(1, n+1)/(n+1)    # 1/(n+1), 2/(n+2), ..., n/(n+1)
    x_sorted = np.sort(x)          # sample quantiles
    quantiles = Q(q)               # theoretical quantiles
    plt.plot(quantiles, x_sorted, "o")
    plt.axline((x_sorted[n//2], x_sorted[n//2]), slope=1,
        linestyle=":", color="gray")  # identity line

Figure 6.5 depicts the Q-Q plot for our example dataset.

qq_plot(heights, lambda q: scipy.stats.norm.ppf(q, mu, sigma))
plt.xlabel(f"Quantiles of N({mu:.1f}, {sigma:.2f})")
plt.ylabel("Sample quantiles")
plt.show()
../_images/qq-heights-9.png

Figure 6.5 Q-Q plot for the heights dataset; it’s a nice fit

Ideally, the points are expected to be arranged on the \(y=x\) line (which was added for readability). This would happen if the sample quantiles matched the theoretical ones perfectly. In our case, there are small discrepancies in the tails (e.g., the smallest observation was slightly smaller than expected, and the largest one was larger than expected), although it is quite normal a behaviour for small samples and certain distribution families. Still, we can say that we observe a very good fit.

6.2.3. Kolmogorov–Smirnov Test (*)

To be scientific, we should yearn for some more formal method that will enable us to test the null hypothesis stating that a given empirical distribution \(\hat{F}_n\) does not differ significantly from the theoretical one, \(F\):

\[\begin{split} \left\{ \begin{array}{rll} H_0: & \hat{F}_n = F & \text{(null hypothesis)}\\ H_1: & \hat{F}_n \neq F & \text{(two-sided alternative)} \\ \end{array} \right. \end{split}\]

The popular goodness-of-fit test by Kolmogorov and Smirnov can give us a conservative interval of the acceptable values of \(\hat{D}_n\) (again: the largest deviation between the empirical and theoretical CDF) as a function of \(n\) (within the framework of frequentist hypothesis testing).

Namely, if the test statistic \(\hat{D}_n\) is smaller than some critical value \(K_n\), then we shall deem the difference insignificant. This is to take into account the fact that the reality might slightly deviate from the ideal; in Section 6.4.4 we mention that even for samples which truly come from a hypothesised distribution, there is some inherent variability. We thus need to be somewhat tolerant.

A good textbook in statistics will tell us (and prove) that, under the assumption that \(\hat{F}_n\) is the ECDF of a sample of \(n\) independent variables really generated from \(F\), the random variable \(\hat{D}_n = \sup_{t\in\mathbb{R}} | \hat{F}_n(t) - F(t) |\) follows the Kolmogorov distribution with parameter \(n\) (available via scipy.stats.kstwo).

In other words, if we generate many samples of length \(n\) from \(F\), and compute \(\hat{D}_n\)s for each of them, we expect it to be distributed like in Figure 6.6.

../_images/kolmogorovdistr-11.png

Figure 6.6 Densities (left) and cumulative distribution functions (right) of some Kolmogorov distributions; the greater the sample size, the smaller the acceptable deviations between the theoretical and empirical CDFs

The choice \(K_n\) involves a trade-off between our desire to reject the null hypothesis when it is true (data really come from \(F\)) and to reject it when it is false (data follow some other distribution, i.e., the difference is significant enough). These two needs are, unfortunately, mutually exclusive.

In practice, we assume some fixed upper bound (significance level) for making the former kind of mistake, the so-called type-I error. A nicely conservative (in a good way5) value that we suggest to employ is \(\alpha=0.001=0.1\%\), i.e., only 1 out of 1000 samples that really come from \(F\) will be rejected as not coming from \(F\).

Such a \(K_n\) may be determined by considering the inverse of the CDF of the Kolmogorov distribution, \(\Xi_n\). Namely, \(K_n=\Xi_n^{-1}(1-\alpha)\):

alpha = 0.001  # significance level
scipy.stats.kstwo.ppf(1-alpha, n)
## 0.029964456376393188

In our case \(\hat{D}_n < K_n\), because \(0.01047 < 0.02996\). Thus, we conclude that our empirical (heights) distribution does not differ significantly (at significance level \(0.1\%\)) from the assumed one (N(160.1, 7.06)). In other words, we do not have enough evidence against the statement that data are normally distributed. It is the presumption of innocence: they are normal enough.

We will go back to this discussion in Section 6.4.4 and Section 12.2.6.

6.3. Other Noteworthy Distributions

6.3.1. Log-normal Distribution

We say that a sample is log-normally distributed, if its logarithm is normally distributed.

In particular, it is sometimes observed that the income of most6 individuals is distributed, at least approximately, log-normally. Let us investigate whether this is the case for the UK taxpayers.

income = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/uk_income_simulated_2020.txt")

The plotting of the histogram of the logarithm of income is left as an exercise (we can pass log_scale=True to seaborn.histplot; we will plot it soon anyway in a different way). We proceed directly with the fitting of a log-normal model, LN(μ, σ). The fitting process is similar to the normal case, but this time we determine the mean and standard deviation based on the logarithms of data:

lmu = np.mean(np.log(income))
lsigma = np.std(np.log(income), ddof=1)
lmu, lsigma
## (10.314409794364623, 0.5816585197803816)

Figure 6.7 depicts the fitted probability density function together with the histograms on the log- and original scale. When creating this plot, there are two pitfalls, though. Firstly, scipy.stats.lognorm parametrises the distribution via the parameter \(s\) equal to \(\sigma\) and scale equal to \(e^\mu\), hence computing the PDF at different points must be done as follows:

x  = np.linspace(np.min(income), np.max(income), 101)
fx = scipy.stats.lognorm.pdf(x, s=lsigma, scale=np.exp(lmu))

Second, passing both log_scale=True and stat="density" to seaborn.histplot does not normalise the values on the y-axis correctly. In order to make the histogram on the log-scale comparable with the true density, we need to turn on the log-scale and pass the manually generated bins that are equidistant on the logarithmic scale (via numpy.geomspace).

b = np.geomspace(np.min(income), np.max(income), 30)

And now:

plt.subplot(1, 2, 1)
sns.histplot(income, stat="density", bins=b, color="lightgray")  # own bins!
plt.xscale("log")  # log-scale on the x-axis
plt.plot(x, fx, "r--")
plt.subplot(1, 2, 2)
sns.histplot(income, stat="density", color="lightgray")
plt.plot(x, fx, "r--", label=f"PDF of LN({lmu:.1f}, {lsigma:.2f})")
plt.legend()
plt.show()
../_images/heights-lognormal-13.png

Figure 6.7 A histogram and the probability density function of the fitted log-normal distribution for the income dataset, on log- (left) and original (right) scale

Overall, this fit is not too bad. However, we deal with only a sample of 1000 households here; the original UK Office of National Statistics data could tell us more about the quality of this model in general, but it is beyond the scope of our simple exercise.

Furthermore, Figure 6.8 gives the quantile-quantile plot on a double-logarithmic scale for the above log-normal model. Additionally, we (empirically) verify the hypothesis of normality (using a “normal” normal distribution, not its “log” version).

plt.subplot(1, 2, 1)
qq_plot(  # see above for the definition
    income,
    lambda q: scipy.stats.lognorm.ppf(q, s=lsigma, scale=np.exp(lmu))
)
plt.xlabel(f"Quantiles of LN({lmu:.1f}, {lsigma:.2f})")
plt.ylabel("Sample quantiles")
plt.xscale("log")
plt.yscale("log")

plt.subplot(1, 2, 2)
mu = np.mean(income)
sigma = np.std(income, ddof=1)
qq_plot(income, lambda q: scipy.stats.norm.ppf(q, mu, sigma))
plt.xlabel(f"Quantiles of N({mu:.1f}, {sigma:.2f})")

plt.show()
../_images/qq-income-15.png

Figure 6.8 Q-Q plots for the income dataset vs a fitted log-normal (good fit; left) and normal (bad fit; right) distribution

Exercise 6.3

Graphically compare the empirical CDF for income and the theoretical CDF of LN(10.3, 0.58).

Exercise 6.4

(*) Perform the Kolmogorov–Smirnov goodness-of-fit test as in Section 6.2.3 to verify that the hypothesis of log-normality is not rejected at the \(\alpha=0.001\) significance level. At the same time, the income distribution significantly differs from a normal one.

We see that definitely the hypothesis that our data follow a normal distribution is most likely false.

On the other hand, the log-normal model, might be quite usable. It again reduced the whole dataset to merely two numbers, μ and σ, based on which (and probability theory), we may deduce that:

  • the expected average (mean) income is \(e^{\mu + \sigma^2/2}\),

  • median is \(e^\mu\),

  • most probable one (mode) in \(e^{\mu-\sigma^2}\),

etc.

Note

Recall again that for skewed distributions such as this one, reporting the mean might be misleading. This is why most people get angry when they read the news about the prospering economy (“yeah, we’d like to see that kind of money in our pockets”). Hence, it is not only μ that matters, it is also σ which quantifies the discrepancy between the rich and the poor (too much inequality is bad, but also too much uniformity is to be avoided).

For a normal distribution, the situation is very different, because the mean, the median, and the most probable outcomes tend to be the same – the distribution is symmetric around μ.

Exercise 6.5

What is the fraction of people with earnings below the mean in our LN(10.3, 0.58) model? Hint: use scipy.stats.lognorm.cdf to get the answer.

6.3.2. Pareto Distribution

Consider again the dataset on the populations of the US cities in the 2000 US Census:

cities = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/other/us_cities_2000.txt")
len(cities), sum(cities)  # number of cities, total population
## (19447, 175062893.0)

Figure 6.9 gives the histogram of the city sizes with the populations on the log-scale. It kind-of looks like a log-normal distribution again, which the reader can inspect themself when they are feeling playful.

sns.histplot(cities, bins=20, log_scale=True, color="lightgray")
plt.show()
../_images/hist-cities-17.png

Figure 6.9 Histogram of the unabridged cities dataset; note the log-scale on the x-axis

This time, however, will be interested in not what is typical, but what is in some sense anomalous or extreme. Let us take a look again at the truncated version of the city size distribution by considering the cities with 10,000 or more inhabitants (i.e., we will only study the right tail the original data, just like in Section 4.3.7).

s = 10_000
large_cities = cities[cities >= s]
len(large_cities), sum(large_cities)  # number of cities, total population
## (2696, 146199374.0)

Plotting the above on a double-logarithmic scale can be performed by passing log_scale=(True, True) to seaborn.histplot, which is left as an exercise. Anyway, doing so will lead to a similar picture as in Figure 6.10. This reveals something very interesting: the bar tops on the double-log-scale are arranged more or less on a straight line. There are many datasets which exhibit this behaviour; we say that they follow a power law (power in the arithmetic sense, not social one), see [CSN09, New05] for discussion.

Let us introduce the Pareto distribution family which has a prototypical power law-like density. It is identified by two parameters:

  • the (what scipy calls it) scale parameter \(s>0\) is equal to the shift from 0,

  • the shape parameter, \(\alpha>0\), controls the slope of the said line on the double-log-scale.

The probability density function of P(α, s) is given by:

\[ f(x) = \frac{\alpha s^\alpha}{x^{\alpha+1}} \]

for \(x\ge s\) and \(f(x)=0\) otherwise.

\(s\) is usually taken as the sample minimum (i.e., 10000 in our case). \(\alpha\) can be estimated through the reciprocal of the mean of the scaled logarithms of our observations:

alpha = 1/np.mean(np.log(large_cities/s))
alpha
## 0.9496171695997675

We have already noticed that comparing the theoretical densities and an empirical histogram on a log-scale is quite problematic with seaborn, therefore we again must apply logarithmic binning manually in order to come up with what we see in Figure 6.10.

b = np.geomspace(s, np.max(large_cities), 21)  # bin boundaries
sns.histplot(large_cities, bins=b, stat="density", color="lightgray")
plt.plot(b, scipy.stats.pareto.pdf(b, alpha, scale=s), "r--",
    label=f"PDF of P({alpha:.3f}, {s})")
plt.xscale("log")
plt.yscale("log")
plt.legend()
plt.show()
../_images/fit-pareto-19.png

Figure 6.10 Histogram of the large_cities dataset and the fitted density on a double log-scale

Figure 6.11 gives the corresponding Q-Q plot on a double-logarithmic scale.

qq_plot(  # defined above
    large_cities,
    lambda q: scipy.stats.pareto.ppf(q, alpha, scale=s)
)
plt.xlabel(f"Quantiles of P({alpha:.3f}, {s})")
plt.ylabel("Sample quantiles")
plt.xscale("log")
plt.yscale("log")
plt.show()
../_images/qq-pareto-21.png

Figure 6.11 Q-Q plot for the large_cites dataset vs the fitted Paretian model

We see that the populations of the largest cities are overestimated. The model could be better, but the cities are still growing, right?

Example 6.6

(*) It might also be interesting to see how well can we predict the probability of a randomly selected city being at least a given size. Let us thus denote with \(S(x)=1-F(x)\) the complementary cumulative distribution function (CCDF; sometimes referred to as the survival function) and with \(\hat{S}_n(x)=1-\hat{F}_n(x)\) being its empirical version. Figure 6.12 compares the empirical and the fitted CCDFs with probabilities on the linear- and log-scale.

x = np.geomspace(np.min(large_cities), np.max(large_cities), 1001)
probs = scipy.stats.pareto.cdf(x, alpha, scale=s)
n = len(large_cities)
for i in [1, 2]:
    plt.subplot(1, 2, i)
    plt.plot(x, 1-probs, "r--", label=f"CCDF of P({alpha:.3f}, {s})")
    plt.plot(np.sort(large_cities), 1-np.arange(1, n+1)/n,
        drawstyle="steps-post", label="Empirical Complementary CDF")
    plt.xlabel("$x$")
    plt.xscale("log")
    plt.yscale(["linear", "log"][i-1])
    if i == 1:
        plt.ylabel("Prob(city size > x)")
        plt.legend()
plt.show()
../_images/ccdf-pareto-23.png

Figure 6.12 Empirical and theoretical complementary cumulative distribution functions for the large_cities dataset with probabilities on the linear- (left) and log-scale (right) and city sizes on the log-scale

In terms of the maximal absolute distance between the two functions, \(\hat{D}_n\), from the left plot we see that the fit is looks pretty good (let us stress that the log-scale overemphasises the relatively small differences in the right tail and should not be used for judging the value of \(\hat{D}_n\)).

However, that the Kolmogorov–Smirnov goodness-of-fit tests rejects the hypothesis of Paretianity (at significance level \(0.1\%\)) is left as an exercise to the reader.

6.3.3. Uniform Distribution

Consider the Polish Lotto lottery, where 6 numbered balls \(\{1,2,\dots,49\}\) are drawn without replacement from an urn. We have a dataset that summarises the number of times each ball has been drawn in all the games in the period 1957–2016.

lotto = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/lotto_table.txt")
lotto
## array([720., 720., 714., 752., 719., 753., 701., 692., 716., 694., 716.,
##        668., 749., 713., 723., 693., 777., 747., 728., 734., 762., 729.,
##        695., 761., 735., 719., 754., 741., 750., 701., 744., 729., 716.,
##        768., 715., 735., 725., 741., 697., 713., 711., 744., 652., 683.,
##        744., 714., 674., 654., 681.])

Each event seems to occur more or less with the same probability. Of course, the numbers on the balls are integer, but in our idealised scenario we may try modelling this dataset using a continuous uniform distribution, which yields arbitrary real numbers on a given interval (a, b), i.e., between some a and b. We denote such a distribution with U(a, b). It has the probability density function given by

\[ f(x) = \frac{1}{b-a}, \]

for \(x\in(a, b)\) and \(f(x)=0\) otherwise.

Notice that scipy.stats.uniform uses parameters a and scale equal to \(b-a\) instead.

In our case, it makes sense to set a=1 and b=50 and interpret an outcome like 49.1253 as representing the 49th ball (compare the notion of the floor function, \(\lfloor x\rfloor\)).

x = np.linspace(1, 50, 1001)
plt.bar(np.arange(1, 50), width=1, height=lotto/np.sum(lotto),
    color="lightgray", edgecolor="black", alpha=0.8, align="edge")
plt.plot(x, scipy.stats.uniform.pdf(x, 1, scale=49), "r--",
    label="PDF of U(1, 50)")
plt.legend()
plt.show()
../_images/lotto-25.png

Figure 6.13 Histogram of the lotto dataset

Visually, see Figure 6.13, this model makes much sense, but again, some more rigorous statistical testing would be required to determine if someone has not been tampering with the lottery results, i.e., if data does not deviate from the uniform distribution significantly.

Unfortunately, we cannot use the Kolmogorov–Smirnov test, because data are not continuous. See, however, Section 11.4.3 for the Pearson chi-squared test that is applicable here.

Exercise 6.7

Does playing lotteries and engaging in gambling make rational sense at all, from the perspective of an individual player? Well, we see that 16 is the most frequently occurring outcome in Lotto, maybe there’s some magic in it? Also, some people became millionaires after all, right?

Note

In data modelling (e.g., Bayesian statistics), sometimes a uniform distribution is chosen as a placeholder for “we know nothing about a phenomenon, so let us just assume that every event is equally likely”. However, overall, it is quite fascinating that the real world tends to be structured after all. Emerging patterns are plentiful, most often far from being uniformly distributed, and, furthermore, they are subject to quantitative analysis.

6.3.4. Distribution Mixtures (*)

Some datasets may fail to fit through simple models such as the ones describe above. It may sometimes be due to their non-random behaviour: statistics gives just one means to create data idealisations, we also have partial differential equations, approximation theory, graphs and complex networks, agent-based modelling, and so forth, which might be worth giving a study (and then try).

Anther reason may be that what we observe is in fact a mixture (creative combination) of simpler processes.

The dataset representing the December 2021 hourly averages pedestrian counts near the Southern Cross Station in Melbourne might be a good instance of such a scenario, compare Figure 4.5.

peds = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/marek/southern_cross_station_peds_2019_dec.txt")

It might not be a bad idea to try to fit a probabilistic (convex) combination of three normal distributions \(f_1\), \(f_2\), \(f_3\), corresponding to the morning, lunch-time, and evening pedestrian count peaks. This yields the PDF

\[ f(x) = w_1 f_1(x) + w_2 f_2(x) + w_3 f_3(x) \]

for some coefficients \(w_1,w_2,w_3\ge 0\) such that \(w_1+w_2+w_3=1\).

Figure 6.14 depicts a mixture of N(8, 1), N(12, 1), and N(17, 2) with the corresponding weights of 0.35, 0.1, and 0.55. This particular dataset is quite coarse-grained (we only have 24 bar heights at our disposal), therefore these estimated parameters and coefficients should be taken with a pinch of chilli pepper.

plt.bar(np.arange(24), width=1, height=peds/np.sum(peds),
    color="lightgray", edgecolor="black", alpha=0.8)
x = np.arange(0, 25, 0.1)
p1 = scipy.stats.norm.pdf(x, 8, 1)
p2 = scipy.stats.norm.pdf(x, 12, 1)
p3 = scipy.stats.norm.pdf(x, 17, 2)
p = 0.35*p1 + 0.1*p2 + 0.55*p3  # weighted combination of 3 densities
plt.plot(x, p, "r--", label="PDF of a normal mixture")
plt.legend()
plt.show()
../_images/mixture-27.png

Figure 6.14 Histogram of the peds dataset and a guesstimated mixture of three normal distributions

Important

It will frequently be the case in data wrangling that more complex entities (models, methods) will be arising as combinations of simpler (primitive) components. This is why we should spend a great deal of time studying the fundamentals.

Note

Some data clustering techniques (in particular, the k-means algorithm that we briefly discuss later in this course) could be used to split a data sample into disjoint chunks corresponding to different mixture components.

Also, it might be the case that the mixture components can in fact be explained by another categorical variable which divides the dataset into natural groups, compare Chapter 12.

6.4. Generating Pseudorandom Numbers

A probability distribution is useful not only for describing a dataset. It also enables us to perform many experiments on data that we do not currently have, but we might obtain in the future, to test various scenarios and hypotheses.

To do this, we can generate a random sample of independent (not related to each other) observations.

6.4.1. Uniform Distribution

When most people say random, they implicitly mean uniformly distributed. For example:

np.random.rand(5)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])

gives 5 observations sampled independently from the uniform distribution on the unit interval, i.e., U(0, 1).

The same with scipy, but this time the support will be (-10, 15).

scipy.stats.uniform.rvs(-10, scale=25, size=5)  # from -10 to -10+25
## array([ 0.5776615 , 14.51910496,  7.12074346,  2.02329754, -0.19706205])

Alternatively, we could do that ourselves by shifting and scaling the output of the random number generator on the unit interval using the formula numpy.random.rand(5)*25-10.

6.4.2. Not Exactly Random

Actually, we are generating numbers using a computer, which is purely deterministic, hence we shall refer to them as pseudorandom or random-like ones (albeit they are non-distinguishable from truly random, when subject to rigorous tests for randomness).

To prove it, we can set the initial state of the generator (the seed) via some number and see what values are output:

np.random.seed(123)  # set seed
np.random.rand(5)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])

Then, we set the seed once again via the same number and see how “random” the next values are:

np.random.seed(123)  # set seed
np.random.rand(5)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])

This enables us to perform completely reproducible numerical experiments, and this is a very good feature: truly scientific inquiries should lead to identical results under the same conditions.

Note

If we do not set the seed manually, it will be initialised based on the current wall time, which is different every… time – therefore the numbers will seem random to us.

Many Python packages that we will be using in the future, including pandas and sklearn rely on numpy’s random number generator, thus we will be calling numpy.random.seed to make them predictable.

Additionally, some of them (e.g., sklearn.model_selection.train_test_split or pandas.DataFrame.sample) are equipped with the random_state argument, which behaves as if we temporarily changed the seed (for just one call to that function). For instance:

scipy.stats.uniform.rvs(size=5, random_state=123)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])

This gives the same sequence as above.

6.4.3. Sampling from Other Distributions

Generating data from other distributions is possible too; there are many rvs methods implemented in scipy.stats. For example, here is a sample from N(100, 16):

scipy.stats.norm.rvs(100, 16, size=3, random_state=50489)
## array([113.41134015,  46.99328545, 157.1304154 ])

Pseudorandom deviates from the standard normal distribution, i.e., N(0, 1), can also be generated using numpy.random.randn. Therefore, as N(100, 16) is a scaled and shifted version thereof, the above is equivalent to:

np.random.seed(50489)
np.random.randn(3)*16 + 100
## array([113.41134015,  46.99328545, 157.1304154 ])

Important

Conclusions based on simulated data are trustworthy, because they cannot be manipulated. Or can they?

The pseudorandom number generator’s seed used above, 50489, is quite suspicious. It might suggest that someone wanted to prove some point (in this case, the violation of the 3σ rule).

This is why we recommend sticking to one and only seed most of the time, e.g., 123, or – when performing simulations – setting consecutive seeds for each iteration: 1, 2, ….

Exercise 6.8

Generate 1000 pseudorandom numbers from the log-normal distribution and draw a histogram thereof.

Note

(*) Having a good pseudorandom number generator from the uniform distribution on the unit interval is crucial, because sampling from other distributions usually involves transforming independent U(0, 1) variates.

For instance, realisations of random variables following any continuous cumulative distribution function \(F\) can be constructed by means of the inverse transform sampling:

  1. Generate a sample \(x_1,\dots,x_n\) independently from U(0, 1);

  2. Transform each \(x_i\) by applying the quantile function, \(y_i=F^{-1}(x_i)\);

  3. Now \(y_1,\dots,y_n\) follows the CDF \(F\).

For more topics on random number generation, see [Gen03, RC04].

Exercise 6.9

(*) Generate 1000 pseudorandom numbers from the log-normal distribution using inverse transform sampling.

Exercise 6.10

(**) Generate 1000 pseudorandom numbers from the distribution mixture discussed in Section 6.3.4.

6.4.4. Natural Variability

Even a sample which we know to be generated from a specific distribution will deviate from it, sometimes considerably. Such effects usually disappear7 when the availability of data increases, but definitely will be visible for small sample sizes.

For example, Figure 6.15 depicts the histogram of 9 different samples of size 100, all drawn independently from the normal distribution N(0, 1).

plt.figure(figsize=(plt.rcParams["figure.figsize"][0], )*2)  # width=height
x = np.linspace(-4, 4, 1001)
fx = scipy.stats.norm.pdf(x)  # PDF of N(0, 1)
for i in range(9):
    plt.subplot(3, 3, i+1)
    sample = scipy.stats.norm.rvs(size=100, random_state=i+1)
    sns.histplot(sample, stat="density", bins=10, color="lightgray")
    plt.plot(x, fx, "r--")
    plt.ylabel(None)
    plt.xlim(-4, 4)
    plt.ylim(0, 0.6)
plt.legend()
plt.show()
../_images/natural-variability-29.png

Figure 6.15 All 9 samples are normally distributed!

There is some ruggedness in the bar sizes that a naïve observer might try to interpret as something meaningful; a competent data scientist must train their eye to ignore such impurities that are only due to random effects (but be always ready do detect those which are worth attention).

Exercise 6.11

Repeat the above experiment for samples of size 10, 1000, and 10000.

Example 6.12

(*) Using a simple Monte Carlo simulation, we can verify (approximately) that the Kolmogorov–Smirnov goodness-of-fit test introduced in Section 6.2.3 has been calibrated properly, i.e., that for samples which really follow the assumed distribution, the null hypothesis is rejected only in ca. 0.1% of the cases.

Let us say we are interested in the null hypothesis referencing the standard normal distribution, N(0, 1), and sample size \(n=100\). We need to generate many (we assume 10,000 below) such samples for each of which we compute and store the maximal absolute deviation from the theoretical CDF, i.e., \(\hat{D}_n\).

n = 100
distrib = scipy.stats.norm(0, 1)  # assumed distribution - N(0, 1)
Dns = []
for i in range(10000):  # increase this for better precision
    x = distrib.rvs(size=n, random_state=i+1)  # really follows distrib
    Dns.append(compute_Dn(x, distrib.cdf))
Dns = np.array(Dns)

Now let us compute the proportion of cases which lead to \(\hat{D}_n\) greater than the critical value \(K_n\):

len(Dns[Dns >= scipy.stats.kstwo.ppf(1-0.001, n)]) / len(Dns)
## 0.0016

In theory, this should be equal to 0.001, but we are relying on randomness above, therefore our values is approximate. Increasing the number of trials from 10,000 to, say, 1,000,000 will make the above estimate more precise.

It is also worth checking out that the density histogram of Dns resembles the Kolmogorov distribution that we can compute via scipy.stats.kstwo.pdf.

Exercise 6.13

(*) It might also be interesting to check out the test’s power, i.e., the probability that when the null hypothesis is false, it will actually be rejected. Modify the above code in such a way that x in the for loop is not generated from N(0, 1) but from N(0.1, 1), N(0.2, 1), etc. and check the proportion of cases where we deem the sample’s distribution different from N(0, 1). Small differences in the location parameter \(\mu\) are usually ignored, but that this improves with sample size \(n\).

6.4.5. Adding Jitter (White Noise)

We have mentioned that measurements might be subject to observational error. Furthermore, rounding can occur as early as at the data collection phase. In particular, our heights dataset has precision of 1 fractional digit. However, in statistics, when we say that data follow a continuous distribution, the probability of having two identical values in a sample is 0. Therefore, some data analysis methods might assume that there are no ties in the input vector, i.e., all values are unique.

The easiest way to deal with such numerical inconveniences is to add some white noise with expected value of 0, either uniformly or normally distributed.

For example, for heights it makes sense to add some jitter from U[-0.05, 0.05]:

heights_jitter = heights + (np.random.rand(len(heights))*0.1-0.05)
heights_jitter[:6]  # preview
## array([160.21704623, 152.68870195, 161.24482407, 157.3675293 ,
##        154.61663465, 144.68964596])

Adding noise also might be performed for aesthetic reasons, e.g., when drawing scatterplots.

6.5. Further Reading

For an excellent general introductory course on probability and statistics, see [Gen09, Gen20]

Topics in random number generation are covered in [Gen03, Knu97, RC04].

For a comprehensive catalogue of statistical distributions, their properties, and relationships between them, see [FEHP10].

[CSN09, New05] give a nice overview of power-law-like behaviour of some “rich” or otherwise extreme datasets.

6.6. Exercises

Exercise 6.14

Why is the notion of the mean income confusing the general public?

Exercise 6.15

When manually setting the seed of a random number generator makes sense?

Exercise 6.16

Given a log-normally distributed sample x, how can we turn it to a normally distributed one, i.e., y=f(x), with f being… what?

Exercise 6.17

What is the 3σ rule for normally distributed data?

Exercise 6.18

(*) How we can verify graphically if a sample follows a hypothesised theoretical distribution?

Exercise 6.19

(*) Explain the meaning of: type I error, significance level, and a test’s power.


1

(*) This intuition is of course theoretically grounded and is based on the asymptotic behaviour of the histograms as the estimators of the underlying probability density function, see, e.g., [FD81] and the many references therein.

2

(*) It might be the case that we will have to obtain the estimates of the probability distribution’s parameters by numerical optimisation, for example, by minimising \( \mathcal{L}(\mu, \sigma) = \sum_{i=1}^n \left( \frac{(x_i-\mu)^2}{\sigma^2} + \log \sigma^2 \right) \) with respect to \(\mu\) and \(\sigma\) (corresponding to the objective function in the maximum likelihood estimation problem for the normal distribution family). In our case, however, we are lucky; there exist open-form formulae expressing the solution to the above, exactly in the form of the aforementioned sample mean and standard deviation. For other distributions things can get a little trickier, though. Furthermore, sometimes we will have many options for point estimators to choose from, which might be more suitable if data are not of top quality (e.g., contain outliers). For instance, in the normal model, it can be shown that we can also estimate \(\mu\) and \(\sigma\) via the sample median and \(\mathrm{IQR}/1.349\).

3

Actually, the probability distribution of any real-valued random variable \(X\) can be uniquely defined by means of a nondecreasing, right (upward) continuous function \(F:\mathbb{R}\to[0,1]\) such that \(\lim_{x\to-\infty} F(x)=0\) and \(\lim_{x\to\infty} F(x)=1\), in which case \(\Pr(X\le x)=F(x)\). The probability density function only exists for continuous random variables and is defined as the derivative of \(F\).

4

The larger the sample size, the less tolerant we should be with regard to the size of this disparity; see Section 6.2.3.

5

More details on that in Section 12.2.6.

6

Except the few richest who are interesting on their own; see Section 6.3.2 where we discuss the Pareto distribution.

7

Compare the Fundamental Theorem of Statistics (the Glivenko–Cantelli theorem).