# 6. Continuous Probability Distributions#

The open-access textbookMinimalist Data Wrangling with Pythonby Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated.Also, make sure to check out my other book,Deep R Programming[34].

Each successful data analyst will deal with hundreds or thousands of
datasets in their lifetime. In the long run, at some level, most of them will
be deemed *boring*. This is because only a few common patterns
will be occurring over and over again.

In particular, the previously mentioned bell-shapedness and right-skewness are quite prevalent in the so-called real world. Surprisingly, however, this is exactly when things become scientific and interesting – allowing us to study various phenomena at an appropriate level of generality.

Mathematically, such idealised patterns in the histogram shapes
can be formalised using the notion of a
*probability density function* (PDF) of a *continuous, real-valued random
variable*.

Intuitively[1], a PDF is
a smooth curve that would arise if we drew a histogram
for the entire *population* (e.g., all women living currently on Earth and
beyond or otherwise an extremely large data sample obtained by independently
querying the same underlying data generating process) in such a way that
the total area of all the bars is equal to 1
and the bin sizes are very small.

As stated at the beginning, we do not intend this to be a course in probability theory and mathematical statistics. Rather, it precedes and motivates them (e.g., [21, 38, 40, 78]). Therefore, our definitions are out of necessity simplified so that they are digestible. For the purpose of our illustrations, we will consider the following characterisation.

Important

(*) We call an integrable function \(f:\mathbb{R}\to\mathbb{R}\)
a *probability density function*
if \(f(x)\ge 0\) for all \(x\) and \(\int_{-\infty}^\infty f(x)\,dx=1\),
i.e., it is nonnegative and normalised in such a way that
the total area under the whole curve is 1.

For any \(a < b\), we treat the area under the fragment of the \(f(x)\) curve for \(x\) between \(a\) and \(b\), i.e., \(\int_a^b f(x)\,dx\), as the probability of the underlying real-valued random variable’s (theoretical data generating process’) falling into the \([a, b]\) interval.

Some distributions appear more frequently than others and appear to fit empirical data or parts thereof particularly well; compare [27]. In this chapter, we review a few noteworthy probability distributions: the normal, log-normal, Pareto, and uniform families (we will also mention the chi-squared, Kolmogorov, and exponential ones in this course).

## 6.1. Normal Distribution#

A *normal (Gaussian) distribution* has a prototypical,
nicely symmetric, bell-shaped density.
It is described by two parameters:
\(\mu\in\mathbb{R}\) (the expected value, at which the PDF is centred)
and \(\sigma>0\) (the standard deviation, saying how
much the distribution is dispersed around μ);
compare Figure 6.1.

The probability density function of N(μ, σ) is given by:

### 6.1.1. Estimating Parameters#

A course in statistics (which, again, this one is not,
we are merely making an illustration here), may tell us
that the sample arithmetic mean \(\bar{x}\) and standard deviation \(s\)
are natural, statistically well-behaving *estimators* of the said parameters:
if all observations would really be drawn independently from N(μ, σ) each,
then we *expect* \(\bar{x}\) and \(s\) be equal to, more or less, μ and σ (the larger the sample size,
the smaller the error).

Recall the `heights`

(females from the NHANES study)
dataset and its bell-shaped histogram
in Figure 4.2.

```
heights = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_height_2020.txt")
n = len(heights)
n
## 4221
```

Let us estimate the said parameters for this sample:

```
mu = np.mean(heights)
sigma = np.std(heights, ddof=1)
mu, sigma
## (160.13679222932953, 7.062858532891359)
```

Mathematically, we will denote these two with
\(\hat{\mu}\) and \(\hat{\sigma}\) (mu and sigma with a hat) to emphasise that
they are merely guesstimates[2] of the unknown respective parameters
\(\mu\) and \(\sigma\). On a side note, we use `ddof=1`

, because in this context
this estimator has slightly better statistical properties.

Let us draw the fitted density function
(i.e., the PDF of N(160.1, 7.06) which we can compute using
**scipy.stats.norm.pdf**),
on top of the histogram; see Figure 6.2.
We pass `density=True`

to **matplotlib.pyplot.hist**
to normalise the bars’ heights so that
their total area sums to 1.

```
plt.hist(heights, density=True, color="lightgray", edgecolor="black")
x = np.linspace(np.min(heights), np.max(heights), 1000)
plt.plot(x, scipy.stats.norm.pdf(x, mu, sigma), "r--",
label=f"PDF of N({mu:.1f}, {sigma:.2f})")
plt.ylabel("Density")
plt.legend()
plt.show()
```

At first glance, this is a genuinely nice match.
Before proceeding with an overview of the ways to assess
the goodness-of-fit more rigorously, we should praise the potential
benefits of having an idealised *model* of our dataset at our disposal.

### 6.1.2. Data Models Are Useful#

*If* (provided that, assuming that, on condition that)
our sample is a realisation of independent random variables
following a given distribution, or a data analyst judges that such an
approximation might be justified or beneficial, then we have a set of many
numbers *reduced* to merely a few parameters.

In the above case, we might want to risk the statement that data follow the normal distribution (assumption 1) with parameters \(\mu=160.1\) and \(\sigma=7.06\) (assumption 2). Still, the choice of the distribution family is one thing, and the way we estimate the underlying parameters (in our case, we use \(\hat{\mu}\) and \(\hat{\sigma}\)) is another.

This not only saves storage space and computational time, but also – based on what we can learn from a course in probability and statistics (by appropriately integrating the PDF) – we can imply facts such as for normally distributed data:

ca. 68% of (i.e., a

*majority*) women are \(\mu\pm \sigma\) tall (the 1σ rule),ca. 95% of (i.e.,

*most typical*) women are \(\mu\pm 2\sigma\) tall (the 2σ rule),ca. 99.7% of (i.e.,

*almost all*) women are \(\mu\pm 3\sigma\) tall (the 3σ rule).

Also, if we knew that the distribution of heights of men is also normal with some other parameters, we could be able to make some comparisons between the two samples. For example, we could compute the probability that a woman randomly selected from the crowd is taller than a male passer-by.

Furthermore, there is a range of *parametric* (assuming some distribution
family) statistical methods that could *additionally* be used if we assumed
the data normality, e.g., the *t*-test to compare the expected values.

How different manufacturing industries (e.g., clothing) can make use of such models? Are simplifications necessary when dealing with complexity? What are the alternatives?

Important

We should always verify the assumptions of a model that we wish to apply in practice. In particular, we will soon note that incomes are not normally distributed. Therefore, we must not refer to the above 2σ or 3σ rule in their case. A cow neither barks nor can it serve as a screwdriver. Period.

## 6.2. Assessing Goodness-of-Fit#

### 6.2.1. Comparing Cumulative Distribution Functions#

In the previous subsection, we were comparing densities and histograms.
It turns out that there is a better way of assessing the extent to which
a sample deviates from a hypothesised distribution.
Namely, we can measure the discrepancy between
some theoretical *cumulative distribution function* (CDF)
and the empirical one (\(\hat{F}_n\); see Section 4.3.8).

Important

If \(f\) is a PDF, then the corresponding theoretical CDF is defined as \(F(x) = \int_{-\infty}^x f(t)\,dt\), i.e., the probability of the underlying random variable’s taking a value less than or equal to \(x\).

By definition[3], each CDF takes values in the unit interval (\([0,1]\)) and is nondecreasing.

For the normal distribution family,
the values of the theoretical CDF can be computed by calling
**scipy.stats.norm.cdf**; see Figure 6.3.

```
x = np.linspace(np.min(heights), np.max(heights), 1001)
probs = scipy.stats.norm.cdf(x, mu, sigma) # sample the CDF at many points
plt.plot(x, probs, "r--", label=f"CDF of N({mu:.1f}, {sigma:.2f})")
heights_sorted = np.sort(heights)
plt.plot(heights_sorted, np.arange(1, n+1)/n,
drawstyle="steps-post", label="Empirical CDF")
plt.xlabel("$x$")
plt.ylabel("Prob(height $\\leq$ x)")
plt.legend()
plt.show()
```

This *looks* like a superb match.

\(F(b)-F(a)=\int_{a}^b f(t)\,dt\) is the probability of generating a value in the interval \([a, b]\).

Let us empirically verify the 3σ rule:

```
F = lambda x: scipy.stats.norm.cdf(x, mu, sigma)
F(mu+3*sigma) - F(mu-3*sigma)
## 0.9973002039367398
```

Indeed, almost all observations are within \([\mu-3\sigma, \mu+3\sigma]\), if data are normally distributed.

Note

A common way to summarise the discrepancy between the empirical and a given theoretical CDF is by computing the greatest absolute deviation:

where the supremum is a continuous version of the maximum.

It holds:

i.e., \(F\) needs to be probed only at the \(n\) points from the sorted input sample.

```
def compute_Dn(x, F): # equivalent to scipy.stats.kstest(x, F)[0]
Fx = F(np.sort(x))
n = len(x)
k = np.arange(1, n+1) # 1, 2, ..., n
Dn1 = np.max(np.abs((k-1)/n - Fx))
Dn2 = np.max(np.abs(k/n - Fx))
return max(Dn1, Dn2)
Dn = compute_Dn(heights, F)
Dn
## 0.010470976524201148
```

If the difference is *sufficiently[4] small*,
then we can assume that a normal model describes data quite well.
This is indeed the case here: we may estimate the probability
of someone being as tall as any given height
with an error less than about 1.05%.

### 6.2.2. Comparing Quantiles#

A *Q-Q plot* (quantile-quantile or probability plot)
is another graphical method for comparing two distributions.
This time, instead of working with a cumulative distribution function \(F\),
we will be dealing with its (generalised) inverse, i.e.,
the quantile function \(Q\).

Given a CDF \(F\), the corresponding *quantile function* is defined
for any \(p\in(0,1)\) as:

i.e., the smallest \(x\) such that the probability of drawing a value not greater than \(x\) is at least \(p\).

Important

If a CDF \(F\) is continuous, and this is the assumption in the current chapter, then \(Q\) is exactly its inverse, i.e., it holds \(Q(p)=F^{-1}(p)\) for all \(p\in(0, 1)\); compare Figure 6.4.

The theoretical quantiles can be generated
by the **scipy.stats.norm.ppf** function.
Here, *ppf* stands for the percent point function which is another
(yet quite esoteric) name for the above \(Q\).

For instance, in our N(160.1, 7.06)-distributed `heights`

dataset,
\(Q(0.9)\) is the height not exceeded by 90% of the female population.
In other words, only 10% of American women are taller than:

```
scipy.stats.norm.ppf(0.9, mu, sigma)
## 169.18820963937648
```

A Q-Q plot draws a version of sample quantiles as a function of the corresponding theoretical quantiles. The sample quantiles that we introduced in Section 5.1.1.3 are natural estimators of the theoretical quantile function. However, we also mentioned that there are quite a few possible definitions thereof in the literature; compare [53].

For simplicity, instead of using **numpy.quantile**,
we will assume that the \(\frac{i}{n+1}\)-quantile[5]
is equal to \(x_{(i)}\), i.e., the *i*-th smallest value in
a given sample \((x_1,x_2,\dots,x_n)\)
and consider only \(i=1, 2, \dots, n\).

Our simplified setting avoids the problem which arises when the 0- or 1-quantiles of the theoretical distribution, i.e., \(Q(0)\) or \(Q(1)\), are infinite (and this is the case for the normal distribution family).

```
def qq_plot(x, Q):
"""
Draws a Q-Q plot, given:
* x - a data sample (vector)
* Q - a theoretical quantile function
"""
n = len(x)
q = np.arange(1, n+1)/(n+1) # 1/(n+1), 2/(n+2), ..., n/(n+1)
x_sorted = np.sort(x) # sample quantiles
quantiles = Q(q) # theoretical quantiles
plt.plot(quantiles, x_sorted, "o")
plt.axline((x_sorted[n//2], x_sorted[n//2]), slope=1,
linestyle=":", color="gray") # identity line
```

Figure 6.5 depicts the Q-Q plot for our example dataset.

```
qq_plot(heights, lambda q: scipy.stats.norm.ppf(q, mu, sigma))
plt.xlabel(f"Quantiles of N({mu:.1f}, {sigma:.2f})")
plt.ylabel("Sample quantiles")
plt.show()
```

Ideally, the points are expected to be arranged on the \(y=x\) line
(which was added for readability). This would happen if
the sample quantiles matched the theoretical ones perfectly.
In our case, there are small discrepancies[6] in the tails
(e.g., the smallest observation was slightly smaller than expected,
and the largest one was larger than expected),
although it is quite a *normal* behaviour for small samples
and certain distribution families.
Still, we can say that we observe a very good fit.

### 6.2.3. Kolmogorov–Smirnov Test (*)#

To be scientific, we should yearn for some more formal method that will
enable us to test the null hypothesis stating that a given empirical
distribution \(\hat{F}_n\) does not differ *significantly* from the
theoretical continuous CDF \(F\):

The popular goodness-of-fit test by Kolmogorov and Smirnov can give us a conservative interval of the acceptable values of \(\hat{D}_n\) (again: the largest deviation between the empirical and theoretical CDF) as a function of \(n\) (within the framework of frequentist hypothesis testing).

Namely, if the *test statistic* \(\hat{D}_n\) is smaller than some
*critical value* \(K_n\), then we shall deem the difference insignificant.
This is to take into account the fact that reality might deviate
from the ideal. In Section 6.4.4, we mention
that even for samples that truly come from a hypothesised distribution,
there is some inherent variability. We need to be somewhat tolerant.

A good textbook in statistics will tell us (and prove) that,
under the assumption that \(\hat{F}_n\) is the ECDF of a sample
of \(n\) independent variables *really* generated from a continuous CDF \(F\),
the random variable \(\hat{D}_n = \sup_{t\in\mathbb{R}} | \hat{F}_n(t) - F(t) |\)
follows the Kolmogorov distribution with parameter \(n\)
(available via **scipy.stats.kstwo**).

In other words, if we generate many samples of length \(n\) from \(F\), and compute \(\hat{D}_n\)s for each of them, we expect it to be distributed like in Figure 6.6.

The choice \(K_n\) involves a trade-off between our desire to:

accept the null hypothesis when it is true (data

*really*come from \(F\)), andreject it when it is false (data follow some other distribution, i.e., the difference is significant enough).

These two needs are, unfortunately, mutually exclusive.

In practice, we assume some fixed upper bound (*significance level*)
for making the former kind of mistake, the so-called *type-I error*.
A nicely conservative (in a good way[7]) value that we
suggest employing is \(\alpha=0.001=0.1\%\), i.e., only 1 out of 1,000 samples
that really come from \(F\) will be rejected as not coming from \(F\).

Such a \(K_n\) may be determined by considering the inverse of the CDF of the Kolmogorov distribution, \(\Xi_n\). Namely, \(K_n=\Xi_n^{-1}(1-\alpha)\):

```
alpha = 0.001 # significance level
scipy.stats.kstwo.ppf(1-alpha, n)
## 0.029964456376393188
```

In our case \(\hat{D}_n < K_n\), because \(0.01047 < 0.02996\).
We conclude that our empirical (`heights`

) distribution
does not differ significantly (at significance level \(0.1\%\))
from the assumed one, i.e., N(160.1, 7.06).
In other words, we do not have enough evidence against the statement
that data are normally distributed. It is the presumption of innocence:
they are normal enough.

We will go back to this discussion in Section 6.4.4 and Section 12.2.6.

## 6.3. Other Noteworthy Distributions#

### 6.3.1. Log-Normal Distribution#

We say that a sample is *log-normally distributed*,
if its logarithm is normally distributed.

In particular, Figure 6.7 reveals that this is the case for the most[8] of the UK taxpayers’ incomes.

```
income = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/uk_income_simulated_2020.txt")
plt.hist(np.log(income), bins=30, color="lightgray", edgecolor="black")
plt.ylabel("Count")
plt.show()
```

Let us thus proceed with the fitting of a log-normal model, LN(μ, σ). The fitting process is similar to the normal case, but this time we determine the mean and standard deviation based on the logarithms of data:

```
lmu = np.mean(np.log(income))
lsigma = np.std(np.log(income), ddof=1)
lmu, lsigma
## (10.314409794364623, 0.5816585197803816)
```

We need to take note of the fact that **scipy.stats.lognorm**
encodes the distribution via the parameter \(s\) equal to \(\sigma\)
and scale equal to \(e^\mu\). Computing the PDF at different points
must done as follows:

```
x = np.linspace(np.min(income), np.max(income), 101)
fx = scipy.stats.lognorm.pdf(x, s=lsigma, scale=np.exp(lmu))
```

Figure 6.8 depicts the fitted probability density function together with the histograms on the log- and original scale.

```
plt.subplot(1, 2, 1)
logbins = np.geomspace(np.min(income), np.max(income), 31)
plt.hist(income, bins=logbins, density=True,
color="lightgray", edgecolor="black")
plt.plot(x, fx, "r--")
plt.xscale("log") # log-scale on the x-axis
plt.ylabel("Density")
plt.subplot(1, 2, 2)
plt.hist(income, bins=30, density=True, color="lightgray", edgecolor="black")
plt.plot(x, fx, "r--", label=f"PDF of LN({lmu:.1f}, {lsigma:.2f})")
plt.ylabel("Density")
plt.legend()
plt.show()
```

Overall, this fit is not too bad. Nonetheless, we are only dealing with a sample of 1,000 households; the original UK Office of National Statistics data could tell us more about the quality of this model in general, but it is beyond the scope of our simple exercise.

Furthermore, Figure 6.9 gives the quantile-quantile plot on a double logarithmic scale for the above log-normal model. Additionally, we (empirically) verify the hypothesis of normality (using a “normal” normal distribution, not its “log” version).

```
plt.subplot(1, 2, 1)
qq_plot( # see above for the definition
income,
lambda q: scipy.stats.lognorm.ppf(q, s=lsigma, scale=np.exp(lmu))
)
plt.xlabel(f"Quantiles of LN({lmu:.1f}, {lsigma:.2f})")
plt.ylabel("Sample quantiles")
plt.xscale("log")
plt.yscale("log")
plt.subplot(1, 2, 2)
mu = np.mean(income)
sigma = np.std(income, ddof=1)
qq_plot(income, lambda q: scipy.stats.norm.ppf(q, mu, sigma))
plt.xlabel(f"Quantiles of N({mu:.1f}, {sigma:.2f})")
plt.show()
```

Graphically compare the empirical CDF for `income`

and the theoretical CDF of LN(10.3, 0.58).

(*) Perform the Kolmogorov–Smirnov goodness-of-fit test as in Section 6.2.3, to verify that the hypothesis of log-normality is not rejected at the \(\alpha=0.001\) significance level. At the same time, the income distribution significantly differs from a normal one.

The hypothesis that our data follow a normal distribution is most likely false. On the other hand, the log-normal model, might be quite adequate. It again reduced the whole dataset to merely two numbers, μ and σ, based on which (and probability theory), we may deduce that:

the expected average (mean) income is \(e^{\mu + \sigma^2/2}\),

median is \(e^\mu\),

most probable one (mode) in \(e^{\mu-\sigma^2}\),

etc.

Note

Recall again that for skewed distributions such as this one,
reporting the mean might be misleading.
This is why *most* people get angry when they read the news
about the prospering economy (“yeah, we’d like to see that
kind of money in our pockets”). Hence, it is not only μ that matters,
it is also σ that quantifies the discrepancy between the rich and the poor
(too much inequality is bad, but also too much uniformity is
to be avoided).

For a normal distribution, the situation is vastly different, because the mean, the median, and the most probable outcomes tend to be the same – the distribution is symmetric around μ.

What is the fraction of people with earnings
below the mean in our LN(10.3, 0.58) model?
Hint: use **scipy.stats.lognorm.cdf** to get the answer.

### 6.3.2. Pareto Distribution#

Consider again the dataset on the populations of the US cities in the 2000 US Census:

```
cities = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/other/us_cities_2000.txt")
len(cities), sum(cities) # number of cities, total population
## (19447, 175062893.0)
```

Figure 6.10 gives the histogram of the city sizes with the populations on the log-scale. It kind of looks like a log-normal distribution again, which the reader can inspect themself when they are feeling playful.

```
logbins = np.geomspace(np.min(cities), np.max(cities), 21)
plt.hist(cities, bins=logbins, color="lightgray", edgecolor="black")
plt.xscale("log")
plt.ylabel("Count")
plt.show()
```

This time, however, will be interested in not what is *typical*,
but what is in some sense *anomalous* or *extreme*.
Let us look again at the *truncated* version of the city size
distribution by considering the cities with 10,000 or more inhabitants
(i.e., we will only study the right tail of the original data,
just like in Section 4.3.7).

```
s = 10_000
large_cities = cities[cities >= s]
len(large_cities), sum(large_cities) # number of cities, total population
## (2696, 146199374.0)
```

Plotting the above on a double logarithmic scale
can be performed by calling **plt.yscale**`("log")`

,
which is left as an exercise.
Anyway, doing so will lead to a similar picture
as in Figure 6.11 below.
This reveals something remarkable: the bar tops on the
double log-scale are arranged more or less in a straight line.
There are many datasets
that exhibit this behaviour; we say that they follow a *power law*
(power in the arithmetic sense, not social one);
see [14, 67] for discussion.

Let us introduce the *Pareto distribution* family which has a prototypical
power law-like density. It is identified by two parameters:

the (what

**scipy**calls it) scale parameter \(s>0\) is equal to the shift from 0,the shape parameter, \(\alpha>0\), controls the slope of the said line on the double log-scale.

The probability density function of P(α, s) is given for \(x\ge s\) by:

and \(f(x)=0\) otherwise.

\(s\) is usually taken as the sample minimum (i.e., 10,000 in our case). \(\alpha\) can be estimated through the reciprocal of the mean of the scaled logarithms of our observations:

```
alpha = 1/np.mean(np.log(large_cities/s))
alpha
## 0.9496171695997675
```

Figure 6.11 allows us to compare the theoretical density and an empirical histogram on the log-scale.

```
logbins = np.geomspace(s, np.max(large_cities), 21) # bin boundaries
plt.hist(large_cities, bins=logbins, density=True,
color="lightgray", edgecolor="black")
plt.plot(logbins, scipy.stats.pareto.pdf(logbins, alpha, scale=s),
"r--", label=f"PDF of P({alpha:.3f}, {s})")
plt.xscale("log")
plt.yscale("log")
plt.ylabel("Density")
plt.legend()
plt.show()
```

Figure 6.12 gives the corresponding Q-Q plot on a double logarithmic scale.

```
qq_plot( # defined above
large_cities,
lambda q: scipy.stats.pareto.ppf(q, alpha, scale=s)
)
plt.xlabel(f"Quantiles of P({alpha:.3f}, {s})")
plt.ylabel("Sample quantiles")
plt.xscale("log")
plt.yscale("log")
plt.show()
```

We see that the populations of the largest cities are overestimated. The model could be better, but the cities are still growing, right?

(*) It might also be interesting to see how well we can predict the
probability of a randomly selected city being at least a given size.
Let us denote with \(S(x)=1-F(x)\) the *complementary
cumulative distribution function* (CCDF; sometimes referred to as
the survival function),
and with \(\hat{S}_n(x)=1-\hat{F}_n(x)\) its empirical version.
Figure 6.13 compares the empirical and the fitted CCDFs
with probabilities on the linear- and log-scale.

```
x = np.geomspace(np.min(large_cities), np.max(large_cities), 1001)
probs = scipy.stats.pareto.cdf(x, alpha, scale=s)
n = len(large_cities)
for i in [1, 2]:
plt.subplot(1, 2, i)
plt.plot(x, 1-probs, "r--", label=f"CCDF of P({alpha:.3f}, {s})")
plt.plot(np.sort(large_cities), 1-np.arange(1, n+1)/n,
drawstyle="steps-post", label="Empirical CCDF")
plt.xlabel("$x$")
plt.xscale("log")
plt.yscale(["linear", "log"][i-1])
if i == 1:
plt.ylabel("Prob(city size > x)")
plt.legend()
plt.show()
```

In terms of the maximal absolute distance between the two functions, \(\hat{D}_n\), from the left plot we see that the fit looks fairly good (let us stress that the log-scale overemphasises the relatively minor differences in the right tail and should not be used for judging the value of \(\hat{D}_n\)).

That the Kolmogorov–Smirnov goodness-of-fit test rejects the hypothesis of Paretianity (at a significance level \(0.1\%\)) is left as an exercise to the reader.

### 6.3.3. Uniform Distribution#

Consider the Polish *Lotto* lottery, where six numbered balls
\(\{1,2,\dots,49\}\) are drawn without replacement from an urn.
We have a dataset that summarises the number of times
each ball has been drawn in all the games in the period 1957–2016.

```
lotto = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/lotto_table.txt")
lotto
## array([720., 720., 714., 752., 719., 753., 701., 692., 716., 694., 716.,
## 668., 749., 713., 723., 693., 777., 747., 728., 734., 762., 729.,
## 695., 761., 735., 719., 754., 741., 750., 701., 744., 729., 716.,
## 768., 715., 735., 725., 741., 697., 713., 711., 744., 652., 683.,
## 744., 714., 674., 654., 681.])
```

Each event seems to occur more or less with the same probability.
Of course, the numbers on the balls are integer,
but in our idealised scenario, we may try modelling this dataset
using a continuous *uniform distribution*,
which yields arbitrary real numbers on a given interval *(a, b)*,
i.e., between some *a* and *b*.
We denote such a distribution with U(*a*, *b*). It
has the probability density function given for \(x\in(a, b)\) by:

and \(f(x)=0\) otherwise.

Notice that **scipy.stats.uniform**
uses parameters `a`

and `scale`

equal to \(b-a\) instead.

In our case, it makes sense to set \(a=1\) and \(b=50\) and interpret an outcome like 49.1253 as representing the 49th ball (compare the notion of the floor function, \(\lfloor x\rfloor\)).

```
x = np.linspace(1, 50, 1001)
plt.bar(np.arange(1, 50), width=1, height=lotto/np.sum(lotto),
color="lightgray", edgecolor="black", alpha=0.8, align="edge")
plt.plot(x, scipy.stats.uniform.pdf(x, 1, scale=49), "r--",
label="PDF of U(1, 50)")
plt.ylim(0, 0.025)
plt.legend()
plt.show()
```

Visually, see Figure 6.14, this model makes much sense, but again, some more rigorous statistical testing would be required to determine if someone has not been tampering with the lottery results, i.e., if data does not deviate from the uniform distribution significantly.

Unfortunately, we cannot use the Kolmogorov–Smirnov test in the version defined above, because data are not continuous. See, however, Section 11.4.3 for the Pearson chi-squared test that is applicable here.

Does playing lotteries
and engaging in gambling make *rational* sense at all,
from the perspective of an individual player?
Well, we see that 16 is the most frequently occurring outcome
in *Lotto*, maybe there’s some magic in it?
Also, some people sometimes became millionaires, right?

Note

In data modelling (e.g., Bayesian statistics), sometimes a uniform distribution is chosen as a placeholder for “we know nothing about a phenomenon, so let us just assume that every event is equally likely”. Nonetheless, it is quite fascinating that the real world tends to be structured after all. Emerging patterns are plentiful, most often they are far from being uniformly distributed. Even more strikingly, they are subject to quantitative analysis.

### 6.3.4. Distribution Mixtures (*)#

Some datasets may fail to fit through simple models such as the ones described above. It may sometimes be due to their non-random behaviour: statistics gives just one means to create data idealisations, we also have partial differential equations, approximation theory, graphs and complex networks, agent-based modelling, and so forth, which might be worth giving a study (and then try).

Another reason may be that what we observe is in fact a *mixture*
(creative combination) of simpler processes.

The dataset representing the December 2021 hourly averages pedestrian counts near the Southern Cross Station in Melbourne might be a good instance of such a scenario; compare Figure 4.5.

```
peds = np.genfromtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/southern_cross_station_peds_2019_dec.txt")
```

It might not be a bad idea to try to fit a probabilistic (convex) combination of three normal distributions \(f_1\), \(f_2\), \(f_3\), corresponding to the morning, lunch-time, and evening pedestrian count peaks. This yields the PDF:

for some coefficients \(w_1,w_2,w_3\ge 0\) such that \(w_1+w_2+w_3=1\).

Figure 6.15 depicts a mixture of N(8, 1), N(12, 1), and N(17, 2) with the corresponding weights of 0.35, 0.1, and 0.55. This dataset is quite coarse-grained (we only have 24 bar heights at our disposal). Consequently, the estimated coefficients should be taken with a pinch of chilli pepper.

```
plt.bar(np.arange(24), width=1, height=peds/np.sum(peds),
color="lightgray", edgecolor="black", alpha=0.8)
x = np.arange(0, 25, 0.1)
p1 = scipy.stats.norm.pdf(x, 8, 1)
p2 = scipy.stats.norm.pdf(x, 12, 1)
p3 = scipy.stats.norm.pdf(x, 17, 2)
p = 0.35*p1 + 0.1*p2 + 0.55*p3 # weighted combination of 3 densities
plt.plot(x, p, "r--", label="PDF of a normal mixture")
plt.legend()
plt.show()
```

Important

It will frequently be the case in data wrangling
that more complex entities (models, methods) will be arising as combinations
of simpler (primitive) components.
This is why we should spend a great deal of time
studying the *fundamentals*.

Note

Some data clustering techniques
(in particular, the *k*-means algorithm that we briefly discuss
later in this course)
could be used to split a data sample into disjoint chunks
corresponding to different mixture components.

Also, it might be the case that the mixture components can in fact be explained by another categorical variable that divides the dataset into natural groups; compare Chapter 12.

## 6.4. Generating Pseudorandom Numbers#

A probability distribution is useful not only for describing a dataset. It also enables us to perform many experiments on data that we do not currently have, but we might obtain in the future, to test various scenarios and hypotheses.

To do this, we can generate a random sample of independent (not related to each other) observations.

### 6.4.1. Uniform Distribution#

When most people say *random*, they implicitly mean
*uniformly distributed*. For example:

```
np.random.rand(5)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])
```

gives five observations sampled independently from the uniform distribution on the unit interval, i.e., U(0, 1).

The same with **scipy**, but this time the support will be
*(-10, 15)*.

```
scipy.stats.uniform.rvs(-10, scale=25, size=5) # from -10 to -10+25
## array([ 0.5776615 , 14.51910496, 7.12074346, 2.02329754, -0.19706205])
```

Alternatively, we could do that ourselves by shifting
and scaling the output of the random number generator
on the unit interval using the formula
**numpy.random.rand**`(5)*25-10`

.

### 6.4.2. Not Exactly Random#

We generate numbers using a computer, which is purely
deterministic. Hence, we shall refer to them as *pseudorandom*
or random-like ones (albeit they are non-distinguishable from truly random,
when subject to rigorous tests for randomness).

To prove it, we can set the initial state of the generator
(the *seed*) via some number and see what values are produced:

```
np.random.seed(123) # set seed
np.random.rand(5)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])
```

Then, we set the seed once again via the same number and see how “random” the next values are:

```
np.random.seed(123) # set seed
np.random.rand(5)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])
```

This enables us to perform completely *reproducible* numerical
experiments, and this is a very good feature: truly scientific
inquiries should lead to identical results under the same conditions.

Note

If we do not set the seed manually,
it will be initialised based on the current wall time, which is
different every… time. As a result, the numbers will *seem* random to us.

Many Python packages that we will be using in the future,
including **pandas** and **sklearn**, rely
on **numpy**’s random number generator.
We will become used to calling **numpy.random.seed**
to make them predictable.

Additionally, some of them
(e.g., **sklearn.model_selection.train_test_split**
or **pandas.DataFrame.sample**) are equipped with the `random_state`

argument, which behaves as if we *temporarily* changed
the seed (for just one call to that function). For instance:

```
scipy.stats.uniform.rvs(size=5, random_state=123)
## array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897])
```

This gives the same sequence as above.

### 6.4.3. Sampling from Other Distributions#

Generating data from other distributions is possible too;
there are many **rvs** methods implemented
in **scipy.stats**.
For example, here is a sample from N(100, 16):

```
scipy.stats.norm.rvs(100, 16, size=3, random_state=50489)
## array([113.41134015, 46.99328545, 157.1304154 ])
```

Pseudorandom deviates from the *standard* normal distribution,
i.e., N(0, 1), can also be generated using **numpy.random.randn**.
As N(100, 16) is a scaled and shifted version thereof,
the above is equivalent to:

```
np.random.seed(50489)
np.random.randn(3)*16 + 100
## array([113.41134015, 46.99328545, 157.1304154 ])
```

Important

Conclusions based on simulated data are trustworthy, because they cannot be manipulated. Or can they?

The pseudorandom number generator’s seed used above,
`50489`

, is quite suspicious. It might suggest that someone
wanted to *prove* some point (in this case, the violation
of the 3σ rule).

This is why we recommend sticking to only one seed most of the time,
e.g., `123`

, or – when performing simulations – setting
consecutive seeds for each iteration: `1`

, `2`

, ….

Generate 1,000 pseudorandom numbers from the log-normal distribution and draw a histogram thereof.

Note

(*) Having a good pseudorandom number generator from the uniform distribution on the unit interval is crucial, because sampling from other distributions usually involves transforming independent U(0, 1) variates.

For instance, realisations of random variables following any continuous
cumulative distribution function \(F\) can be constructed through
the *inverse transform sampling* (see [37, 77]):

Generate a sample \(x_1,\dots,x_n\) independently from U(0, 1).

Transform each \(x_i\) by applying the quantile function, \(y_i=F^{-1}(x_i)\).

Now \(y_1,\dots,y_n\) follows the CDF \(F\).

(*) Generate 1,000 pseudorandom numbers from the log-normal distribution using inverse transform sampling.

(**) Generate 1,000 pseudorandom numbers from the distribution mixture discussed in Section 6.3.4.

### 6.4.4. Natural Variability#

Even a sample truly generated from a specific distribution will deviate from it, sometimes considerably. Such effects will be especially visible for small sample sizes, but they usually disappear[9] when the availability of data increases.

For example, Figure 6.16 depicts the histograms of nine different samples of size 100, all drawn independently from the standard normal distribution.

```
plt.figure(figsize=(plt.rcParams["figure.figsize"][0], )*2) # width=height
for i in range(9):
plt.subplot(3, 3, i+1)
sample = scipy.stats.norm.rvs(size=100, random_state=i+1)
plt.hist(sample, density=True, bins=10,
color="lightgray", edgecolor="black")
plt.ylabel(None)
plt.xlim(-4, 4)
plt.ylim(0, 0.6)
plt.legend()
plt.show()
```

There is some ruggedness in the bar sizes that a naïve observer might try to interpret as something meaningful. A competent data scientist must train their eye to ignore such impurities (but should always be ready to detect those which are worth attention). In this case, they are only due to random effects.

Repeat the above experiment for samples of sizes 10, 1,000, and 10,000.

(*) Using a simple Monte Carlo simulation, we can verify (approximately) that the Kolmogorov–Smirnov goodness-of-fit test introduced in Section 6.2.3 has been calibrated properly, i.e., that for samples that really follow the assumed distribution, the null hypothesis is rejected only in ca. 0.1% of the cases.

Let us say we are interested in the null hypothesis referencing the standard normal distribution, N(0, 1), and sample size \(n=100\). We need to generate many (we assume 10,000 below) such samples for each of which we compute and store the maximal absolute deviation from the theoretical CDF, i.e., \(\hat{D}_n\).

```
n = 100
distrib = scipy.stats.norm(0, 1) # assumed distribution - N(0, 1)
Dns = []
for i in range(10000): # increase this for better precision
x = distrib.rvs(size=n, random_state=i+1) # really follows distrib
Dns.append(compute_Dn(x, distrib.cdf))
Dns = np.array(Dns)
```

Now let us compute the proportion of cases which lead to \(\hat{D}_n\) greater than the critical value \(K_n\):

```
len(Dns[Dns >= scipy.stats.kstwo.ppf(1-0.001, n)]) / len(Dns)
## 0.0016
```

In theory, this should be equal to 0.001. But our values are necessarily approximate, because we rely on randomness. Increasing the number of trials from 10,000 to, say, 1,000,000 will make the above estimate more precise.

It is also worth checking out that
the density histogram of `Dns`

resembles the Kolmogorov distribution
that we can compute via **scipy.stats.kstwo.pdf**.

(*) It might also be interesting to check out the test’s *power*,
i.e., the probability that when the null hypothesis is false,
it will actually be rejected.
Modify the above code in such a way that `x`

in the **for** loop
is not generated from N(0, 1), but N(0.1, 1), N(0.2, 1), etc.,
and check the proportion of cases
where we deem the sample distribution different from N(0, 1).
Small differences in the location parameter \(\mu\) are usually
ignored, and this improves with sample size \(n\).

### 6.4.5. Adding Jitter (White Noise)#

We mentioned that measurements might be subject to observational
error. Rounding can also occur as early as
the data collection phase. In particular, our `heights`

dataset is precise up to 1 fractional digit.
However, in statistics, when we say that data follow
a continuous distribution, the probability of having two identical
values in a sample is 0. Therefore, some data analysis methods
might assume that there are no ties in the input vector, i.e.,
all values are unique.

The easiest way to deal with such numerical inconveniences is to add some white noise with the expected value of 0, either uniformly or normally distributed.

For example, for `heights`

it makes sense to add some jitter from
U[-0.05, 0.05]:

```
heights_jitter = heights + (np.random.rand(len(heights))*0.1-0.05)
heights_jitter[:6] # preview
## array([160.21704623, 152.68870195, 161.24482407, 157.3675293 ,
## 154.61663465, 144.68964596])
```

Adding noise also might be performed for aesthetic reasons, e.g., when drawing scatterplots.

## 6.5. Further Reading#

For an excellent general introductory course on probability and statistics; see [38, 40] and also [78]. More advanced students are likely to enjoy other classics such as [4, 7, 17, 26]. To go beyond the fundamentals, check out [24]. Topics in random number generation are covered in [37, 56, 77].

For a more comprehensive introduction to exploratory data analysis, see the classical books by Tukey [86, 87] and Tufte [85].

We took the logarithm of the log-normally distributed incomes and obtained a normally distributed sample. In statistical practice, it is not rare to apply different non-linear transforms of the input vectors at the data preprocessing stage (see, e.g., Section 9.2.6). In particular, the Box–Cox (power) transform [10] is of the form \(x\mapsto \frac{x^\lambda-1}{\lambda}\) for some \(\lambda\). Interestingly, in the limit as \(\lambda\to 0\), this formula yields \(x\mapsto \log x\) which is exactly what we were applying in this chapter.

[14, 67] give a nice overview of the power-law-like behaviour of some “rich” or otherwise extreme datasets. It is worth noting that the logarithm of a Paretian sample divided by the minimum follows an exponential distribution (which we discuss in Chapter 16). For a comprehensive catalogue of statistical distributions, their properties, and relationships between them, see [27].

## 6.6. Exercises#

Why is the notion of the mean income confusing the general public?

When manually setting the seed of a random number generator makes sense?

Given a log-normally distributed sample `x`

, how can we turn it
to a normally distributed one, i.e., `y=`

**f**`(x)`

,
with **f** being… what?

What is the 3σ rule for normally distributed data?

(*) How can we verify graphically if a sample follows a hypothesised theoretical distribution?

(*) Explain the meaning of type I error, significance level, and a test’s power.