Minimalist Data Wrangling with Python

Marek Gagolewski

doi:10.5281/zenodo.6451068

2. Scalar types and control structures in Python¶

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Deep R Programming [36] too.

In this part, we introduce the basics of the Python language itself. Being a general-purpose tool, various packages supporting data wrangling operations are provided as third-party extensions. In further chapters, extending upon the concepts discussed here, we will be able to use numpy, scipy, matplotlib, pandas, seaborn, and other packages with a healthy degree of confidence.

2.1. Scalar types¶

Scalars are single or atomic values. Their five ubiquitous types are:

bool – logical,
int, float, complex – numeric,
str – character.

Let’s discuss them in detail.

2.1.1. Logical values¶

There are only two possible logical (Boolean) values: True and False. By typing:

True
## True

we instantiated the former. This is a dull exercise unless we have fallen into the undermentioned pitfall.

Important

Python is a case-sensitive language. Writing “TRUE” or “true” instead of “True” is an error.

2.1.2. Numeric values¶

The three numeric scalar types are:

int – integers, e.g., 1, -42, 1_000_000;
float – floating-point (real) numbers, e.g., -1.0, 3.14159, 1.23e-4;
(*) complex – complex numbers, e.g., 1+2j.

In practice, numbers of the type int and float often interoperate seamlessly. We usually do not have to think about them as being of distinctive types. On the other hand, complex numbers are rather infrequently used in data science applications (but see Section 4.1.4).

Exercise 2.1

1.23e-4 and 9.8e5 are examples of numbers in scientific notation, where “e” stands for “… times 10 to the power of …”. Additionally, 1_000_000 is a decorated (more human-readable) version of 1000000. Use the print function to check out their values.

2.1.2.1. Arithmetic operators¶

Here is the list of available arithmetic operators:

1 + 2    # addition
## 3
1 - 7    # subtraction
## -6
4 * 0.5  # multiplication
## 2.0
7 / 3    # float division (results are always of the type float)
## 2.3333333333333335
7 // 3   # integer division
## 2
7 % 3    # division remainder
## 1
2 ** 4   # exponentiation
## 16

The precedence of these operators is quite predictable, e.g., exponentiation has higher priority than multiplication and division, which in turn bind more strongly than addition and subtraction. Thus,

1 + 2 * 3 ** 4
## 163

is the same as 1+(2*(3**4)) and is different from, e.g., ((1+2)*3)**4).

Note

Keep in mind that computers’ floating-point arithmetic is precise only up to a dozen or so significant digits. As a consequence, the result of 7/3 is only approximate; hence the 2.3333333333333335 above. We will discuss this topic in Section 5.5.6.

2.1.2.2. Creating named variables¶

A named variable can be introduced through the assignment operator, `=`. It can store an arbitrary Python object which we can recall at any later time. Names of variables can include any lower- and uppercase letters, underscores, and (except at the beginning) digits.

To make our code easier to understand for humans, it is best to use names that are self-explanatory, like:

x = 7  # read: let `x` from now on be equal to 7 (or: `x` becomes 7)

“x” is great name: it means something of general interest in mathematics. Let’s print out the value it is bound to:

print(x)  # or just `x`
## 7

New variables can be created easily based on existing ones:

my_2nd_variable = x/3 - 2  # creates `my_2nd_variable`
print(my_2nd_variable)
## 0.3333333333333335

Existing variables may be rebound to any other value freely:

x = x/3  # let the new `x` be equal to the old `x` (7) divided by 3
print(x)
## 2.3333333333333335

Exercise 2.2

Define two named variables height (in centimetres) and weight (in kilograms). Determine the corresponding body mass index (BMI).

Note

(*) Augmented assignments are also available. For example:

x *= 3
print(x)
## 7.0

In this context, the foregoing is equivalent to x = x*3. In other words, it creates a new object. Nevertheless, in some scenarios, augmented assignments may modify the objects they act upon in place; compare Section 3.5.

2.1.3. Character strings¶

Character strings (objects of the type str) store text data. They are created using apostrophes or double quotes:

print("spam, spam, #, bacon, and spam")
## spam, spam, #, bacon, and spam
print('Cześć! ¿Qué tal?')
## Cześć! ¿Qué tal?
print('"G\'day, how\'s it goin\'," he asked.\\\n"All good," she responded.')
## "G'day, how's it goin'," he asked.\
## "All good," she responded.

We see some examples of escape sequences here:

“\'” is a way to include an apostrophe in an apostrophe-delimited string,
“\\” enters a backslash,
“\n” inputs a newline character.

Multiline strings are created using three apostrophes or double quotes:

"""
spam\\spam
tasty\t"spam"
lovely\t'spam'
"""
## '\nspam\\spam\ntasty\t"spam"\nlovely\t\'spam\'\n'

Exercise 2.3

Call the print function on the above objects to reveal the meaning of the included escape sequences.

Important

Many string operations are available, e.g., for formatting and pattern searching. They are especially important in the art of data wrangling as information often arrives in textual form. Chapter 14 covers this topic in detail.

2.1.3.1. F-strings (formatted string literals)¶

F-strings are formatted string literals:

x = 2
f"x is equal to {x}"
## 'x is equal to 2'

Notice the “f” prefix. The “{x}” part was replaced with the value stored in the x variable.

The formatting of items can be fine-tuned. As usual, it is best to study the documentation in search of noteworthy features. Here, let’s just mention that we will frequently be referring to placeholders like “{value:width}” and “{value:width.precision}”, which specify the field width and the number of fractional digits of a number. This way, we can output a series of values aesthetically aligned one beneath another.

π = 3.14159265358979323846
e = 2.71828182845904523536
print(f"""
π   = {π:10.8f}
e   = {e:10.8f}
πe² = {(π*e**2):10.8f}
""")
## 
## π   = 3.14159265
## e   = 2.71828183
## πe² = 23.21340436

“10.8f” means that a value should be formatted as a float, be of width at least ten characters (text columns), and use eight fractional digits.

2.2. Calling built-in functions¶

We have a few base functions at our disposal. For instance, to round the Euler constant e to two decimal digits, we can call:

e = 2.718281828459045
round(e, 2)
## 2.72

Exercise 2.4

Call help("round") to access the function’s manual. Note that the second argument, called ndigits, which we set to 2, defaults to None. Check what happens when we omit it during the call.

2.2.1. Positional and keyword arguments¶

The round function has two parameters, number and ndigits. Thus, the following calls are equivalent:

print(
    round(e, 2),                 # two arguments matched positionally
    round(e, ndigits=2),         # positional and keyword argument
    round(number=e, ndigits=2),  # two keyword arguments
    round(ndigits=2, number=e)   # the order does not matter for keyword args
)
## 2.72 2.72 2.72 2.72

Verifying that no other call scheme is permitted is left as an exercise, i.e., positionally matched arguments must be listed before the keyword ones.

2.2.2. Modules and packages¶

Python modules and packages (which are collections of modules) define thousands of additional functions. For example, math features the most common mathematical routines:

import math   # the math module must be imported before we can use it
print(math.log(2.718281828459045))  # the natural logarithm (base e)
## 1.0
print(math.floor(-7.33))  # the floor function
## -8
print(math.sin(math.pi))  # sin(pi) equals 0 (with small numeric error)
## 1.2246467991473532e-16

See the official documentation for the comprehensive list of objects available. On a side note, all floating-point computations in any programming language are subject to round-off errors and other inaccuracies. This is why the result of \(\sin\pi\) is not exactly 0, but some value very close thereto. We will elaborate on this topic in Section 5.5.6.

Packages can be given aliases, for the sake of code readability or due to our being lazy. For instance, in Chapter 4 we will get used to importing the numpy package under the np alias:

import numpy as np

And now, instead of writing, for example, numpy.random.rand(), we can call:

np.random.rand()  # a pseudorandom value in [0.0, 1.0)
## 0.6964691855978616

2.2.3. Slots and methods¶

Python is an object-orientated programming language. Each object is an instance of some class whose name we can reveal by calling the type function:

x = 1+2j
type(x)
## <class 'complex'>

Important

Classes define two kinds of attributes:

slots – associated data,
methods – associated functions.

Exercise 2.5

Call help("complex") to reveal that the complex class defines, amongst others, the conjugate method and the real and imag slots.

Here is how we can read the two slots:

print(x.real)  # access slot `real` of object `x` of the class `complex`
## 1.0
print(x.imag)
## 2.0

And here is an example of a method call:

x.conjugate()  # equivalently: complex.conjugate(x)
## (1-2j)

Notably, the documentation of this function can be accessed by typing help("complex.conjugate") (class name – dot – method name).

2.3. Controlling program flow¶

2.3.1. Relational and logical operators¶

We have several operators which return a single logical value:

1 == 1.0  # is equal to?
## True
2 != 3  # is not equal to?
## True
"spam" < "egg" # is less than? (with respect to the lexicographic order)
## False

Some more examples:

math.sin(math.pi) == 0.0  # well, numeric error...
## False
abs(math.sin(math.pi)) <= 1e-9  # is close to 0?
## True

Logical results can be combined using and (conjunction; for testing if both operands are true) and or (alternative; for determining whether at least one operand is true). Likewise, not stands for negation.

3 <= math.pi and math.pi <= 4  # is it between 3 and 4?
## True
not (1 > 2 and 2 < 3) and not 100 <= 3
## True

Notice that not 100 <= 3 is equivalent to 100 > 3. Also, based on the de Morgan laws, not (1 > 2 and 2 < 3) is true if and only if 1 <= 2 or 2 >= 3 holds.

Exercise 2.6

Assuming that p, q, r are logical and a, b, c, d are variables of the type float, simplify the following expressions:

not not p,
not p and not q,
not (not p or not q or not r),
not a == b,
not (b > a and b < c),
not (a>=b and b>=c and a>=c),
(a>b and a<c) or (a<c and a>d).

2.3.2. The if statement¶

The if statement executes a chunk of code conditionally, based on whether the provided expression is true or not. For instance, given some variable:

x = np.random.rand()  # a pseudorandom value in [0.0, 1.0)

we can react enthusiastically to its being less than 0.5:

if x < 0.5: print("spam!")  # note the colon after the tested condition

Actually, we remained cool as a cucumber (nothing was printed) because x is equal to:

print(x)
## 0.6964691855978616

Multiple elif (else-if) parts can be added. They are inspected one by one, until one of the tests turns out to be successful. At the end, we can include an optional else part. It is executed when all of the tested conditions turn out to be false.

if   x < 0.25: print("spam!")
elif x < 0.5:  print("ham!")    # i.e., x in [0.25, 0.5)
elif x < 0.75: print("bacon!")  # i.e., x in [0.5, 0.75)
else:          print("eggs!")   # i.e., x >= 0.75
## bacon!

Note that if we wrote the second condition as x >= 0.25 and x < 0.5, we would introduce some redundancy; when it is being considered, we already know that x < 0.25 (the first test) is not true. Similarly, the else part is only executed when all the tests fail, which in our case happens if neither x < 0.25, x < 0.5, nor x < 0.75 is true, i.e., if x >= 0.75.

Whenever more than one statement is to be executed conditionally, an indented code block can be introduced.

if x >= 0.25 and x <= 0.75:
    print("bacon!")
    print("I love it!")
else:
    print("I'd rather eat spam!")
print("more spam!")  # executed regardless of the condition's state
## bacon!
## I love it!
## more spam!

Important

The indentation must be neat and consistent. We recommend using four spaces. Note the kind of error generated when we try executing:

if x < 0.5:
    print("spam!")
   print("ham!")    # :(

IndentationError: unindent does not match any outer indentation level

Exercise 2.7

For a given BMI, print out the corresponding category as defined by the WHO (underweight if less than 18.5 kg/m², normal range up to 25.0 kg/m², etc.). Bear in mind that the BMI is a simplistic measure. Both the medical and statistical communities pointed out its inherent limitations. Read the Wikipedia article thereon for more details (and appreciate the amount of data wrangling required for its preparation: tables, charts, calculations; something that we will be able to perform quite soon, given quality reference data, of course).

Exercise 2.8

(*) Check if it is easy to find on the internet (in reliable sources) some raw datasets related to the body mass studies, e.g., measuring subjects’ height, weight, body fat and muscle mass, etc.

2.3.3. The while loop¶

The while loop executes a given statement or a series of statements as long as a given condition is true. For example, here is a simple simulator determining how long we have to wait until drawing the first value not greater than 0.01 whilst generating numbers in the unit interval:

count = 0
while np.random.rand() > 0.01:
    count = count + 1
print(count)
## 117

Exercise 2.9

Using the while loop, determine the arithmetic mean of 100 randomly generated numbers (i.e., the sum of the numbers divided by 100).

2.4. Defining functions¶

As a means for code reuse, we can define our own functions. For instance, below is a procedure that computes the minimum (with respect to the `<` relation) of three given objects:

def min3(a, b, c):
    """
    A function to determine the minimum of three given inputs.

    By the way, this is a docstring (documentation string);
    call help("min3") later to view it.
    """
    if a < b:
        if a < c:
            return a
        else:
            return c
    else:
        if b < c:
            return b
        else:
            return c

Example calls:

print(min3(10, 20, 30),
      min3(10, 30, 20),
      min3(20, 10, 30),
      min3(20, 30, 10),
      min3(30, 10, 20),
      min3(30, 20, 10))
## 10 10 10 10 10 10

Note that min3 returns a value. The result it yields can be consumed in further computations:

x = min3(np.random.rand(), 0.5, np.random.rand())  # minimum of 3 numbers
x = round(x, 3)  # transform the result somehow
print(x)
## 0.5

Exercise 2.10

Write a function named bmi which computes and returns a person’s BMI, given their weight (in kilograms) and height (in centimetres). As documenting functions constitutes a good development practice, do not forget about including a docstring.

New variables can be introduced inside a function’s body. This can help the function perform its duties.

def min3(a, b, c):
    """
    A function to determine the minimum of three given inputs
    (alternative version).
    """
    m = a  # a local (temporary/auxiliary) variable
    if b < m:
        m = b
    if c < m:   # be careful! no `else` or `elif` here — it's a separate `if`
        m = c
    return m

Example call:

m = 7
n = 10
o = 3
min3(m, n, o)
## 3

All local variables cease to exist after the function is called. Notice that m inside the function is a variable independent of m in the global (calling) scope.

print(m)  # this is still the global `m` from before the call
## 7

Exercise 2.11

Implement a function max3 which determines the maximum of three given values.

Exercise 2.12

Write a function med3 which defines the median of three given values (the value that is in-between two other ones).

Exercise 2.13

(*) Indite a function min4 to compute the minimum of four values.

2.4.1. Lambda expressions¶

Lambda expressions give us an uncomplicated way to define functions using a single line of code. They are defined using the syntax lambda argument_name: return_expression.

square = lambda x: x**2  # i.e., def square(x): return x**2
square(4)
## 16

Objects generated through lambda expressions do not have to be assigned a name: they can remain anonymous. This is useful when calling a method which takes another function as its argument. With lambdas, the latter can be generated on the fly.

def print_x_and_fx(x, f):
    """
    Arguments: x - some object; f - a function to be called on x
    """
    print(f"x = {x} and f(x) = {f(x)}")

print_x_and_fx(4, lambda x: x**2)
## x = 4 and f(x) = 16
print_x_and_fx(math.pi/4, lambda x: round(math.cos(x), 5))
## x = 0.7853981633974483 and f(x) = 0.70711

2.4.2. (*) Own modules¶

Definitions of functions and other Python objects can be placed in a separate source file. This way, they can be referred to from within multiple projects. For instance, in the current working directory, if we create a file module.py featuring the definition of the above square function, we will be able to call it like:

import module
module.square(4)
## 16

Unfortunately, once a module is loaded, any changes thereto will not be reflected until the Python session is restarted. Thus, in an interactive environment (such as when working with Jupyter notebooks), we may find the importlib.reload function useful.

2.5. Exercises¶

Exercise 2.14

What does import xxxxxx as x mean?

Exercise 2.15

What is the difference between if and while?

Exercise 2.16

Name the scalar types we introduced in this chapter.

Exercise 2.17

What is a function’s docstring and how can we create and access it?

Exercise 2.18

What are keyword arguments of a function?