2. Scalar types and control structures in Python#

The open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out the author’s other book, Deep R Programming [34].

In this part, we introduce the basics of the Python language itself. As it is a general-purpose tool, various packages supporting data wrangling operations will provided as third-party extensions. In further chapters, based on the concepts discussed here, we will be able to use numpy, scipy, matplotlib, pandas, seaborn, and other packages with some healthy degree of confidence.

2.1. Scalar types#

The five ubiquitous scalar types (i.e., single or atomic values) are:

  • bool – logical,

  • int, float, complex – numeric,

  • str – character.

2.1.1. Logical values#

There are only two possible logical (Boolean) values: True and False. We can type:

True
## True

to instantiate one of them. This might seem boring; unless, when trying to play with the above code, we fell into the following pitfall.

Important

Python is case-sensitive. Writing “TRUE” or “true” instead of “True” is an error.

2.1.2. Numeric values#

The three numeric scalar types are:

  • int – integers, e.g., 1, -42, 1_000_000;

  • float – floating-point (real) numbers, e.g., -1.0, 3.14159, 1.23e-4;

  • complex (*) – complex numbers, e.g., 1+2j (these are infrequently used in our applications; however, see Section 4.1.4).

In practice, numbers of the type int and float often interoperate seamlessly. We usually do not have to think about them as being of distinctive types.

Exercise 2.1

1.23e-4 and 9.8e5 are examples of numbers entered using the so-called scientific notation, where “e” stands for “times 10 to the power of”. Additionally, 1_000_000 is a decorated (more human-readable) version of 1000000. Use the print function to check their values.

2.1.2.1. Arithmetic operators#

Here is the list of available arithmetic operators:

1 + 2    # addition
## 3
1 - 7    # subtraction
## -6
4 * 0.5  # multiplication
## 2.0
7 / 3    # float division (the result is always of the type float)
## 2.3333333333333335
7 // 3   # integer division
## 2
7 % 3    # division remainder
## 1
2 ** 4   # exponentiation
## 16

The precedence of these operators is quite predictable, e.g., exponentiation has higher priority than multiplication and division, which in turn bind more strongly than addition and subtraction. Consequently:

1 + 2 * 3 ** 4  # the same as 1+(2*(3**4))
## 163

is different from, e.g., ((1+2)*3)**4).

Note

Keep in mind that computers’ floating-point arithmetic is precise only up to a few significant digits. As a consequence, the result of 7/3 is only approximate (2.3333333333333335). We will get back to this topic in Section 5.5.6.

2.1.2.2. Creating named variables#

Named variables can be introduced using the assignment operator, `=`. They can store arbitrary Python objects and be referred to anytime. Names of variables can include any lower- and uppercase letters, underscores, and digits (but not at the beginning). It is best to make them self-explanatory, like:

x = 7  # read: let `x` from now on be equal to 7 (or: `x` becomes 7)

We can check that x (great name, by the way: it means something of general interest in mathematics) is now available for further reference by printing out the value it is bound to:

print(x)  # or just `x`
## 7

New variable can easily be created based on existing ones:

my_2nd_variable = x/3 - 2  # creates `my_2nd_variable`
print(my_2nd_variable)
## 0.3333333333333335

Also, existing variables can be rebound to any other value whenever we please:

x = x/3  # let the new `x` be equal to the old `x` (7) divided by 3
print(x)
## 2.3333333333333335
Exercise 2.2

Create two named variables height (in centimetres) and weight (in kilograms). Based on them, determine your BMI.

Note

(*) Augmented assignments are also available. For example:

x *= 3
print(x)
## 7.0

In this context, the above is equivalent to x = x*3. In other words, it created a new variable. Nevertheless, in other scenarios, augmented assignments modify the objects they act upon in place; compare Section 3.5.

2.1.3. Character strings#

Character strings (objects of the type str) consist of arbitrary text. They are created using either double quotes or apostrophes:

print("spam, spam, #, bacon, and spam")
## spam, spam, #, bacon, and spam
print("Cześć! ¿Qué tal?")
## Cześć! ¿Qué tal?
print('"G\'day, howya goin\'," he asked.\n"Fine, thanks," she responded.\\')
## "G'day, howya goin'," he asked.
## "Fine, thanks," she responded.\

Above, “\'” (a way to include an apostrophe in an apostrophe-delimited string), “\\” (a backslash), and “\n” (a newline character) are examples of escape sequences.

Multiline strings are also possible:

"""
spam\\spam
tasty\t"spam"
lovely\t'spam'
"""
## '\nspam\\spam\ntasty\t"spam"\nlovely\t\'spam\'\n'
Exercise 2.3

Call the print function on the above object to reveal the special meaning of the included escape sequences.

Important

Many string operations are available, e.g., for formatting, pattern searching, or extracting matching chunks. They are especially important in the art of data wrangling as information often arrives in textual form. Chapter 14 covers this topic in detail.

2.1.3.1. F-strings (formatted string literals)#

Also, f-strings (formatted string literals) help prepare nice output messages:

x = 2
f"x is equal to {x}"
## 'x is equal to 2'

Notice the “f” prefix. The “{x}” part was replaced with the value stored in the x variable.

There are many options available. As usual, it is best to study the documentation in search of interesting features. Here, let us just mention that we will frequently be referring to placeholders like “{variable:width}” and “{variable:width.precision}”, which specify the field width and the number of fractional digits of a number. This can arouse a series of values nicely aligned one below another.

π = 3.14159265358979323846
e = 2.71828182845904523536
print(f"""
π = {π:10.8f}
e = {e:10.8f}
""")
## 
## π = 3.14159265
## e = 2.71828183

10.8f” means that a value should be formatted as a float, be of width at least ten characters (text columns), and use eight fractional digits.

2.2. Calling built-in functions#

We have a few functions at our disposal. For instance:

e = 2.718281828459045
round(e, 2)
## 2.72

We rounded the Euler constant e to two decimal digits.

Exercise 2.4

Call help("round") to access the function’s manual. Note that the second argument, called ndigits, which we set to 2, has a default value of None. Check what happens when we omit it during the call.

2.2.1. Positional and keyword arguments#

As round has two parameters, number and ndigits, the following (and no other) calls are equivalent:

print(
    round(e, 2),  # two arguments matched positionally
    round(e, ndigits=2),  # positional and keyword argument
    round(number=e, ndigits=2),  # two keyword arguments
    round(ndigits=2, number=e)  # the order does not matter for keyword args
)
## 2.72 2.72 2.72 2.72

That no other form is permitted is left as an exercise, i.e., positionally matched arguments must be listed before the keyword ones.

2.2.2. Modules and packages#

Other functions are available in numerous Python modules and packages (which are collections of modules).

For example, math features many mathematical functions:

import math   # the math module must be imported prior its first use
print(math.log(2.718281828459045))  # the natural logarithm (base e)
## 1.0
print(math.floor(-7.33))  # the floor function
## -8
print(math.sin(math.pi))  # sin(pi) equals 0 (with some numeric error)
## 1.2246467991473532e-16

See the official documentation for the comprehensive list of objects defined therein. On a side note, all floating-point computations in any programming language are subject to round-off errors and other inaccuracies. This is why the result of \(\sin\pi\) is not exactly 0, but some value very close thereto. We will elaborate on this topic in Section 5.5.6.

Packages can be given aliases, for the sake of code readability or due to our being lazy. For instance, we are used to importing the numpy package under the np alias:

import numpy as np

And now, instead of writing, for example, numpy.random.rand(), we can call instead:

np.random.rand()  # a pseudorandom value in [0.0, 1.0)
## 0.6964691855978616

2.2.3. Slots and methods#

Python is an object-orientated programming language. Each object is an instance of some class whose name we can reveal by calling the type function:

x = 1+2j
type(x)
## <class 'complex'>

Important

Classes define the following kinds of attributes:

  • slots – associated data,

  • methods – associated functions.

Exercise 2.5

Call help("complex") to reveal that the complex class defines, amongst others, the conjugate method and the real and imag slots.

Here is how we can read the two slots:

print(x.real)  # access slot `real` of object `x` of the class `complex`
## 1.0
print(x.imag)
## 2.0

And here is an example of a method call:

x.conjugate()  # equivalently: complex.conjugate(x)
## (1-2j)

Notably, the documentation of this function can be accessed by typing help("complex.conjugate") (class name – dot – method name).

2.3. Controlling program flow#

2.3.1. Relational and logical operators#

We have several operators which return a single logical value:

1 == 1.0  # is equal to?
## True
2 != 3  # is not equal to?
## True
"spam" < "egg" # is less than? (with respect to the lexicographic order)
## False

Some more examples:

math.sin(math.pi) == 0.0  # well, numeric error...
## False
abs(math.sin(math.pi)) <= 1e-9  # is close to 0?
## True

Logical results might be combined using and (conjunction; for testing if both operands are true) and or (alternative; for determining whether at least one operand is true). Likewise, not (negation) is available too.

3 <= math.pi and math.pi <= 4
## True
not (1 > 2 and 2 < 3) and not 100 <= 3
## True

Notice that not 100 <= 3 is equivalent to 100 > 3. Also, based on the de Morgan’s laws, not (1 > 2 and 2 < 3) is true if and only if 1 <= 2 or 2 >= 3 holds.

Exercise 2.6

Assuming that p, q, r are logical and a, b, c, d are variables of the type float, simplify the following expressions:

  • not not p,

  • not p and not q,

  • not (not p or not q or not r),

  • not a == b,

  • not (b > a and b < c),

  • not (a>=b and b>=c and a>=c),

  • (a>b and a<c) or (a<c and a>d).

2.3.2. The if statement#

The if statement allows us to execute a chunk of code conditionally, based on whether the provided expression is true or not.

For instance, given some variable:

x = np.random.rand()  # a pseudorandom value in [0.0, 1.0)

we can react enthusiastically to its being less than 0.5 (note the colon after the tested condition):

if x < 0.5: print("spam!")

We did not get excited because x is equal to:

print(x)
## 0.6964691855978616

Multiple elif (else-if) parts can also be added. They can be followed by an optional else part, which is executed if all the conditions tested are not true.

if x < 0.25:   print("spam!")
elif x < 0.5:  print("ham!")    # i.e., x in [0.25, 0.5)
elif x < 0.75: print("bacon!")  # i.e., x in [0.5, 0.75)
else:          print("eggs!")   # i.e., x >= 0.75
## bacon!

If more than one statement is to be executed conditionally, an indented code block can be introduced.

if x >= 0.25 and x <= 0.75:
    print("spam!")
    print("I love it!")
else:
    print("I'd rather eat spam!")
print("more spam!")  # executed regardless of the condition's state
## spam!
## I love it!
## more spam!

Important

The indentation must be neat and consistent. We recommend using four spaces. The reader is encouraged to try to execute the following code chunk and note what kind of error is generated:

if x < 0.5:
    print("spam!")
   print("ham!")    # :(
Exercise 2.7

For a given BMI, print out the corresponding category as defined by the WHO (underweight if below 18.5 kg/m², normal range up to 25.0 kg/m², etc.). Bear in mind that the BMI is a simplistic measure. Both the medical and statistical communities pointed out its inherent limitations. Read the Wikipedia article thereon for more details (and appreciate the amount of data wrangling required for its preparation: tables, charts, calculations; something that we will be able to do quite soon, given quality reference data, of course).

Exercise 2.8

(*) Check if it is easy to find on the internet (in reliable sources) some raw datasets related to the body mass studies, e.g., measuring subjects’ height, weight, body fat and muscle percentage, etc.

2.3.3. The while loop#

The while loop executes a given statement or a series of statements as long as a given condition is true.

For example, here is a simple simulator determining how long we have to wait until drawing the first number not greater than 0.01 whilst generating numbers in the unit interval:

count = 0
while np.random.rand() > 0.01:
    count = count + 1
print(count)
## 117
Exercise 2.9

Using the while loop, determine the arithmetic mean of 10 randomly generated numbers (i.e., the sum of the numbers divided by 10).

2.4. Defining functions#

We can also introduce our own functions as a means for code reuse. For instance, below is one that computes the minimum (with respect to the `<` relation) of three given objects:

def min3(a, b, c):
    """
    A function to determine the minimum of three given inputs.

    By the way, this is a docstring (documentation string);
    call help("min3") later.
    """
    if a < b:
        if a < c:
            return a
        else:
            return c
    else:
        if b < c:
            return b
        else:
            return c

Example calls:

print(min3(10, 20, 30),
      min3(10, 30, 20),
      min3(20, 10, 30),
      min3(20, 30, 10),
      min3(30, 10, 20),
      min3(30, 20, 10))
## 10 10 10 10 10 10

Note that the function returns a value. The result can be fetched and used in further computations:

x = min3(np.random.rand(), 0.5, np.random.rand())  # minimum of 3 numbers
x = round(x, 3)  # do something with the result
print(x)
## 0.5
Exercise 2.10

Write a function named bmi which computes and returns a person’s BMI, given their weight (in kilograms) and height (in centimetres). As documenting functions constitutes a good development practice, do not forget about including a docstring.

We can also introduce new variables inside a function’s body. This can help the function perform what it has been designed to do.

def min3(a, b, c):
    """
    A function to determine the minimum of three given inputs
    (alternative version).
    """
    m = a  # a local (temporary/auxiliary) variable
    if b < m:
        m = b
    if c < m:   # be careful! no `else` or `elif` here — it's a separate `if`
        m = c
    return m

Example call:

m = 7
n = 10
o = 3
min3(m, n, o)
## 3

All local variables cease to exist after the function is called. Notice that m inside the function is a variable independent of m in the global (calling) scope.

print(m)  # this is still the global `m` from before the call
## 7
Exercise 2.11

Implement a function max3 which determines the maximum of three given values.

Exercise 2.12

Write a function med3 which defines the median of three given values (the value that is in-between two other ones).

Exercise 2.13

(*) Indite a function min4 to compute the minimum of four values.

Note

Lambda expressions give us an uncomplicated way to define functions using a single line of code. Their syntax is: lambda argument_name: return_value.

square = lambda x: x**2  # i.e., def square(x): return x**2
square(4)
## 16

Objects generated through lambda expressions do not have to be assigned a name – they can be anonymous. This is useful when calling methods that take other functions as their arguments. With lambdas, the latter can be generated on the fly.

def print_x_and_fx(x, f):
    """
    Arguments: x - some object; f - a function to be called on x
    """
    print(f"x = {x} and f(x) = {f(x)}")

print_x_and_fx(4, lambda x: x**2)
## x = 4 and f(x) = 16
print_x_and_fx(math.pi/4, lambda x: round(math.cos(x), 5))
## x = 0.7853981633974483 and f(x) = 0.70711

2.5. Exercises#

Exercise 2.14

What does import xxxxxx as x mean?

Exercise 2.15

What is the difference between if and while?

Exercise 2.16

Name the scalar types we introduced in this chapter.

Exercise 2.17

What is a docstring and how can we create and access it?

Exercise 2.18

What are keyword arguments?