2. Scalar Types and Control Structures in Python

The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.

In this part, we introduce the basics of the Python language itself. As it is a general-purpose tool, various packages supporting data wrangling operations will provided as third-party extensions. Therefore, based on the concepts discussed here, in further chapters we will be able to use numpy, scipy, pandas, matplotlib, seaborn, and other packages with some healthy degree of confidence.

2.1. Scalar Types

The five ubiquitous scalar types (i.e., single or atomic values) are:

  • bool – logical,

  • int, float, complex – numeric,

  • str – character.

2.1.1. Logical Values

There are only two possible logical (Boolean) values: True and False. We can type:

True
## True

to instantiate one of them. This might seem boring — unless, when trying to play with the above code, the kind reader fell into the following pitfall:

Important

Python is case-sensitive. Writing “TRUE” or “true” instead of “True” is an error.

2.1.2. Numeric Values

The three numeric scalar types are:

  • int – integers, e.g., 1, -42, 1_000_000;

  • float – floating-point (real) numbers, e.g., -1.0, 3.14159, 1.23e-4;

  • complex (*) – complex numbers, e.g., 1+2j (these are infrequently used in our applications).

In practice, int and float often interoperate seamlessly. We usually do not have to think about them as being of distinctive types.

Exercise 2.1

1.23e-4 and 9.8e5 are examples of numbers entered using the so-called scientific notation, where “e” stands for “times 10 to the power of”. Moreover, 1_000_000 is a decorated (more human-readable) version of 1000000. Use the print function to check their values.

2.1.2.1. Arithmetic Operators

Here is the list of available arithmetic operators:

1 + 2    # addition
## 3
1 - 7    # subtraction
## -6
4 * 0.5  # multiplication
## 2.0
7 / 3    # float division (the result is always of type float)
## 2.3333333333333335
7 // 3   # integer division
## 2
7 % 3    # division remainder
## 1
2 ** 4   # exponentiation
## 16

The precedence of these operators is quite predictable, e.g., exponentiation has higher priority than multiplication and division, which in turn bind more strongly than addition and subtraction. Hence:

1 + 2 * 3 ** 4  # the same as 1+(2*(3**4))
## 163

is different from, e.g., ((1+2)*3)**4).

Note

Keep in mind that computers’ floating-point arithmetic is precise only up to a few significant digits. Hence, the result of 7/3 is only approximate (2.3333333333333335). We will get back to this topic in Section 5.6.2.

2.1.2.2. Creating Named Variables

Named variables can be introduced using the assignment operator, `=`. They can store arbitrary Python objects and be referred to at any time. Names of variables can include any lower- and uppercase letters, underscores, and digits (but not at the beginning). It is best to make them self-explanatory, like:

x = 7  # read: let `x` from now on be equal to 7 (or: `x` becomes 7)

We can check that x (great name, by the way: it means something of general interest in mathematics) is now available for further reference by printing out the value that is bound therewith:

print(x)  # or just `x`
## 7

New variable can easily be created based on existing ones:

my_2nd_variable = x/3 - 2  # creates `my_2nd_variable`
print(my_2nd_variable)
## 0.3333333333333335

Also, existing variables can be re-bound to any other value whenever we please:

x = x/3  # let the new `x` be equal to the old `x` (7) divided by 3
print(x)
## 2.3333333333333335
Exercise 2.2

Create two named variables height (in centimetres) and weight (in kilograms). Based on them, determine your BMI.

Note

(*) Augmented assignments are also available. For example:

x *= 3
print(x)
## 7.0

In this context, the above is equivalent to x = x*3, i.e., a new variable has been created. However, in other scenarios, augmented assignments modify the objects they act upon in-place, compare Section 3.5.

2.1.3. Character Strings

Character strings (objects of type str), which can consist of arbitrary text, are created using either double quotes or apostrophes:

print("spam, spam, #, bacon, and spam")
## spam, spam, #, bacon, and spam
print("Cześć! ¿Qué tal?")
## Cześć! ¿Qué tal?
print('"G\'day, howya goin\'," he asked.\n"Fine, thanks," she responded.\\')
## "G'day, howya goin'," he asked.
## "Fine, thanks," she responded.\

Above, `\'` (a way to include an apostrophe in an apostrophe-delimited string), `\\` (a backslash), and `\n` (a newline character) are examples of escape sequences.

Multiline strings are also possible:

"""
spam\\spam
tasty\t"spam"
lovely\t'spam'
"""
## '\nspam\\spam\ntasty\t"spam"\nlovely\t\'spam\'\n'
Exercise 2.3

Call the print function on the above object to reveal the special meaning of the included escape sequences.

Important

Many string operations are available. They are related, for example to formatting, pattern searching, or extracting matching chunks. They are especially important in the art of data wrangling as oftentimes information comes to us in textual form. We shall be covering this topic in detail in Chapter 14.

2.1.3.1. F-Strings (Formatted String Literals)

Also, the so-called f-strings (formatted string literals) can be used to prepare nice output messages:

x = 2
f"x is {x}"
## 'x is 2'

Notice the f prefix. The {x} part was replaced with the value stored in the x variable.

There are many options available. As usual, it is best to study the documentation in search of interesting features. Here, let us just mention that we will frequently be referring to placeholders like {variable:width} and {variable:width.precision}, which specify the field width and the number of fractional digits of a number. This can result in a series of values nicely aligned one below another.

π = 3.14159265358979323846
e = 2.71828182845904523536
print(f"""
π = {π:10.8f}
e = {e:10.8f}
""")
## 
## π = 3.14159265
## e = 2.71828183

10.8f means that a value should be formatted as a float, be of at least width 10, and use eight fractional digits.

2.2. Calling Built-in Functions

There are quite a few built-in functions ready for use. For instance:

e = 2.718281828459045
round(e, 2)
## 2.72

Rounds e to 2 decimal digits.

Exercise 2.4

Call help("round") to access the function’s manual. Note that the second argument, called ndigits, which we have set to 2, has a default value of None. Check what happens when we omit it during the call.

2.2.1. Positional and Keyword Arguments

As round has two parameters, number and ndigits, the following (and no other) calls are equivalent:

print(
    round(e, 2),  # two arguments matched positionally
    round(e, ndigits=2),  # positional and keyword argument
    round(number=e, ndigits=2),  # two keyword arguments
    round(ndigits=2, number=e)  # the order does not matter for keyword args
)
## 2.72 2.72 2.72 2.72

That no other form is allowed is left as an exercise, i.e., positionally matched arguments must be listed before the keyword ones.

2.2.2. Modules and Packages

Other functions are available in numerous Python modules and packages (which are collections of modules).

For example, math features many mathematical functions:

import math   # the math module must be imported prior its first use
print(math.log(2.718281828459045))  # the natural logarithm (base e)
## 1.0
print(math.floor(-7.33))  # the floor function
## -8
print(math.sin(math.pi))  # sin(pi) equals 0 (with some numeric error)
## 1.2246467991473532e-16

See the official documentation for the comprehensive list of objects defined therein. On a side note, all floating-point computations in any programming language are subject to round-off errors and other inaccuracies, hence the result of \(\sin\pi\) not being exactly 0, but some value very close thereto. We will elaborate on this topic in Section 5.6.2.

Packages can be given aliases, for the sake of code readability or due to our being lazy. For instance, we are used to importing the numpy package under the np alias:

import numpy as np

And now, instead of writing, for example, numpy.random.rand(), we can call instead:

np.random.rand()  # a pseudorandom value in [0.0, 1.0)
## 0.6964691855978616

2.2.3. Slots and Methods

Python is an object-oriented programming language. Each object is an instance of some class whose name we can reveal by calling the type function:

x = 1+2j
type(x)
## <class 'complex'>

Important

Classes define the following kinds of attributes:

  • slots – associated data,

  • methods – associated functions.

Exercise 2.5

Call help("complex") to reveal that the complex class features, amongst others, the conjugate method and the real and imag slots.

Here is how we can read the two slots:

print(x.real)  # access slot `real` of object `x` of class `complex`
## 1.0
print(x.imag)
## 2.0

And here is an example of a method call:

x.conjugate()  # equivalently: complex.conjugate(x)
## (1-2j)

Notably, the documentation of this function can be accessed by typing help("complex.conjugate") (class name – dot – method name).

2.3. Controlling Program Flow

2.3.1. Relational and Logical Operators

Further, we have several operators which return a single logical value:

1 == 1.0  # is equal to?
## True
2 != 3  # is not equal to?
## True
"spam" < "egg" # is less than? (with respect to the lexicographic order)
## False

Some more examples:

math.sin(math.pi) == 0.0  # well, numeric error...
## False
abs(math.sin(math.pi)) <= 1e-9  # is close to 0?
## True

Logical results might be combined using and (conjunction; for testing if both operands are true) and or (alternative; for determining whether at least one operand is true). Furthermore, not (negation) is available too.

3 <= math.pi and math.pi <= 4
## True
not (1 > 2 and 2 < 3) and not 100 <= 3
## True

Notice that not 100 <= 3 is equivalent to 100 > 3. Also, based on the de Morgan’s laws, not (1 > 2 and 2 < 3) is true if and only if 1 <= 2 or 2 >= 3 holds.

Exercise 2.6

Assuming that p, q, r are logical and a, b, c, d are float-type variables, simplify the following expressions:

  • not not p,

  • not p and not q,

  • not (not p or not q or not r),

  • not a == b,

  • not (b > a and b < c),

  • not (a>=b and b>=c and a>=c),

  • (a>b and a<c) or (a<c and a>d).

2.3.2. The if Statement

The if statement allows us to execute a chunk of code conditionally, based on whether the provided expression is true or not.

For instance, given some variable:

x = np.random.rand()  # a pseudorandom value in [0.0, 1.0)

we can react enthusiastically to its being less than 0.5 (note the colon after the tested condition):

if x < 0.5: print("spam!")

which did not happen, because it is equal to:

print(x)
## 0.6964691855978616

Further, multiple elif (else-if) parts can be added, followed by an optional else part, which is executed if all the conditions tested are not true.

if x < 0.25:   print("spam!")
elif x < 0.5:  print("ham!")    # i.e., x in [0.25, 0.5)
elif x < 0.75: print("bacon!")  # i.e., x in [0.5, 0.75)
else:          print("eggs!")   # i.e., x >= 0.75
## bacon!

If more than one statement is to be executed conditionally, an indented code block can be introduced.

if x >= 0.25 and x <= 0.75:
    print("spam!")
    print("I love it!")
else:
    print("I'd rather eat spam!")
print("more spam!")  # executed regardless of the condition's state
## spam!
## I love it!
## more spam!

Important

The indentation must be neat and consistent. We recommend using four spaces. The reader is encouraged to try to execute the following code chunk and note what kind of error is generated:

if x < 0.5:
    print("spam!")
   print("ham!")    # :(
Exercise 2.7

For a given BMI, print out the corresponding category as defined by the WHO (underweight if below 18.5, normal range up to 25.0, etc.). Let us bear in mind that the BMI is a simplistic measure. Both the medical and statistical communities point out its inherent limitations. Read the Wikipedia article thereon for more details (and appreciate the amount of data wrangling required for its preparation – tables, charts, calculations; something that we will be able to do quite soon, given good reference data, of course).

Exercise 2.8

(*) Check if it is easy to find on the internet (in reliable sources) some raw data sets related to the body mass studies, e.g., measuring subjects’ height, weight, body fat and muscle percentage, etc.

2.3.3. The while Loop

The while loop executes a given statement or a series of statements as long as a given condition is true.

For example, here is a simple simulator determining how long we have to wait until drawing the first number not greater than 0.01 whilst generating numbers in the unit interval:

count = 0
while np.random.rand() > 0.01:
    count = count + 1
print(count)
## 117
Exercise 2.9

Using the while loop, determine the arithmetic mean of 10 randomly generated numbers (i.e., the sum of the numbers divided by 10).

2.4. Defining Own Functions

We can also define our own functions as a means for code reuse. For instance, below is one that computes the minimum (with respect to the `<` relation) of three given objects:

def min3(a, b, c):
    """
    A function to determine the minimum of three given inputs.

    By the way, this is a docstring (documentation string);
    call help("min3") later.
    """
    if a < b:
        if a < c:
            return a
        else:
            return c
    else:
        if b < c:
            return b
        else:
            return c

Example calls:

print(min3(10, 20, 30),
      min3(10, 30, 20),
      min3(20, 10, 30),
      min3(20, 30, 10),
      min3(30, 10, 20),
      min3(30, 20, 10))
## 10 10 10 10 10 10

Note that the function returns a value. Hence, the result can be fetched and used in further computations.

x = min3(np.random.rand(), 0.5, np.random.rand())  # minimum of 3 numbers
x = round(x, 3)  # do something with the result
print(x)
## 0.5
Exercise 2.10

Write a function named bmi which computes and returns a person’s BMI, given their weight (in kilograms) and height (in centimetres). As documenting functions constitutes a good development practice, do not forget about including a docstring.

We can also introduce new variables inside a function’s body. This can help the function perform what it has been designed to do.

def min3(a, b, c):
    """
    A function to determine the minimum of three given inputs
    (alternative version).
    """
    m = a  # a local (temporary/auxiliary) variable
    if b < m:
        m = b
    if c < m:   # be careful! no `else` or `elif` here — it's a separate `if`
        m = c
    return m

Example call:

m = 7
n = 10
o = 3
min3(m, n, o)
## 3

All local variables cease to exist after the function is called. Notice that m inside the function is a variable independent of m in the global (calling) scope.

print(m)  # this is still the global `m` from before the call
## 7
Exercise 2.11

Write a function max3 which determines the maximum of 3 given values.

Exercise 2.12

Write a function med3 which defines the median of 3 given values (the one value that is in-between the other ones).

Exercise 2.13

(*) Write a function min4 to compute the minimum of 4 values.

Note

Lambda expressions give us an uncomplicated way to define functions using a single line of code. Their syntax is: lambda argument_name: return_value.

square = lambda x: x**2  # i.e., def square(x): return x**2
square(4)
## 16

Objects generated through lambda expressions do not have to be assigned a name – they can be anonymous. This is useful when calling methods that take other functions as their arguments. With lambdas, the latter can be generated on the fly.

def print_x_and_fx(x, f):
    """
    Arguments: x - some object; f - a function to be called on x
    """
    print(f"x = {x} and f(x) = {f(x)}")

print_x_and_fx(4, lambda x: x**2)
## x = 4 and f(x) = 16
print_x_and_fx(math.pi/4, lambda x: round(math.cos(x), 5))
## x = 0.7853981633974483 and f(x) = 0.70711

2.5. Exercises

Exercise 2.14

What does import xxxxxx as x mean?

Exercise 2.15

What is the difference between if and while?

Exercise 2.16

Name the scalar types we have introduced in this chapter.

Exercise 2.17

What is a docstring and how to create and access it?

Exercise 2.18

What are keyword arguments?