3. Sequential and other types in Python#

The open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Also, make sure to check out my other book, Deep R Programming [34].

3.1. Sequential types#

Sequential objects store data items that can be accessed by index (position). The three main types of sequential objects are: lists, tuples, and ranges.

As a matter of fact, strings (which we often treat as scalars) can also be classified as such. Therefore, amongst sequential objects are such diverse classes as:

  • lists,

  • tuples,

  • ranges, and

  • strings.

3.1.1. Lists#

Lists consist of arbitrary Python objects. They are created using square brackets:

x = [True, "two", 3, [4j, 5, "six"], None]
print(x)
## [True, 'two', 3, [4j, 5, 'six'], None]

Above is an example list featuring objects of the types: bool, str, int, list (yes, it is possible to have a list inside another list), and None (the None object is the only of this kind, it represents a placeholder for nothingness), in this order.

Note

We will often be using lists when creating vectors in numpy or data frame columns in pandas. Further, lists of lists of equal lengths can be used to create matrices.

Each list is mutable. Consequently, its state may be changed arbitrarily. For instance, we can append a new object at its end:

x.append("spam")
print(x)
## [True, 'two', 3, [4j, 5, 'six'], None, 'spam']

The list.append method modified x in place.

3.1.2. Tuples#

Next, tuples are like lists, but they are immutable (read-only) – once created, they cannot be altered.

("one", [], (3j, 4))
## ('one', [], (3j, 4))

This gave us a triple (a 3-tuple) featuring a string, an empty list, and a pair (a 2-tuple). Let us stress that we can drop the round brackets and still get a tuple:

1, 2, 3  # the same as `(1, 2, 3)`
## (1, 2, 3)

Also:

42,  # equivalently: `(42, )`
## (42,)

Note the trailing comma; the above notation defines a singleton (a 1-tuple). It is not the same as the simple 42 or (42), which is an object of the type int.

Note

Having a separate data type representing an immutable sequence makes sense in certain contexts. For example, a data frame’s shape is its inherent property that should not be tinkered with. If a tabular dataset has 10 rows and 5 columns, we disallow the user to set the former to 15 (without making further assumptions, providing extra data, etc.).

When creating collections of items, we usually prefer lists, as they are more flexible a data type. Yet, Section 3.4.2 will mention that many functions return tuples. We are expected to be able to handle them with confidence.

3.1.3. Ranges#

Objects defined by calling range(from, to) or range(from, to, by) represent arithmetic progressions of integers. For the sake of illustration, let us convert a few of them to ordinary lists:

list(range(0, 5))  # i.e., range(0, 5, 1) – from 0 to 5 (exclusive) by 1
## [0, 1, 2, 3, 4]
list(range(10, 0, -1))  # from 10 to 0 (exclusive) by -1
## [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

Let us point out that the rightmost boundary (to) is exclusive and that by defaults to 1.

3.1.4. Strings (again)#

Recall that we discussed character strings in Section 2.1.3.

print("lovely\nspam")
## lovely
## spam

Strings are often treated as scalars (atomic entities, as in: a string as a whole). However, as we will soon find out, their individual characters can also be accessed by index.

Furthermore, Chapter 14 will discuss a plethora of operations on text.

3.2. Working with sequences#

3.2.1. Extracting elements#

The index operator, `[...]`, can be applied on any sequential object to extract an element at a position specified by a single integer.

x = ["one", "two", "three", "four", "five"]
x[0]  # the first element
## 'one'
x[1]  # the second element
## 'two'
x[len(x)-1]  # the last element
## 'five'

The valid indexes are \(0, 1, \dots, n-2, n-1\), where \(n\) is the length (size) of the sequence, which can be fetched by calling len.

Important

Think of an index as the distance from the start of a sequence. For example, x[3] means “three items away from the beginning”, i.e., the fourth element.

Negative indexes count from the end:

x[-1]  # the last element (ultimate)
## 'five'
x[-2]  # the next to last (the last but one, penultimate)
## 'four'
x[-len(x)]  # the first element
## 'one'

The index operator can be applied on any sequential object:

"string"[3]
## 'i'

Indexing a string returns a string – that is why we classified strings as scalars too.

More examples:

range(0, 10)[-1]  # the last item in an arithmetic progression
## 9
(1, )[0]  # extract from a 1-tuple
## 1

Important

The same “thing” can have different meanings in different contexts. Therefore, we must always remain vigilant.

For instance, raw square brackets are used to create a list (e.g., [1, 2, 3]) whereas their presence after a sequential object indicates some form of indexing (e.g., x[1] or even [1, 2, 3][1]).

Similarly, (1, 2) creates a 2-tuple and f(1, 2) denotes a call to a function f with two arguments.

3.2.2. Slicing#

We can also use slices of the form from:to or from:to:by to select a subsequence of a given sequence. Slices are similar to ranges, but `:` can only be used within square brackets.

x = ["one", "two", "three", "four", "five"]
x[1:4]  # from 2nd to 5th (exclusive)
## ['two', 'three', 'four']
x[-1:0:-2]  # from last to first (exclusive) by every 2nd backwards
## ['five', 'three']

In fact, from and to are optional – when omitted, they default to one of the sequence boundaries.

x[3:]  # from 3rd to end
## ['four', 'five']
x[:2]  # first two
## ['one', 'two']
x[:0]  # none (first zero)
## []
x[::2]  # every 2nd from the start
## ['one', 'three', 'five']
x[::-1]  # elements in reverse order
## ['five', 'four', 'three', 'two', 'one']

And, of course, they can be applied on other sequential objects as well:

"spam, bacon, spam, and eggs"[13:17]  # fetch a substring
## 'spam'

Important

Knowing the difference between element extraction and subsetting a sequence (creating a subsequence) is crucial.

For example:

x[0]  # extraction (indexing with a single integer)
## 'one'

gives the object at that index.

x[0:1]  # subsetting (indexing with a slice)
## ['one']

gives the object of the same type as x (here, a list) featuring the items at that indexes (in this case, only the first object, but a slice can potentially select any number of elements, including none).

pandas data frames and numpy arrays will behave similarly, but there will be many more indexing options (as discussed in Section 5.4, Section 8.2, and Section 10.5).

3.2.3. Modifying elements#

Lists are mutable: their state may be changed. The index operator can be used to replace the elements at given indexes.

x = ["one", "two", "three", "four", "five"]
x[0] = "spam"  # replace the first element
x[-3:] = ["bacon", "eggs"]  # replace last three with given two
print(x)
## ['spam', 'two', 'bacon', 'eggs']
Exercise 3.1

There are quite a few methods that we can use to modify list elements: not only the aforementioned append, but also insert, remove, pop, etc. Invoke help("list") to access their descriptions and call them on a few example lists.

Exercise 3.2

Verify that we cannot perform similar operations on tuples, ranges, and strings. In other words, check that they are immutable.

3.2.4. Searching for specific elements#

The in operator and its negation, not in, determine whether an element exists in a given sequence:

7 in range(0, 10)
## True
[2, 3] in [ 1, [2, 3], [4, 5, 6] ]
## True

For strings, in tests whether a string features a specific substring, so we do not have to restrict ourselves to single characters:

"spam" in "lovely spams"
## True
Exercise 3.3

Check out the count and index methods for the list and other classes.

3.2.5. Arithmetic operators#

Some arithmetic operators were overloaded for certain sequential types, but they carry different meanings than those for integers and floats.

In particular, `+` can be used to join (concatenate) strings, lists, and tuples:

"spam" + " " + "bacon"
## 'spam bacon'
[1, 2, 3] + [4]
## [1, 2, 3, 4]

and `*` duplicates (recycles) a given sequence:

"spam" * 3
## 'spamspamspam'
(1, 2) * 4
## (1, 2, 1, 2, 1, 2, 1, 2)

In each case, a new object has been returned.

3.3. Dictionaries#

Dictionaries (objects of the type dict) are sets of key:value pairs, where the values (any Python object) can be accessed by key (usually a string[1]).

x = {
    "a": [1, 2, 3],
    "b": 7,
    "z": "spam!"
}
print(x)
## {'a': [1, 2, 3], 'b': 7, 'z': 'spam!'}

We can also create a dictionary with string keys using the dict function which accepts any keyword arguments:

dict(a=[1, 2, 3], b=7, z="spam!")
## {'a': [1, 2, 3], 'b': 7, 'z': 'spam!'}

The index operator can be used to extract specific elements:

x["a"]
## [1, 2, 3]

In this context, x[0] is not valid – it is not an object of sequential type; a key of 0 does not exist in a given dictionary.

The in operator checks whether a given key exists:

"a" in x, 0 not in x, "z" in x, "w" in x  # a tuple of 4 tests' results
## (True, True, True, False)

We can also add new elements to a dictionary:

x["f"] = "more spam!"
print(x)
## {'a': [1, 2, 3], 'b': 7, 'z': 'spam!', 'f': 'more spam!'}
Example 3.4

(*) In practice, we often import JSON files (which is a popular data exchange format on the internet) exactly in the form of Python dictionaries. Let us demo it quickly:

import requests
x = requests.get("https://api.github.com/users/gagolews/starred").json()

Now x is a sequence of dictionaries giving the information on the repositories starred by yours truly on GitHub. As an exercise, the reader is encouraged to inspect its structure.

3.4. Iterable types#

All the objects we discussed here are iterable. In other words, we can iterate through each element contained therein.

In particular, the list and tuple functions take any iterable object and convert it to a sequence of the corresponding type, for instance:

list("spam")
## ['s', 'p', 'a', 'm']
tuple(range(0, 10, 2))
## (0, 2, 4, 6, 8)
list({ "a": 1, "b": ["spam", "bacon", "spam"] })
## ['a', 'b']
Exercise 3.5

Take a look at the documentation of the extend method for the list class. The manual page suggests that this operation takes any iterable object. Feed it with a list, tuple, range, and a string and see what happens.

The notion of iterable objects is essential, as they appear in many contexts. There are quite a few other iterable types that are, for example, non-sequential (we cannot access their elements at random using the index operator).

Exercise 3.6

(*) Check out the enumerate, zip, and reversed functions and what kind of iterable objects they return.

3.4.1. The for loop#

The for loop iterates over every element in an iterable object, allowing us to perform a specific action. For example:

x = [1, "two", ["three", 3j, 3], False]  # some iterable object
for el in x:   # for every element in `x`, let's call it `el`
    print(el)  # do something on `el`
## 1
## two
## ['three', 3j, 3]
## False

Another example:

for i in range(len(x)):
    print(i, x[i], sep=": ")  # sep=" " is the default (element separator)
## 0: 1
## 1: two
## 2: ['three', 3j, 3]
## 3: False

One more example – computing (and printing using f-strings; see Section 2.1.3.1) the elementwise product of two vectors of equal lengths:

x = [1,  2,   3,    4,     5]  # for testing
y = [1, 10, 100, 1000, 10000]  # just a test
z = []  # result list – start with an empty one
for i in range(len(x)):
    tmp = x[i] * y[i]
    print(f"The product of {x[i]:6} and {y[i]:6} is {tmp:6}")
    z.append(tmp)
## The product of      1 and      1 is      1
## The product of      2 and     10 is     20
## The product of      3 and    100 is    300
## The product of      4 and   1000 is   4000
## The product of      5 and  10000 is  50000

The resulting list:

print(z)
## [1, 20, 300, 4000, 50000]

Yet another example: here is a function that determines the minimum of a given iterable object (compare the built-in min function, see help("min")).

import math
def mymin(x):
    """
    The smallest element in an iterable object x.
    We assume that x consists of numbers only.
    """
    curmin = math.inf  # infinity is greater than any other number
    for e in x:
        if e < curmin:
            curmin = e  # a better candidate for the minimum
    return curmin
Exercise 3.7

Author some basic versions (using the for loop) of the built-in max, sum, any, and all functions.

Exercise 3.8

(*) The glob function in the glob module can be used to list all files in a given directory whose names match a specific wildcard, e.g., glob.glob("~/Music/*.mp3") ("~" points to the current user’s home directory, see Section 13.6.1). Moreover, getsize from the os.path module returns the size of a given file, in bytes. Compose a function that determines the total size of all the files in a given directory.

3.4.2. Tuple assignment#

We can create many variables in one line of code by using the syntax tuple_of_ids = iterable_object, which unpacks the iterable:

a, b, c = [1, "two", [3, 3j, "three"]]
print(a)
## 1
print(b)
## two
print(c)
## [3, 3j, 'three']

This is useful, for example, when the swapping of two elements is needed:

a, b = 1, 2  # the same as (a, b) = (1, 2)
a, b = b, a  # swap a and b
print(a)
## 2
print(b)
## 1

Another use case is where we fetch outputs of functions that return many objects at once. For instance, later we will learn about numpy.unique which (depending on arguments passed) may return a tuple of arrays:

import numpy as np
result = np.unique([1, 2, 1, 2, 1, 1, 3, 2, 1], return_counts=True)
print(result)
## (array([1, 2, 3]), array([5, 3, 1]))

That this is indeed a tuple of length two (which we should be able to tell already by merely looking at the result: note the round brackets and two objects separated by a comma) can be verified as follows:

type(result), len(result)
## (<class 'tuple'>, 2)

Now, instead of:

values = result[0]
counts = result[1]

we can write:

values, counts = np.unique([1, 2, 1, 2, 1, 1, 3, 2, 1], return_counts=True)

This gives two separate variables, each storing a different array:

print(values)
## [1 2 3]
print(counts)
## [5 3 1]

If only the second item is of our interest, we can write:

counts = np.unique([1, 2, 1, 2, 1, 1, 3, 2, 1], return_counts=True)[1]
print(counts)
## [5 3 1]

because a tuple is a sequential object.

Example 3.9

(*) Knowing that the dict.items method generates an iterable object that can be used to traverse through all the (key, value) pairs:

x = { "a": 1, "b": ["spam", "bacon", "spam"] }
print(list(x.items()))  # just a demo
## [('a', 1), ('b', ['spam', 'bacon', 'spam'])]

we can utilise tuple assignments in contexts such as:

for k, v in x.items():   # or: for (k, v) in x.items()...
    print(k, v, sep=": ")
## a: 1
## b: ['spam', 'bacon', 'spam']

Note

(**) If there are too many values to unpack, we can use the notation like *name inside the tuple_of_identifiers. This will serve as a placeholder that gathers all the remaining values and wraps them up in a list:

a, b, *c, d = range(10)
print(a, b, c, d, sep="\n")
## 0
## 1
## [2, 3, 4, 5, 6, 7, 8]
## 9

This placeholder may appear only once on the left-hand side of the assignment operator.

3.4.3. Argument unpacking (*)#

Sometimes we will need to call a function with many parameters or call a series of functions with similar arguments (e.g., when plotting many objects using the same plotting style like colour, shape, font). In such scenarios, it may be convenient to pre-prepare the data to be passed as their inputs beforehand.

Consider the following function that takes four arguments and prints them out:

def test(a, b, c, d):
    "It is just a test – simply prints the arguments passed"
    print("a = ", a, ", b = ", b, ", c = ", c, ", d = ", d, sep="")

Arguments to be matched positionally can be wrapped inside any iterable object and then unpacked using the asterisk operator:

args = [1, 2, 3, 4]  # merely an example
test(*args)  # just like test(1, 2, 3, 4)
## a = 1, b = 2, c = 3, d = 4

Keyword arguments can be wrapped inside a dictionary and unpacked with a double asterisk:

kwargs = dict(a=1, c=3, d=4, b=2)
test(**kwargs)
## a = 1, b = 2, c = 3, d = 4

The unpackings can be intertwined. For this reason, the following calls are equivalent:

test(1, *range(2, 4), 4)
## a = 1, b = 2, c = 3, d = 4
test(1, **dict(d=4, c=3, b=2))
## a = 1, b = 2, c = 3, d = 4
test(*range(1, 3), **dict(d=4, c=3))
## a = 1, b = 2, c = 3, d = 4

3.4.4. Variadic arguments: *args and **kwargs (*)#

We can also construct a function that takes any number of positional or keyword arguments by including *args or **kwargs (those are customary names) in their parameter list:

def test(a, b, *args, **kwargs):
    "simply prints the arguments passed"
    print(
        "a = ", a, ", b = ", b,
        ", args = ", args, ", kwargs = ", kwargs, sep=""
    )

For example:

test(1, 2, 3, 4, 5, spam=6, eggs=7)
## a = 1, b = 2, args = (3, 4, 5), kwargs = {'spam': 6, 'eggs': 7}

We see that *args gathers all the positionally matched arguments (except a and b, which were set explicitly) into a tuple. On the other hand, **kwargs is a dictionary that stores all keyword arguments not featured in the function’s parameter list.

Exercise 3.10

From time to time, we will be coming across *args and **kwargs in various contexts. Study what matplotlib.pyplot.plot uses them for (by calling help(plt.plot)).

3.5. Object references and copying (*)#

3.5.1. Copying references#

It is important to always keep in mind that when writing:

x = [1, 2, 3]
y = x

the assignment operator does not create a copy of x; both x and y refer to the same object in the computer’s memory.

Important

If x is mutable, any change made to it will affect y (as, again, they are two different means to access the same object). This will also be true for numpy arrays and pandas data frames.

For example:

x.append(4)
print(y)
## [1, 2, 3, 4]

That now a call to print(x) gives the same result as above is left as an exercise.

3.5.2. Pass by assignment#

Arguments are passed to functions by assignment too. In other words, they behave as if `=` was used – what we get is another reference to the existing object.

def myadd(z, i):
    z.append(i)

And now:

myadd(x, 5)
myadd(y, 6)
print(x)
## [1, 2, 3, 4, 5, 6]

3.5.3. Object copies#

If we find the above behaviour undesirable, we can always make a copy of an object. It is customary for the mutable objects to be equipped with a relevant method:

x = [1, 2, 3]
y = x.copy()
x.append(4)
print(y)
## [1, 2, 3]

This did not change the object referred to as y because it is now a different entity.

3.5.4. Modify in place or return a modified copy?#

We now know that we can have functions or methods that change the state of a given object. Consequently, for all the functions we apply, it is important to read their documentation to determine if they modify their inputs in place or if they return an entirely new object.

Consider the following examples. The sorted function returns a sorted version of the input iterable:

x = [5, 3, 2, 4, 1]
print(sorted(x))  # returns a sorted copy of x (does not change x)
## [1, 2, 3, 4, 5]
print(x)  # unchanged
## [5, 3, 2, 4, 1]

The list.sorted method modifies the list it is applied on in place:

x = [5, 3, 2, 4, 1]
x.sort()  # modifies x in place and returns nothing
print(x)
## [1, 2, 3, 4, 5]

Additionally, random.shuffle is a function (not: a method) that changes the state of the argument:

x = [5, 3, 2, 4, 1]
import random
random.shuffle(x)  # modifies x in place, returns nothing
print(x)
## [1, 5, 3, 2, 4]

Later we will learn about the Series class in pandas, which represents data frame columns. It has the sort_values method which by default returns a sorted copy of the object it acts upon:

import pandas as pd
x = pd.Series([5, 3, 2, 4, 1])
print(list(x.sort_values()))  # inplace=False
## [1, 2, 3, 4, 5]
print(list(x))  # unchanged
## [5, 3, 2, 4, 1]

This behaviour might be changed:

x = pd.Series([5, 3, 2, 4, 1])
x.sort_values(inplace=True)  # note the argument now
print(list(x))  # changed
## [1, 2, 3, 4, 5]

Important

We are always advised to study the official[2] documentation of every function we call. Although surely some patterns arise (such as: a method is likely to modify an object in place whereas a similar standalone function will be returning a copy), ultimately, the functions’ developers are free to come up with some exceptions to them if they deem it more sensible or convenient.

3.6. Further reading#

Our overview of the Python language is by no means exhaustive. Still, it touches upon the most important topics from the perspective of data wrangling.

We will mention a few additional language elements in this course (list comprehensions, file handling, string formatting, regular expressions, etc.). Yet, we have deliberately decided not to introduce some language constructs which we can easily do without (e.g., else clauses on for and while loops, the match statement) or are perhaps too technical for an introductory course (yield, iter and next, sets, name binding scopes, deep copying of objects, defining own classes, overloading operators, function factories and closures).

Also, we skipped the constructs that do not work well with the third-party packages we will soon be using (e.g., a notation like x < y < z is not valid if the three involved variables are numpy vectors of lengths greater than 1).

The said simplifications were brought in so the student is not overwhelmed. We strongly advocate for minimalism in software development. Python is the basis for one of many possible programming environments for exercising data science. In the long run, it is best to focus on developing the most transferable skills, as other software solutions might not enjoy all the Python’s syntactic sugar, and vice versa.

The reader is encouraged to skim through at least the following chapters in the official Python 3 tutorial:

3.7. Exercises#

Exercise 3.11

Name the sequential objects we introduced.

Exercise 3.12

Is every iterable object sequential?

Exercise 3.13

Is dict an instance of a sequential type?

Exercise 3.14

What is the meaning of `+` and `*` operations on strings and lists?

Exercise 3.15

Given a list x featuring numeric scalars, how can we create a new list of the same length giving the squares of all the elements in the former?

Exercise 3.16

(*) How can we make an object copy and when should we do so?

Exercise 3.17

What is the difference between x[0], x[1], x[:0], and x[:1], where x is a sequential object?