Minimalist Data Wrangling with Python

Marek Gagolewski

doi:10.5281/zenodo.6451068

3. Sequential and other types in Python¶

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Deep R Programming [36] too.

3.1. Sequential types¶

Sequential objects store data items that can be accessed by index (position). The three main sequential types are: lists, tuples, and ranges.

As a matter of fact, strings (which we often treat as scalars) can also be considered of this kind. Therefore, amongst sequential objects are such diverse classes as:

lists,
tuples,
ranges, and
strings.

Nobody expected that.

3.1.1. Lists¶

Lists consist of arbitrary Python objects. They can be created using standalone square brackets:

x = [True, "two", 3, [4j, 5, "six"], None]
print(x)
## [True, 'two', 3, [4j, 5, 'six'], None]

The preceding is an example list featuring objects of the types: bool, str, int, list (yes, it is possible to have a list inside another list), and None (the None object is the only of this kind, it represents a placeholder for nothingness).

Note

We will often be relying on lists when creating vectors in numpy or data frame columns in pandas. Furthermore, lists of lists of equal lengths can be used to create matrices.

Each list is mutable. Consequently, its state may freely be changed. For instance, we can append a new object at its end:

x.append("spam")
print(x)
## [True, 'two', 3, [4j, 5, 'six'], None, 'spam']

The call to the list.append method modified x in place.

3.1.2. Tuples¶

Next, tuples are like lists, but they are immutable (read-only): once created, they cannot be altered.

("one", [], (3j, 4))
## ('one', [], (3j, 4))

This gave us a triple (a 3-tuple) carrying a string, an empty list, and a pair (a 2-tuple). Let’s stress that we can drop the round brackets and still get a tuple:

1, 2, 3  # the same as `(1, 2, 3)`
## (1, 2, 3)

Also:

42,  # equivalently: `(42, )`
## (42,)

Note the trailing comma; we defined a singleton (a 1-tuple). It is not the same as the scalar 42 or (42), which is an object of the type int.

Note

Having a separate data type representing an immutable sequence makes sense in certain contexts. For example, a data frame’s shape is its inherent property that should not be tinkered with. If a tabular dataset has 10 rows and 5 columns, we disallow the user to set the former to 15 (without making further assumptions, providing extra data, etc.).

When creating collections of items, we usually prefer lists, as they are more flexible a data type. Yet, Section 3.4.2 will mention that many functions return tuples. We are thus expected to be able to handle them with confidence.

3.1.3. Ranges¶

Objects defined by calling range(from, to) or range(from, to, by) represent arithmetic progressions of integers.

list(range(0, 5))  # i.e., range(0, 5, 1) – from 0 to 5 (exclusive) by 1
## [0, 1, 2, 3, 4]
list(range(10, 0, -1))  # from 10 to 0 (exclusive) by -1
## [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

We converted the two ranges to ordinary lists as otherwise their display is not particularly spectacular. Let’s point out that the rightmost boundary (to) is exclusive and that by defaults to 1.

3.1.4. Strings (again)¶

Recall that we have already discussed character strings in Section 2.1.3.

print("lovely\nspam")
## lovely
## spam

Strings are most often treated as scalars (atomic entities, as in: a string as a whole). However, we will soon find out that their individual characters can also be accessed by index. Furthermore, Chapter 14 will discuss a plethora of operations on parts of strings.

3.2. Working with sequences¶

3.2.1. Extracting elements¶

The index operator, `[...]`, can be applied on any sequential object to extract an element at a position specified by a single integer.

x = ["one", "two", "three", "four", "five"]
x[0]  # the first element
## 'one'
x[1]  # the second element
## 'two'
x[len(x)-1]  # the last element
## 'five'

The valid indexes are \(0, 1, \dots, n-2, n-1\), where \(n\) is the length (size) of the sequence, which can be fetched by calling len.

Important

Think of an index as the distance from the start of a sequence. For example, x[3] means “three items away from the beginning”, i.e., the fourth element.

Negative indexes count from the end:

x[-1]  # the last element (ultimate)
## 'five'
x[-2]  # the next to last (the last but one, penultimate)
## 'four'
x[-len(x)]  # the first element
## 'one'

The index operator can be applied on any sequential object:

"string"[3]
## 'i'

More examples:

range(0, 10)[-1]  # the last item in an arithmetic progression
## 9
(1, )[0]  # extract from a 1-tuple
## 1

Important

The same “thing” can have different meanings in different contexts. Therefore, we must always remain vigilant.

For instance, raw square brackets are used to create a list (e.g., [1, 2, 3]) whereas their presence after a sequential object indicates some form of indexing (e.g., x[1] or even [1, 2, 3][1]). Similarly, (1, 2) creates a 2-tuple and f(1, 2) denotes a call to a function f with two arguments.

3.2.2. Slicing¶

We can also use slices of the form from:to or from:to:by to select a subsequence of a given sequence. Slices are similar to ranges, but `:` can only be used within square brackets.

x = ["one", "two", "three", "four", "five"]
x[1:4]  # from the second to the fifth (exclusive)
## ['two', 'three', 'four']
x[-1:0:-2]  # from the last to first (exclusive) by every second backwards
## ['five', 'three']

In fact, the from and to parts of a slice are optional. When omitted, they default to one of the sequence boundaries.

x[3:]  # from the third element to the end
## ['four', 'five']
x[:2]  # the first two
## ['one', 'two']
x[:0]  # none (the first zero)
## []
x[::2]  # every second element from the start
## ['one', 'three', 'five']
x[::-1]  # the elements in reverse order
## ['five', 'four', 'three', 'two', 'one']

Slicing can be applied on other sequential objects as well:

"spam, bacon, spam, and eggs"[13:17]  # fetch a substring
## 'spam'

Knowing the difference between element extraction and subsetting a sequence (creating a subsequence) is crucial. For example:

x[0]  # extraction (indexing with a single integer)
## 'one'

It gave the object at that index. Moreover:

x[0:1]  # subsetting (indexing with a slice)
## ['one']

It returned the object of the same type as x (here, a list), even though, in this case, only one object was fetched. However, a slice can potentially select any number of elements, including zero.

pandas data frames and numpy arrays will behave similarly, but there will be many more indexing options; see Section 5.4, Section 8.2, and Section 10.5.

3.2.3. Modifying elements of mutable sequences¶

Lists are mutable: their state can be changed. The index operator can replace the elements at given indexes.

x = ["one", "two", "three", "four", "five"]
x[0] = "spam"  # replace the first element
x[-3:] = ["bacon", "eggs"]  # replace last three with given two
print(x)
## ['spam', 'two', 'bacon', 'eggs']

Exercise 3.1

There are quite a few methods that modify list elements: not only the aforementioned append, but also insert, remove, pop, etc. Invoke help("list"), read their descriptions, and call them on a few example lists.

Exercise 3.2

Verify that similar operations cannot be performed on tuples, ranges, and strings. In other words, check that these types are immutable.

3.2.4. Searching for specific elements¶

The in operator and its negation, not in, determine whether an element exists in a given sequence:

7 in range(0, 10)
## True
[2, 3] in [ 1, [2, 3], [4, 5, 6] ]
## True

For strings, in tests whether a string includes a specific substring:

"spam" in "lovely spams"
## True

Exercise 3.3

In the documentation of the list and other classes, check out the count and index methods.

3.2.5. Arithmetic operators¶

Some arithmetic operators were overloaded for certain sequential types. However, they carry different meanings from those for integers and floats. In particular, `+` joins (concatenates) strings, lists, and tuples:

"spam" + " " + "bacon"
## 'spam bacon'
[1, 2, 3] + [4]
## [1, 2, 3, 4]

Moreover, `*` duplicates (recycles) a given sequence:

"spam" * 3
## 'spamspamspam'
(1, 2) * 4
## (1, 2, 1, 2, 1, 2, 1, 2)

In each case, a new object has been returned.

3.3. Dictionaries¶

Dictionaries are sets of key:value pairs, where the value (any Python object) can be accessed by key (usually[1] a string). In other words, they map keys to values.

x = {
    "a": [1, 2, 3],
    "b": 7,
    "z": "spam!"
}
print(x)
## {'a': [1, 2, 3], 'b': 7, 'z': 'spam!'}

We can also create a dictionary with string keys using the dict function which accepts any keyword arguments:

dict(a=[1, 2, 3], b=7, z="spam!")
## {'a': [1, 2, 3], 'b': 7, 'z': 'spam!'}

The index operator extracts a specific element from a dictionary, uniquely identified by a given key:

x["a"]
## [1, 2, 3]

In this context, x[0] is not valid and raises an error: a dictionary is not an object of sequential type; a key of 0 does not exist in x. If we are unsure whether a specific key is defined, we can use the in operator:

"a" in x, 0 not in x, "z" in x, "w" in x  # a tuple of four tests' results
## (True, True, True, False)

There is also a method called get, which returns an element associated with a given key, or something else (by default, None) if we have a mismatch:

x.get("a")
## [1, 2, 3]
x.get("c")  # if missing, returns None by default
x.get("c") is None  # indeed
## True
x.get("c", "unknown")
## 'unknown'

We can also add new elements to a dictionary using the index operator:

x["f"] = "more spam!"
print(x)
## {'a': [1, 2, 3], 'b': 7, 'z': 'spam!', 'f': 'more spam!'}

Example 3.4

(*) In practice, we often import JSON files (which is a popular data exchange format on the internet) exactly in the form of Python dictionaries. Let’s demo it briefly:

import requests
x = requests.get("https://api.github.com/users/gagolews/starred").json()

Now x is a sequence of dictionaries giving the information on the repositories starred by yours truly on GitHub. As an exercise, the reader is encouraged to inspect its structure.

3.4. Iterable types¶

All the objects we discussed here are iterable. In other words, we can iterate through each element contained therein. In particular, the list and tuple functions take any iterable object and convert it to a sequence of the corresponding type. For instance:

list("spam")
## ['s', 'p', 'a', 'm']
tuple(range(0, 10, 2))
## (0, 2, 4, 6, 8)
list({ "a": 1, "b": ["spam", "bacon", "spam"] })
## ['a', 'b']

Exercise 3.5

Take a look at the documentation of the extend method for the list class. The manual page suggests that this operation takes any iterable object. Feed it with a list, tuple, range, and a string and see what happens.

The notion of iterable objects is essential, as they appear in many contexts. There exist other iterable types that are, for example, non-sequential: we cannot access their elements at random using the index operator.

Exercise 3.6

(*) Check out the enumerate, zip, and reversed functions and what kind of iterable objects they return.

3.4.1. The for loop¶

The for loop allows to perform a specific action on each element in an iterable object. For instance, we can access consecutive items in a list as follows:

x = [1, "two", ["three", 3j, 3], False]  # some iterable object
for el in x:   # for each element in `x`, let's call it `el`...
    print(el)  # ... do something on `el`
## 1
## two
## ['three', 3j, 3]
## False

Another common pattern is to traverse a sequential object by means of element indexes:

for i in range(len(x)):  # for i = 0, 1, ..., len(x)-1
    print(i, x[i], sep=": ")  # sep (label separator) defaults to " "
## 0: 1
## 1: two
## 2: ['three', 3j, 3]
## 3: False

Example 3.7

Let’s compute the elementwise multiplication of two vectors of equal lengths, i.e., the product of their corresponding elements:

x = [1,  2,   3,    4,     5]  # for testing
y = [1, 10, 100, 1000, 10000]  # just a test
z = []  # result list – start with an empty one
for i in range(len(x)):
    tmp = x[i] * y[i]
    print(f"The product of {x[i]:6} and {y[i]:6} is {tmp:6}")
    z.append(tmp)
## The product of      1 and      1 is      1
## The product of      2 and     10 is     20
## The product of      3 and    100 is    300
## The product of      4 and   1000 is   4000
## The product of      5 and  10000 is  50000

The items were printed with a little help of f-strings; see Section 2.1.3.1. Here is the resulting list:

print(z)
## [1, 20, 300, 4000, 50000]

Example 3.8

A dictionary may be useful for recoding lists of labels:

map = dict(  # from=to
    apple="red",
    pear="yellow",
    kiwi="green",
)

And now:

x = ["apple", "pear", "apple", "kiwi", "apple", "kiwi"]
recoded_x = []
for fruit in x:
    recoded_x.append(map[fruit])  # or, e.g., map.get(fruit, "unknown")

print(recoded_x)
## ['red', 'yellow', 'red', 'green', 'red', 'green']

Exercise 3.9

Here is a function that determines the minimum of a given iterable object (compare the built-in min function, see help("min")).

import math
def mymin(x):
    """
    Fetches the smallest element in an iterable object x.
    We assume that x consists of numbers only.
    """
    curmin = math.inf  # infinity is greater than any other number
    for e in x:
        if e < curmin:
            curmin = e  # a better candidate for the minimum
    return curmin

mymin([0, 5, -1, 100])
## -1
mymin(range(5, 0, -1))
## 1
mymin((1,))
## 1

Note that due to the use of math.inf, the function operates under the assumption that all elements in x are numeric. Rewrite it so that it will work correctly, e.g., in the case of lists of strings.

Exercise 3.10

Using the for loop, author some basic versions of the built-in max, sum, any, and all functions.

Exercise 3.11

(*) The glob function in the glob module lists all files in a given directory whose names match a specific wildcard, e.g., glob.glob("~/Music/*.mp3") gives the list of MP3 files in the current user’s home directory; see Section 13.6.1. Moreover, getsize from the os.path module returns the size of a file, in bytes. Compose a function that determines the total size of all the files in a given directory.

3.4.2. Tuple assignment¶

We can create many variables in one line of code by using the syntax tuple_of_ids = iterable_object, which unpacks the iterable object on the right side of the assignment operator:

a, b, c = [1, "two", [3, 3j, "three"]]
print(a)
## 1
print(b)
## two
print(c)
## [3, 3j, 'three']

This is useful, for example, when the swapping of two elements is needed:

a, b = 1, 2  # the same as (a, b) = (1, 2) – parentheses are optional
a, b = b, a  # swap a and b
print(a)
## 2
print(b)
## 1

Another use case is where we fetch outputs of functions that return many objects at once. For instance, later we will learn about numpy.unique which (depending on arguments passed) may return a tuple of arrays:

import numpy as np
result = np.unique([1, 2, 1, 2, 1, 1, 3, 2, 1], return_counts=True)
print(result)
## (array([1, 2, 3]), array([5, 3, 1]))

That this is a tuple of length two can be verified[2] as follows:

type(result), len(result)
## (<class 'tuple'>, 2)

Now, instead of:

values = result[0]
counts = result[1]

we can write:

values, counts = np.unique([1, 2, 1, 2, 1, 1, 3, 2, 1], return_counts=True)

This gives two separate variables, each storing a different array:

print(values)
## [1 2 3]
print(counts)
## [5 3 1]

If only the second item is of our interest, we can write:

counts = np.unique([1, 2, 1, 2, 1, 1, 3, 2, 1], return_counts=True)[1]
print(counts)
## [5 3 1]

because a tuple is a sequential object.

Example 3.12

(*) The dict.items method generates an iterable object that can be used to traverse through all the (key, value) pairs:

x = { "a": 1, "b": ["spam", "bacon", "spam"] }
print(list(x.items()))  # just a demo
## [('a', 1), ('b', ['spam', 'bacon', 'spam'])]

We can thus utilise tuple assignments in contexts such as:

for k, v in x.items():   # or: for (k, v) in x.items()...
    print(k, v, sep=": ")
## a: 1
## b: ['spam', 'bacon', 'spam']

Note

(**) If there are more values to unpack than then number of identifiers, we can use the notation like *name inside the tuple_of_identifiers on the left side of the assignment operator. Such a placeholder gathers all the surplus objects in the form of a list:

for a, b, *c, d in [range(4), range(10), range(3)]:
    print(a, b, c, d, sep="; ")
## 0; 1; [2]; 3
## 0; 1; [2, 3, 4, 5, 6, 7, 8]; 9
## 0; 1; []; 2

3.4.3. Argument unpacking (*)¶

Sometimes we will need to call a function with many parameters or call a series of functions with similar arguments, e.g., when plotting many objects using the same plotting style like colour, shape, font. In such scenarios, it may be convenient to pre-prepare the data to be passed as their inputs before making the actual call.

Consider a function that takes four arguments and prints them out obtusely:

def test(a, b, c, d):
    "It is just a test – print the given arguments"
    print("a = ", a, ", b = ", b, ", c = ", c, ", d = ", d, sep="")

Arguments to be matched positionally can be wrapped inside any iterable object and then unpacked using the asterisk operator:

args = [1, 2, 3, 4]  # merely an example
test(*args)  # just like test(1, 2, 3, 4)
## a = 1, b = 2, c = 3, d = 4

Keyword arguments can be wrapped inside a dictionary and unpacked with a double asterisk:

kwargs = dict(a=1, c=3, d=4, b=2)
test(**kwargs)
## a = 1, b = 2, c = 3, d = 4

The unpackings can be intertwined. For this reason, the following calls are equivalent:

test(1, *range(2, 4), 4)
## a = 1, b = 2, c = 3, d = 4
test(1, **dict(d=4, c=3, b=2))
## a = 1, b = 2, c = 3, d = 4
test(*range(1, 3), **dict(d=4, c=3))
## a = 1, b = 2, c = 3, d = 4

3.4.4. Variadic arguments: `*args` and `**kwargs` (*)¶

We can also construct a function that takes any number of positional or keyword arguments by including *args or **kwargs (those are customary names) in their parameter list:

def test(a, b, *args, **kwargs):
    "simply prints the arguments passed"
    print(
        "a = ", a, ", b = ", b,
        ", args = ", args, ", kwargs = ", kwargs, sep=""
    )

For example:

test(1, 2, 3, 4, 5, spam=6, eggs=7)
## a = 1, b = 2, args = (3, 4, 5), kwargs = {'spam': 6, 'eggs': 7}

We see that *args gathers all the positionally matched arguments (except a and b, which were set explicitly) into a tuple. On the other hand, **kwargs is a dictionary that stores all keyword arguments that are not mentioned in the function’s parameter list.

Exercise 3.13

From time to time, we will be coming across *args and **kwargs in various contexts. Study what matplotlib.pyplot.plot uses them for (by calling help(plt.plot)).

3.5. Object references and copying (*)¶

3.5.1. Copying references¶

It is important to always keep in mind that when writing:

x = [1, 2, 3]
y = x

the assignment operator does not create a copy of x; both x and y refer to the same object in the computer’s memory.

Important

If x is mutable, any change made to it will affect y (as, again, they are two different means to access the same object). This will also be true for numpy arrays and pandas data frames.

For example:

x.append(4)
print(y)
## [1, 2, 3, 4]

3.5.2. Pass by assignment¶

Arguments are passed to functions by assignment too. In other words, they behave as if `=` was used: what we get is another reference to the existing object.

def myadd(z, i):
    z.append(i)

And now:

myadd(x, 5)
myadd(y, 6)
print(x)
## [1, 2, 3, 4, 5, 6]

3.5.3. Object copies¶

If we find the foregoing behaviour undesirable, we can always make a copy of a fragile object. It is customary for the mutable types to be equipped with a relevant method:

x = [1, 2, 3]
y = x.copy()
x.append(4)
print(y)
## [1, 2, 3]

This did not change the object referred to as y because it is now a different entity.

3.5.4. Modify in place or return a modified copy?¶

We now know that we can have functions or methods that change the state of a given object. Consequently, for all the functions we apply, it is important to read their documentation to determine if they modify their inputs in place or if they return an entirely new object.

In particular, the sorted function returns a sorted version of an iterable object:

x = [5, 3, 2, 4, 1]
print(sorted(x))  # returns a sorted copy of x (does not change x)
## [1, 2, 3, 4, 5]
print(x)  # unchanged
## [5, 3, 2, 4, 1]

The list.sort method modifies the object it is applied on in place:

x = [5, 3, 2, 4, 1]
x.sort()  # modifies x in place and returns nothing
print(x)
## [1, 2, 3, 4, 5]

Additionally, random.shuffle is a function (not: a method) that changes the state of the argument:

x = [5, 3, 2, 4, 1]
import random
random.shuffle(x)  # modifies x in place, returns nothing
print(x)
## [1, 4, 3, 5, 2]

Later we will learn about the Series class in pandas, which represents data frame columns. It has the sort_values method which, by default, returns a sorted copy of the object it acts upon:

import pandas as pd
x = pd.Series([5, 3, 2, 4, 1])
print(list(x.sort_values()))  # inplace=False
## [1, 2, 3, 4, 5]
print(list(x))  # unchanged
## [5, 3, 2, 4, 1]

This behaviour can, however, be altered:

x = pd.Series([5, 3, 2, 4, 1])
x.sort_values(inplace=True)  # note the argument now
print(list(x))  # changed
## [1, 2, 3, 4, 5]

Important

We are always advised to study the official[3] documentation of every function we call. Although surely some patterns arise (such as: a method is more likely to modify an object in place whereas a similar standalone function will be returning a copy), ultimately, the functions’ developers are free to come up with some exceptions to them if they deem it more sensible or convenient.

3.6. Further reading¶

Our overview of the Python language is by no means exhaustive. Still, it touches upon the most important topics from the perspective of data wrangling.

We will mention a few additional standard library features later in this course: list comprehensions in Section 5.5.7, exception handling in Section 13.6.3, file connection in Section 13.6.4, string formatting in Section 14.3.1, pattern searching with regular expressions in Section 14.4, etc.

We have deliberately decided not to introduce some language constructs which we can easily manage without (e.g., else clauses on for and while loops, the match statement) or are perhaps too technical for an introductory course (yield, iter and next, sets, name binding scopes, deep copying of objects, defining new classes, overloading operators, function factories and closures).

Also, we skipped the constructs that do not work well with the third-party packages we will soon be using (e.g., a notation like x < y < z is not valid if the three involved variables are numpy vectors of lengths greater than one).

The said simplifications were brought in so the student is not overwhelmed. We strongly advocate for minimalism in software development. Python is the basis for one of many possible programming environments for exercising data science. In the long run, it is best to focus on developing the most transferable skills, as other software solutions might not enjoy all the Python’s syntactic sugar, and vice versa.

The reader is encouraged to skim through at least the following chapters of the official Python 3 tutorial:

3.7. Exercises¶

Exercise 3.14

Name the sequential objects we introduced.

Exercise 3.15

Is every iterable object sequential?

Exercise 3.16

Is dict an instance of a sequential type?

Exercise 3.17

What is the meaning of `+` and `*` operations on strings and lists?

Exercise 3.18

Given a list x of numeric scalars, how can we create a new list of the same length giving the squares of all the elements in the former?

Exercise 3.19

(*) How can we make an object copy and when should we do so?

Exercise 3.20

What is the difference between x[0], x[1], x[:0], and x[:1], where x is a sequential object?

3. Sequential and other types in Python¶

3.1. Sequential types¶

3.1.1. Lists¶

3.1.2. Tuples¶

3.1.3. Ranges¶

3.1.4. Strings (again)¶

3.2. Working with sequences¶

3.2.1. Extracting elements¶

3.2.2. Slicing¶

3.2.3. Modifying elements of mutable sequences¶

3.2.4. Searching for specific elements¶

3.2.5. Arithmetic operators¶

3.3. Dictionaries¶

3.4. Iterable types¶

3.4.1. The for loop¶

3.4.2. Tuple assignment¶

3.4.3. Argument unpacking (*)¶

3.4.4. Variadic arguments: *args and **kwargs (*)¶

3.5. Object references and copying (*)¶

3.5.1. Copying references¶

3.5.2. Pass by assignment¶

3.5.3. Object copies¶

3.5.4. Modify in place or return a modified copy?¶

3.6. Further reading¶

3.7. Exercises¶

3.4.4. Variadic arguments: `*args` and `**kwargs` (*)¶