14. Text Data

The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.

In [Gag22] it is noted that effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalisation, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. Means for the handling of string data should be included in each statistician’s or data scientist’s repertoire to complement their numerical computing and data wrangling skills.

Diverse data cleansing and preparation operations (compare, e.g., [vdLdJ18] and [DJ03]) need to be applied before an analyst can begin to enjoy an orderly and meaningful data frame, matrix, or spreadsheet being finally at their disposal. Activities related to information retrieval, computer vision, bioinformatics, natural language processing, or even musicology can also benefit from including them in data processing pipelines.

In this part we discuss the most basic string operations in base Python, together with their vectorised versions in numpy and pandas.

14.1. Basic String Operations

Recall that the str class represents individual character strings:

x = "spam"
type(x)
## <class 'str'>

There are a few binary operators overloaded for strings, e.g., `+` stands for string concatenation:

x + " and eggs"
## 'spam and eggs'

`*` duplicates a given string:

x * 3
## 'spamspamspam'

Further, str is a sequential type, therefore we can extract individual code points and create substrings using the index operator:

x[-1]  # last letter
## 'm'

Recall that strings are immutable. However, parts of strings can always be reused in conjunction with the concatenation operator:

x[:2] + "ecial"
## 'special'

14.1.1. Unicode as the Universal Encoding

It is worth knowing that all strings in Python (from version 3.0) use Unicode(https://www.unicode.org/charts/), which is a universal encoding capable of representing ca. 150,000 characters covering letters and numbers in contemporary and historic alphabets/scripts, mathematical, political, phonetic, and other symbols, emojis, etc. It is thus a very powerful representation.

Note

Despite the wide support for Unicode, sometimes our own or other readers’ display (e.g., web browsers when viewing a HTML version of the output report) might not be able to render all code points properly (e.g., due to missing fonts). However, we should rest assured that they are still there, and are processed correctly if string functions are applied thereon.

Note

(**) More precisely, Python strings are UTF-8-encoded. Most web pages and API data are nowadays served in UTF-8. However, occasionally we can encounter files encoded in ISO-8859-1 (Western Europe), Windows-1250 (Eastern Europe), Windows-1251 (Cyrillic), GB18030 and Big5 (Chinese), EUC-KR (Korean), Shift-JIS and EUC-JP (Japanese), amongst others; they can be converted using the str.decode method.

14.1.2. Normalising Strings

Dirty text data are a pain, especially if similar (semantically) tokens are encoded in many different ways. For the sake of string matching, we might want, e.g., the German "groß", "GROSS", and " gross  " to compare all equal.

str.strip removes whitespaces (spaces, tabs, newline characters) at both ends of strings (see also str.lstrip and str.rstrip for their nonsymmetric versions).

str.lower and str.upper change letter case. For caseless comparison/matching, str.casefold might be a slightly better option as it unfolds many more code point sequences:

"Groß".lower(), "Groß".upper(), "Groß".casefold()
## ('groß', 'GROSS', 'gross')

Important

(**) More advanced string transliteration can be performed by means of the ICU library, which the PyICU package provides wrappers for.

For instance, converting all code points to ASCII (English) might be necessary when identifiers are expected to miss some diacritics that would normally be included (as in "Gągolewski" vs "Gagolewski"):

import icu  # PyICU package
icu.Transliterator.createInstance("Lower; Any-Latin; Latin-ASCII").transliterate(
    "Χαίρετε! Groß gżegżółka — © La Niña – köszönöm – Gągolewski"
)
## 'chairete! gross gzegzolka - (C) la nina - koszonom - gagolewski'

Converting between different Unicode Normalisation Forms (also available in the unicodedata package and via pandas.Series.str.normalize) might be used for the removal of some formatting nuances:

icu.Transliterator.createInstance("NFKD; NFC").transliterate("¼ąr²︷")
## '1⁄4ąr2{'

14.1.3. Substring Searching and Replacing

Determining if a string features a particular fixed substring can be done in a number of different ways.

For instance:

food = "bacon, spam, spam, eggs, and spam"
"spam" in food
## True

verifies whether a particular substring exists,

food.count("spam")
## 3

counts the number of occurrences of a substring,

food.find("spam")
## 7

locates the first pattern occurrence (see also str.rfind as well as str.index and str.rindex),

food.replace("spam", "veggies")
## 'bacon, veggies, veggies, eggs, and veggies'

replaces matching substrings with another string.

Exercise 14.1

Read the manual of the following methods: str.startswith, str.endswith str.removeprefix, str.removesuffix.

The splitting of long strings at specific fixed delimiter strings can be done via:

food.split(", ")
## ['bacon', 'spam', 'spam', 'eggs', 'and spam']

see also str.partition. The str.join method implements the inverse operation:

", ".join(["spam", "bacon", "eggs", "spam"])
## 'spam, bacon, eggs, spam'

Important

In Section 14.4, we will discuss pattern matching with regular expressions, which can be useful in, amongst others, extracting more abstract data chunks (numbers, URLs, email addresses, IDs) from within strings.

14.1.4. Locale-Aware Services in ICU (*)

Recall that relational operators such as `<` and `>= perform lexicographic comparing of strings:

"spam" > "egg"
## True

We have: "a" < "aa" < "aaaaaaaaaaaaa" < "ab" < "aba" < "abb" < "b" < "ba" < "baaaaaaa" < "bb" < "Spanish inquisition".

Lexicographic ordering (character-by-character, from left to right) is, however, not necessarily appropriate for strings featuring numerals:

"a9" < "a123"
## False

Also, it only takes into account the numeric codes corresponding to each Unicode character, therefore does not work well with non-English alphabets:

"MIELONECZKĄ" < "MIELONECZKI"
## False

In Polish, A with ogonek (Ą) should sort after A and before B, let alone I. However, their corresponding numeric codes in the Unicode table are: 260 (Ą), 65 (A), 66 (B), and 73 (I), therefore the resulting ordering is incorrect, natural language processing-wisely.

It is best to perform string collation using the services provided by ICU. Here is an example of German phone book-like collation where "ö" is treated the same as "oe":

c = icu.Collator.createInstance(icu.Locale("de_DE@collation=phonebook"))
c.setStrength(0)
c.compare("Löwe", "loewe")
## 0

A result of 0 means that the strings are deemed equal.

In some languages, contractions occur, e.g., in Slovak and Czech, two code points "ch" are treated as a single entity and are sorted after "h":

icu.Collator.createInstance(icu.Locale("sk_SK")).compare("chladný", "hladný")
## 1

i.e., we have "chladný" > "hladný" (the 1st argument is greater than the 2nd one). Compare the above to something similar in Polish:

icu.Collator.createInstance(icu.Locale("pl_PL")).compare("chłodny", "hardy")
## -1

i.e., "chłodny" < "hardy" (the first argument is less than the 2nd one).

Also, with ICU, numeric collation is possible:

c = icu.Collator.createInstance()
c.setAttribute(icu.UCollAttribute.NUMERIC_COLLATION, icu.UCollAttributeValue.ON)
c.compare("a9", "a123")
## -1

Which is the correct result: "a9" is less than "a123" (compare the above to the example which was using using `<`).

14.1.5. String Operations in pandas

String sequences in Series are by default using the broadest possible object data type:

pd.Series(["spam", "bacon", "spam"])
## 0     spam
## 1    bacon
## 2     spam
## dtype: object

which basically means that we deal with a sequence of Python objects of arbitrary type (here, are all of class str). This allows for the encoding of missing values by means of the None object.

Vectorised versions of base string operations are available via the pandas.Series.str accessor, which we usually refer to by calling x.str.method_name(), for instance:

x = pd.Series(["spam", "bacon", None, "buckwheat", "spam"])
x.str.upper()
## 0         SPAM
## 1        BACON
## 2         None
## 3    BUCKWHEAT
## 4         SPAM
## dtype: object

We thus have pandas.Series.str.strip, pandas.Series.str.split, pandas.Series.str.find, and so forth.

But there is more. For example, a function to compute the length of each string:

x.str.len()
## 0    4.0
## 1    5.0
## 2    NaN
## 3    9.0
## 4    4.0
## dtype: float64

Concatenating all items into a single string:

x.str.cat(sep="; ")
## 'spam; bacon; buckwheat; spam'

Vectorised string concatenation:

x + " and spam"
## 0         spam and spam
## 1        bacon and spam
## 2                   NaN
## 3    buckwheat and spam
## 4         spam and spam
## dtype: object

Conversion to numeric:

pd.Series(["1.3", "-7", None, "3523"]).astype(float)
## 0       1.3
## 1      -7.0
## 2       NaN
## 3    3523.0
## dtype: float64

Selecting substrings:

x.str.slice(2, -1)  # like x.iloc[i][2:-1] for all i
## 0         a
## 1        co
## 2      None
## 3    ckwhea
## 4         a
## dtype: object

Replacing substrings:

x.str.slice_replace(0, 2, "tofu")  # like x.iloc[i][2:-1] = "tofu"
## 0         tofuam
## 1        tofucon
## 2           None
## 3    tofuckwheat
## 4         tofuam
## dtype: object
Exercise 14.2

Consider the nasaweather_glaciers data frame. All glaciers are assigned 11/12-character unique identifiers. The ID number is assigned to the glacier as defined by the WGMS convention that forms the glacier ID number by combining the following five elements. Extract all of them and store them as independent columns in the data frame.

  1. 2-character political unit,

  2. 1-digit continent code,

  3. 4-character drainage code,

  4. 2-digit free position code,

  5. 2- or 3-digit local glacier code.

14.1.6. String Operations in numpy (*)

There is a huge overlap between the numpy and pandas capabilities for string handling, with the latter being more powerful. Still, some readers will find the following useful.

As mentioned in our introduction to numpy vectors, objects of type ndarray can store not only numeric and logical data, but also character strings. For example:

x = np.array(["spam", "bacon", "egg"])
x
## array(['spam', 'bacon', 'egg'], dtype='<U5')

Here, the data type "<U5" (compare also x.dtype) means that we deal with Unicode strings of length no greater than 5. Thus, unfortunately, replacing elements with too long a content will result in truncated strings:

x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckw'], dtype='<U5')

In order to remedy this, we first need to recast the vector manually:

x = x.astype("<U10")
x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckwheat'], dtype='<U10')

Conversion from/to numeric is also possible:

np.array(["1.3", "-7", "3523"]).astype(float)
## array([ 1.300e+00, -7.000e+00,  3.523e+03])
np.array([1, 3.14, -5153]).astype(str)
## array(['1.0', '3.14', '-5153.0'], dtype='<U32')

The numpy.char module includes a number of vectorised versions of string routines, most of which we have discussed above. For example:

x = np.array([
    "spam", "spam, bacon, and spam",
    "spam, eggs, bacon, spam, spam, and spam"
])
np.char.split(x, ", ")
## array([list(['spam']), list(['spam', 'bacon', 'and spam']),
##        list(['spam', 'eggs', 'bacon', 'spam', 'spam', 'and spam'])],
##       dtype=object)
np.char.count(x, "spam")
## array([1, 2, 4])

Operations that we would normally perform via the use of binary operators (i.e., `+`, `*`, `<`, etc.) are available through standalone functions:

np.char.add(["spam", "bacon"], " and spam")
## array(['spam and spam', 'bacon and spam'], dtype='<U14')
np.char.equal(["spam", "bacon", "spam"], "spam")
## array([ True, False,  True])

Also the function that returns the length of each string is noteworthy:

np.char.str_len(x)
## array([ 4, 21, 39])

14.2. Working with String Lists

Series can also consist of lists of strings of varying lengths. They can not only be input manually (via the pandas.Series constructor), but also through string splitting. For instance:

x = pd.Series([
    "spam",
    "spam, bacon, spam",
    None,
    "spam, eggs, bacon, spam, spam"
])
xs = x.str.split(", ", regex=False)
xs
## 0                             [spam]
## 1                [spam, bacon, spam]
## 2                               None
## 3    [spam, eggs, bacon, spam, spam]
## dtype: object

and now, e.g., looking at the last element:

xs.iloc[-1]
## ['spam', 'eggs', 'bacon', 'spam', 'spam']

reveals that it is indeed a list of strings.

There are a few vectorised operations that enable us to work with such variable length lists, such as concatenating all strings:

xs.str.join("; ")
## 0                             spam
## 1                spam; bacon; spam
## 2                             None
## 3    spam; eggs; bacon; spam; spam
## dtype: object

selecting, say, the first string in each list:

xs.str.get(0)
## 0    spam
## 1    spam
## 2    None
## 3    spam
## dtype: object

or slicing:

xs.str.slice(0, -1)  # like xs.iloc[i][0:-1] for all i
## 0                           []
## 1                [spam, bacon]
## 2                         None
## 3    [spam, eggs, bacon, spam]
## dtype: object
Exercise 14.3

(*) Using pandas.merge, join the following datasets: countries, world_factbook_2020, and ssi_2016_dimensions based on the country names. Note that some manual data cleansing will be necessary.

Exercise 14.4

(**) Given a Series object featuring lists of strings:

  1. determine the list of all unique strings (e.g., for xs above we have: ["spam", "bacon", "eggs"]), call it xu;

  2. create a data frame x with xs.shape[0] rows and len(xu) columns such that x.iloc[i, j] is equal to 1 if xu[j] is amongst xs.loc[i] and equal to 0 otherwise;

  3. given x (and only x: neither xs nor xu), perform the inverse operation.

14.3. Formatted Outputs for Reproducible Report Generation

When preparing reports from data analysis (e.g., using Jupyter Notebooks or writing directly to Markdown files which we later compile to PDF or HTML using pandoc) it is important to be able to output nicely formatted content programmatically.

14.3.1. Formatting Strings

Recall that string formatting for inclusion of data stored in existing objects can easily be done using f-strings (formatted string literals) of the type f"...{expression}...". For instance:

pi = 3.14159265358979323846
f"π = {pi:.2f}"
## 'π = 3.14'

creates a string including the variable pi formatted as a float rounded to two places after the decimal separator.

Note

(**) Similar functionality can be achieved using the str.format method as well as the `%` operator overloaded for strings, which uses sprintf-like value placeholders known to some readers from other programming languages (such as C):

"π = {:.2f}".format(pi), "π = %.2f" % pi
## ('π = 3.14', 'π = 3.14')

14.3.2. str and repr

The str and repr functions can create string representations of a number of objects, with the former being more human-readable and latter slightly more technical.

x = np.array([1, 2, 3])
str(x), repr(x)
## ('[1 2 3]', 'array([1, 2, 3])')

Note that repr often returns an output that can be interpreted as executable Python code literally (with no or few adjustments, however, pandas objects are one of the many exceptions).

14.3.3. Justifying Strings

str.center, str.ljust, str.rjust can be used to centre, left-, or-right justify a string so that it is of at least given width, which might make the display thereof more aesthetic. Very long strings, possibly containing whole text paragraphs can be dealt with using the wrap and shorten functions from the textwrap package.

14.3.4. Direct Markdown Output in Jupyter

Further, with IPython/Jupyter, we can output strings that will be directly interpreted as Markdown-formatted:

import IPython.display
x = 2+2
out = f"*Result*: $2^2=2+2={x}$."
IPython.display.Markdown(out)

Result: \(2^2=2+2=4\).

Recall that Markdown is a very flexible markup language, allowing us to insert itemised and numbered lists, mathematical formulae, tables, images, etc.

14.3.5. Manual Markdown File Output

We can also generate Markdown code programmatically in the form of standalone .md files:

f = open("/tmp/test-report.md", "w")  # open for writing (overwrite if exists)
f.write("**Yummy Foods** include, but are not limited to:\n\n")
x = ["spam", "bacon", "eggs", "spam"]
for e in x:
    f.write(f"* {e}\n")
f.write("\nAnd now for something completely different:\n\n")
f.write("Rank | Food\n")
f.write("-----|-----\n")
for i in range(len(x)):
    f.write(f"{i+1:4} | {x[i][::-1]:10}\n")
f.close()
## 50
## 7
## 8
## 7
## 7
## 46
## 12
## 12
## 18
## 18
## 18
## 18

Here is the resulting raw Markdown source file:

with open("/tmp/test-report.md", "r") as f:
    out = f.read()
print(out)
## **Yummy Foods** include, but are not limited to:
## 
## * spam
## * bacon
## * eggs
## * spam
## 
## And now for something completely different:
## 
## Rank | Food
## -----|-----
##    1 | maps      
##    2 | nocab     
##    3 | sgge      
##    4 | maps

We can run it through the pandoc tool to convert it to a number of formats, including HTML, PDF, EPUB, and ODT. We may also render it directly into our report:

IPython.display.Markdown(out)

Yummy Foods include, but are not limited to:

  • spam

  • bacon

  • eggs

  • spam

And now for something completely different:

Rank

Food

1

maps

2

nocab

3

sgge

4

maps

Note

Figures created in matplotlib can be exported to PNG, SVG, or PDF files using the matplotlib.pyplot.savefig function.

Note

Data frames can be nicely prepared for display in a report using pandas.DataFrame.to_markdown.

Note

(*) Markdown is amongst many markup languages. Other learn-worthy ones include HTML (for the Web) and LaTeX (especially for beautiful typesetting of maths, print-ready articles and books, e.g., PDF; see [O+21] for a good introduction).

Jupyter Notebooks can be converted to different formats using the jupyter-nbconvert command line tool.

More generally, pandoc is a generic converter between different formats, e.g., the highly universal (although primitive) Markdown and the said LaTeX and HTML. Also, it can be used for preparing presentations (slides).

14.4. Regular Expressions (*)

This section contains large excerpts from yours truly’s other work [Gag22].

Regular expressions (regexes) provide us with a concise grammar for defining systematic patterns which can be sought in character strings. Examples of such patterns include: specific fixed substrings, emojis of any kind, stand-alone sequences of lower-case Latin letters (“words”), substrings that can be interpreted as real numbers (with or without fractional parts, also in scientific notation), telephone numbers, email addresses, or URLs.

Theoretically, the concept of regular pattern matching dates back to the so-called regular languages and finite state automata [Kle51], see also [RS59] and [HU79]. Regexes in the form as we know today have already been present in one of the pre-Unix implementations of the command-line text editor qed [RT70] (the predecessor of the well-known sed).

14.4.1. Regex Matching with re

In Python, the re module implements a regular expression matching engine that accepts patterns following a similar syntax to the ones available in the Perl language.

Before we proceed with a detailed discussion on how to read and write regexes, let us first review the methods for identifying the matching substrings. Below we use the r"\bni+\b" regex as an example, which catches "n" followed by at least one "i" that begins and ends at a word boundary, i.e., which may be considered standalone words.

In particular, re.findall extracts all non-overlapping matches to a given regex:

import re
x = "We're the knights who say ni! niiiii! ni! niiiiiiiii!"
re.findall(r"\bni+\b", x)
## ['ni', 'niiiii', 'ni', 'niiiiiiiii']

The order of arguments is look for what, where, not the other way around.

Important

We used the r"..." prefix when entering a string so that \b is not treated as a escape sequence denoting the backspace character. Otherwise, the above would have to be input as "\\bni+\\b".

If we had not insisted on matching at the word boundaries (i.e., if we used "ni+" instead), we would also match the "ni" in "knights".

The re.search function returns an object of class re.Match that enables us to get some more information about the first match:

r = re.search(r"\bni+\b", x)
r.start(), r.end(), r.group()
## (26, 28, 'ni')

The above includes the start and end position (index) and the match itself. If the regex contains capture groups (see below for more details), we can also pinpoint the matches thereto.

Moreover, re.finditer returns an iterable object that includes the same details, but now about all the matches:

rs = re.finditer(r"\bni+\b", x)
for r in rs:
    print((r.start(), r.end(), r.group()))
## (26, 28, 'ni')
## (30, 36, 'niiiii')
## (38, 40, 'ni')
## (42, 52, 'niiiiiiiii')

re.split divides a string into chunks separated by matches to a given regex:

re.split(r"!\s+", x)
## ["We're the knights who say ni", 'niiiii', 'ni', 'niiiiiiiii!']

The r"!\s*" regex matches the exclamation mark followed by one or more whitespace characters.

re.sub replaces each match with a given string:

re.sub(r"\bni+\b", "nu", x)
## "We're the knights who say nu! nu! nu! nu!"

Note

(**) More flexible replacement strings can be generated by passing a custom function as the second argument:

re.sub(r"\bni+\b", lambda m: "n" + "u"*(m.end()-m.start()-1), x)
## "We're the knights who say nu! nuuuuu! nu! nuuuuuuuuu!"

14.4.2. Regex Matching with pandas

The pandas.Series.str accessor also defines a number of vectorised functions that utilise the re package’s matcher.

Example Series object:

x = pd.Series(["ni!", "niiii, ni, nii!", None, "spam, bacon", "nii, ni!"])
x
## 0                ni!
## 1    niiii, ni, nii!
## 2               None
## 3        spam, bacon
## 4           nii, ni!
## dtype: object

Here are the most notable functions; their names are self-explanatory, so let us just generate the (textual) picture instead of the words’ abundance.

x.str.contains(r"\bni+\b")
## 0     True
## 1     True
## 2     None
## 3    False
## 4     True
## dtype: object
x.str.count(r"\bni+\b")
## 0    1.0
## 1    3.0
## 2    NaN
## 3    0.0
## 4    2.0
## dtype: float64
x.str.replace(r"\bni+\b", "nu", regex=True)
## 0            nu!
## 1    nu, nu, nu!
## 2           None
## 3    spam, bacon
## 4        nu, nu!
## dtype: object
x.str.findall(r"\bni+\b")
## 0                [ni]
## 1    [niiii, ni, nii]
## 2                None
## 3                  []
## 4           [nii, ni]
## dtype: object
x.str.split(r",\s+")  # a comma, one or more whitespaces
## 0                [ni!]
## 1    [niiii, ni, nii!]
## 2                 None
## 3        [spam, bacon]
## 4           [nii, ni!]
## dtype: object

In the two last cases, we get lists of strings in result.

Also, later we will mention pandas.Series.str.extract and pandas.Series.str.extractall which work with regexes that include capture groups.

Note

(*) If we intend to seek matches to the same pattern in different strings without the use of pandas, it might be a good to pre-compile a regex first and then use the re.Pattern.findall method instead or re.findall:

p = re.compile(r"\bni+\b")  # returns an object of class `re.Pattern`
p.findall("We're the knights who say ni! ni! niiiii! nininiiiiiiiii!")
## ['ni', 'ni', 'niiiii']

14.4.3. Matching Individual Characters

Most programming languages and text editors (including Kate, Eclipse, and VSCode) support finding or replacing patterns with regexes. Therefore, they should be amongst the instruments at every data scientist’s disposal. One general introduction to regexes is [Fri06]. The re module flavour is summarised in the official manual, see also [Kuc22]. In the following sections we review the most important elements of the regex syntax as we did in [Gag22].

We begin by discussing different ways to define character sets. In this part, determining the length of all matching substrings will be quite straightforward.

Important

The following characters have special meaning to the regex engine:

. \ | ( ) [ ] > { } ^ $ * + ?

Any regular expression that contains none of the above behaves like a fixed pattern:

re.findall("spam", "spam, eggs, spam, bacon, sausage, and spam")
## ['spam', 'spam', 'spam']

There are hence 3 occurrences of a pattern that is comprised of 4 code points, “s” followed by “p”, then by “a”, and ending with “m”.

If we wish to include a special character as part of a regular expression – so that it is treated literally – we will need to escape it with a backslash, “\”.

re.findall(r"\.", "spam...")
## ['.', '.', '.']

14.4.3.1. Matching Any Character

The (unescaped) dot, “.”, matches any code point except the newline.

x = "Spam, ham,\njam, SPAM, eggs, and spam"
re.findall("..am", x, re.IGNORECASE)
## ['Spam', ' ham', 'SPAM', 'spam']

The above matches non-overlapping length-4 substrings that end with “am”, case insensitively.

The dot’s insensitivity to the newline character is motivated by the need to maintain the compatibility with tools such as grep (when searching within text files in a line-by-line manner). This behaviour can be altered by setting the DOTALL flag.

re.findall("..am", x, re.DOTALL|re.IGNORECASE)
## ['Spam', ' ham', '\njam', 'SPAM', 'spam']

14.4.3.2. Defining Character Sets

Sets of characters can be introduced by enumerating their members within a pair of square brackets. For instance, “[abc]” denotes the set {a, b, c} – such a regular expression matches one (and only one) symbol from this set. Moreover, in:

re.findall("[hj]am", x)
## ['ham', 'jam']

the “[hj]am” regex matches: “h” or “j”, followed by “a”, followed by “m”. In other words, "ham" and "jam" are the only two strings that are matched by this pattern (unless matching is done case-insensitively).

Important

The following characters, if used within square brackets, may be treated non-literally:

\ [ ] ^ - & ~ |

Therefore, to include them as-is in a character set, the backslash-escape must be used. For example, “[\[\]\\]” matches a backslash or a square bracket.

14.4.3.3. Complementing Sets

Including “^” (the caret) after the opening square bracket denotes the set complement. Hence, “[^abc]” matches any code point except “a”, “b”, and “c”. Here is an example where we seek any substring that consists of 3 non-spaces.

x = "Nobody expects the Spanish Inquisition!"
re.findall("[^ ][^ ][^ ]", x)
## ['Nob', 'ody', 'exp', 'ect', 'the', 'Spa', 'nis', 'Inq', 'uis', 'iti', 'on!']

14.4.3.4. Defining Code Point Ranges

Each Unicode code point can be referenced by its unique numeric identifier for more details. For instance, “a” is assigned code U+0061 and “z” is mapped to U+007A. In the pre-Unicode era (mostly with regard to the ASCII codes, ≤ U+007F, representing English letters, decimal digits, some punctuation characters, and a few control characters), we were used to relying on specific code ranges; e.g., “[a-z]” denotes the set comprised of all characters with codes between U+0061 and U+007A, i.e., lowercase letters of the English (Latin) alphabet.

re.findall("[0-9A-Za-z]", "Gągolewski")
## ['G', 'g', 'o', 'l', 'e', 'w', 's', 'k', 'i']

The above pattern denotes a union of 3 code ranges: digits and ASCII upper- and lowercase letters.

Nowadays, in the processing of text in natural languages, this notation should rather be avoided. Note the missing “ą” (Polish “a” with ogonek) in the result.

14.4.3.5. Using Predefined Character Sets

Some other noteworthy Unicode-aware code point classes include the “word characters”:

x = "aąbßÆAĄB你12𝟛๔٤,.;'! \t-+=\n[]©←→”„"
re.findall(r"\w", x)
## ['a', 'ą', 'b', 'ß', 'Æ', 'A', 'Ą', 'B', '你', '1', '2', '𝟛', '๔', '٤']

decimal digits:

re.findall(r"\d", x)
## ['1', '2', '𝟛', '๔', '٤']

and whitespaces:

re.findall(r"\s", x)
## [' ', '\t', '\n']

Moreover, e.g., “\W” is equivalent to “[^\w]” , i.e., denotes its complement.

14.4.4. Alternating and Grouping Subexpressions

14.4.4.1. Alternation Operator

The alternation operator, “|” (the pipe or bar), matches either its left or its right branch, for instance:

x = "spam, egg, ham, jam, algae, and an amalgam of spam, all al dente"
re.findall("spam|ham", x)
## ['spam', 'ham', 'spam']

14.4.4.2. Grouping Subexpressions

|” has a very low precedence. Therefore, if we wish to introduce an alternative of subexpressions, we need to group them using the “(?:...)” syntax. For instance, “(?:sp|h)am” matches either “spam” or “ham”.

Notice that the bare use of the round brackets, "(...)" (i.e., without the "?:") part, has the side-effect of creating new capturing groups, see below for more details.

Also, matching is always done left-to-right, on a first-come, first-served basis. Hence, if the left branch is a subset of the right one, the latter will never be matched. In particular, “(?:al|alga|algae)” can only match “al”. To fix this, we can write “(?:algae|alga|al)”.

14.4.4.3. Non-grouping Parentheses

Some parenthesised subexpressions – those in which the opening bracket is followed by the question mark – have a distinct meaning. In particular, “(?#...)” denotes a free-format comment that is ignored by the regex parser:

re.findall(
  "(?# match 'sp' or 'h')(?:sp|h)(?# and 'am')am|(?# or match 'egg')egg",
  x
)
## ['spam', 'egg', 'ham', 'spam']

This is just horrible. Luckily, constructing more sophisticated regexes by concatenating subfragments thereof is more readable:

re.findall(
       "(?:sp|h)" +   # match either 'sp' or 'h'
       "am" +         # followed by 'am'
    "|" +        # ... or ...
       "egg",         # just match 'egg'
    x
)
## ['spam', 'egg', 'ham', 'spam']

What is more, e.g., “(?i)” enables the case-insensitive mode.

re.findall("(?i)spam", "Spam spam SPAMITY spAm")
## ['Spam', 'spam', 'SPAM', 'spAm']

14.4.5. Quantifiers

More often than not, a variable number of instances of the same subexpression needs to be captured or its presence should be made optional. This can be achieved by means of the following quantifiers:

  • ?” matches 0 or 1 times;

  • *” matches 0 or more times;

  • +” matches 1 or more times;

  • {n,m}” matches between n and m times;

  • {n,}” matches at least n times;

  • {n}” matches exactly n times.

These operators are applied onto the directly preceding atoms. For example, “ni+” captures "ni", "nii", "niii", etc., but neither "n" alone nor "ninini" altogether.

By default, the quantifiers are greedy – they match the repeated subexpression as many times as possible. The “?” suffix (hence, quantifiers such as “??”, “*?”, “+?”, and so forth) tries with as few occurrences as possible (to obtain a match still).

Greedy:

x = "sp(AM)(maps)(SP)am"
re.findall(r"\(.+\)", x)
## ['(AM)(maps)(SP)']

Lazy:

re.findall(r"\(.+?\)", x)
## ['(AM)', '(maps)', '(SP)']

Greedy (but clever):

re.findall(r"\([^)]+\)", x)
## ['(AM)', '(maps)', '(SP)']

The first regex is greedy: it matches an opening bracket, then as many characters as possible (including “)”) that are followed by a closing bracket. The two other patterns terminate as soon as the first closing bracket is found.

More examples:

x = "spamamamnomnomnomammmmmmmmm"
re.findall("sp(?:am|nom)+", x)
## ['spamamamnomnomnomam']
re.findall("sp(?:am|nom)+?", x)
## ['spam']

And:

re.findall("sp(?:am|nom)+?m*", x)
## ['spam']
re.findall("sp(?:am|nom)+?m+", x)
## ['spamamamnomnomnomammmmmmmmm']

Let us stress that the quantifier is applied to the subexpression that stands directly before it. Grouping parentheses can be used in case they are needed.

x = "12, 34.5, 678.901234, 37...629, ..."
re.findall(r"\d+\.\d+", x)
## ['34.5', '678.901234']

matches digits, a dot, and another series of digits.

re.findall(r"\d+(?:\.\d+)?", x)
## ['12', '34.5', '678.901234', '37', '629']

finds digits which are possibly (but not necessarily) followed by a dot and a digit sequence.

Exercise 14.5

Write a regex that extracts all #hashtags from a string #omg #SoEasy.

14.4.6. Capture Groups and References Thereto

Round-bracketed subexpressions (without the "?:" prefix) form the so-called capture groups that can be extracted separately or be referred to in other parts of the same regex.

14.4.6.1. Extracting Capture Group Matches

The above is evident when we use re.findall:

x = "name='Sir Launcelot', quest='Seek the Grail', favecolour='blue'"
re.findall(r"(\w+)='(.+?)'", x)
## [('name', 'Sir Launcelot'), ('quest', 'Seek the Grail'), ('favecolour', 'blue')]

Simply returned the matches to the capture groups, not the whole matching substring.

re.find and re.finditer can pinpoint each component:

r = re.search(r"(\w+)='(.+?)'", x)
print("all (0):", (r.start(), r.end(), r.group()))
print("     1 :", (r.start(1), r.end(1), r.group(1)))
print("     2 :", (r.start(2), r.end(2), r.group(2)))
## all (0): (0, 20, "name='Sir Launcelot'")
##      1 : (0, 4, 'name')
##      2 : (6, 19, 'Sir Launcelot')

Here is a vectorised version of the above from pandas, returning the first match:

y = pd.Series([
    "name='Sir Launcelot'",
    "quest='Seek the Grail'",
    "favecolour='blue', favecolour='yel.. Aaargh!'"
])
y.str.extract(r"(\w+)='(.+?)'")
##             0               1
## 0        name   Sir Launcelot
## 1       quest  Seek the Grail
## 2  favecolour            blue

We see that the findings are presented in a data frame form. The first column gives the matches to the first capture group, and so forth.

All matches are available too:

y.str.extractall(r"(\w+)='(.+?)'")
##                   0               1
##   match                            
## 0 0            name   Sir Launcelot
## 1 0           quest  Seek the Grail
## 2 0      favecolour            blue
##   1      favecolour   yel.. Aaargh!

Recall that if we just need the grouping part of “(...)”, i.e., without the capturing feature, “(?:...)” can be applied.

Also, named capture groups defined like “(?P<name>...)” are supported.

y.str.extract("(?:\\w+)='(?P<value>.+?)'")
##             value
## 0   Sir Launcelot
## 1  Seek the Grail
## 2            blue

14.4.6.2. Replacing with Capture Group Matches

Matches to particular capture groups can be recalled in replacement strings when using re.sub and pandas.Series.str.replace. Here, the match in its entirety is denoted with “\g<0>”, then “\g<1>” stores whatever was caught by the first capture group, “\g<2>” is the match to the second capture group, etc.

re.sub(r"(\w+)='(.+?)'", r"\g<2> is a \g<1>", x)
## 'Sir Launcelot is a name, Seek the Grail is a quest, blue is a favecolour'

Named capture groups can be referred to too:

re.sub(r"(?P<key>\w+)='(?P<value>.+?)'",
  r"\g<value> is a \g<key>", x)
## 'Sir Launcelot is a name, Seek the Grail is a quest, blue is a favecolour'

14.4.6.3. Back-Referencing

Matches to capture groups can also be part of the regexes themselves. For example, “\1” denotes whatever has been consumed by the first capture group.

Even though, in general, parsing HTML code with regexes is not recommended, let us consider the following examples:

x = "<strong><em>spam</em></strong><code>eggs</code>"
re.findall(r"<[a-z]+>.*?</[a-z]+>", x)
## ['<strong><em>spam</em>', '<code>eggs</code>']
re.findall(r"(<([a-z]+)>.*?</\2>)", x)
## [('<strong><em>spam</em></strong>', 'strong'), ('<code>eggs</code>', 'code')]

The second regex guarantees that the match will include all characters between the opening <tag> and the corresponding (not: any) closing </tag>. Named capture groups can be referenced using the (?P=name) syntax (the angle brackets are part of the token):

re.findall(r"(<(?P<tagname>[a-z]+)>.*?</(?P=tagname)>)", x)
## [('<strong><em>spam</em></strong>', 'strong'), ('<code>eggs</code>', 'code')]

14.4.7. Anchoring

Lastly, let us mention the ways to match a pattern at a given abstract position within a string.

14.4.7.1. Matching at the Beginning or End of a String

^” and “$” match, respectively, start and end of the string (or each line within a string, if the re.MULTILINE flag is set).

x = pd.Series(["spam egg", "bacon spam", "spam", "egg spam bacon", "sausage"])
rs = ["spam", "^spam", "spam$", "spam$|^spam", "^spam$"]
pd.concat([x.str.contains(r) for r in rs], axis=1, keys=rs)
##     spam  ^spam  spam$  spam$|^spam  ^spam$
## 0   True   True  False         True   False
## 1   True  False   True         True   False
## 2   True   True   True         True    True
## 3   True  False  False        False   False
## 4  False  False  False        False   False

The 5 regular expressions match “spam”, respectively, anywhere within the string, at the beginning, at the end, at the beginning or end, and in strings that are equal to the pattern itself.

Exercise 14.6

Write a regex that does the same job as str.strip.

14.4.7.2. Matching at Word Boundaries

Furthermore, “\b” matches at a “word boundary”, e.g., near spaces, punctuation marks, or at the start/end of a string (i.e., wherever there is a transition between a word, “\w”, and a non-word character, “\W”, or vice versa).

In the following example, we match all stand-alone numbers (this regular expression is provided for didactic purposes only):

re.findall(r"[-+]?\b\d+(?:\.\d+)?\b", "+12, 34.5, -5.3243")
## ['+12', '34.5', '-5.3243']

14.4.7.3. Looking Behind and Ahead

There are also ways to guarantee that a pattern occurrence begins or ends with a match to some subexpression: “(?<=...)...” is the so-called look-behind, whereas “...(?=...)” denotes the look-ahead. Moreover, “(?<!...)...” and “...(?!...)” are their negated (“negative look-behind/ahead”) versions.

x = "I like spam, spam, eggs, and spam."
re.findall(r"\b\w+\b(?=[,.])", x)
## ['spam', 'spam', 'eggs', 'spam']
re.findall(r"\b\w+\b(?![,.])", x)
## ['I', 'like', 'and']

The first regex captures words that end with “,” or “.”. The second one matches words that end neither with “,” nor “.”.

Exercise 14.7

Write a regex that extracts all standalone numbers accepted by Python, including 12.123, -53, +1e-9, -1.2423e10, 4. and .2.

Exercise 14.8

Write a regex that matches all email addresses.

Exercise 14.9

Write a regex that matches all URLs starting with http:// or https://.

Exercise 14.10

Cleanse the warsaw_weather dataset so that it contains analysable numeric data.

14.5. Exercises

Exercise 14.11

List some ways to normalise character strings.

Exercise 14.12

(**) What are the challenges of processing non-English text?

Exercise 14.13

What are the problems with the "[A-Za-z]" and "[A-z]" character sets?

Exercise 14.14

Name the two ways to turn on case-insensitive regex matching.

Exercise 14.15

What is a word boundary?

Exercise 14.16

What is the difference between the "^" and "$" anchors?

Exercise 14.17

When we would prefer using "[0-9]" instead of "\d"?

Exercise 14.18

What is the difference between the "?", "??", "*", "*?", "+", and "+?" quantifiers?

Exercise 14.19

Does "." match all the characters?

Exercise 14.20

What are named capture groups and how to refer to the matches thereto in re.sub?