14. Text data

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Deep R Programming [36] too.

In [35], it is noted that effective processing of character strings is needed at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation; compare, e.g., [93] and [20]. Pattern searching, string collation and sorting, normalisation, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. Means for the handling of string data should be included in each statistician’s or data scientist’s repertoire to complement their numerical computing and data wrangling skills.

In this chapter, we discuss the handiest string operations in base Python, together with their vectorised versions in numpy and pandas. We also mention some more advanced features of the Unicode ICU library.

14.1. Basic string operations

Recall from Section 2.1.3 that the str class represents individual character strings:

x = "spam"
type(x)
## <class 'str'>

There are a few binary operators overloaded for strings, e.g., `+` stands for string concatenation:

x + " and eggs"
## 'spam and eggs'

`*` duplicates a given string:

x * 3
## 'spamspamspam'

Chapter 3 noted that str is a sequential type. As a consequence, we can extract individual code points and create substrings using the index operator:

x[-1]  # last letter
## 'm'

Strings are immutable, but parts thereof can always be reused in conjunction with the concatenation operator:

x[:2] + "ecial"
## 'special'

14.1.1. Unicode as the universal encoding

It is worth knowing that all strings in Python (from version 3.0) use Unicode[1], which is a universal encoding capable of representing c. 150 000 characters covering letters and numbers in contemporary and historic alphabets/scripts, mathematical, political, phonetic, and other symbols, emojis, etc.

Note

Despite the wide support for Unicode, sometimes our own or other readers’ display (e.g., web browsers when viewing an HTML version of the output report) might not be able to render all code points properly, e.g., due to missing fonts. Still, we can rest assured that they are processed correctly if string functions are applied thereon.

14.1.2. Normalising strings

Dirty text data are a pain, especially if similar (semantically) tokens are encoded in many different ways. For the sake of string matching, we might want, e.g., the German "groß", "GROSS", and "  gross     " to compare all equal.

str.strip removes whitespaces (spaces, tabs, newline characters) at both ends of strings (see also str.lstrip and str.rstrip for their nonsymmetric versions).

str.lower and str.upper change letter case. For caseless comparison/matching, str.casefold might be a slightly better option as it unfolds many more code point sequences:

"Groß".lower(), "Groß".upper(), "Groß".casefold()
## ('groß', 'GROSS', 'gross')

Note

(*) More advanced string transliteration can be performed by means of the ICU (International Components for Unicode) library. Its Python bindings are provided by the PyICU package. Unfortunately, the package is not easily available on W****ws.

For instance, converting all code points to ASCII (English) might be necessary when identifiers are expected to miss some diacritics that would normally be included (as in "Gągolewski" vs "Gagolewski"):

import icu  # PyICU package
(icu.Transliterator
    .createInstance("Lower; Any-Latin; Latin-ASCII")
    .transliterate(
        "Χαίρετε! Groß gżegżółka — © La Niña – köszönöm – Gągolewski"
    )
)
## 'chairete! gross gzegzolka - (C) la nina - koszonom - gagolewski'

Converting between different Unicode Normalisation Forms (also available in the unicodedata package and via pandas.Series.str.normalize) might be used for the removal of some formatting nuances:

icu.Transliterator.createInstance("NFKD; NFC").transliterate("¼ąr²")
## '1⁄4ąr2'

14.1.3. Substring searching and replacing

Determining if a string has a particular fixed substring can be done in several ways.

For instance, the in operator verifies whether a particular substring occurs at least once:

food = "bacon, spam, spam, srapatapam, eggs, and spam"
"spam" in food
## True

The str.count method determines the number of occurrences of a substring:

food.count("spam")
## 3

To locate the first pattern appearance, we call str.index:

food.index("spam")
## 7

str.replace substitutes matching substrings with new content:

food.replace("spam", "veggies")
## 'bacon, veggies, veggies, srapatapam, eggs, and veggies'
Exercise 14.1

Read the manual of the following methods: str.startswith, str.endswith, str.find, str.rfind, str.rindex, str.removeprefix, and str.removesuffix.

The splitting of long strings at specific fixed delimiters can be done via:

food.split(", ")
## ['bacon', 'spam', 'spam', 'srapatapam', 'eggs', 'and spam']

See also str.partition. The str.join method implements the inverse operation:

", ".join(["spam", "bacon", "eggs", "spam"])
## 'spam, bacon, eggs, spam'

Moreover, Section 14.4 will discuss pattern matching with regular expressions. They can be useful in, amongst others, extracting more abstract data chunks (numbers, URLs, email addresses, IDs) from strings.

14.1.4. Locale-aware services in ICU (*)

Recall that relational operators such as `<` and `>=` perform the lexicographic comparing of strings (like in a dictionary or an encyclopedia):

"spam" > "egg"
## True

We have: "a" < "aa" < "aaaaaaaaaaaaa" < "ab" < "aba" < "abb" < "b" < "ba" < "baaaaaaa" < "bb" < "Spanish Inquisition".

The lexicographic ordering (character-by-character, from left to right) is not necessarily appropriate for strings with numerals:

"a9" < "a123"  # 1 is smaller than 9
## False

Additionally, it only takes into account the numeric codes (see Section 14.4.3.4) corresponding to each Unicode character. Consequently, it does not work well with non-English alphabets:

"MIELONECZKĄ" < "MIELONECZKI"
## False

In Polish, A with ogonek (Ą) is expected to sort after A and before B, let alone I. However, their corresponding numeric codes in the Unicode table are: 260 (Ą), 65 (A), 66 (B), and 73 (I). The resulting ordering is thus incorrect, as far as natural language processing is concerned.

It is best to perform string collation using the services provided by ICU. Here is an example of German phone book-like collation where "ö" is treated the same as "oe":

c = icu.Collator.createInstance(icu.Locale("de_DE@collation=phonebook"))
c.setStrength(0)  # ignore case and some diacritics
c.compare("Löwe", "loewe")
## 0

A result of 0 means that the strings are deemed equal.

In some languages, contractions occur, e.g., in Slovak and Czech, two code points "ch" are treated as a single entity and are sorted after "h":

icu.Collator.createInstance(icu.Locale("sk_SK")).compare("chladný", "hladný")
## 1

This means that we have "chladný" > "hladný" (the first argument is greater than the second one). Compare the above to something similar in Polish:

icu.Collator.createInstance(icu.Locale("pl_PL")).compare("chłodny", "hardy")
## -1

That is, "chłodny" < "hardy" (the first argument is less than the second one).

Also, with ICU, numeric collation is possible:

c = icu.Collator.createInstance()
c.setAttribute(
    icu.UCollAttribute.NUMERIC_COLLATION,
    icu.UCollAttributeValue.ON
)
c.compare("a9", "a123")
## -1

Which is the correct result: "a9" is less than "a123" (compare the above to the example where we used the ordinary `<`).

14.1.5. String operations in pandas

String sequences in pandas.Series are by default using the broadest possible object data type:

pd.Series(["spam", "bacon", "spam"])
## 0     spam
## 1    bacon
## 2     spam
## dtype: object

This allows for missing values encoding by means of the None object (which is of the type None, not str); compare Section 15.1.

Vectorised versions of base string operations are available via the pandas.Series.str accessor. We thus have pandas.Series.str.strip, pandas.Series.str.split, pandas.Series.str.find, and so forth. For instance:

x = pd.Series(["spam", "bacon", None, "buckwheat", "spam"])
x.str.upper()
## 0         SPAM
## 1        BACON
## 2         None
## 3    BUCKWHEAT
## 4         SPAM
## dtype: object

But there is more. For example, a function to compute the length of each string:

x.str.len()
## 0    4.0
## 1    5.0
## 2    NaN
## 3    9.0
## 4    4.0
## dtype: float64

Vectorised concatenation of strings can be performed using the overloaded `+` operator:

x + " and spam"
## 0         spam and spam
## 1        bacon and spam
## 2                   NaN
## 3    buckwheat and spam
## 4         spam and spam
## dtype: object

To concatenate all items into a single string, we call:

x.str.cat(sep="; ")
## 'spam; bacon; buckwheat; spam'

Conversion to numeric:

pd.Series(["1.3", "-7", None, "3523"]).astype(float)
## 0       1.3
## 1      -7.0
## 2       NaN
## 3    3523.0
## dtype: float64

Select substrings:

x.str.slice(2, -1)  # like x.iloc[i][2:-1] for all i
## 0         a
## 1        co
## 2      None
## 3    ckwhea
## 4         a
## dtype: object

Replace substrings:

x.str.slice_replace(0, 2, "tofu")  # like x.iloc[i][2:-1] = "tofu"
## 0         tofuam
## 1        tofucon
## 2           None
## 3    tofuckwheat
## 4         tofuam
## dtype: object
Exercise 14.2

Consider the nasaweather_glaciers data frame. All glaciers are assigned 11/12-character unique identifiers as defined by the WGMS convention that forms the glacier ID number by combining the following five elements:

  1. 2-character political unit (the first two letters of the ID),

  2. 1-digit continent code (the third letter),

  3. 4-character drainage code (the next four),

  4. 2-digit free position code (the next two),

  5. 2- or 3-digit local glacier code (the remaining ones).

Extract the five chunks and store them as independent columns in the data frame.

14.1.6. String operations in numpy (*)

There is a huge overlap between the numpy and pandas capabilities for string handling, with the latter being more powerful. After all, numpy is a workhorse for numerical computing. Still, some readers might find what follows useful.

As mentioned in our introduction to numpy vectors, objects of the type ndarray can store not only numeric and logical data, but also character strings. For example:

x = np.array(["spam", "bacon", "egg"])
x
## array(['spam', 'bacon', 'egg'], dtype='<U5')

Here, the data type “<U5” means that we deal with Unicode strings of length no greater than five. Unfortunately, replacing elements with too long a content will spawn truncated strings:

x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckw'], dtype='<U5')

To remedy this, we first need to recast the vector manually:

x = x.astype("<U10")
x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckwheat'], dtype='<U10')

Conversion from/to numeric is also possible:

np.array(["1.3", "-7", "3523"]).astype(float)
## array([ 1.300e+00, -7.000e+00,  3.523e+03])
np.array([1, 3.14, -5153]).astype(str)
## array(['1.0', '3.14', '-5153.0'], dtype='<U32')

The numpy.char module includes several vectorised versions of string routines, most of which we have already discussed. For example:

x = np.array([
    "spam", "spam, bacon, and spam",
    "spam, eggs, bacon, spam, spam, and spam"
])
np.char.split(x, ", ")
## array([list(['spam']), list(['spam', 'bacon', 'and spam']),
##        list(['spam', 'eggs', 'bacon', 'spam', 'spam', 'and spam'])],
##       dtype=object)
np.char.count(x, "spam")
## array([1, 2, 4])

Vectorised operations that we would normally perform through the binary operators (i.e., `+`, `*`, `<`, etc.) are available through standalone functions:

np.char.add(["spam", "bacon"], " and spam")
## array(['spam and spam', 'bacon and spam'], dtype='<U14')
np.char.equal(["spam", "bacon", "spam"], "spam")
## array([ True, False,  True])

The function that returns the length of each string is also noteworthy:

np.char.str_len(x)
## array([ 4, 21, 39])

14.2. Working with string lists

pandas nicely supports lists of strings of varying lengths. For instance:

x = pd.Series([
    "spam",
    "spam, bacon, spam",
    "potatoes",
    None,
    "spam, eggs, bacon, spam, spam"
])
xs = x.str.split(", ", regex=False)
xs
## 0                             [spam]
## 1                [spam, bacon, spam]
## 2                         [potatoes]
## 3                               None
## 4    [spam, eggs, bacon, spam, spam]
## dtype: object

And now, e.g., looking at the last element:

xs.iloc[-1]
## ['spam', 'eggs', 'bacon', 'spam', 'spam']

reveals that it is indeed a list of strings.

There are a few vectorised operations that enable us to work with such variable length lists, such as concatenating all strings:

xs.str.join("; ")
## 0                             spam
## 1                spam; bacon; spam
## 2                         potatoes
## 3                             None
## 4    spam; eggs; bacon; spam; spam
## dtype: object

selecting, say, the first string in each list:

xs.str.get(0)
## 0        spam
## 1        spam
## 2    potatoes
## 3        None
## 4        spam
## dtype: object

or slicing:

xs.str.slice(0, -1)  # like xs.iloc[i][0:-1] for all i
## 0                           []
## 1                [spam, bacon]
## 2                           []
## 3                         None
## 4    [spam, eggs, bacon, spam]
## dtype: object
Exercise 14.3

(*) Using pandas.merge, join the countries, world_factbook_2020, and ssi_2016_dimensions datasets based on the country names. Note that some manual data cleansing will be necessary beforehand.

Exercise 14.4

(**) Given a Series object xs that includes lists of strings, convert it to a 0/1 representation.

  1. Determine the list of all unique strings; let’s call it xu.

  2. Create a data frame x with xs.shape[0] rows and len(xu) columns such that x.iloc[i, j] is equal to 1 if xu[j] is amongst xs.loc[i] and equal to 0 otherwise. Set the column names to xs.

  3. Given x (and only x: neither xs nor xu), perform the inverse operation.

For example, for the above xs object, x should look like:

##    bacon  eggs  potatoes  spam
## 0      0     0         0     1
## 1      1     0         0     1
## 2      0     0         1     0
## 3      0     0         0     0
## 4      1     1         0     1

14.3. Formatted outputs for reproducible report generation

Some good development practices related to reproducible report generation are discussed in [84, 102, 103]. Note that the paradigm of literate programming was introduced by D. Knuth in [57].

Reports from data analysis can be prepared, e.g., in Jupyter Notebooks or by writing directly to Markdown files which we can later compile to PDF or HTML. Below we briefly discuss how to output nicely formatted objects programmatically.

14.3.1. Formatting strings

Inclusion of textual representation of data stored in existing objects can easily be done using f-strings (formatted string literals; see Section 2.1.3.1) of the type f"...{expression}...". For instance:

pi = 3.14159265358979323846
f"π = {pi:.2f}"
## 'π = 3.14'

creates a string showing the value of the variable pi formatted as a float rounded to two places after the decimal separator.

Note

(**) Similar functionality can be achieved using the str.format method:

"π = {:.2f}".format(pi)
## 'π = 3.14'

as well as the `%` operator overloaded for strings, which uses sprintf-like value placeholders known to some readers from other programming languages (such as C):

"π = %.2f" % pi
## 'π = 3.14'

14.3.2. str and repr

The str and repr functions can create string representations of many objects:

x = np.array([1, 2, 3])
str(x)
## '[1 2 3]'
repr(x)
## 'array([1, 2, 3])'

The former is more human-readable, and the latter is slightly more technical. Note that repr often returns an output that can be interpreted as executable Python code with no or few adjustments. Nonetheless, pandas objects are amongst the many exceptions to this rule.

14.3.3. Aligning strings

str.center, str.ljust, str.rjust can be used to centre-, left-, or right-align a string so that it is of at least given width. This might make the display thereof more aesthetic. Very long strings, possibly containing whole text paragraphs can be dealt with using the wrap and shorten functions from the textwrap package.

14.3.4. Direct Markdown output in Jupyter

Further, with IPython/Jupyter, we can output strings that will be directly interpreted as Markdown-formatted:

import IPython.display
x = 2+2
out = f"*Result*: $2^2=2\\cdot 2={x}$."  # LaTeX math
IPython.display.Markdown(out)

Result: \(2^2=2\cdot 2=4\).

Recall from Section 1.2.5 that Markdown is a very flexible markup[2] language that allows us to define itemised and numbered lists, mathematical formulae, tables, images, etc.

On a side note, data frames can be nicely prepared for display in a report using pandas.DataFrame.to_markdown.

14.3.5. Manual Markdown file output (*)

We can also generate Markdown code programmatically in the form of standalone .md files:

import tempfile, os.path
filename = os.path.join(tempfile.mkdtemp(), "test-report.md")
f = open(filename, "w")  # open for writing (overwrite if exists)
f.write("**Yummy Foods** include, but are not limited to:\n\n")
x = ["spam", "bacon", "eggs", "spam"]
for e in x:
    f.write(f"* {e}\n")
f.write("\nAnd now for something *completely* different:\n\n")
f.write("Rank | Food\n")
f.write("-----|-----\n")
for i in range(len(x)):
    f.write(f"{i+1:4} | {x[i][::-1]:10}\n")
f.close()

Here is the resulting raw Markdown source file:

with open(filename, "r") as f:  # will call f.close() automatically
    out = f.read()
print(out)
## **Yummy Foods** include, but are not limited to:
## 
## * spam
## * bacon
## * eggs
## * spam
## 
## And now for something *completely* different:
## 
## Rank | Food
## -----|-----
##    1 | maps      
##    2 | nocab     
##    3 | sgge      
##    4 | maps

We can convert it to other formats, including HTML, PDF, EPUB, ODT, and even presentations by running[3] the pandoc tool. We may also embed it directly inside an IPython/Jupyter notebook:

IPython.display.Markdown(out)

Yummy Foods include, but are not limited to:

  • spam

  • bacon

  • eggs

  • spam

And now for something completely different:

Rank

Food

1

maps

2

nocab

3

sgge

4

maps

Note

Figures created in matplotlib can be exported to PNG, SVG, or PDF files using the matplotlib.pyplot.savefig function. We can include them manually in a Markdown document using the ![description](filename) syntax.

Note

(*) IPython/Jupyter Notebooks can be converted to different formats using the jupyter-nbconvert command line tool. jupytext can create notebooks from ordinary text files. Literate programming with mixed R and Python is possible with the R packages knitr and reticulate. See [75] for an overview of many more options.

14.4. Regular expressions (*)

This section contains large excerpts from yours truly’s other work [35].

Regular expressions (regexes) provide concise grammar for defining systematic patterns which can be sought in character strings. Examples of such patterns include: specific fixed substrings, emojis of any kind, standalone sequences of lower-case Latin letters (“words”), substrings that can be interpreted as real numbers (with or without fractional parts, also in scientific notation), telephone numbers, email addresses, or URLs.

Theoretically, the concept of regular pattern matching dates to the so-called regular languages and finite state automata [56]; see also [78] and [51]. Regexes, in the form as we know it today, were already present in one of the pre-UNIX implementations of the command-line text editor qed [79] (the predecessor of the well-known sed).

14.4.1. Regex matching with re (*)

In Python, the re module implements a regular expression matching engine. It accepts patterns that follow similar syntax to the one available in the Perl language.

As a matter of fact, most programming languages and text editors (including Kate, Eclipse, and VSCodium) support finding and replacing patterns with regexes. This is why they should be amongst the instruments at every data scientist’s disposal.

Before we proceed with a detailed discussion on how to read and write regular expressions, let’s first review some of the methods for identifying the matching substrings. Below we use the r"\bni+\b" regex as an example. It catches "n" followed by at least one "i" that begins and ends at a word boundary. In other words, we seek "ni", "nii", "niii", etc. which may be considered standalone words.

In particular, re.findall extracts all non-overlapping matches to a given regex:

import re
x = "We're the knights who say ni! niiiii! ni! niiiiiiiii!"
re.findall(r"\bni+\b", x)
## ['ni', 'niiiii', 'ni', 'niiiiiiiii']

The order of arguments is (look for what, where), not vice versa.

Important

We used the r"..." prefix to input a string so that “\b” is not treated as an escape sequence which denotes the backspace character. Otherwise, the foregoing would have to be written as “\\bni+\\b”.

If we had not insisted on matching at the word boundaries (i.e., if we used the simple "ni+" regex instead), we would also match the "ni" in "knights".

The re.search function returns an object of the class re.Match that enables us to get some more information about the first match:

r = re.search(r"\bni+\b", x)
r.start(), r.end(), r.group()
## (26, 28, 'ni')

It includes the start and the end position (index) as well as the match itself. If the regex contains capture groups (more details follow), we can also pinpoint the matches thereto.

Moreover, re.finditer returns an iterable object that includes the same details, but now about all the matches:

rs = re.finditer(r"\bni+\b", x)
for r in rs:
    print((r.start(), r.end(), r.group()))
## (26, 28, 'ni')
## (30, 36, 'niiiii')
## (38, 40, 'ni')
## (42, 52, 'niiiiiiiii')

re.split divides a string into chunks separated by matches to a given regex:

re.split(r"!\s+", x)
## ["We're the knights who say ni", 'niiiii', 'ni', 'niiiiiiiii!']

The “!\s*” regex matches the exclamation mark followed by one or more whitespace characters.

Using re.sub, each match can be replaced with a given string:

re.sub(r"\bni+\b", "nu", x)
## "We're the knights who say nu! nu! nu! nu!"

Note

(**) More flexible replacement strings can be generated by passing a custom function as the second argument:

re.sub(r"\bni+\b", lambda m: "n" + "u"*(m.end()-m.start()-1), x)
## "We're the knights who say nu! nuuuuu! nu! nuuuuuuuuu!"

14.4.2. Regex matching with pandas (*)

The pandas.Series.str accessor also defines a number of vectorised functions that utilise the re package’s matcher.

Example Series object:

x = pd.Series(["ni!", "niiii, ni, nii!", None, "spam, bacon", "nii, ni!"])
x
## 0                ni!
## 1    niiii, ni, nii!
## 2               None
## 3        spam, bacon
## 4           nii, ni!
## dtype: object

Here are the most notable functions:

x.str.contains(r"\bni+\b")
## 0     True
## 1     True
## 2     None
## 3    False
## 4     True
## dtype: object
x.str.count(r"\bni+\b")
## 0    1.0
## 1    3.0
## 2    NaN
## 3    0.0
## 4    2.0
## dtype: float64
x.str.replace(r"\bni+\b", "nu", regex=True)
## 0            nu!
## 1    nu, nu, nu!
## 2           None
## 3    spam, bacon
## 4        nu, nu!
## dtype: object
x.str.findall(r"\bni+\b")
## 0                [ni]
## 1    [niiii, ni, nii]
## 2                None
## 3                  []
## 4           [nii, ni]
## dtype: object
x.str.split(r",\s+")  # a comma, one or more whitespaces
## 0                [ni!]
## 1    [niiii, ni, nii!]
## 2                 None
## 3        [spam, bacon]
## 4           [nii, ni!]
## dtype: object

In the two last cases, we get lists of strings as results.

Also, later we will mention pandas.Series.str.extract and pandas.Series.str.extractall which work with regexes that include capture groups.

Note

(*) If we intend to seek matches to the same pattern in many different strings without the use of pandas, it might be faster to precompile a regex first, and then use the re.Pattern.findall method instead or re.findall:

p = re.compile(r"\bni+\b")  # returns an object of the class `re.Pattern`
p.findall("We're the Spanish Inquisition ni! ni! niiiii! nininiiiiiiiii!")
## ['ni', 'ni', 'niiiii']

14.4.3. Matching individual characters (*)

In the coming subsections, we review the most essential elements of the regex syntax as we did in [35]. One general introduction to regexes is [31]. The re module flavour is summarised in the official manual, see also [59].

We begin by discussing different ways to define character sets. In this part, determining the length of all matching substrings will be straightforward.

Important

The following characters have special meaning to the regex engine: “.”, “\”, “|”, “(“, “)”, “[“, “]”, “{“, “}”, “^”, “$”, “*”, “+”, and “?”.

Any regular expression that contains none of the preceding characters behaves like a fixed pattern:

re.findall("spam", "spam, eggs, spam, bacon, sausage, and spam")
## ['spam', 'spam', 'spam']

There are three occurrences of a pattern that is comprised of four code points, “s” followed by “p”, then by “a”, and ending with “m”.

If we want to include a special character as part of a regular expression so that it is treated literally, we will need to escape it with a backslash, “\”.

re.findall(r"\.", "spam...")
## ['.', '.', '.']

14.4.3.1. Matching anything (almost) (*)

The (unescaped) dot, “.”, matches any code point except the newline.

x = "Spam, ham,\njam, SPAM, eggs, and spam"
re.findall("..am", x, re.IGNORECASE)
## ['Spam', ' ham', 'SPAM', 'spam']

It extracted non-overlapping substrings of length four that end with “am”, case-insensitively.

The dot’s insensitivity to the newline character is motivated by the need to maintain compatibility with tools such as grep (when searching within text files in a line-by-line manner). This behaviour can be altered by setting the DOTALL flag.

re.findall("..am", x, re.DOTALL|re.IGNORECASE)  # `|` is the bitwise OR
## ['Spam', ' ham', '\njam', 'SPAM', 'spam']

14.4.3.2. Defining character sets (*)

Sets of characters can be introduced by enumerating their members within a pair of square brackets. For instance, “[abc]” denotes the set {a, b, c} – such a regular expression matches one (and only one) symbol from this set. Moreover, in:

re.findall("[hj]am", x)
## ['ham', 'jam']

the “[hj]am” regex matches: “h” or “j”, followed by “a”, followed by “m”. In other words, "ham" and "jam" are the only two strings that are matched by this pattern (unless matching is done case-insensitively).

Important

The following characters, if used within square brackets, may be treated not literally: “\”, “[“, “]”, “^”, “-“, “&”, “~”, and “|”.

To include them as-is in a character set, the backslash-escape must be used. For example, “[\[\]\\]” matches a backslash or a square bracket.

14.4.3.3. Complementing sets (*)

Including “^” (the caret) after the opening square bracket denotes a set’s complement. Hence, “[^abc]” matches any code point except “a”, “b”, and “c”. Here is an example where we seek any substring that consists of four non-spaces:

x = "Nobody expects the Spanish Inquisition!"
re.findall("[^ ][^ ][^ ][^ ]", x)
## ['Nobo', 'expe', 'Span', 'Inqu', 'isit', 'ion!']

14.4.3.4. Defining code point ranges (*)

Each Unicode character can be referenced by its unique numeric code. For instance, “a” is assigned code U+0061 and “z” is mapped to U+007A. In the pre-Unicode era (mostly with regard to the ASCII codes, ≤ U+007F, representing English letters, decimal digits, as well as some punctuation and control characters), we were used to relying on specific code ranges. For example, “[a-z]” denotes the set comprised of all characters with codes between U+0061 and U+007A, i.e., lowercase letters of the English (Latin) alphabet.

re.findall("[0-9A-Za-z]", "Gągolewski")
## ['G', 'g', 'o', 'l', 'e', 'w', 's', 'k', 'i']

This pattern denotes the union of three code ranges: ASCII upper- and lowercase letters and digits. Nowadays, in the processing of text in natural languages, this notation should be avoided. Note the missing “ą” (Polish “a” with ogonek) in the result.

14.4.3.5. Using predefined character sets (*)

Consider a string:

x = "aąbßÆAĄB你12𝟛٤,.;'! \t-+=\n[]©←→”„"

Some glyphs are not available in the PDF version of this book because we did not install the required fonts, e.g., the Arabic digit 4 or left and right arrows. However, they are well-defined at the program level.

Noteworthy Unicode-aware code point classes include the word characters:

re.findall(r"\w", x)
## ['a', 'ą', 'b', 'ß', 'Æ', 'A', 'Ą', 'B', '你', '1', '2', '𝟛', '٤']

decimal digits:

re.findall(r"\d", x)
## ['1', '2', '𝟛', '٤']

and whitespaces:

re.findall(r"\s", x)
## [' ', '\t', '\n']

Moreover, e.g., “\W” is equivalent to “[^\w]” , i.e., denotes the set’s complement.

14.4.4. Alternating and grouping subexpressions (*)

14.4.4.1. Alternation operator (*)

The alternation operator, “|” (the pipe or bar), matches either its left or its right branch. For instance:

x = "spam, egg, ham, jam, algae, and an amalgam of spam, all al dente"
re.findall("spam|ham", x)
## ['spam', 'ham', 'spam']

14.4.4.2. Grouping subexpressions (*)

The “|” operator has very low precedence (otherwise, we would match "spamam" or "spaham" above instead). If we want to introduce an alternative of subexpressions, we need to group them using the “(?:...)” syntax. For instance, “(?:sp|h)am” matches either "spam" or "ham".

Notice that the bare use of the round brackets, “(...)” (i.e., without the “?:”) part, has the side-effect of creating new capturing groups; see below for more details.

Also, matching is always done left-to-right, on the first-come, first-served (greedy) basis. Consequently, if the left branch is a subset of the right one, the latter will never be matched. In particular, “(?:al|alga|algae)” can only match "al". To fix this, we can write “(?:algae|alga|al)”.

14.4.4.3. Non-grouping parentheses (*)

Some parenthesised subexpressions – those in which the opening bracket is followed by the question mark – have a distinct meaning. In particular, “(?#...)” denotes a free-format comment that is ignored by the regex parser:

re.findall(
  "(?# match 'sp' or 'h')(?:sp|h)(?# and 'am')am|(?# or match 'egg')egg",
  x
)
## ['spam', 'egg', 'ham', 'spam']

This is just horrible. Luckily, constructing more sophisticated regexes by concatenating subfragments thereof is more readable:

re.findall(
       "(?:sp|h)" +   # match either 'sp' or 'h'
       "am" +         # followed by 'am'
    "|" +        # ... or ...
       "egg",         # just match 'egg'
    x
)
## ['spam', 'egg', 'ham', 'spam']

What is more, e.g., “(?i)” enables the case-insensitive mode.

re.findall("(?i)spam", "Spam spam SPAMITY spAm")
## ['Spam', 'spam', 'SPAM', 'spAm']

14.4.5. Quantifiers (*)

More often than not, a variable number of instances of the same subexpression needs to be captured. Sometimes we want to make its presence optional. These can be achieved by means of the following quantifiers:

  • ?” matches 0 or 1 time;

  • *” matches 0 or more times;

  • +” matches 1 or more times;

  • {n,m}” matches between n and m times;

  • {n,}” matches at least n times;

  • {n}” matches exactly n times.

These operators are applied onto the directly preceding atoms. For example, “ni+” captures "ni", "nii", "niii", etc., but neither "n" alone nor "ninini" altogether.

By default, the quantifiers are greedy – they match the repeated subexpression as many times as possible. The “?” suffix (forming quantifiers such as “??”, “*?”, “+?”, and so forth) tries with as few occurrences as possible (to obtain a match still).

Greedy:

x = "sp(AM)(maps)(SP)am"
re.findall(r"\(.+\)", x)
## ['(AM)(maps)(SP)']

Lazy:

re.findall(r"\(.+?\)", x)
## ['(AM)', '(maps)', '(SP)']

Greedy (but clever):

re.findall(r"\([^)]+\)", x)
## ['(AM)', '(maps)', '(SP)']

The first regex is greedy: it matches an opening bracket, then as many characters as possible (including “)”) that are followed by a closing bracket. The two other patterns terminate as soon as the first closing bracket is found.

More examples:

x = "spamamamnomnomnomammmmmmmmm"
re.findall("sp(?:am|nom)+", x)
## ['spamamamnomnomnomam']
re.findall("sp(?:am|nom)+?", x)
## ['spam']

And:

re.findall("sp(?:am|nom)+?m*", x)
## ['spam']
re.findall("sp(?:am|nom)+?m+", x)
## ['spamamamnomnomnomammmmmmmmm']

Let’s stress that the quantifier is applied to the subexpression that stands directly before it. Grouping parentheses can be used in case they are needed.

x = "12, 34.5, 678.901234, 37...629, ..."
re.findall(r"\d+\.\d+", x)
## ['34.5', '678.901234']

matches digits, a dot, and another series of digits.

re.findall(r"\d+(?:\.\d+)?", x)
## ['12', '34.5', '678.901234', '37', '629']

finds digits which are possibly (but not necessarily) followed by a dot and a digit sequence.

Exercise 14.5

Write a regex that extracts all #hashtags from a string #omg #SoEasy.

14.4.6. Capture groups and references thereto (**)

Round-bracketed subexpressions (without the “?:” prefix) form the so-called capture groups that can be extracted separately or be referred to in other parts of the same regex.

14.4.6.1. Extracting capture group matches (**)

The preceding statement can be nicely verified by calling re.findall:

x = "name='Sir Launcelot', quest='Seek Grail', favcolour='blue'"
re.findall(r"(\w+)='(.+?)'", x)
## [('name', 'Sir Launcelot'), ('quest', 'Seek Grail'), ('favcolour', 'blue')]

It returned the matches to the individual capture groups, not the whole matching substrings.

re.find and re.finditer can pinpoint each component:

r = re.search(r"(\w+)='(.+?)'", x)
print("whole (0):", (r.start(), r.end(), r.group()))
print("       1 :", (r.start(1), r.end(1), r.group(1)))
print("       2 :", (r.start(2), r.end(2), r.group(2)))
## whole (0): (0, 20, "name='Sir Launcelot'")
##        1 : (0, 4, 'name')
##        2 : (6, 19, 'Sir Launcelot')

Here its vectorised version using pandas, returning the first match:

y = pd.Series([
    "name='Sir Launcelot'",
    "quest='Seek Grail'",
    "favcolour='blue', favcolour='yel.. Aaargh!'"
])
y.str.extract(r"(\w+)='(.+?)'")
##            0              1
## 0       name  Sir Launcelot
## 1      quest     Seek Grail
## 2  favcolour           blue

We see that the findings are conveniently presented in the data frame form. The first column gives the matches to the first capture group. All matches can be extracted too:

y.str.extractall(r"(\w+)='(.+?)'")
##                  0              1
##   match                          
## 0 0           name  Sir Launcelot
## 1 0          quest     Seek Grail
## 2 0      favcolour           blue
##   1      favcolour  yel.. Aaargh!

Recall that if we just need the grouping part of “(...)”, i.e., without the capturing feature, “(?:...)” can be applied.

Also, named capture groups defined like “(?P<name>...)” are supported.

y.str.extract("(?:\\w+)='(?P<value>.+?)'")
##            value
## 0  Sir Launcelot
## 1     Seek Grail
## 2           blue

14.4.6.2. Replacing with capture group matches (**)

When using re.sub and pandas.Series.str.replace, matches to particular capture groups can be recalled in replacement strings. The match in its entirety is denoted by “\g<0>”, then “\g<1>” stores whatever was caught by the first capture group, and “\g<2>” is the match to the second capture group, etc.

re.sub(r"(\w+)='(.+?)'", r"\g<2> is a \g<1>", x)
## 'Sir Launcelot is a name, Seek Grail is a quest, blue is a favcolour'

Named capture groups can be referred to too:

re.sub(r"(?P<key>\w+)='(?P<value>.+?)'",
  r"\g<value> is a \g<key>", x)
## 'Sir Launcelot is a name, Seek Grail is a quest, blue is a favcolour'

14.4.6.3. Back-referencing (**)

Matches to capture groups can also be part of the regexes themselves. In such a context, e.g., “\1” denotes whatever has been consumed by the first capture group.

In general, parsing HTML code with regexes is not recommended, unless it is well-structured (which might be the case if it is generated programmatically; but we can always use the lxml package). Despite this, let’s consider the following examples:

x = "<p><em>spam</em></p><code>eggs</code>"
re.findall(r"<[a-z]+>.*?</[a-z]+>", x)
## ['<p><em>spam</em>', '<code>eggs</code>']

It did not match the correct closing HTML tag. But we can make this happen by writing:

re.findall(r"(<([a-z]+)>.*?</\2>)", x)
## [('<p><em>spam</em></p>', 'p'), ('<code>eggs</code>', 'code')]

This regex guarantees that the match will include all characters between the opening "<tag>" and the corresponding (not: any) closing "</tag>".

Named capture groups can be referenced using the “(?P=name)” syntax:

re.findall(r"(<(?P<tagname>[a-z]+)>.*?</(?P=tagname)>)", x)
## [('<p><em>spam</em></p>', 'p'), ('<code>eggs</code>', 'code')]

The angle brackets are part of the token.

14.4.7. Anchoring (*)

Lastly, let’s mention the ways to match a pattern at a given abstract position within a string.

14.4.7.1. Matching at the beginning or end of a string (*)

^” and “$” match, respectively, start and end of the string (or each line within a string, if the re.MULTILINE flag is set).

x = pd.Series(["spam egg", "bacon spam", "spam", "egg spam bacon", "milk"])
rs = ["spam", "^spam", "spam$", "spam$|^spam", "^spam$"]  # regexes to test

The five regular expressions match "spam", respectively, anywhere within the string, at the beginning, at the end, at the beginning or end, and in strings that are equal to the pattern itself. We can check this by calling:

pd.concat([x.str.contains(r) for r in rs], axis=1, keys=rs)
##     spam  ^spam  spam$  spam$|^spam  ^spam$
## 0   True   True  False         True   False
## 1   True  False   True         True   False
## 2   True   True   True         True    True
## 3   True  False  False        False   False
## 4  False  False  False        False   False
Exercise 14.6

Compose a regex that does the same job as str.strip.

14.4.7.2. Matching at word boundaries (*)

What is more, “\b” matches at a “word boundary”, e.g., near spaces, punctuation marks, or at the start/end of a string (i.e., wherever there is a transition between a word, “\w”, and a non-word character, “\W”, or vice versa).

In the following example, we match all stand-alone numbers (this regular expression is imperfect, though):

re.findall(r"[-+]?\b\d+(?:\.\d+)?\b", "+12, 34.5, -5.3243")
## ['+12', '34.5', '-5.3243']

14.4.7.3. Looking behind and ahead (**)

There is a way to guarantee that a pattern occurrence begins or ends with a match to a subexpression: “(?<=...)...” denotes the look-behind, whereas “...(?=...)” designates a look-ahead.

x = "I like spam, spam, eggs, and spam."
re.findall(r"\b\w+\b(?=[,.])", x)
## ['spam', 'spam', 'eggs', 'spam']

This regex captured words that end with a comma or a dot

Moreover, “(?<!...)...” and “...(?!...)” are their negated versions (negative look-behind/ahead).

re.findall(r"\b\w+\b(?![,.])", x)
## ['I', 'like', 'and']

This time, we matched the words that end with neither a comma nor a dot.

14.5. Exercises

Exercise 14.7

List some ways to normalise character strings.

Exercise 14.8

(**) What are the challenges of processing non-English text?

Exercise 14.9

What are the problems with the "[A-Za-z]" and "[A-z]" character sets?

Exercise 14.10

Name the two ways to turn on case-insensitive regex matching.

Exercise 14.11

What is a word boundary?

Exercise 14.12

What is the difference between the "^" and "$" anchors?

Exercise 14.13

When would we prefer using "[0-9]" instead of "\d"?

Exercise 14.14

What is the difference between the "?", "??", "*", "*?", "+", and "+?" quantifiers?

Exercise 14.15

Does "." match all the characters?

Exercise 14.16

What are named capture groups and how can we refer to the matches thereto in re.sub?

Exercise 14.17

Write a regex that extracts all standalone numbers accepted by Python, including 12.123, -53, +1e-9, -1.2423e10, 4. and .2.

Exercise 14.18

Author a regex that matches all email addresses.

Exercise 14.19

Indite a regex that matches all URLs starting with http:// or https://.

Exercise 14.20

Cleanse the warsaw_weather dataset so that it contains analysable numeric data.