14. Text data¶
This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Deep R Programming [36] too.
In [35], it is noted that effective processing of character strings is needed at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation; compare, e.g., [93] and [20]. Pattern searching, string collation and sorting, normalisation, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. Means for the handling of string data should be included in each statistician’s or data scientist’s repertoire to complement their numerical computing and data wrangling skills.
In this chapter, we discuss the handiest string operations in base Python, together with their vectorised versions in numpy and pandas. We also mention some more advanced features of the Unicode ICU library.
14.1. Basic string operations¶
Recall from Section 2.1.3
that the str
class represents individual character strings:
x = "spam"
type(x)
## <class 'str'>
There are a few binary operators overloaded for strings, e.g., `+` stands for string concatenation:
x + " and eggs"
## 'spam and eggs'
`*` duplicates a given string:
x * 3
## 'spamspamspam'
Chapter 3 noted that str
is a sequential type. As a consequence, we can extract
individual code points and create substrings using the index operator:
x[-1] # last letter
## 'm'
Strings are immutable, but parts thereof can always be reused in conjunction with the concatenation operator:
x[:2] + "ecial"
## 'special'
14.1.1. Unicode as the universal encoding¶
It is worth knowing that all strings in Python (from version 3.0) use Unicode[1], which is a universal encoding capable of representing c. 150 000 characters covering letters and numbers in contemporary and historic alphabets/scripts, mathematical, political, phonetic, and other symbols, emojis, etc.
Note
Despite the wide support for Unicode, sometimes our own or other readers’ display (e.g., web browsers when viewing an HTML version of the output report) might not be able to render all code points properly, e.g., due to missing fonts. Still, we can rest assured that they are processed correctly if string functions are applied thereon.
14.1.2. Normalising strings¶
Dirty text data are a pain, especially if similar (semantically)
tokens are encoded in many different ways. For the sake
of string matching, we might want, e.g., the German
"groß"
, "GROSS"
, and " gross "
to compare all equal.
str.strip removes whitespaces (spaces, tabs, newline characters) at both ends of strings (see also str.lstrip and str.rstrip for their nonsymmetric versions).
str.lower and str.upper change letter case. For caseless comparison/matching, str.casefold might be a slightly better option as it unfolds many more code point sequences:
"Groß".lower(), "Groß".upper(), "Groß".casefold()
## ('groß', 'GROSS', 'gross')
Note
(*) More advanced string transliteration can be performed by means of the ICU (International Components for Unicode) library. Its Python bindings are provided by the PyICU package. Unfortunately, the package is not easily available on W****ws.
For instance, converting all code points to ASCII (English)
might be necessary when identifiers are expected to miss some diacritics
that would normally be included (as in "Gągolewski"
vs "Gagolewski"
):
import icu # PyICU package
(icu.Transliterator
.createInstance("Lower; Any-Latin; Latin-ASCII")
.transliterate(
"Χαίρετε! Groß gżegżółka — © La Niña – köszönöm – Gągolewski"
)
)
## 'chairete! gross gzegzolka - (C) la nina - koszonom - gagolewski'
Converting between different Unicode Normalisation Forms (also available in the unicodedata package and via pandas.Series.str.normalize) might be used for the removal of some formatting nuances:
icu.Transliterator.createInstance("NFKD; NFC").transliterate("¼ąr²")
## '1⁄4ąr2'
14.1.3. Substring searching and replacing¶
Determining if a string has a particular fixed substring can be done in several ways.
For instance, the in operator verifies whether a particular substring occurs at least once:
food = "bacon, spam, spam, srapatapam, eggs, and spam"
"spam" in food
## True
The str.count method determines the number of occurrences of a substring:
food.count("spam")
## 3
To locate the first pattern appearance, we call str.index:
food.index("spam")
## 7
str.replace substitutes matching substrings with new content:
food.replace("spam", "veggies")
## 'bacon, veggies, veggies, srapatapam, eggs, and veggies'
Read the manual of the following methods: str.startswith, str.endswith, str.find, str.rfind, str.rindex, str.removeprefix, and str.removesuffix.
The splitting of long strings at specific fixed delimiters can be done via:
food.split(", ")
## ['bacon', 'spam', 'spam', 'srapatapam', 'eggs', 'and spam']
See also str.partition. The str.join method implements the inverse operation:
", ".join(["spam", "bacon", "eggs", "spam"])
## 'spam, bacon, eggs, spam'
Moreover, Section 14.4 will discuss pattern matching with regular expressions. They can be useful in, amongst others, extracting more abstract data chunks (numbers, URLs, email addresses, IDs) from strings.
14.1.4. Locale-aware services in ICU (*)¶
Recall that relational operators such as `<` and `>=` perform the lexicographic comparing of strings (like in a dictionary or an encyclopedia):
"spam" > "egg"
## True
We have:
"a"
< "aa"
< "aaaaaaaaaaaaa"
< "ab"
< "aba"
<
"abb"
< "b"
< "ba"
< "baaaaaaa"
< "bb"
< "Spanish Inquisition"
.
The lexicographic ordering (character-by-character, from left to right) is not necessarily appropriate for strings with numerals:
"a9" < "a123" # 1 is smaller than 9
## False
Additionally, it only takes into account the numeric codes (see Section 14.4.3.4) corresponding to each Unicode character. Consequently, it does not work well with non-English alphabets:
"MIELONECZKĄ" < "MIELONECZKI"
## False
In Polish, A with ogonek (Ą) is expected to sort after A and before B, let alone I. However, their corresponding numeric codes in the Unicode table are: 260 (Ą), 65 (A), 66 (B), and 73 (I). The resulting ordering is thus incorrect, as far as natural language processing is concerned.
It is best to perform string collation using the services provided
by ICU. Here is an example of German phone book-like collation
where "ö"
is treated the same as "oe"
:
c = icu.Collator.createInstance(icu.Locale("de_DE@collation=phonebook"))
c.setStrength(0) # ignore case and some diacritics
c.compare("Löwe", "loewe")
## 0
A result of 0
means that the strings are deemed equal.
In some languages, contractions occur,
e.g., in Slovak and Czech, two code points "ch"
are
treated as a single entity and are sorted after "h"
:
icu.Collator.createInstance(icu.Locale("sk_SK")).compare("chladný", "hladný")
## 1
This means that we have "chladný"
> "hladný"
(the first argument
is greater than the second one). Compare the above to something similar
in Polish:
icu.Collator.createInstance(icu.Locale("pl_PL")).compare("chłodny", "hardy")
## -1
That is, "chłodny"
< "hardy"
(the first argument is less than
the second one).
Also, with ICU, numeric collation is possible:
c = icu.Collator.createInstance()
c.setAttribute(
icu.UCollAttribute.NUMERIC_COLLATION,
icu.UCollAttributeValue.ON
)
c.compare("a9", "a123")
## -1
Which is the correct result: "a9"
is less than "a123"
(compare the above to the example
where we used the ordinary `<`).
14.1.5. String operations in pandas¶
String sequences in pandas.Series
are
by default
using the broadest possible object
data type:
pd.Series(["spam", "bacon", "spam"])
## 0 spam
## 1 bacon
## 2 spam
## dtype: object
This allows for missing values encoding by means
of the None
object (which is of the type None
, not str
);
compare Section 15.1.
Vectorised versions of base string operations are available via the pandas.Series.str accessor. We thus have pandas.Series.str.strip, pandas.Series.str.split, pandas.Series.str.find, and so forth. For instance:
x = pd.Series(["spam", "bacon", None, "buckwheat", "spam"])
x.str.upper()
## 0 SPAM
## 1 BACON
## 2 None
## 3 BUCKWHEAT
## 4 SPAM
## dtype: object
But there is more. For example, a function to compute the length of each string:
x.str.len()
## 0 4.0
## 1 5.0
## 2 NaN
## 3 9.0
## 4 4.0
## dtype: float64
Vectorised concatenation of strings can be performed using the overloaded `+` operator:
x + " and spam"
## 0 spam and spam
## 1 bacon and spam
## 2 NaN
## 3 buckwheat and spam
## 4 spam and spam
## dtype: object
To concatenate all items into a single string, we call:
x.str.cat(sep="; ")
## 'spam; bacon; buckwheat; spam'
Conversion to numeric:
pd.Series(["1.3", "-7", None, "3523"]).astype(float)
## 0 1.3
## 1 -7.0
## 2 NaN
## 3 3523.0
## dtype: float64
Select substrings:
x.str.slice(2, -1) # like x.iloc[i][2:-1] for all i
## 0 a
## 1 co
## 2 None
## 3 ckwhea
## 4 a
## dtype: object
Replace substrings:
x.str.slice_replace(0, 2, "tofu") # like x.iloc[i][2:-1] = "tofu"
## 0 tofuam
## 1 tofucon
## 2 None
## 3 tofuckwheat
## 4 tofuam
## dtype: object
Consider the nasaweather_glaciers
data frame.
All glaciers are assigned 11/12-character unique identifiers
as defined by the WGMS convention that forms the glacier ID number
by combining the following five elements:
2-character political unit (the first two letters of the ID),
1-digit continent code (the third letter),
4-character drainage code (the next four),
2-digit free position code (the next two),
2- or 3-digit local glacier code (the remaining ones).
Extract the five chunks and store them as independent columns in the data frame.
14.1.6. String operations in numpy (*)¶
There is a huge overlap between the numpy and pandas capabilities for string handling, with the latter being more powerful. After all, numpy is a workhorse for numerical computing. Still, some readers might find what follows useful.
As mentioned in our introduction to numpy vectors,
objects of the type ndarray
can store not only numeric and logical
data, but also character strings. For example:
x = np.array(["spam", "bacon", "egg"])
x
## array(['spam', 'bacon', 'egg'], dtype='<U5')
Here, the data type “<U5
” means
that we deal with Unicode strings of length no greater than five.
Unfortunately, replacing elements with too long a content
will spawn truncated strings:
x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckw'], dtype='<U5')
To remedy this, we first need to recast the vector manually:
x = x.astype("<U10")
x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckwheat'], dtype='<U10')
Conversion from/to numeric is also possible:
np.array(["1.3", "-7", "3523"]).astype(float)
## array([ 1.300e+00, -7.000e+00, 3.523e+03])
np.array([1, 3.14, -5153]).astype(str)
## array(['1.0', '3.14', '-5153.0'], dtype='<U32')
The numpy.char module includes several vectorised versions of string routines, most of which we have already discussed. For example:
x = np.array([
"spam", "spam, bacon, and spam",
"spam, eggs, bacon, spam, spam, and spam"
])
np.char.split(x, ", ")
## array([list(['spam']), list(['spam', 'bacon', 'and spam']),
## list(['spam', 'eggs', 'bacon', 'spam', 'spam', 'and spam'])],
## dtype=object)
np.char.count(x, "spam")
## array([1, 2, 4])
Vectorised operations that we would normally perform through the binary operators (i.e., `+`, `*`, `<`, etc.) are available through standalone functions:
np.char.add(["spam", "bacon"], " and spam")
## array(['spam and spam', 'bacon and spam'], dtype='<U14')
np.char.equal(["spam", "bacon", "spam"], "spam")
## array([ True, False, True])
The function that returns the length of each string is also noteworthy:
np.char.str_len(x)
## array([ 4, 21, 39])
14.2. Working with string lists¶
pandas nicely supports lists of strings of varying lengths. For instance:
x = pd.Series([
"spam",
"spam, bacon, spam",
"potatoes",
None,
"spam, eggs, bacon, spam, spam"
])
xs = x.str.split(", ", regex=False)
xs
## 0 [spam]
## 1 [spam, bacon, spam]
## 2 [potatoes]
## 3 None
## 4 [spam, eggs, bacon, spam, spam]
## dtype: object
And now, e.g., looking at the last element:
xs.iloc[-1]
## ['spam', 'eggs', 'bacon', 'spam', 'spam']
reveals that it is indeed a list of strings.
There are a few vectorised operations that enable us to work with such variable length lists, such as concatenating all strings:
xs.str.join("; ")
## 0 spam
## 1 spam; bacon; spam
## 2 potatoes
## 3 None
## 4 spam; eggs; bacon; spam; spam
## dtype: object
selecting, say, the first string in each list:
xs.str.get(0)
## 0 spam
## 1 spam
## 2 potatoes
## 3 None
## 4 spam
## dtype: object
or slicing:
xs.str.slice(0, -1) # like xs.iloc[i][0:-1] for all i
## 0 []
## 1 [spam, bacon]
## 2 []
## 3 None
## 4 [spam, eggs, bacon, spam]
## dtype: object
(*)
Using pandas.merge, join the
countries
,
world_factbook_2020
,
and ssi_2016_dimensions
datasets based on the country names.
Note that some manual data cleansing will be necessary beforehand.
(**)
Given a Series
object xs
that includes lists of strings,
convert it to a 0/1 representation.
Determine the list of all unique strings; let’s call it
xu
.Create a data frame
x
withxs.shape[0]
rows andlen(xu)
columns such thatx.iloc[i, j]
is equal to 1 ifxu[j]
is amongstxs.loc[i]
and equal to 0 otherwise. Set the column names toxs
.Given
x
(and onlyx
: neitherxs
norxu
), perform the inverse operation.
For example, for the above xs
object, x
should look like:
## bacon eggs potatoes spam
## 0 0 0 0 1
## 1 1 0 0 1
## 2 0 0 1 0
## 3 0 0 0 0
## 4 1 1 0 1
14.3. Formatted outputs for reproducible report generation¶
Some good development practices related to reproducible report generation are discussed in [84, 102, 103]. Note that the paradigm of literate programming was introduced by D. Knuth in [57].
Reports from data analysis can be prepared, e.g., in Jupyter Notebooks or by writing directly to Markdown files which we can later compile to PDF or HTML. Below we briefly discuss how to output nicely formatted objects programmatically.
14.3.1. Formatting strings¶
Inclusion of textual representation of data stored in existing
objects can easily be done using f-strings
(formatted string literals; see Section 2.1.3.1)
of the type f"...{expression}..."
. For instance:
pi = 3.14159265358979323846
f"π = {pi:.2f}"
## 'π = 3.14'
creates a string showing the value of the variable pi
formatted
as a float
rounded to two places after the decimal separator.
Note
(**) Similar functionality can be achieved using the str.format method:
"π = {:.2f}".format(pi)
## 'π = 3.14'
as well as the `%` operator overloaded for strings, which uses sprintf-like value placeholders known to some readers from other programming languages (such as C):
"π = %.2f" % pi
## 'π = 3.14'
14.3.2. str and repr¶
The str and repr functions can create string representations of many objects:
x = np.array([1, 2, 3])
str(x)
## '[1 2 3]'
repr(x)
## 'array([1, 2, 3])'
The former is more human-readable, and the latter is slightly more technical. Note that repr often returns an output that can be interpreted as executable Python code with no or few adjustments. Nonetheless, pandas objects are amongst the many exceptions to this rule.
14.3.3. Aligning strings¶
str.center, str.ljust, str.rjust can be used to centre-, left-, or right-align a string so that it is of at least given width. This might make the display thereof more aesthetic. Very long strings, possibly containing whole text paragraphs can be dealt with using the wrap and shorten functions from the textwrap package.
14.3.4. Direct Markdown output in Jupyter¶
Further, with IPython/Jupyter, we can output strings that will be directly interpreted as Markdown-formatted:
import IPython.display
x = 2+2
out = f"*Result*: $2^2=2\\cdot 2={x}$." # LaTeX math
IPython.display.Markdown(out)
Result: \(2^2=2\cdot 2=4\).
Recall from Section 1.2.5 that Markdown is a very flexible markup[2] language that allows us to define itemised and numbered lists, mathematical formulae, tables, images, etc.
On a side note, data frames can be nicely prepared for display in a report using pandas.DataFrame.to_markdown.
14.3.5. Manual Markdown file output (*)¶
We can also generate Markdown code programmatically
in the form of standalone .md
files:
import tempfile, os.path
filename = os.path.join(tempfile.mkdtemp(), "test-report.md")
f = open(filename, "w") # open for writing (overwrite if exists)
f.write("**Yummy Foods** include, but are not limited to:\n\n")
x = ["spam", "bacon", "eggs", "spam"]
for e in x:
f.write(f"* {e}\n")
f.write("\nAnd now for something *completely* different:\n\n")
f.write("Rank | Food\n")
f.write("-----|-----\n")
for i in range(len(x)):
f.write(f"{i+1:4} | {x[i][::-1]:10}\n")
f.close()
Here is the resulting raw Markdown source file:
with open(filename, "r") as f: # will call f.close() automatically
out = f.read()
print(out)
## **Yummy Foods** include, but are not limited to:
##
## * spam
## * bacon
## * eggs
## * spam
##
## And now for something *completely* different:
##
## Rank | Food
## -----|-----
## 1 | maps
## 2 | nocab
## 3 | sgge
## 4 | maps
We can convert it to other formats, including HTML, PDF, EPUB, ODT, and even presentations by running[3] the pandoc tool. We may also embed it directly inside an IPython/Jupyter notebook:
IPython.display.Markdown(out)
Yummy Foods include, but are not limited to:
spam
bacon
eggs
spam
And now for something completely different:
Rank |
Food |
---|---|
1 |
maps |
2 |
nocab |
3 |
sgge |
4 |
maps |
Note
Figures created in matplotlib
can be exported to PNG, SVG, or PDF files
using the matplotlib.pyplot.savefig function.
We can include them manually in a Markdown document using the
![description](filename)
syntax.
Note
(*) IPython/Jupyter Notebooks can be converted to different formats using the jupyter-nbconvert command line tool. jupytext can create notebooks from ordinary text files. Literate programming with mixed R and Python is possible with the R packages knitr and reticulate. See [75] for an overview of many more options.
14.4. Regular expressions (*)¶
This section contains large excerpts from yours truly’s other work [35].
Regular expressions (regexes) provide concise grammar for defining systematic patterns which can be sought in character strings. Examples of such patterns include: specific fixed substrings, emojis of any kind, standalone sequences of lower-case Latin letters (“words”), substrings that can be interpreted as real numbers (with or without fractional parts, also in scientific notation), telephone numbers, email addresses, or URLs.
Theoretically, the concept of regular pattern matching dates to the so-called regular languages and finite state automata [56]; see also [78] and [51]. Regexes, in the form as we know it today, were already present in one of the pre-UNIX implementations of the command-line text editor qed [79] (the predecessor of the well-known sed).
14.4.1. Regex matching with re (*)¶
In Python, the re module implements a regular expression matching engine. It accepts patterns that follow similar syntax to the one available in the Perl language.
As a matter of fact, most programming languages and text editors (including Kate, Eclipse, and VSCodium) support finding and replacing patterns with regexes. This is why they should be amongst the instruments at every data scientist’s disposal.
Before we proceed with a detailed discussion on how to read and write
regular expressions, let’s first review some of the methods
for identifying the matching substrings. Below we use the r"\bni+\b"
regex as an example. It catches "n"
followed by at least one "i"
that begins and ends at a word boundary.
In other words, we seek "ni"
, "nii"
, "niii"
, etc. which may be
considered standalone words.
In particular, re.findall extracts all non-overlapping matches to a given regex:
import re
x = "We're the knights who say ni! niiiii! ni! niiiiiiiii!"
re.findall(r"\bni+\b", x)
## ['ni', 'niiiii', 'ni', 'niiiiiiiii']
The order of arguments is (look for what, where), not vice versa.
Important
We used the r"..."
prefix to input a string so that “\b
” is not
treated as an escape sequence which denotes the backspace character.
Otherwise, the foregoing would have to be written as “\\bni+\\b
”.
If we had not insisted on matching at the word boundaries (i.e.,
if we used the simple "ni+"
regex instead),
we would also match the "ni"
in "knights"
.
The re.search function returns an object of the class re.Match
that enables us to get some more information about the first match:
r = re.search(r"\bni+\b", x)
r.start(), r.end(), r.group()
## (26, 28, 'ni')
It includes the start and the end position (index) as well as the match itself. If the regex contains capture groups (more details follow), we can also pinpoint the matches thereto.
Moreover, re.finditer returns an iterable object that includes the same details, but now about all the matches:
rs = re.finditer(r"\bni+\b", x)
for r in rs:
print((r.start(), r.end(), r.group()))
## (26, 28, 'ni')
## (30, 36, 'niiiii')
## (38, 40, 'ni')
## (42, 52, 'niiiiiiiii')
re.split divides a string into chunks separated by matches to a given regex:
re.split(r"!\s+", x)
## ["We're the knights who say ni", 'niiiii', 'ni', 'niiiiiiiii!']
The “!\s*
” regex matches the exclamation mark
followed by one or more whitespace characters.
Using re.sub, each match can be replaced with a given string:
re.sub(r"\bni+\b", "nu", x)
## "We're the knights who say nu! nu! nu! nu!"
Note
(**) More flexible replacement strings can be generated by passing a custom function as the second argument:
re.sub(r"\bni+\b", lambda m: "n" + "u"*(m.end()-m.start()-1), x)
## "We're the knights who say nu! nuuuuu! nu! nuuuuuuuuu!"
14.4.2. Regex matching with pandas (*)¶
The pandas.Series.str accessor also defines a number of vectorised functions that utilise the re package’s matcher.
Example Series
object:
x = pd.Series(["ni!", "niiii, ni, nii!", None, "spam, bacon", "nii, ni!"])
x
## 0 ni!
## 1 niiii, ni, nii!
## 2 None
## 3 spam, bacon
## 4 nii, ni!
## dtype: object
Here are the most notable functions:
x.str.contains(r"\bni+\b")
## 0 True
## 1 True
## 2 None
## 3 False
## 4 True
## dtype: object
x.str.count(r"\bni+\b")
## 0 1.0
## 1 3.0
## 2 NaN
## 3 0.0
## 4 2.0
## dtype: float64
x.str.replace(r"\bni+\b", "nu", regex=True)
## 0 nu!
## 1 nu, nu, nu!
## 2 None
## 3 spam, bacon
## 4 nu, nu!
## dtype: object
x.str.findall(r"\bni+\b")
## 0 [ni]
## 1 [niiii, ni, nii]
## 2 None
## 3 []
## 4 [nii, ni]
## dtype: object
x.str.split(r",\s+") # a comma, one or more whitespaces
## 0 [ni!]
## 1 [niiii, ni, nii!]
## 2 None
## 3 [spam, bacon]
## 4 [nii, ni!]
## dtype: object
In the two last cases, we get lists of strings as results.
Also, later we will mention pandas.Series.str.extract and pandas.Series.str.extractall which work with regexes that include capture groups.
Note
(*) If we intend to seek matches to the same pattern in many different strings without the use of pandas, it might be faster to precompile a regex first, and then use the re.Pattern.findall method instead or re.findall:
p = re.compile(r"\bni+\b") # returns an object of the class `re.Pattern`
p.findall("We're the Spanish Inquisition ni! ni! niiiii! nininiiiiiiiii!")
## ['ni', 'ni', 'niiiii']
14.4.3. Matching individual characters (*)¶
In the coming subsections, we review the most essential elements of the regex syntax as we did in [35]. One general introduction to regexes is [31]. The re module flavour is summarised in the official manual, see also [59].
We begin by discussing different ways to define character sets. In this part, determining the length of all matching substrings will be straightforward.
Important
The following characters have special meaning to the regex engine:
“.
”,
“\
”,
“|
”,
“(
“,
“)
”,
“[
“,
“]
”, “{
“,
“}
”,
“^
”,
“$
”,
“*
”,
“+
”, and
“?
”.
Any regular expression that contains none of the preceding characters behaves like a fixed pattern:
re.findall("spam", "spam, eggs, spam, bacon, sausage, and spam")
## ['spam', 'spam', 'spam']
There are three occurrences of a pattern that is comprised of four code
points, “s
” followed by “p
”, then by “a
”, and ending with “m
”.
If we want to include a special character as part of a regular
expression so that it is treated literally, we will need to escape
it with a backslash, “\
”.
re.findall(r"\.", "spam...")
## ['.', '.', '.']
14.4.3.1. Matching anything (almost) (*)¶
The (unescaped) dot, “.
”, matches any code point except the newline.
x = "Spam, ham,\njam, SPAM, eggs, and spam"
re.findall("..am", x, re.IGNORECASE)
## ['Spam', ' ham', 'SPAM', 'spam']
It extracted non-overlapping substrings of length four
that end with “am
”, case-insensitively.
The dot’s insensitivity to the newline character is motivated by the
need to maintain compatibility with tools such as grep (when
searching within text files in a line-by-line manner). This behaviour
can be altered by setting the DOTALL
flag.
re.findall("..am", x, re.DOTALL|re.IGNORECASE) # `|` is the bitwise OR
## ['Spam', ' ham', '\njam', 'SPAM', 'spam']
14.4.3.2. Defining character sets (*)¶
Sets of characters can be introduced by enumerating their members within
a pair of square brackets. For instance, “[abc]
” denotes the set
{a, b, c} – such a regular expression
matches one (and only one) symbol from this set. Moreover, in:
re.findall("[hj]am", x)
## ['ham', 'jam']
the “[hj]am
” regex matches: “h
” or “j
”, followed by “a
”,
followed by “m
”. In other words, "ham"
and "jam"
are the only two
strings that are matched by this pattern (unless matching is done
case-insensitively).
Important
The following characters, if used within square brackets, may be treated
not literally:
“\
”,
“[
“,
“]
”,
“^
”,
“-
“,
“&
”,
“~
”, and
“|
”.
To include them as-is in a character set, the
backslash-escape must be used. For example, “[\[\]\\]
” matches a
backslash or a square bracket.
14.4.3.3. Complementing sets (*)¶
Including “^
” (the caret) after the opening square bracket denotes
a set’s complement. Hence, “[^abc]
” matches any code point except “a
”,
“b
”, and “c
”. Here is an example where we seek any substring that
consists of four non-spaces:
x = "Nobody expects the Spanish Inquisition!"
re.findall("[^ ][^ ][^ ][^ ]", x)
## ['Nobo', 'expe', 'Span', 'Inqu', 'isit', 'ion!']
14.4.3.4. Defining code point ranges (*)¶
Each Unicode character can be referenced by its unique
numeric code.
For instance, “a
” is
assigned code U+0061 and “z
” is mapped to U+007A. In the pre-Unicode
era (mostly with regard to the ASCII codes, ≤ U+007F, representing
English letters, decimal digits, as well as some punctuation and
control characters), we were used to relying on specific code ranges.
For example, “[a-z]
” denotes the set comprised of all characters with codes
between U+0061 and U+007A, i.e., lowercase letters of the English
(Latin) alphabet.
re.findall("[0-9A-Za-z]", "Gągolewski")
## ['G', 'g', 'o', 'l', 'e', 'w', 's', 'k', 'i']
This pattern denotes the union of three code ranges: ASCII
upper- and lowercase letters and digits.
Nowadays, in the processing of text in natural languages, this notation
should be avoided. Note the missing “ą
” (Polish “a
” with
ogonek) in the result.
14.4.3.5. Using predefined character sets (*)¶
Consider a string:
x = "aąbßÆAĄB你12𝟛٤,.;'! \t-+=\n[]©←→”„"
Some glyphs are not available in the PDF version of this book because we did not install the required fonts, e.g., the Arabic digit 4 or left and right arrows. However, they are well-defined at the program level.
Noteworthy Unicode-aware code point classes include the word characters:
re.findall(r"\w", x)
## ['a', 'ą', 'b', 'ß', 'Æ', 'A', 'Ą', 'B', '你', '1', '2', '𝟛', '٤']
decimal digits:
re.findall(r"\d", x)
## ['1', '2', '𝟛', '٤']
and whitespaces:
re.findall(r"\s", x)
## [' ', '\t', '\n']
Moreover, e.g., “\W
” is
equivalent to “[^\w]
” , i.e.,
denotes the set’s complement.
14.4.4. Alternating and grouping subexpressions (*)¶
14.4.4.1. Alternation operator (*)¶
The alternation operator, “|
” (the pipe or bar),
matches either its left or its right branch. For instance:
x = "spam, egg, ham, jam, algae, and an amalgam of spam, all al dente"
re.findall("spam|ham", x)
## ['spam', 'ham', 'spam']
14.4.4.2. Grouping subexpressions (*)¶
The “|
” operator has very low precedence (otherwise, we would
match "spamam"
or "spaham"
above instead).
If we want to introduce an alternative of subexpressions,
we need to group them using the
“(?:...)
” syntax. For instance, “(?:sp|h)am
” matches either "spam"
or "ham"
.
Notice that the bare use of the round brackets, “(...)
”
(i.e., without the “?:
”) part, has the side-effect of creating new
capturing groups; see below for more details.
Also, matching is always done left-to-right, on the first-come,
first-served (greedy) basis. Consequently, if the left branch is a
subset of the right one, the latter will never be matched. In particular,
“(?:al|alga|algae)
” can only match "al"
. To fix this, we can write
“(?:algae|alga|al)
”.
14.4.4.3. Non-grouping parentheses (*)¶
Some parenthesised subexpressions – those in which the opening bracket
is followed by the question mark – have a distinct meaning. In
particular, “(?#...)
” denotes a free-format comment that is ignored by
the regex parser:
re.findall(
"(?# match 'sp' or 'h')(?:sp|h)(?# and 'am')am|(?# or match 'egg')egg",
x
)
## ['spam', 'egg', 'ham', 'spam']
This is just horrible. Luckily, constructing more sophisticated regexes by concatenating subfragments thereof is more readable:
re.findall(
"(?:sp|h)" + # match either 'sp' or 'h'
"am" + # followed by 'am'
"|" + # ... or ...
"egg", # just match 'egg'
x
)
## ['spam', 'egg', 'ham', 'spam']
What is more, e.g., “(?i)
” enables the case-insensitive mode.
re.findall("(?i)spam", "Spam spam SPAMITY spAm")
## ['Spam', 'spam', 'SPAM', 'spAm']
14.4.5. Quantifiers (*)¶
More often than not, a variable number of instances of the same subexpression needs to be captured. Sometimes we want to make its presence optional. These can be achieved by means of the following quantifiers:
“
?
” matches 0 or 1 time;“
*
” matches 0 or more times;“
+
” matches 1 or more times;“
{n,m}
” matches betweenn
andm
times;“
{n,}
” matches at leastn
times;“
{n}
” matches exactlyn
times.
These operators are applied onto the directly preceding atoms.
For example, “ni+
” captures "ni"
, "nii"
, "niii"
, etc.,
but neither "n"
alone nor "ninini"
altogether.
By default, the quantifiers are greedy – they match the repeated
subexpression as many times as possible. The “?
” suffix (forming
quantifiers such as “??
”, “*?
”, “+?
”, and so forth) tries with as
few occurrences as possible (to obtain a match still).
Greedy:
x = "sp(AM)(maps)(SP)am"
re.findall(r"\(.+\)", x)
## ['(AM)(maps)(SP)']
Lazy:
re.findall(r"\(.+?\)", x)
## ['(AM)', '(maps)', '(SP)']
Greedy (but clever):
re.findall(r"\([^)]+\)", x)
## ['(AM)', '(maps)', '(SP)']
The first regex is greedy: it matches an opening bracket, then as many
characters as possible (including “)
”) that are followed by a closing
bracket. The two other patterns terminate as soon as the first closing
bracket is found.
More examples:
x = "spamamamnomnomnomammmmmmmmm"
re.findall("sp(?:am|nom)+", x)
## ['spamamamnomnomnomam']
re.findall("sp(?:am|nom)+?", x)
## ['spam']
And:
re.findall("sp(?:am|nom)+?m*", x)
## ['spam']
re.findall("sp(?:am|nom)+?m+", x)
## ['spamamamnomnomnomammmmmmmmm']
Let’s stress that the quantifier is applied to the subexpression that stands directly before it. Grouping parentheses can be used in case they are needed.
x = "12, 34.5, 678.901234, 37...629, ..."
re.findall(r"\d+\.\d+", x)
## ['34.5', '678.901234']
matches digits, a dot, and another series of digits.
re.findall(r"\d+(?:\.\d+)?", x)
## ['12', '34.5', '678.901234', '37', '629']
finds digits which are possibly (but not necessarily) followed by a dot and a digit sequence.
Write a regex that extracts all #hashtags from a string #omg #SoEasy.
14.4.6. Capture groups and references thereto (**)¶
Round-bracketed subexpressions (without the “?:
” prefix)
form the so-called capture groups that can be extracted separately or be
referred to in other parts of the same regex.
14.4.6.1. Extracting capture group matches (**)¶
The preceding statement can be nicely verified by calling re.findall:
x = "name='Sir Launcelot', quest='Seek Grail', favcolour='blue'"
re.findall(r"(\w+)='(.+?)'", x)
## [('name', 'Sir Launcelot'), ('quest', 'Seek Grail'), ('favcolour', 'blue')]
It returned the matches to the individual capture groups, not the whole matching substrings.
re.find and re.finditer can pinpoint each component:
r = re.search(r"(\w+)='(.+?)'", x)
print("whole (0):", (r.start(), r.end(), r.group()))
print(" 1 :", (r.start(1), r.end(1), r.group(1)))
print(" 2 :", (r.start(2), r.end(2), r.group(2)))
## whole (0): (0, 20, "name='Sir Launcelot'")
## 1 : (0, 4, 'name')
## 2 : (6, 19, 'Sir Launcelot')
Here its vectorised version using pandas, returning the first match:
y = pd.Series([
"name='Sir Launcelot'",
"quest='Seek Grail'",
"favcolour='blue', favcolour='yel.. Aaargh!'"
])
y.str.extract(r"(\w+)='(.+?)'")
## 0 1
## 0 name Sir Launcelot
## 1 quest Seek Grail
## 2 favcolour blue
We see that the findings are conveniently presented in the data frame form. The first column gives the matches to the first capture group. All matches can be extracted too:
y.str.extractall(r"(\w+)='(.+?)'")
## 0 1
## match
## 0 0 name Sir Launcelot
## 1 0 quest Seek Grail
## 2 0 favcolour blue
## 1 favcolour yel.. Aaargh!
Recall that if we just need the grouping part of “(...)
”, i.e., without the
capturing feature, “(?:...)
” can be applied.
Also, named capture
groups defined like “(?P<name>...)
” are supported.
y.str.extract("(?:\\w+)='(?P<value>.+?)'")
## value
## 0 Sir Launcelot
## 1 Seek Grail
## 2 blue
14.4.6.2. Replacing with capture group matches (**)¶
When using re.sub and pandas.Series.str.replace,
matches to particular capture groups can be recalled in replacement
strings. The match in its entirety is denoted by “\g<0>
”,
then “\g<1>
” stores whatever was caught by the first capture group,
and “\g<2>
” is the match to the second capture group, etc.
re.sub(r"(\w+)='(.+?)'", r"\g<2> is a \g<1>", x)
## 'Sir Launcelot is a name, Seek Grail is a quest, blue is a favcolour'
Named capture groups can be referred to too:
re.sub(r"(?P<key>\w+)='(?P<value>.+?)'",
r"\g<value> is a \g<key>", x)
## 'Sir Launcelot is a name, Seek Grail is a quest, blue is a favcolour'
14.4.6.3. Back-referencing (**)¶
Matches to capture groups can also be part of the regexes themselves.
In such a context, e.g., “\1
” denotes whatever has been
consumed by the first capture group.
In general, parsing HTML code with regexes is not recommended, unless it is well-structured (which might be the case if it is generated programmatically; but we can always use the lxml package). Despite this, let’s consider the following examples:
x = "<p><em>spam</em></p><code>eggs</code>"
re.findall(r"<[a-z]+>.*?</[a-z]+>", x)
## ['<p><em>spam</em>', '<code>eggs</code>']
It did not match the correct closing HTML tag. But we can make this happen by writing:
re.findall(r"(<([a-z]+)>.*?</\2>)", x)
## [('<p><em>spam</em></p>', 'p'), ('<code>eggs</code>', 'code')]
This regex guarantees that the match will include all characters
between the opening "<tag>"
and the corresponding (not: any) closing
"</tag>"
.
Named capture groups can be referenced using the “(?P=name)
”
syntax:
re.findall(r"(<(?P<tagname>[a-z]+)>.*?</(?P=tagname)>)", x)
## [('<p><em>spam</em></p>', 'p'), ('<code>eggs</code>', 'code')]
The angle brackets are part of the token.
14.4.7. Anchoring (*)¶
Lastly, let’s mention the ways to match a pattern at a given abstract position within a string.
14.4.7.1. Matching at the beginning or end of a string (*)¶
“^
” and “$
” match, respectively, start and end of the string
(or each line within a string, if the re.MULTILINE
flag is set).
x = pd.Series(["spam egg", "bacon spam", "spam", "egg spam bacon", "milk"])
rs = ["spam", "^spam", "spam$", "spam$|^spam", "^spam$"] # regexes to test
The five regular expressions match "spam"
, respectively, anywhere within
the string, at the beginning, at the end, at the beginning or end, and
in strings that are equal to the pattern itself.
We can check this by calling:
pd.concat([x.str.contains(r) for r in rs], axis=1, keys=rs)
## spam ^spam spam$ spam$|^spam ^spam$
## 0 True True False True False
## 1 True False True True False
## 2 True True True True True
## 3 True False False False False
## 4 False False False False False
Compose a regex that does the same job as str.strip.
14.4.7.2. Matching at word boundaries (*)¶
What is more, “\b
” matches at a “word boundary”, e.g., near spaces,
punctuation marks, or at the start/end of a string (i.e., wherever there
is a transition between a word, “\w
”, and a non-word character,
“\W
”, or vice versa).
In the following example, we match all stand-alone numbers (this regular expression is imperfect, though):
re.findall(r"[-+]?\b\d+(?:\.\d+)?\b", "+12, 34.5, -5.3243")
## ['+12', '34.5', '-5.3243']
14.4.7.3. Looking behind and ahead (**)¶
There is a way to guarantee that a pattern occurrence begins or
ends with a match to a subexpression: “(?<=...)...
” denotes the
look-behind, whereas “...(?=...)
” designates a look-ahead.
x = "I like spam, spam, eggs, and spam."
re.findall(r"\b\w+\b(?=[,.])", x)
## ['spam', 'spam', 'eggs', 'spam']
This regex captured words that end with a comma or a dot
Moreover, “(?<!...)...
” and “...(?!...)
” are their negated
versions (negative look-behind/ahead).
re.findall(r"\b\w+\b(?![,.])", x)
## ['I', 'like', 'and']
This time, we matched the words that end with neither a comma nor a dot.
14.5. Exercises¶
List some ways to normalise character strings.
(**) What are the challenges of processing non-English text?
What are the problems with the "[A-Za-z]"
and "[A-z]"
character sets?
Name the two ways to turn on case-insensitive regex matching.
What is a word boundary?
What is the difference between the "^"
and "$"
anchors?
When would we prefer using "[0-9]"
instead of "\d"
?
What is the difference between the "?"
, "??"
, "*"
, "*?"
,
"+"
, and "+?"
quantifiers?
Does "."
match all the characters?
What are named capture groups and how can we refer to the matches thereto in re.sub?
Write a regex that extracts
all standalone numbers accepted by Python,
including 12.123
, -53
, +1e-9
,
-1.2423e10
, 4.
and .2
.
Author a regex that matches all email addresses.
Indite a regex that matches all URLs
starting with http://
or https://
.
Cleanse the
warsaw_weather
dataset so that it contains analysable numeric data.