14. Text Data
The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.
In [Gag22] it is noted that effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalisation, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. Means for the handling of string data should be included in each statistician’s or data scientist’s repertoire to complement their numerical computing and data wrangling skills.
Diverse data cleansing and preparation operations (compare, e.g., [vdLdJ18] and [DJ03]) need to be applied before an analyst can begin to enjoy an orderly and meaningful data frame, matrix, or spreadsheet being finally at their disposal. Activities related to information retrieval, computer vision, bioinformatics, natural language processing, or even musicology can also benefit from including them in data processing pipelines.
In this part we discuss the most basic string operations in base Python, together with their vectorised versions in numpy and pandas.
14.1. Basic String Operations
Recall that the str
class represents individual character strings:
x = "spam"
type(x)
## <class 'str'>
There are a few binary operators overloaded for strings, e.g., `+` stands for string concatenation:
x + " and eggs"
## 'spam and eggs'
`*` duplicates a given string:
x * 3
## 'spamspamspam'
Further, str
is a sequential type, therefore we can extract
individual code points and create substrings using the index operator:
x[-1] # last letter
## 'm'
Recall that strings are immutable. However, parts of strings can always be reused in conjunction with the concatenation operator:
x[:2] + "ecial"
## 'special'
14.1.1. Unicode as the Universal Encoding
It is worth knowing that all strings in Python (from version 3.0) use Unicode(https://www.unicode.org/charts/), which is a universal encoding capable of representing ca. 150,000 characters covering letters and numbers in contemporary and historic alphabets/scripts, mathematical, political, phonetic, and other symbols, emojis, etc. It is thus a very powerful representation.
Note
Despite the wide support for Unicode, sometimes our own or other readers’ display (e.g., web browsers when viewing a HTML version of the output report) might not be able to render all code points properly (e.g., due to missing fonts). However, we should rest assured that they are still there, and are processed correctly if string functions are applied thereon.
Note
(**) More precisely, Python strings are UTF-8-encoded. Most web pages and API data are nowadays served in UTF-8. However, occasionally we can encounter files encoded in ISO-8859-1 (Western Europe), Windows-1250 (Eastern Europe), Windows-1251 (Cyrillic), GB18030 and Big5 (Chinese), EUC-KR (Korean), Shift-JIS and EUC-JP (Japanese), amongst others; they can be converted using the str.decode method.
14.1.2. Normalising Strings
Dirty text data are a pain, especially if similar (semantically)
tokens are encoded in many different ways. For the sake
of string matching, we might want, e.g., the German
"groß"
, "GROSS"
, and " gross "
to compare all equal.
str.strip removes whitespaces (spaces, tabs, newline characters) at both ends of strings (see also str.lstrip and str.rstrip for their nonsymmetric versions).
str.lower and str.upper change letter case. For caseless comparison/matching, str.casefold might be a slightly better option as it unfolds many more code point sequences:
"Groß".lower(), "Groß".upper(), "Groß".casefold()
## ('groß', 'GROSS', 'gross')
Important
(**) More advanced string transliteration can be performed by means of the ICU library, which the PyICU package provides wrappers for.
For instance, converting all code points to ASCII (English)
might be necessary when identifiers are expected to miss some diacritics
that would normally be included (as in "Gągolewski"
vs "Gagolewski"
):
import icu # PyICU package
icu.Transliterator.createInstance("Lower; Any-Latin; Latin-ASCII").transliterate(
"Χαίρετε! Groß gżegżółka — © La Niña – köszönöm – Gągolewski"
)
## 'chairete! gross gzegzolka - (C) la nina - koszonom - gagolewski'
Converting between different Unicode Normalisation Forms (also available in the unicodedata package and via pandas.Series.str.normalize) might be used for the removal of some formatting nuances:
icu.Transliterator.createInstance("NFKD; NFC").transliterate("¼ąr²︷")
## '1⁄4ąr2{'
14.1.3. Substring Searching and Replacing
Determining if a string features a particular fixed substring can be done in a number of different ways.
For instance:
food = "bacon, spam, spam, eggs, and spam"
"spam" in food
## True
verifies whether a particular substring exists,
food.count("spam")
## 3
counts the number of occurrences of a substring,
food.find("spam")
## 7
locates the first pattern occurrence (see also str.rfind as well as str.index and str.rindex),
food.replace("spam", "veggies")
## 'bacon, veggies, veggies, eggs, and veggies'
replaces matching substrings with another string.
Read the manual of the following methods: str.startswith, str.endswith str.removeprefix, str.removesuffix.
The splitting of long strings at specific fixed delimiter strings can be done via:
food.split(", ")
## ['bacon', 'spam', 'spam', 'eggs', 'and spam']
see also str.partition. The str.join method implements the inverse operation:
", ".join(["spam", "bacon", "eggs", "spam"])
## 'spam, bacon, eggs, spam'
Important
In Section 14.4, we will discuss pattern matching with regular expressions, which can be useful in, amongst others, extracting more abstract data chunks (numbers, URLs, email addresses, IDs) from within strings.
14.1.4. Locale-Aware Services in ICU (*)
Recall that relational operators such as `<` and `>= perform lexicographic comparing of strings:
"spam" > "egg"
## True
We have:
"a"
< "aa"
< "aaaaaaaaaaaaa"
< "ab"
< "aba"
<
"abb"
< "b"
< "ba"
< "baaaaaaa"
< "bb"
< "Spanish inquisition"
.
Lexicographic ordering (character-by-character, from left to right) is, however, not necessarily appropriate for strings featuring numerals:
"a9" < "a123"
## False
Also, it only takes into account the numeric codes corresponding to each Unicode character, therefore does not work well with non-English alphabets:
"MIELONECZKĄ" < "MIELONECZKI"
## False
In Polish, A with ogonek (Ą) should sort after A and before B, let alone I. However, their corresponding numeric codes in the Unicode table are: 260 (Ą), 65 (A), 66 (B), and 73 (I), therefore the resulting ordering is incorrect, natural language processing-wisely.
It is best to perform string collation using the services provided
by ICU. Here is an example of German phone book-like collation
where "ö"
is treated the same as "oe"
:
c = icu.Collator.createInstance(icu.Locale("de_DE@collation=phonebook"))
c.setStrength(0)
c.compare("Löwe", "loewe")
## 0
A result of 0
means that the strings are deemed equal.
In some languages, contractions occur,
e.g., in Slovak and Czech, two code points "ch"
are
treated as a single entity and are sorted after "h"
:
icu.Collator.createInstance(icu.Locale("sk_SK")).compare("chladný", "hladný")
## 1
i.e., we have "chladný"
> "hladný"
(the 1st argument is greater
than the 2nd one). Compare the above to something similar in Polish:
icu.Collator.createInstance(icu.Locale("pl_PL")).compare("chłodny", "hardy")
## -1
i.e., "chłodny"
< "hardy"
(the first argument is less than
the 2nd one).
Also, with ICU, numeric collation is possible:
c = icu.Collator.createInstance()
c.setAttribute(icu.UCollAttribute.NUMERIC_COLLATION, icu.UCollAttributeValue.ON)
c.compare("a9", "a123")
## -1
Which is the correct result: "a9"
is less than "a123"
(compare the above to the example which was using using `<`).
14.1.5. String Operations in pandas
String sequences in Series
are
by default
using the broadest possible object
data type:
pd.Series(["spam", "bacon", "spam"])
## 0 spam
## 1 bacon
## 2 spam
## dtype: object
which basically means that we deal with a sequence
of Python objects of arbitrary type (here, are all of class str
).
This allows for the encoding of missing values by means
of the None
object.
Vectorised versions of base string operations
are available
via the pandas.Series.str accessor,
which we usually refer to by calling
x.str.method_name()
, for instance:
x = pd.Series(["spam", "bacon", None, "buckwheat", "spam"])
x.str.upper()
## 0 SPAM
## 1 BACON
## 2 None
## 3 BUCKWHEAT
## 4 SPAM
## dtype: object
We thus have pandas.Series.str.strip, pandas.Series.str.split, pandas.Series.str.find, and so forth.
But there is more. For example, a function to compute the length of each string:
x.str.len()
## 0 4.0
## 1 5.0
## 2 NaN
## 3 9.0
## 4 4.0
## dtype: float64
Concatenating all items into a single string:
x.str.cat(sep="; ")
## 'spam; bacon; buckwheat; spam'
Vectorised string concatenation:
x + " and spam"
## 0 spam and spam
## 1 bacon and spam
## 2 NaN
## 3 buckwheat and spam
## 4 spam and spam
## dtype: object
Conversion to numeric:
pd.Series(["1.3", "-7", None, "3523"]).astype(float)
## 0 1.3
## 1 -7.0
## 2 NaN
## 3 3523.0
## dtype: float64
Selecting substrings:
x.str.slice(2, -1) # like x.iloc[i][2:-1] for all i
## 0 a
## 1 co
## 2 None
## 3 ckwhea
## 4 a
## dtype: object
Replacing substrings:
x.str.slice_replace(0, 2, "tofu") # like x.iloc[i][2:-1] = "tofu"
## 0 tofuam
## 1 tofucon
## 2 None
## 3 tofuckwheat
## 4 tofuam
## dtype: object
Consider the nasaweather_glaciers
data frame.
All glaciers are assigned 11/12-character unique identifiers.
The ID number is assigned to the glacier as defined by the WGMS
convention that forms the glacier ID number by combining the
following five elements. Extract all of them and store them as independent
columns in the data frame.
2-character political unit,
1-digit continent code,
4-character drainage code,
2-digit free position code,
2- or 3-digit local glacier code.
14.1.6. String Operations in numpy (*)
There is a huge overlap between the numpy and pandas capabilities for string handling, with the latter being more powerful. Still, some readers will find the following useful.
As mentioned in our introduction to numpy vectors,
objects of type ndarray
can store not only numeric and logical
data, but also character strings. For example:
x = np.array(["spam", "bacon", "egg"])
x
## array(['spam', 'bacon', 'egg'], dtype='<U5')
Here, the data type "<U5"
(compare also x.dtype
) means
that we deal with Unicode strings of length no greater than 5.
Thus, unfortunately, replacing elements with too long a content
will result in truncated strings:
x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckw'], dtype='<U5')
In order to remedy this, we first need to recast the vector manually:
x = x.astype("<U10")
x[2] = "buckwheat"
x
## array(['spam', 'bacon', 'buckwheat'], dtype='<U10')
Conversion from/to numeric is also possible:
np.array(["1.3", "-7", "3523"]).astype(float)
## array([ 1.300e+00, -7.000e+00, 3.523e+03])
np.array([1, 3.14, -5153]).astype(str)
## array(['1.0', '3.14', '-5153.0'], dtype='<U32')
The numpy.char module includes a number of vectorised versions of string routines, most of which we have discussed above. For example:
x = np.array([
"spam", "spam, bacon, and spam",
"spam, eggs, bacon, spam, spam, and spam"
])
np.char.split(x, ", ")
## array([list(['spam']), list(['spam', 'bacon', 'and spam']),
## list(['spam', 'eggs', 'bacon', 'spam', 'spam', 'and spam'])],
## dtype=object)
np.char.count(x, "spam")
## array([1, 2, 4])
Operations that we would normally perform via the use of binary operators (i.e., `+`, `*`, `<`, etc.) are available through standalone functions:
np.char.add(["spam", "bacon"], " and spam")
## array(['spam and spam', 'bacon and spam'], dtype='<U14')
np.char.equal(["spam", "bacon", "spam"], "spam")
## array([ True, False, True])
Also the function that returns the length of each string is noteworthy:
np.char.str_len(x)
## array([ 4, 21, 39])
14.2. Working with String Lists
Series
can also consist of lists of strings of varying lengths.
They can not only be input manually (via the pandas.Series constructor),
but also through string splitting.
For instance:
x = pd.Series([
"spam",
"spam, bacon, spam",
None,
"spam, eggs, bacon, spam, spam"
])
xs = x.str.split(", ", regex=False)
xs
## 0 [spam]
## 1 [spam, bacon, spam]
## 2 None
## 3 [spam, eggs, bacon, spam, spam]
## dtype: object
and now, e.g., looking at the last element:
xs.iloc[-1]
## ['spam', 'eggs', 'bacon', 'spam', 'spam']
reveals that it is indeed a list of strings.
There are a few vectorised operations that enable us to work with such variable length lists, such as concatenating all strings:
xs.str.join("; ")
## 0 spam
## 1 spam; bacon; spam
## 2 None
## 3 spam; eggs; bacon; spam; spam
## dtype: object
selecting, say, the first string in each list:
xs.str.get(0)
## 0 spam
## 1 spam
## 2 None
## 3 spam
## dtype: object
or slicing:
xs.str.slice(0, -1) # like xs.iloc[i][0:-1] for all i
## 0 []
## 1 [spam, bacon]
## 2 None
## 3 [spam, eggs, bacon, spam]
## dtype: object
(*) Using pandas.merge, join the following datasets: countries, world_factbook_2020, and ssi_2016_dimensions based on the country names. Note that some manual data cleansing will be necessary.
(**)
Given a Series
object featuring lists of strings:
determine the list of all unique strings (e.g., for
xs
above we have:["spam", "bacon", "eggs"]
), call itxu
;create a data frame
x
withxs.shape[0]
rows andlen(xu)
columns such thatx.iloc[i, j]
is equal to 1 ifxu[j]
is amongstxs.loc[i]
and equal to 0 otherwise;given
x
(and onlyx
: neitherxs
norxu
), perform the inverse operation.
14.3. Formatted Outputs for Reproducible Report Generation
When preparing reports from data analysis (e.g., using Jupyter Notebooks or writing directly to Markdown files which we later compile to PDF or HTML using pandoc) it is important to be able to output nicely formatted content programmatically.
14.3.1. Formatting Strings
Recall that string formatting for inclusion of data stored in existing
objects can easily be done using f-strings
(formatted string literals) of the type f"...{expression}..."
.
For instance:
pi = 3.14159265358979323846
f"π = {pi:.2f}"
## 'π = 3.14'
creates a string including the variable pi
formatted as a float
rounded to two places after the decimal separator.
Note
(**) Similar functionality can be achieved using the str.format method as well as the `%` operator overloaded for strings, which uses sprintf-like value placeholders known to some readers from other programming languages (such as C):
"π = {:.2f}".format(pi), "π = %.2f" % pi
## ('π = 3.14', 'π = 3.14')
14.3.2. str and repr
The str and repr functions can create string representations of a number of objects, with the former being more human-readable and latter slightly more technical.
x = np.array([1, 2, 3])
str(x), repr(x)
## ('[1 2 3]', 'array([1, 2, 3])')
Note that repr often returns an output that can be interpreted as executable Python code literally (with no or few adjustments, however, pandas objects are one of the many exceptions).
14.3.3. Justifying Strings
str.center, str.ljust, str.rjust can be used to centre, left-, or-right justify a string so that it is of at least given width, which might make the display thereof more aesthetic. Very long strings, possibly containing whole text paragraphs can be dealt with using the wrap and shorten functions from the textwrap package.
14.3.4. Direct Markdown Output in Jupyter
Further, with IPython/Jupyter, we can output strings that will be directly interpreted as Markdown-formatted:
import IPython.display
x = 2+2
out = f"*Result*: $2^2=2+2={x}$."
IPython.display.Markdown(out)
Result: \(2^2=2+2=4\).
Recall that Markdown is a very flexible markup language, allowing us to insert itemised and numbered lists, mathematical formulae, tables, images, etc.
14.3.5. Manual Markdown File Output
We can also generate Markdown code programmatically
in the form of standalone .md
files:
f = open("/tmp/test-report.md", "w") # open for writing (overwrite if exists)
f.write("**Yummy Foods** include, but are not limited to:\n\n")
x = ["spam", "bacon", "eggs", "spam"]
for e in x:
f.write(f"* {e}\n")
f.write("\nAnd now for something completely different:\n\n")
f.write("Rank | Food\n")
f.write("-----|-----\n")
for i in range(len(x)):
f.write(f"{i+1:4} | {x[i][::-1]:10}\n")
f.close()
## 50
## 7
## 8
## 7
## 7
## 46
## 12
## 12
## 18
## 18
## 18
## 18
Here is the resulting raw Markdown source file:
with open("/tmp/test-report.md", "r") as f:
out = f.read()
print(out)
## **Yummy Foods** include, but are not limited to:
##
## * spam
## * bacon
## * eggs
## * spam
##
## And now for something completely different:
##
## Rank | Food
## -----|-----
## 1 | maps
## 2 | nocab
## 3 | sgge
## 4 | maps
We can run it through the pandoc tool to convert it to a number of formats, including HTML, PDF, EPUB, and ODT. We may also render it directly into our report:
IPython.display.Markdown(out)
Yummy Foods include, but are not limited to:
spam
bacon
eggs
spam
And now for something completely different:
Rank |
Food |
---|---|
1 |
maps |
2 |
nocab |
3 |
sgge |
4 |
maps |
Note
Figures created in matplotlib can be exported to PNG, SVG, or PDF files using the matplotlib.pyplot.savefig function.
Note
Data frames can be nicely prepared for display in a report using pandas.DataFrame.to_markdown.
Note
(*) Markdown is amongst many markup languages. Other learn-worthy ones include HTML (for the Web) and LaTeX (especially for beautiful typesetting of maths, print-ready articles and books, e.g., PDF; see [O+21] for a good introduction).
Jupyter Notebooks can be converted to different formats using the jupyter-nbconvert command line tool.
More generally, pandoc is a generic converter between different formats, e.g., the highly universal (although primitive) Markdown and the said LaTeX and HTML. Also, it can be used for preparing presentations (slides).
14.4. Regular Expressions (*)
This section contains large excerpts from yours truly’s other work [Gag22].
Regular expressions (regexes) provide us with a concise grammar for defining systematic patterns which can be sought in character strings. Examples of such patterns include: specific fixed substrings, emojis of any kind, stand-alone sequences of lower-case Latin letters (“words”), substrings that can be interpreted as real numbers (with or without fractional parts, also in scientific notation), telephone numbers, email addresses, or URLs.
Theoretically, the concept of regular pattern matching dates back to the so-called regular languages and finite state automata [Kle51], see also [RS59] and [HU79]. Regexes in the form as we know today have already been present in one of the pre-Unix implementations of the command-line text editor qed [RT70] (the predecessor of the well-known sed).
14.4.1. Regex Matching with re
In Python, the re module implements a regular expression matching engine that accepts patterns following a similar syntax to the ones available in the Perl language.
Before we proceed with a detailed discussion on how to read and write
regexes, let us first review the methods for identifying the matching
substrings. Below we use the r"\bni+\b"
regex as an example,
which catches "n"
followed by at least one "i"
that begins and ends at a word boundary, i.e.,
which may be considered standalone words.
In particular, re.findall extracts all non-overlapping matches to a given regex:
import re
x = "We're the knights who say ni! niiiii! ni! niiiiiiiii!"
re.findall(r"\bni+\b", x)
## ['ni', 'niiiii', 'ni', 'niiiiiiiii']
The order of arguments is look for what, where, not the other way around.
Important
We used the r"..."
prefix when entering a string
so that \b
is not treated as a escape sequence
denoting the backspace character. Otherwise, the above would have to be input
as "\\bni+\\b"
.
If we had not insisted on matching at the word boundaries (i.e.,
if we used "ni+"
instead), we would also match the "ni"
in "knights"
.
The re.search function returns an object of class re.Match
that enables us to get some more information about the first match:
r = re.search(r"\bni+\b", x)
r.start(), r.end(), r.group()
## (26, 28, 'ni')
The above includes the start and end position (index) and the match itself. If the regex contains capture groups (see below for more details), we can also pinpoint the matches thereto.
Moreover, re.finditer returns an iterable object that includes the same details, but now about all the matches:
rs = re.finditer(r"\bni+\b", x)
for r in rs:
print((r.start(), r.end(), r.group()))
## (26, 28, 'ni')
## (30, 36, 'niiiii')
## (38, 40, 'ni')
## (42, 52, 'niiiiiiiii')
re.split divides a string into chunks separated by matches to a given regex:
re.split(r"!\s+", x)
## ["We're the knights who say ni", 'niiiii', 'ni', 'niiiiiiiii!']
The r"!\s*"
regex matches the exclamation mark
followed by one or more whitespace characters.
re.sub replaces each match with a given string:
re.sub(r"\bni+\b", "nu", x)
## "We're the knights who say nu! nu! nu! nu!"
Note
(**) More flexible replacement strings can be generated by passing a custom function as the second argument:
re.sub(r"\bni+\b", lambda m: "n" + "u"*(m.end()-m.start()-1), x)
## "We're the knights who say nu! nuuuuu! nu! nuuuuuuuuu!"
14.4.2. Regex Matching with pandas
The pandas.Series.str accessor also defines a number of vectorised functions that utilise the re package’s matcher.
Example Series
object:
x = pd.Series(["ni!", "niiii, ni, nii!", None, "spam, bacon", "nii, ni!"])
x
## 0 ni!
## 1 niiii, ni, nii!
## 2 None
## 3 spam, bacon
## 4 nii, ni!
## dtype: object
Here are the most notable functions; their names are self-explanatory, so let us just generate the (textual) picture instead of the words’ abundance.
x.str.contains(r"\bni+\b")
## 0 True
## 1 True
## 2 None
## 3 False
## 4 True
## dtype: object
x.str.count(r"\bni+\b")
## 0 1.0
## 1 3.0
## 2 NaN
## 3 0.0
## 4 2.0
## dtype: float64
x.str.replace(r"\bni+\b", "nu", regex=True)
## 0 nu!
## 1 nu, nu, nu!
## 2 None
## 3 spam, bacon
## 4 nu, nu!
## dtype: object
x.str.findall(r"\bni+\b")
## 0 [ni]
## 1 [niiii, ni, nii]
## 2 None
## 3 []
## 4 [nii, ni]
## dtype: object
x.str.split(r",\s+") # a comma, one or more whitespaces
## 0 [ni!]
## 1 [niiii, ni, nii!]
## 2 None
## 3 [spam, bacon]
## 4 [nii, ni!]
## dtype: object
In the two last cases, we get lists of strings in result.
Also, later we will mention pandas.Series.str.extract and pandas.Series.str.extractall which work with regexes that include capture groups.
Note
(*) If we intend to seek matches to the same pattern in different strings without the use of pandas, it might be a good to pre-compile a regex first and then use the re.Pattern.findall method instead or re.findall:
p = re.compile(r"\bni+\b") # returns an object of class `re.Pattern`
p.findall("We're the knights who say ni! ni! niiiii! nininiiiiiiiii!")
## ['ni', 'ni', 'niiiii']
14.4.3. Matching Individual Characters
Most programming languages and text editors (including Kate, Eclipse, and VSCode) support finding or replacing patterns with regexes. Therefore, they should be amongst the instruments at every data scientist’s disposal. One general introduction to regexes is [Fri06]. The re module flavour is summarised in the official manual, see also [Kuc22]. In the following sections we review the most important elements of the regex syntax as we did in [Gag22].
We begin by discussing different ways to define character sets. In this part, determining the length of all matching substrings will be quite straightforward.
Important
The following characters have special meaning to the regex engine:
.
\
|
(
)
[
]
>{
}
^
$
*
+
?
Any regular expression that contains none of the above behaves like a fixed pattern:
re.findall("spam", "spam, eggs, spam, bacon, sausage, and spam")
## ['spam', 'spam', 'spam']
There are hence 3 occurrences of a pattern that is comprised of 4 code
points, “s
” followed by “p
”, then by “a
”, and ending with “m
”.
If we wish to include a special character as part of a regular
expression – so that it is treated literally – we will need to escape
it with a backslash, “\
”.
re.findall(r"\.", "spam...")
## ['.', '.', '.']
14.4.3.1. Matching Any Character
The (unescaped) dot, “.
”, matches any code point except the newline.
x = "Spam, ham,\njam, SPAM, eggs, and spam"
re.findall("..am", x, re.IGNORECASE)
## ['Spam', ' ham', 'SPAM', 'spam']
The above matches non-overlapping length-4 substrings that end with
“am
”, case insensitively.
The dot’s insensitivity to the newline character is motivated by the
need to maintain the compatibility with tools such as grep (when
searching within text files in a line-by-line manner). This behaviour
can be altered by setting the DOTALL
flag.
re.findall("..am", x, re.DOTALL|re.IGNORECASE)
## ['Spam', ' ham', '\njam', 'SPAM', 'spam']
14.4.3.2. Defining Character Sets
Sets of characters can be introduced by enumerating their members within
a pair of square brackets. For instance, “[abc]
” denotes the set
{a, b, c} – such a regular expression
matches one (and only one) symbol from this set. Moreover, in:
re.findall("[hj]am", x)
## ['ham', 'jam']
the “[hj]am
” regex matches: “h
” or “j
”, followed by “a
”,
followed by “m
”. In other words, "ham"
and "jam"
are the only two
strings that are matched by this pattern (unless matching is done
case-insensitively).
Important
The following characters, if used within square brackets, may be treated non-literally:
\
[
]
^
-
&
~
|
Therefore, to include them as-is in a character set, the
backslash-escape must be used. For example, “[\[\]\\]
” matches a
backslash or a square bracket.
14.4.3.3. Complementing Sets
Including “^
” (the caret) after the opening square bracket denotes the set
complement. Hence, “[^abc]
” matches any code point except “a
”,
“b
”, and “c
”. Here is an example where we seek any substring that
consists of 3 non-spaces.
x = "Nobody expects the Spanish Inquisition!"
re.findall("[^ ][^ ][^ ]", x)
## ['Nob', 'ody', 'exp', 'ect', 'the', 'Spa', 'nis', 'Inq', 'uis', 'iti', 'on!']
14.4.3.4. Defining Code Point Ranges
Each Unicode code point can be referenced by its unique numeric
identifier for more details. For instance, “a
” is
assigned code U+0061 and “z
” is mapped to U+007A. In the pre-Unicode
era (mostly with regard to the ASCII codes, ≤ U+007F, representing
English letters, decimal digits, some punctuation characters, and a few
control characters), we were used to relying on specific code ranges;
e.g., “[a-z]
” denotes the set comprised of all characters with codes
between U+0061 and U+007A, i.e., lowercase letters of the English
(Latin) alphabet.
re.findall("[0-9A-Za-z]", "Gągolewski")
## ['G', 'g', 'o', 'l', 'e', 'w', 's', 'k', 'i']
The above pattern denotes a union of 3 code ranges: digits and ASCII upper- and lowercase letters.
Nowadays, in the processing of text in natural languages, this notation
should rather be avoided. Note the missing “ą
” (Polish “a
” with
ogonek) in the result.
14.4.3.5. Using Predefined Character Sets
Some other noteworthy Unicode-aware code point classes include the “word characters”:
x = "aąb߯AĄB你12𝟛๔٤,.;'! \t-+=\n[]©←→”„"
re.findall(r"\w", x)
## ['a', 'ą', 'b', 'ß', 'Æ', 'A', 'Ą', 'B', '你', '1', '2', '𝟛', '๔', '٤']
decimal digits:
re.findall(r"\d", x)
## ['1', '2', '𝟛', '๔', '٤']
and whitespaces:
re.findall(r"\s", x)
## [' ', '\t', '\n']
Moreover, e.g., “\W
” is
equivalent to “[^\w]
” , i.e.,
denotes its complement.
14.4.4. Alternating and Grouping Subexpressions
14.4.4.1. Alternation Operator
The alternation operator, “|
” (the pipe or bar),
matches either its left or its right branch, for instance:
x = "spam, egg, ham, jam, algae, and an amalgam of spam, all al dente"
re.findall("spam|ham", x)
## ['spam', 'ham', 'spam']
14.4.4.2. Grouping Subexpressions
“|
” has a very low precedence. Therefore, if we wish to introduce an
alternative of subexpressions, we need to group them using the
“(?:...)
” syntax. For instance, “(?:sp|h)am
” matches either “spam
”
or “ham
”.
Notice that the bare use of the round brackets, "(...)"
(i.e., without the "?:"
) part, has the side-effect of creating new
capturing groups, see below for more details.
Also, matching is always done left-to-right, on a first-come,
first-served basis. Hence, if the left branch is a subset of the right
one, the latter will never be matched. In particular,
“(?:al|alga|algae)
” can only match “al
”. To fix this, we can write
“(?:algae|alga|al)
”.
14.4.4.3. Non-grouping Parentheses
Some parenthesised subexpressions – those in which the opening bracket
is followed by the question mark – have a distinct meaning. In
particular, “(?#...)
” denotes a free-format comment that is ignored by
the regex parser:
re.findall(
"(?# match 'sp' or 'h')(?:sp|h)(?# and 'am')am|(?# or match 'egg')egg",
x
)
## ['spam', 'egg', 'ham', 'spam']
This is just horrible. Luckily, constructing more sophisticated regexes by concatenating subfragments thereof is more readable:
re.findall(
"(?:sp|h)" + # match either 'sp' or 'h'
"am" + # followed by 'am'
"|" + # ... or ...
"egg", # just match 'egg'
x
)
## ['spam', 'egg', 'ham', 'spam']
What is more, e.g., “(?i)
” enables the case-insensitive mode.
re.findall("(?i)spam", "Spam spam SPAMITY spAm")
## ['Spam', 'spam', 'SPAM', 'spAm']
14.4.5. Quantifiers
More often than not, a variable number of instances of the same subexpression needs to be captured or its presence should be made optional. This can be achieved by means of the following quantifiers:
“
?
” matches 0 or 1 times;“
*
” matches 0 or more times;“
+
” matches 1 or more times;“
{n,m}
” matches betweenn
andm
times;“
{n,}
” matches at leastn
times;“
{n}
” matches exactlyn
times.
These operators are applied onto the directly preceding atoms.
For example, “ni+
” captures "ni"
, "nii"
, "niii"
, etc.,
but neither "n"
alone nor "ninini"
altogether.
By default, the quantifiers are greedy – they match the repeated
subexpression as many times as possible. The “?
” suffix (hence,
quantifiers such as “??
”, “*?
”, “+?
”, and so forth) tries with as
few occurrences as possible (to obtain a match still).
Greedy:
x = "sp(AM)(maps)(SP)am"
re.findall(r"\(.+\)", x)
## ['(AM)(maps)(SP)']
Lazy:
re.findall(r"\(.+?\)", x)
## ['(AM)', '(maps)', '(SP)']
Greedy (but clever):
re.findall(r"\([^)]+\)", x)
## ['(AM)', '(maps)', '(SP)']
The first regex is greedy: it matches an opening bracket, then as many
characters as possible (including “)
”) that are followed by a closing
bracket. The two other patterns terminate as soon as the first closing
bracket is found.
More examples:
x = "spamamamnomnomnomammmmmmmmm"
re.findall("sp(?:am|nom)+", x)
## ['spamamamnomnomnomam']
re.findall("sp(?:am|nom)+?", x)
## ['spam']
And:
re.findall("sp(?:am|nom)+?m*", x)
## ['spam']
re.findall("sp(?:am|nom)+?m+", x)
## ['spamamamnomnomnomammmmmmmmm']
Let us stress that the quantifier is applied to the subexpression that stands directly before it. Grouping parentheses can be used in case they are needed.
x = "12, 34.5, 678.901234, 37...629, ..."
re.findall(r"\d+\.\d+", x)
## ['34.5', '678.901234']
matches digits, a dot, and another series of digits.
re.findall(r"\d+(?:\.\d+)?", x)
## ['12', '34.5', '678.901234', '37', '629']
finds digits which are possibly (but not necessarily) followed by a dot and a digit sequence.
Write a regex that extracts all #hashtags from a string #omg #SoEasy.
14.4.6. Capture Groups and References Thereto
Round-bracketed subexpressions (without the "?:"
prefix)
form the so-called capture groups that can be extracted separately or be
referred to in other parts of the same regex.
14.4.6.1. Extracting Capture Group Matches
The above is evident when we use re.findall:
x = "name='Sir Launcelot', quest='Seek the Grail', favecolour='blue'"
re.findall(r"(\w+)='(.+?)'", x)
## [('name', 'Sir Launcelot'), ('quest', 'Seek the Grail'), ('favecolour', 'blue')]
Simply returned the matches to the capture groups, not the whole matching substring.
re.find and re.finditer can pinpoint each component:
r = re.search(r"(\w+)='(.+?)'", x)
print("all (0):", (r.start(), r.end(), r.group()))
print(" 1 :", (r.start(1), r.end(1), r.group(1)))
print(" 2 :", (r.start(2), r.end(2), r.group(2)))
## all (0): (0, 20, "name='Sir Launcelot'")
## 1 : (0, 4, 'name')
## 2 : (6, 19, 'Sir Launcelot')
Here is a vectorised version of the above from pandas, returning the first match:
y = pd.Series([
"name='Sir Launcelot'",
"quest='Seek the Grail'",
"favecolour='blue', favecolour='yel.. Aaargh!'"
])
y.str.extract(r"(\w+)='(.+?)'")
## 0 1
## 0 name Sir Launcelot
## 1 quest Seek the Grail
## 2 favecolour blue
We see that the findings are presented in a data frame form. The first column gives the matches to the first capture group, and so forth.
All matches are available too:
y.str.extractall(r"(\w+)='(.+?)'")
## 0 1
## match
## 0 0 name Sir Launcelot
## 1 0 quest Seek the Grail
## 2 0 favecolour blue
## 1 favecolour yel.. Aaargh!
Recall that if we just need the grouping part of “(...)
”, i.e., without the
capturing feature, “(?:...)
” can be applied.
Also, named capture
groups defined like “(?P<name>...)
” are supported.
y.str.extract("(?:\\w+)='(?P<value>.+?)'")
## value
## 0 Sir Launcelot
## 1 Seek the Grail
## 2 blue
14.4.6.2. Replacing with Capture Group Matches
Matches to particular capture groups can be recalled in replacement
strings when using re.sub and pandas.Series.str.replace.
Here, the match in its entirety is
denoted with “\g<0>
”, then “\g<1>
” stores whatever was caught by the first
capture group, “\g<2>
” is the match to the second capture group, etc.
re.sub(r"(\w+)='(.+?)'", r"\g<2> is a \g<1>", x)
## 'Sir Launcelot is a name, Seek the Grail is a quest, blue is a favecolour'
Named capture groups can be referred to too:
re.sub(r"(?P<key>\w+)='(?P<value>.+?)'",
r"\g<value> is a \g<key>", x)
## 'Sir Launcelot is a name, Seek the Grail is a quest, blue is a favecolour'
14.4.6.3. Back-Referencing
Matches to capture groups can also be part of the regexes themselves.
For example, “\1
” denotes whatever has been consumed by the first
capture group.
Even though, in general, parsing HTML code with regexes is not recommended, let us consider the following examples:
x = "<strong><em>spam</em></strong><code>eggs</code>"
re.findall(r"<[a-z]+>.*?</[a-z]+>", x)
## ['<strong><em>spam</em>', '<code>eggs</code>']
re.findall(r"(<([a-z]+)>.*?</\2>)", x)
## [('<strong><em>spam</em></strong>', 'strong'), ('<code>eggs</code>', 'code')]
The second regex guarantees that the match will include all characters
between the opening <tag>
and the corresponding (not: any) closing
</tag>
. Named capture groups can be referenced using the (?P=name)
syntax (the angle brackets are part of the token):
re.findall(r"(<(?P<tagname>[a-z]+)>.*?</(?P=tagname)>)", x)
## [('<strong><em>spam</em></strong>', 'strong'), ('<code>eggs</code>', 'code')]
14.4.7. Anchoring
Lastly, let us mention the ways to match a pattern at a given abstract position within a string.
14.4.7.1. Matching at the Beginning or End of a String
“^
” and “$
” match, respectively, start and end of the string
(or each line within a string, if the re.MULTILINE
flag is set).
x = pd.Series(["spam egg", "bacon spam", "spam", "egg spam bacon", "sausage"])
rs = ["spam", "^spam", "spam$", "spam$|^spam", "^spam$"]
pd.concat([x.str.contains(r) for r in rs], axis=1, keys=rs)
## spam ^spam spam$ spam$|^spam ^spam$
## 0 True True False True False
## 1 True False True True False
## 2 True True True True True
## 3 True False False False False
## 4 False False False False False
The 5 regular expressions match “spam
”, respectively, anywhere within
the string, at the beginning, at the end, at the beginning or end, and
in strings that are equal to the pattern itself.
Write a regex that does the same job as str.strip.
14.4.7.2. Matching at Word Boundaries
Furthermore, “\b
” matches at a “word boundary”, e.g., near spaces,
punctuation marks, or at the start/end of a string (i.e., wherever there
is a transition between a word, “\w
”, and a non-word character,
“\W
”, or vice versa).
In the following example, we match all stand-alone numbers (this regular expression is provided for didactic purposes only):
re.findall(r"[-+]?\b\d+(?:\.\d+)?\b", "+12, 34.5, -5.3243")
## ['+12', '34.5', '-5.3243']
14.4.7.3. Looking Behind and Ahead
There are also ways to guarantee that a pattern occurrence begins or
ends with a match to some subexpression: “(?<=...)...
” is the
so-called look-behind, whereas “...(?=...)
” denotes the look-ahead.
Moreover, “(?<!...)...
” and “...(?!...)
” are their negated
(“negative look-behind/ahead”) versions.
x = "I like spam, spam, eggs, and spam."
re.findall(r"\b\w+\b(?=[,.])", x)
## ['spam', 'spam', 'eggs', 'spam']
re.findall(r"\b\w+\b(?![,.])", x)
## ['I', 'like', 'and']
The first regex captures words that end with “,
” or “.
”. The second
one matches words that end neither with “,
” nor “.
”.
Write a regex that extracts
all standalone numbers accepted by Python,
including 12.123
, -53
, +1e-9
,
-1.2423e10
, 4.
and .2
.
Write a regex that matches all email addresses.
Write a regex that matches all URLs
starting with http://
or https://
.
Cleanse the
warsaw_weather
dataset so that it contains analysable numeric data.
14.5. Exercises
List some ways to normalise character strings.
(**) What are the challenges of processing non-English text?
What are the problems with the "[A-Za-z]"
and "[A-z]"
character sets?
Name the two ways to turn on case-insensitive regex matching.
What is a word boundary?
What is the difference between the "^"
and "$"
anchors?
When we would prefer using "[0-9]"
instead of "\d"
?
What is the difference between the "?"
, "??"
, "*"
, "*?"
,
"+"
, and "+?"
quantifiers?
Does "."
match all the characters?
What are named capture groups and how to refer to the matches thereto in re.sub?