13. Accessing Databases#
The open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF; a printed version can be ordered from Amazon: AU CA DE ES FR IT JP NL PL SE UK US). It is a non-profit project. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Also, make sure to check out my other book, Deep R Programming .
pandas is convenient for working with data that fit into memory and which can be stored in individual CSV files. Still, larger information banks in a shared environment will often be made available to us via relational (structured) databases such as PostgreSQL or MariaDB, or a wide range of commercial products.
Most commonly, we use SQL (Structured Query Language) to define the data chunks we wish to analyse. Then, we fetch them from the database driver in the form of a pandas data frame. This enables us to perform the operations we are already familiar with, e.g., various transformations or visualisations.
Below we make a quick introduction to the basics of SQL using SQLite, which is a lightweight, flat-file, and server-less open-source database management system. Overall, SQLite is a sensible choice for data of even hundreds or thousands of gigabytes in size that fit on a single computer’s disk. This is more than enough for playing with our data science projects or prototyping more complex solutions.
In this chapter, we will learn that the syntax of SQL is very readable: it is modelled after the natural (English) language. The purpose of this introduction is not to write own queries nor to design own databanks: this should be covered by a separate course on database systems; see, e.g., [15, 19].
13.1. Example Database#
In this chapter, we will be working with a simplified data dump of the Q&A site Travel Stack Exchange, which we downloaded on 2017-10-31. It consists of five separate data frames.
Tags gives, amongst others,
topic categories (
TagName) and how many questions mention them (
Tags = pd.read_csv("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/travel_stackexchange_com_2017/Tags.csv.gz", comment="#") Tags.head(3) ## Count ExcerptPostId Id TagName WikiPostId ## 0 104 2138.0 1 cruising 2137.0 ## 1 43 357.0 2 caribbean 356.0 ## 2 43 319.0 4 vacations 318.0
Users provides information on the registered users.
Users = pd.read_csv("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/travel_stackexchange_com_2017/Users.csv.gz", comment="#") Users.head(3) ## AccountId Age CreationDate ... Reputation UpVotes Views ## 0 -1.0 NaN 2011-06-21T15:16:44.253 ... 1.0 2472.0 0.0 ## 1 2.0 40.0 2011-06-21T20:10:03.720 ... 101.0 1.0 31.0 ## 2 7598.0 32.0 2011-06-21T20:11:02.490 ... 101.0 1.0 14.0 ## ## [3 rows x 11 columns]
Badges recalls all rewards handed to the users (
for their engaging in various praiseworthy activities:
Badges = pd.read_csv("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/travel_stackexchange_com_2017/Badges.csv.gz", comment="#") Badges.head(3) ## Class Date Id Name TagBased UserId ## 0 3 2011-06-21T20:16:48.910 1 Autobiographer False 2 ## 1 3 2011-06-21T20:16:48.910 2 Autobiographer False 3 ## 2 3 2011-06-21T20:16:48.910 3 Autobiographer False 4
Posts lists all the questions and answers (the latter
do not have
ParentId set to
Posts = pd.read_csv("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/travel_stackexchange_com_2017/Posts.csv.gz", comment="#") Posts.head(3) ## AcceptedAnswerId ... ViewCount ## 0 393.0 ... 419.0 ## 1 NaN ... 1399.0 ## 2 NaN ... NaN ## ## [3 rows x 17 columns]
Votes list all the up-votes (
VoteTypeId equal to 2)
and down-votes (
VoteTypeId of 3)
to all the posts.
Votes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" + "teaching-data/master/travel_stackexchange_com_2017/Votes.csv.gz", comment="#") Votes.head(3) ## BountyAmount CreationDate Id PostId UserId VoteTypeId ## 0 NaN 2011-06-21T00:00:00.000 1 1 NaN 2 ## 1 NaN 2011-06-21T00:00:00.000 2 1 NaN 2 ## 2 NaN 2011-06-21T00:00:00.000 3 2 NaN 2
file for a detailed description of each column.
Note that rows are uniquely defined by their respective
They are relations between the data frames, e.g.,
Moreover, for privacy reasons, some
UserIds might be missing.
In such a case, they are encoded with a not-a-number;
compare Chapter 15.
13.2. Exporting Data to a Database#
Let us establish a connection with the to-be SQLite database. In our case, this will be an ordinary file stored on the computer’s disk:
import tempfile, os.path dbfile = os.path.join(tempfile.mkdtemp(), "travel.db") print(dbfile) ## /tmp/tmpnwyz5knx/travel.db
The above defines the file path (compare Section 13.6.1)
where the database is going to be stored. We use a randomly generated
filename inside the local file system’s (we are on Linux)
/tmp. This is just a pleasant exercise, and
we will not be using this database afterwards.
The reader might prefer setting a filename relative to the
current working directory (as given by os.getcwd), e.g.,
dbfile = "travel.db".
We can now connect to the database:
import sqlite3 conn = sqlite3.connect(dbfile)
The database might now be queried: we can add new tables, insert new rows, and retrieve records.
In the end, we should remember to call
Our data are already in the form of pandas data frames. Therefore, exporting them to the database is straightforward. We only need to make a series of calls to the pandas.DataFrame.to_sql method.
Tags.to_sql("Tags", conn, index=False) Users.to_sql("Users", conn, index=False) Badges.to_sql("Badges", conn, index=False) Posts.to_sql("Posts", conn, index=False) Votes.to_sql("Votes", conn, index=False)
(*) It is possible to export data that do not fit into memory
by reading them in chunks of considerable, but not too large, sizes.
In particular pandas.read_csv has the
nrows argument that
lets us read several rows from a file connection; see
Section 13.6.4. Then,
(..., if_exists="append") can
be used to append new rows to an existing table.
Exporting data can of course be done without pandas as well,
e.g., when they are to be fetched from XML or JSON files
(compare Section 13.5) and processed manually, row by row.
Intermediate-level SQL users can call
conn.execute("CREATE TABLE t..."),
conn.executemany("INSERT INTO t VALUES(?, ?, ?)", l),
conn.commit(). This will create a new table (here: named
populated by a list of records (e.g., in the form of tuples or
numpy vectors). For more details, see the
of the sqlite3 package.
13.3. Exercises on SQL vs pandas#
We can use pandas to fetch the results of any SQL query in the form of a data frame. For example:
pd.read_sql_query(""" SELECT * FROM Tags LIMIT 3 """, conn) ## Count ExcerptPostId Id TagName WikiPostId ## 0 104 2138.0 1 cruising 2137.0 ## 1 43 357.0 2 caribbean 356.0 ## 2 43 319.0 4 vacations 318.0
The above query selected all columns (
SELECT *) and
the first three rows (
LIMIT 3) from the
For the above and all the following SQL queries, write the equivalent Python code that generates the same result using pandas functions and methods. In each case, there might be more than one equally fine solution. In case of any doubt about the meaning of the queries, please refer to the SQLite documentation. Example solutions are provided at the end of this section.
For a reference query:
res1a = pd.read_sql_query(""" SELECT * FROM Tags LIMIT 3 """, conn)
The equivalent pandas implementation might look like:
res1b = Tags.head(3)
To verify that the results are equal, we can call:
pd.testing.assert_frame_equal(res1a, res1b) # no error == OK
No error message means that the test is passed. The cordial thing about the assert_frame_equal function is that it ignores small round-off errors introduced by arithmetic operations.
Nonetheless, the results generated by pandas
might be the same up to the reordering of rows. In such a case,
before calling pandas.testing.assert_frame_equal,
we can invoke
DataFrame.sort_values on both data frames to sort them
with respect to 1 or 2 chosen columns.
Tags, select two columns
and rows for which
TagName is equal to one of the three choices provided.
res2a = pd.read_sql_query(""" SELECT TagName, Count FROM Tags WHERE TagName IN ('poland', 'australia', 'china') """, conn) res2a ## TagName Count ## 0 china 443 ## 1 australia 411 ## 2 poland 139
Hint: use pandas.Series.isin.
Select a set of columns from
Posts whose rows fulfil a given
set of conditions.
res3a = pd.read_sql_query(""" SELECT Title, Score, ViewCount, FavoriteCount FROM Posts WHERE PostTypeId=1 AND ViewCount>=10000 AND FavoriteCount BETWEEN 35 AND 100 """, conn) res3a ## Title ... FavoriteCount ## 0 When traveling to a country with a different c... ... 35.0 ## 1 How can I do a "broad" search for flights? ... 49.0 ## 2 Tactics to avoid getting harassed by corrupt p... ... 42.0 ## 3 Flight tickets: buy two weeks before even duri... ... 36.0 ## 4 OK we're all adults here, so really, how on ea... ... 79.0 ## 5 How to intentionally get denied entry to the U... ... 53.0 ## 6 How do you know if Americans genuinely/literal... ... 79.0 ## 7 OK, we are all adults here, so what is a bidet... ... 38.0 ## 8 How to cope with too slow Wi-Fi at hotel? ... 41.0 ## ## [9 rows x 4 columns]
Score columns from
ParentId is missing (i.e., the post is in fact a question)
Title is well-defined.
Then, sort the results by the
Score column, decreasingly
(descending order). Finally, return only the first five rows (e.g.,
top five scoring questions).
res4a = pd.read_sql_query(""" SELECT Title, Score FROM Posts WHERE ParentId IS NULL AND Title IS NOT NULL ORDER BY Score DESC LIMIT 5 """, conn) res4a ## Title Score ## 0 OK we're all adults here, so really, how on ea... 306 ## 1 How do you know if Americans genuinely/literal... 254 ## 2 How to intentionally get denied entry to the U... 219 ## 3 Why are airline passengers asked to lift up wi... 210 ## 4 Why prohibit engine braking? 178
Hint: use pandas.DataFrame.sort_values and numpy.isnan or pandas.isnull.
13.3.3. Removing Duplicates#
Get all unique badge names for the user with
res5a = pd.read_sql_query(""" SELECT DISTINCT Name FROM Badges WHERE UserId=23 """, conn) res5a ## Name ## 0 Supporter ## 1 Student ## 2 Teacher ## 3 Scholar ## 4 Beta ## 5 Nice Question ## 6 Editor ## 7 Nice Answer ## 8 Yearling ## 9 Popular Question ## 10 Taxonomist ## 11 Notable Question
Hint: use pandas.DataFrame.drop_duplicates.
For each badge handed to the user with
extract the award year store it in a new column named
Then, select only the unique pairs
res6a = pd.read_sql_query(""" SELECT DISTINCT Name, CAST(strftime('%Y', Date) AS FLOAT) AS Year FROM Badges WHERE UserId=23 """, conn) res6a ## Name Year ## 0 Supporter 2011.0 ## 1 Student 2011.0 ## 2 Teacher 2011.0 ## 3 Scholar 2011.0 ## 4 Beta 2011.0 ## 5 Nice Question 2011.0 ## 6 Editor 2012.0 ## 7 Nice Answer 2012.0 ## 8 Yearling 2012.0 ## 9 Nice Question 2012.0 ## 10 Nice Question 2013.0 ## 11 Yearling 2013.0 ## 12 Popular Question 2014.0 ## 13 Yearling 2014.0 ## 14 Taxonomist 2014.0 ## 15 Notable Question 2015.0 ## 16 Nice Question 2017.0
see Chapter 16.
13.3.4. Grouping and Aggregating#
Count how many badges of each type the user with
Also, for each badge type, compute the minimal, average, and maximal
Return only the top four badges (with respect to the counts).
res7a = pd.read_sql_query(""" SELECT Name, COUNT(*) AS Count, MIN(CAST(strftime('%Y', Date) AS FLOAT)) AS MinYear, AVG(CAST(strftime('%Y', Date) AS FLOAT)) AS MeanYear, MAX(CAST(strftime('%Y', Date) AS FLOAT)) AS MaxYear FROM Badges WHERE UserId=23 GROUP BY Name ORDER BY Count DESC LIMIT 4 """, conn) res7a ## Name Count MinYear MeanYear MaxYear ## 0 Nice Question 4 2011.0 2013.25 2017.0 ## 1 Yearling 3 2012.0 2013.00 2014.0 ## 2 Popular Question 3 2014.0 2014.00 2014.0 ## 3 Notable Question 2 2015.0 2015.00 2015.0
Count how many unique combinations of pairs
for the badges won by the user with
Id=23 are there.
Then, return only the rows having
Count greater than 1
and order the results by
In other words, list the badges received more than once in any given
res8a = pd.read_sql_query(""" SELECT Name, CAST(strftime('%Y', Date) AS FLOAT) AS Year, COUNT(*) AS Count FROM Badges WHERE UserId=23 GROUP BY Name, Year HAVING Count > 1 ORDER BY Count DESC """, conn) res8a ## Name Year Count ## 0 Popular Question 2014.0 3 ## 1 Notable Question 2015.0 2
WHERE is performed before
GROUP BY, and
is applied thereafter.
for all posts with
OwnerUserId not equal to
the tags which were created by “alive” users).
Return the top six records with respect to
res9a = pd.read_sql_query(""" SELECT Tags.TagName, Tags.Count, Posts.OwnerUserId, Users.Age, Users.Location, Users.DisplayName FROM Tags JOIN Posts ON Posts.Id=Tags.WikiPostId JOIN Users ON Users.AccountId=Posts.OwnerUserId WHERE OwnerUserId != -1 ORDER BY Tags.Count DESC, Tags.TagName ASC LIMIT 6 """, conn) res9a ## TagName Count ... Location DisplayName ## 0 canada 802 ... Mumbai, India hitec ## 1 europe 681 ... Philadelphia, PA Adam Tuttle ## 2 visa-refusals 554 ... New York, NY Benjamin Pollack ## 3 australia 411 ... Mumbai, India hitec ## 4 eu 204 ... Philadelphia, PA Adam Tuttle ## 5 new-york-city 204 ... Mumbai, India hitec ## ## [6 rows x 6 columns]
First, create an auxiliary (temporary) table named
where we store the information about the number of up-votes
VoteTypeId=2) that each post has received.
Then, join (merge) this table with
and fetch some details about the five questions (
with the most up-votes.
res10a = pd.read_sql_query(""" SELECT UpVotesTab.*, Posts.Title FROM ( SELECT PostId, COUNT(*) AS UpVotes FROM Votes WHERE VoteTypeId=2 GROUP BY PostId ) AS UpVotesTab JOIN Posts ON UpVotesTab.PostId=Posts.Id WHERE Posts.PostTypeId=1 ORDER BY UpVotesTab.UpVotes DESC LIMIT 5 """, conn) res10a ## PostId UpVotes Title ## 0 3080 307 OK we're all adults here, so really, how on ea... ## 1 38177 254 How do you know if Americans genuinely/literal... ## 2 24540 221 How to intentionally get denied entry to the U... ## 3 20207 211 Why are airline passengers asked to lift up wi... ## 4 96447 178 Why prohibit engine braking?
13.3.6. Solutions to Exercises#
In this section we provide examples of solutions to the above exercises.
To obtain a result equivalent to
res2a, we need basic filtering only:
res2b = ( Tags. loc[ Tags.TagName.isin(["poland", "australia", "china"]), ["TagName", "Count"] ]. reset_index(drop=True) )
Let us verify whether the two data frames are identical:
pd.testing.assert_frame_equal(res2a, res2b) # no error == OK
res3a with pandas only, we need some more complex
filtering with loc
res3b = ( Posts. loc[ (Posts.PostTypeId == 1) & (Posts.ViewCount >= 10000) & (Posts.FavoriteCount >= 35) & (Posts.FavoriteCount <= 100), ["Title", "Score", "ViewCount", "FavoriteCount"] ]. reset_index(drop=True) ) pd.testing.assert_frame_equal(res3a, res3b) # no error == OK
res4a, some filtering and sorting is all we need:
res4b = ( Posts. loc[ Posts.ParentId.isna() & (~Posts.Title.isna()), ["Title", "Score"] ]. sort_values("Score", ascending=False). head(5). reset_index(drop=True) ) pd.testing.assert_frame_equal(res4a, res4b) # no error == OK
The key to
res5a is the pandas.DataFrame.drop_duplicates
res5b = ( Badges. loc[Badges.UserId == 23, ["Name"]]. drop_duplicates(). reset_index(drop=True) ) pd.testing.assert_frame_equal(res5a, res5b) # no error == OK
res6a, we first need to add a new column to the copy of
Badges2 = Badges.copy() # otherwise we would destroy the original object Badges2.loc[:, "Year"] = ( Badges2.Date.astype("datetime64").dt.strftime("%Y").astype("float") )
Then, we apply some basic filtering and the removal of duplicated rows:
res6b = ( Badges2. loc[Badges2.UserId == 23, ["Name", "Year"]]. drop_duplicates(). reset_index(drop=True) ) pd.testing.assert_frame_equal(res6a, res6b) # no error == OK
res7a, we can use pandas.DataFrameGroupBy.aggregate:
Badges2 = Badges.copy() Badges2.loc[:, "Year"] = ( Badges2.Date.astype("datetime64").dt.strftime("%Y").astype("float") ) res7b = ( Badges2. loc[Badges2.UserId == 23, ["Name", "Year"]]. groupby("Name")["Year"]. aggregate([len, np.min, np.mean, np.max]). sort_values("len", ascending=False). head(4). reset_index() ) res7b.columns = ["Name", "Count", "MinYear", "MeanYear", "MaxYear"]
Had we not converted
float, we would obtain
a meaningless average year, without any warning.
Unfortunately, the rows in
res7b are ordered
differently. For testing, we need to reorder them in the same way:
pd.testing.assert_frame_equal( res7a.sort_values(["Name", "Count"]).reset_index(drop=True), res7b.sort_values(["Name", "Count"]).reset_index(drop=True) ) # no error == OK
res8a, we first count the number of values in each group:
Badges2 = Badges.copy() Badges2.loc[:, "Year"] = ( Badges2.Date.astype("datetime64").dt.strftime("%Y").astype("float") ) res8b = ( Badges2. loc[ Badges2.UserId == 23, ["Name", "Year"] ]. groupby(["Name", "Year"]). size(). rename("Count"). reset_index() )
HAVING part is performed after
res8b = ( res8b. loc[ res8b.Count > 1, : ]. sort_values("Count", ascending=False). reset_index(drop=True) ) pd.testing.assert_frame_equal(res8a, res8b) # no error == OK
To obtain a result equivalent to
res9a, we need to merge
Tags, and then merge the result with
res9b = pd.merge(Posts, Tags, left_on="Id", right_on="WikiPostId") res9b = pd.merge(Users, res9b, left_on="AccountId", right_on="OwnerUserId")
Then, some filtering and sorting will do the trick:
res9b = ( res9b. loc[ (res9b.OwnerUserId != -1) & (~res9b.OwnerUserId.isna()), ["TagName", "Count", "OwnerUserId", "Age", "Location", "DisplayName"] ]. sort_values(["Count", "TagName"], ascending=[False, True]). head(6). reset_index(drop=True) )
In SQL, “not equals to -1” implies
IS NOT NULL.
pd.testing.assert_frame_equal(res9a, res9b) # no error == OK
To obtain a result equivalent to
we first need to create an auxiliary data frame
that corresponds to the subquery.
UpVotesTab = ( Votes. loc[Votes.VoteTypeId==2, :]. groupby("PostId"). size(). rename("UpVotes"). reset_index() )
res10b = pd.merge(UpVotesTab, Posts, left_on="PostId", right_on="Id") res10b = ( res10b. loc[res10b.PostTypeId==1, ["PostId", "UpVotes", "Title"]]. sort_values("UpVotes", ascending=False). head(5). reset_index(drop=True) ) pd.testing.assert_frame_equal(res10a, res10b) # no error == OK
13.4. Closing the Database Connection#
We said we should not forget about:
This gives some sense of closure. Such a relief.
13.5. Common Data Serialisation Formats for the Web#
CSV files are an all-round way to exchange tabular data between different programming and data analysis environments.
For unstructured or non-tabularly-structured data, XML and JSON (and its superset, YAML) are the common formats of choice, especially for communicating with different Web APIs.
It is recommended that we solve some of the following exercises to make sure we can fetch data in these formats. Sadly, often this will require some quite tedious labour, neither art nor science; see also  and .
Australian Radiation Protection and Nuclear Safety Agency publishes UV data for different Aussie cities. Using, for example, the xml package, convert this XML dataset to a data frame.
(*) Check out the English Wikipedia article featuring a list of 20th-century classical composers. Using pandas.read_html, convert the Climate Data table included therein to a data frame.
(*) Using, for example, the lxml package, write a function that converts each bullet list featured in a given Wikipedia article (e.g., this one), to a list of strings.
(**) Import an archived version of a Stack Exchange site that you find interesting and store it in an SQLite database. You can find the relevant data dumps here.
(**) Download and then import an archived version of one of the wikis hosted by the Wikimedia Foundation (e.g., the whole English Wikipedia) so that it can be stored in an SQLite database.
13.6. Working with Many Files#
For the mass-processing of many files, it is worth knowing of the most basic functions for dealing with file paths, searching for files, etc. Usually, we will be looking up ways to complete specific tasks at hand, e.g., how to read data from a ZIP-like archive, on the internet. This is because – contrary to the basic operations of vectors, matrices, and data frames – these are not amongst the actions that we perform that often.
Good development practices related to data storage are described in .
13.6.1. File Paths#
UNIX-like operating systems, including GNU/Linux and macOS,
use slashes, `
/`, as path separators, e.g.,
Windows, however, uses backslashes, `
\`, which have a special meaning
in character strings (escape sequences; see Section 2.1.3).
Therefore, they should be input as, e.g.,
Alternatively, we can use raw strings, where the backslash
is treated literally, e.g.,
When constructing file paths programmatically, it is thus best to rely on os.path.join, which takes care of the system-specific nuances.
import os.path os.path.join("~", "Desktop", "file.csv") # we are on GNU/Linux ## '~/Desktop/file.csv'
The tilde, `
~`, denotes the current user’s home directory.
For storing auxiliary data, we can use the system’s temporary directory. See the tempfile module for functions that generate appropriate file paths therein. For instance, a subdirectory inside the temporary directory can be created via a call to tempfile.mkdtemp.
We will frequently be referring to file paths relative to the working directory of the currently executed Python session (e.g., from which IPython/Jupyter notebook server was started); see os.getcwd.
All non-absolute file names (ones that do not start
c:\\`, and the like), for example,
are always relative to the current working directory.
For instance, if the working directory is
..` denotes the current working directory’s parent directory.
Print the current working directory by calling os.getcwd.
Next, download the file
and save it in the current Python session’s working directory
(e.g., in your web browser, right-click on the web page’s canvas
and select Save Page As…).
Load with pandas.read_csv by passing
"air_quality_2018_param.csv" as the input path.
(*) Download the aforementioned file programmatically (if you have not done so yet) using the requests module.
13.6.2. File Search#
glob.glob and os.listdir can generate a list of files in a given directory (and possibly all its subdirectories).
os.path.isdir and os.path.isfile can be used to determine the type of a given object in the file system.
Write a function that computes the total size of all the files in a given directory and all its subdirectories.
13.6.3. Exception Handling#
Accessing resources on the disk or the internet can lead to errors, for example when the file is not found. The try..except statement can be used if we want to be able to react to any of the envisaged errors
try: # statements to execute x = pd.read_csv("file_not_found.csv") print(x.head()) # this will not be executed if the above raises an error except OSError: # if an exception occurs, we can handle it here print("File has not been found") ## File has not been found
For more details, refer to the documentation.
13.6.4. File Connections (*)#
Basic ways of opening and reading from/writing to file connections are described in the documentation. In Section 14.3.5, we show an example where we create a Markdown file manually.
They may be useful if we wish to process large
files on a chunk-by-chunk basis.
In particular, pandas.read_csv
accepts a file handler (see open).
Then, passing the
nrows argument we can indicate the number of rows
Find an example of an XML and JSON file. Which one is more human-readable? Do they differ in terms of capabilities?
What is wrong with constructing file paths like
"~" + "\\" + "filename.csv"?
What are the benefits of using a SQL database management system in data science activities?
(*) How can we populate a database with gigabytes of data read from many CSV files?