13. Accessing Databases (An Interlude)

The online version of the open-access textbook Minimalist Data Wrangling with Python by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). Any bug/typos reports/fixes are appreciated. Although available online, this is a whole course; it should be read from the beginning to the end. In particular, refer to the Preface for general introductory remarks.

pandas is convenient for working with data that fit into memory and which can be stored in individual CSV files. However, larger information banks, possibly in a shared environment, will often be made available to us via relational (structured) databases such us PostgreSQL or MariaDB, or a wide range of commercial products.

Most commonly, we will be using SQL (structured query language) to express the data chunks we need for analysis, fetching them from the database driver, and importing them to a pandas data frame to perform the already well-known operations.

Below we give a quick introduction to the basics of SQL using SQLite, which is a lightweight, flat-file, and server-less database management system. Overall, SQLite is sensible for data of even hundreds or thousands of gigabytes in size that fit on a single computer’s disk (making it a good choice for playing with our own data science projects or prototyping more complex solutions).

Important

The syntax of SQL is very readable – it is modelled after the natural (English) language. In this introduction, we merely assume that we wish to learn how to read the basic SQL queries, not to write our own. The latter should be covered by a separate course on databases.

13.1. Example Database

In the course of this section, we will be working with a simplified data dump of the Q&A site Travel Stack Exchange, which we have downloaded on 2017-10-31.

Exercise 13.1

Before continuing, make yourself familiar with the Travel Stack Exchange site.

For example, Tags gives, amongst others, topic categories (TagName) and how many questions mention them (Count):

Tags = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/travel_stackexchange_com_2017/Tags.csv.gz",
    comment="#")
Tags.head(3)
##    Count  ExcerptPostId  Id    TagName  WikiPostId
## 0    104         2138.0   1   cruising      2137.0
## 1     43          357.0   2  caribbean       356.0
## 2     43          319.0   4  vacations       318.0

Users provides information on the registered users. In particular, the Id column uniquely identifies each member.

Users = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/travel_stackexchange_com_2017/Users.csv.gz",
    comment="#")
Users.head(3)
##    AccountId   Age             CreationDate  ... Reputation  UpVotes  Views
## 0       -1.0   NaN  2011-06-21T15:16:44.253  ...        1.0   2472.0    0.0
## 1        2.0  40.0  2011-06-21T20:10:03.720  ...      101.0      1.0   31.0
## 2     7598.0  32.0  2011-06-21T20:11:02.490  ...      101.0      1.0   14.0
## 
## [3 rows x 11 columns]

Badges recall all rewards given to the users (UserId) for their engaging in various praiseworthy activities:

Badges = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/travel_stackexchange_com_2017/Badges.csv.gz",
    comment="#")
Badges.head(3)
##    Class                     Date  Id            Name  TagBased  UserId
## 0      3  2011-06-21T20:16:48.910   1  Autobiographer     False       2
## 1      3  2011-06-21T20:16:48.910   2  Autobiographer     False       3
## 2      3  2011-06-21T20:16:48.910   3  Autobiographer     False       4

Posts (note again that Id is the unique identifier of each record in every data frame) lists all the questions and answers (the latter do not have ParentId set to NaN).

Posts = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/travel_stackexchange_com_2017/Posts.csv.gz",
    comment="#")
Posts.head(3)
##    AcceptedAnswerId  ...  ViewCount
## 0             393.0  ...      419.0
## 1               NaN  ...     1399.0
## 2               NaN  ...        NaN
## 
## [3 rows x 17 columns]

Votes list all the up-votes (VoteTypeId equal to 2) and down-votes (VoteTypeId of 3) to all the posts. For privacy reasons, all UserIds are missing (which is encoded with a not-a-number, compare Chapter 15).

Votes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching_data/master/travel_stackexchange_com_2017/Votes.csv.gz",
    comment="#")
Votes.head(3)
##    BountyAmount             CreationDate  Id  PostId  UserId  VoteTypeId
## 0           NaN  2011-06-21T00:00:00.000   1       1     NaN           2
## 1           NaN  2011-06-21T00:00:00.000   2       1     NaN           2
## 2           NaN  2011-06-21T00:00:00.000   3       2     NaN           2
Exercise 13.2

See the README file for the detailed description of each column.

13.2. Exporting Data

Let us establish a connection with the to-be SQLite database. In our case, that will be an ordinary file stored on the computer’s disk:

import tempfile, os.path
dbfile = os.path.join(tempfile.mkdtemp(), "travel.db")
print(dbfile)
## /tmp/tmpkv8_qvz8/travel.db

The above gives the file path (compare Section 13.6.1) where the database is going to be stored. We use a randomly generated filename inside the local filesystem’s (we are on Linux) temporary directory, /tmp, because this is just a simple exercise and we will not be using this database afterwards.

And now:

import sqlite3
conn = sqlite3.connect(dbfile)

We are thus “connected”. The database might now be queried: new tables can be added, new records inserted, and information fetched.

Important

At the end, we should not forget about the call to conn.close().

We already have our data in the form of pandas data frames, therefore exporting them to the database is straightforward. All we have to do is to perform a bunch of calls to the pandas.DataFrame.to_sql method.

Tags.to_sql("Tags", conn, index=False)
Users.to_sql("Users", conn, index=False)
Badges.to_sql("Badges", conn, index=False)
Posts.to_sql("Posts", conn, index=False)
Votes.to_sql("Votes", conn, index=False)

Note

(*) It is possible to export data that do not fit into memory by reading them in chunks of considerable, but not too large, sizes. In particular pandas.read_csv has the nrows argument that lets us read a number of rows from a file buffer; see help("open"). Then, pandas.DataFrame.to_sql(..., if_exists="append") can be used to append new rows to an existing table.

Exporting data can of course be done without pandas as well, e.g., when they are to be fetched from XML or JSON files (compare Section 13.5) and processed manually, row by row. Intermediate-level SQL users can call conn.execute("CREATE TABLE t..."), conn.executemany("INSERT INTO t VALUES(?, ?, ?)", l) followed by conn.commit() to create a new table (here: named t) populated by a list of records (e.g., in the form of tuples or numpy vectors). For more details, see the manual of the sqlite3 package.

13.3. Exercises on SQL vs pandas

We can use pandas to fetch the results generated by executing any SQL query, for example:

pd.read_sql_query("""
    SELECT * FROM Tags LIMIT 3
""", conn)
##    Count  ExcerptPostId  Id    TagName  WikiPostId
## 0    104         2138.0   1   cruising      2137.0
## 1     43          357.0   2  caribbean       356.0
## 2     43          319.0   4  vacations       318.0

Fetches all columns (SELECT *) and the first 3 rows (LIMIT 3) from the Tags table.

Exercise 13.3

For the above and all the following SQL queries, write the equivalent Python code using pandas functions and methods that generates exactly the same result. In each case, there might be more than one equally fine solution.

Let us emphasise again that the SQL syntax is very readable, hence no explanations should be necessary for the queries below. In case of any doubt, though, refer to the manual.

Example solutions are provided at the end of this section.

Example 13.4

For instance, here is the reference result:

res1a = pd.read_sql_query("""
    SELECT * FROM Tags LIMIT 3
""", conn)

Our example equivalent pandas implementation of the above operation might look like:

res1b = Tags.head(3)

To verify if the results are identical, we can call:

pd.testing.assert_frame_equal(res1a, res1b)

Sometimes, however the results generated by pandas will be the same up to the reordering of rows. In such a case, before calling pandas.testing.assert_frame_equal, it might thus be a good idea to call DataFrame.sort_values on both data frames to sort them with respect to 1 or 2 chosen columns. The good thing about the assert_frame_equal function is that it ignores small round-off errors introduced by some some operations.

13.3.1. Filtering

Exercise 13.5

Fetch all tag data for tags whose name contain "europe" as substring:

res2a = pd.read_sql_query("""
    SELECT * FROM Tags WHERE TagName LIKE '%europe%'
""", conn)
res2a
##    Count  ExcerptPostId    Id           TagName  WikiPostId
## 0    681          286.0    30            europe       285.0
## 1     27         2384.0   174    eastern-europe      2383.0
## 2     12         6403.0   512  southeast-europe      6402.0
## 3     20         4513.0   669    central-europe      4512.0
## 4     36        16392.0  1950    western-europe     16391.0

Hint: use Tags.TagName.str.contains("europe") to obtain the same result with pandas.

Exercise 13.6

From Tags, select two columns TagName and Count and rows for which TagName is equal to one of the three choices provided.

res3a = pd.read_sql_query("""
    SELECT TagName, Count
    FROM Tags
    WHERE TagName IN ('poland', 'australia', 'china')
""", conn)
res3a
##      TagName  Count
## 0      china    443
## 1  australia    411
## 2     poland    139

Hint: use pandas.Series.isin.

Exercise 13.7

Select a number of columns from Posts whose rows fulfil given conditions:

res4a = pd.read_sql_query("""
    SELECT Title, Score, ViewCount, FavoriteCount
    FROM Posts
    WHERE PostTypeId=1 AND
        ViewCount>=10000 AND
        FavoriteCount BETWEEN 35 AND 100
""", conn)
res4a
##                                                Title  ...  FavoriteCount
## 0  When traveling to a country with a different c...  ...           35.0
## 1         How can I do a "broad" search for flights?  ...           49.0
## 2  Tactics to avoid getting harassed by corrupt p...  ...           42.0
## 3  Flight tickets: buy two weeks before even duri...  ...           36.0
## 4  OK we're all adults here, so really, how on ea...  ...           79.0
## 5  How to intentionally get denied entry to the U...  ...           53.0
## 6  How do you know if Americans genuinely/literal...  ...           79.0
## 7  OK, we are all adults here, so what is a bidet...  ...           38.0
## 8          How to cope with too slow Wi-Fi at hotel?  ...           41.0
## 
## [9 rows x 4 columns]

13.3.2. Ordering

Exercise 13.8

Select the Title and Score columns from Posts where ParentId is missing (i.e., the post is actually a question) and Title is well-defined; then, sort the result by the Score column, decreasingly (descending order); finally, return only first 5 rows (e.g., top 5 scoring questions):

res5a = pd.read_sql_query("""
    SELECT Title, Score
    FROM Posts
    WHERE ParentId IS NULL AND Title IS NOT NULL
    ORDER BY Score DESC
    LIMIT 5
""", conn)
res5a
##                                                Title  Score
## 0  OK we're all adults here, so really, how on ea...    306
## 1  How do you know if Americans genuinely/literal...    254
## 2  How to intentionally get denied entry to the U...    219
## 3  Why are airline passengers asked to lift up wi...    210
## 4                       Why prohibit engine braking?    178

Hint: use pandas.DataFrame.sort_values and numpy.isnan or pandas.isnull.

13.3.3. Removing Duplicates

Exercise 13.9

Get all unique badge names for the user with Id=23.

res6a = pd.read_sql_query("""
    SELECT DISTINCT Name
    FROM Badges
    WHERE UserId=23
""", conn)
res6a
##                 Name
## 0          Supporter
## 1            Student
## 2            Teacher
## 3            Scholar
## 4               Beta
## 5      Nice Question
## 6             Editor
## 7        Nice Answer
## 8           Yearling
## 9   Popular Question
## 10        Taxonomist
## 11  Notable Question

Hint: use pandas.DataFrame.drop_duplicates.

Exercise 13.10

For each badge given to the user with Id=23, get the year it was given and store it in a new column named Year. Then, select are unique pairs (Name, Year).

res7a = pd.read_sql_query("""
    SELECT DISTINCT
        Name,
        CAST(strftime('%Y', Date) AS FLOAT) AS Year
    FROM Badges
    WHERE UserId=23
""", conn)
res7a
##                 Name    Year
## 0          Supporter  2011.0
## 1            Student  2011.0
## 2            Teacher  2011.0
## 3            Scholar  2011.0
## 4               Beta  2011.0
## 5      Nice Question  2011.0
## 6             Editor  2012.0
## 7        Nice Answer  2012.0
## 8           Yearling  2012.0
## 9      Nice Question  2012.0
## 10     Nice Question  2013.0
## 11          Yearling  2013.0
## 12  Popular Question  2014.0
## 13          Yearling  2014.0
## 14        Taxonomist  2014.0
## 15  Notable Question  2015.0
## 16     Nice Question  2017.0

Hint: use Badges.Date.astype("datetime64").dt.strftime("%Y").astype("float").

13.3.4. Grouping and Aggregating

Exercise 13.11

Count how many badges of each type the user with Id=23 has received. Also, for each badge type, return the minimal, average, and maximal receiving year. Return only the top 4 badges (with respect to the counts).

res8a = pd.read_sql_query("""
    SELECT
        Name,
        COUNT(*) AS Count,
        MIN(CAST(strftime('%Y', Date) AS FLOAT)) AS MinYear,
        AVG(CAST(strftime('%Y', Date) AS FLOAT)) AS MeanYear,
        MAX(CAST(strftime('%Y', Date) AS FLOAT)) AS MaxYear
    FROM Badges
    WHERE UserId=23
    GROUP BY Name
    ORDER BY Count DESC, MeanYear ASC
    LIMIT 4
""", conn)
res8a
##                Name  Count  MinYear  MeanYear  MaxYear
## 0     Nice Question      4   2011.0   2013.25   2017.0
## 1          Yearling      3   2012.0   2013.00   2014.0
## 2  Popular Question      3   2014.0   2014.00   2014.0
## 3  Notable Question      2   2015.0   2015.00   2015.0
Exercise 13.12

Similarly to the above, but now count how many unique combinations of pairs (Name, Date) are there; then, return only the rows having Count greater than 1 and order the results by Count decreasingly.

res9a = pd.read_sql_query("""
    SELECT
        Name,
        CAST(strftime('%Y', Date) AS FLOAT) AS Year,
        COUNT(*) AS Count
    FROM Badges
    WHERE UserId=23
    GROUP BY Name, Year
    HAVING Count > 1
    ORDER BY Count DESC
""", conn)
res9a
##                Name    Year  Count
## 0  Popular Question  2014.0      3
## 1  Notable Question  2015.0      2

Note that WHERE is performed before GROUP BY, and HAVING is applied thereafter.

13.3.5. Joining

Exercise 13.13

Join (merge) Tags, Posts, and Users for all posts with OwnerUserId not equal to -1 (i.e., the tags which were created by “alive” users). Return the top 6 records with respect to Tags.Count.

res10a = pd.read_sql_query("""
    SELECT Tags.TagName, Tags.Count, Posts.OwnerUserId,
        Users.Age, Users.Location, Users.DisplayName
    FROM Tags
    JOIN Posts ON Posts.Id=Tags.WikiPostId
    JOIN Users ON Users.AccountId=Posts.OwnerUserId
    WHERE OwnerUserId != -1
    ORDER BY Tags.Count DESC, Tags.TagName ASC
    LIMIT 6
""", conn)
res10a
##          TagName  Count  ...          Location       DisplayName
## 0         canada    802  ...     Mumbai, India             hitec
## 1         europe    681  ...  Philadelphia, PA       Adam Tuttle
## 2  visa-refusals    554  ...      New York, NY  Benjamin Pollack
## 3      australia    411  ...     Mumbai, India             hitec
## 4             eu    204  ...  Philadelphia, PA       Adam Tuttle
## 5  new-york-city    204  ...     Mumbai, India             hitec
## 
## [6 rows x 6 columns]
Exercise 13.14

First, create an auxiliary (temporary) table named UpVotesTab, where we store the information about the number of up-votes (VoteTypeId=2) that each post has received. Then, join (merge) this table with Posts and fetch some details about the 5 questions (PostTypeId=1) with the most up-votes:

res11a = pd.read_sql_query("""
    SELECT UpVotesTab.*, Posts.Title FROM
    (
        SELECT PostId, COUNT(*) AS UpVotes
        FROM Votes
        WHERE VoteTypeId=2
        GROUP BY PostId
    ) AS UpVotesTab
    JOIN Posts ON UpVotesTab.PostId=Posts.Id
    WHERE Posts.PostTypeId=1
    ORDER BY UpVotesTab.UpVotes DESC LIMIT 5
""", conn)
res11a
##    PostId  UpVotes                                              Title
## 0    3080      307  OK we're all adults here, so really, how on ea...
## 1   38177      254  How do you know if Americans genuinely/literal...
## 2   24540      221  How to intentionally get denied entry to the U...
## 3   20207      211  Why are airline passengers asked to lift up wi...
## 4   96447      178                       Why prohibit engine braking?

13.3.6. Solutions to Exercises

Below are example solutions to the above exercises.

Example 13.15
res2b = (
    Tags.
    loc[Tags.TagName.str.contains("europe"), :].
    reset_index(drop=True)
)
res2b
##    Count  ExcerptPostId    Id           TagName  WikiPostId
## 0    681          286.0    30            europe       285.0
## 1     27         2384.0   174    eastern-europe      2383.0
## 2     12         6403.0   512  southeast-europe      6402.0
## 3     20         4513.0   669    central-europe      4512.0
## 4     36        16392.0  1950    western-europe     16391.0
pd.testing.assert_frame_equal(res2a, res2b)
Example 13.16
res3b = (
    Tags.
    loc[
        Tags.TagName.isin(["poland", "australia", "china"]),
        ["TagName", "Count"]
    ].
    reset_index(drop=True)
)
res3b
##      TagName  Count
## 0      china    443
## 1  australia    411
## 2     poland    139
pd.testing.assert_frame_equal(res3a, res3b)
Example 13.17
res4b = (
    Posts.
    loc[
        (Posts.PostTypeId == 1) & (Posts.ViewCount >= 10000) &
        (Posts.FavoriteCount >= 35) & (Posts.FavoriteCount <= 100),
        ["Title", "Score", "ViewCount", "FavoriteCount"]
    ].
    reset_index(drop=True)
)
res4b
##                                                Title  ...  FavoriteCount
## 0  When traveling to a country with a different c...  ...           35.0
## 1         How can I do a "broad" search for flights?  ...           49.0
## 2  Tactics to avoid getting harassed by corrupt p...  ...           42.0
## 3  Flight tickets: buy two weeks before even duri...  ...           36.0
## 4  OK we're all adults here, so really, how on ea...  ...           79.0
## 5  How to intentionally get denied entry to the U...  ...           53.0
## 6  How do you know if Americans genuinely/literal...  ...           79.0
## 7  OK, we are all adults here, so what is a bidet...  ...           38.0
## 8          How to cope with too slow Wi-Fi at hotel?  ...           41.0
## 
## [9 rows x 4 columns]
pd.testing.assert_frame_equal(res4a, res4b)
Example 13.18
res5b = (
    Posts.
    loc[
        Posts.ParentId.isna() & (~Posts.Title.isna()),
        ["Title", "Score"]
    ].
    sort_values("Score", ascending=False).
    head(5).
    reset_index(drop=True)
)
res5b
##                                                Title  Score
## 0  OK we're all adults here, so really, how on ea...    306
## 1  How do you know if Americans genuinely/literal...    254
## 2  How to intentionally get denied entry to the U...    219
## 3  Why are airline passengers asked to lift up wi...    210
## 4                       Why prohibit engine braking?    178
pd.testing.assert_frame_equal(res5a, res5b)
Example 13.19
res6b = (
    Badges.
    loc[Badges.UserId == 23, ["Name"]].
    drop_duplicates().
    reset_index(drop=True)
)
res6b
##                 Name
## 0          Supporter
## 1            Student
## 2            Teacher
## 3            Scholar
## 4               Beta
## 5      Nice Question
## 6             Editor
## 7        Nice Answer
## 8           Yearling
## 9   Popular Question
## 10        Taxonomist
## 11  Notable Question
pd.testing.assert_frame_equal(res6a, res6b)
Example 13.20
Badges2 = Badges.copy()
Badges2.loc[:, "Year"] = (
    Badges2.Date.astype("datetime64").dt.strftime("%Y").astype("float")
)
res7b = (
    Badges2.
    loc[Badges2.UserId == 23, ["Name", "Year"]].
    drop_duplicates().
    reset_index(drop=True)
)
res7b
##                 Name    Year
## 0          Supporter  2011.0
## 1            Student  2011.0
## 2            Teacher  2011.0
## 3            Scholar  2011.0
## 4               Beta  2011.0
## 5      Nice Question  2011.0
## 6             Editor  2012.0
## 7        Nice Answer  2012.0
## 8           Yearling  2012.0
## 9      Nice Question  2012.0
## 10     Nice Question  2013.0
## 11          Yearling  2013.0
## 12  Popular Question  2014.0
## 13          Yearling  2014.0
## 14        Taxonomist  2014.0
## 15  Notable Question  2015.0
## 16     Nice Question  2017.0
pd.testing.assert_frame_equal(res7a, res7b)
Example 13.21
Badges2 = Badges.copy()
Badges2.loc[:, "Year"] = (
    Badges2.Date.astype("datetime64").dt.strftime("%Y").astype("float")
)
res8b = (
    Badges2.
    loc[Badges2.UserId == 23, ["Name", "Year"]].
    groupby("Name")["Year"].
    aggregate([len, np.min, np.mean, np.max]).
    sort_values(["len", "mean"], ascending=[False, True]).
    head(4).
    reset_index()
)
res8b.columns = ["Name", "Count", "MinYear", "MeanYear", "MaxYear"]
res8b
##                Name  Count  MinYear  MeanYear  MaxYear
## 0     Nice Question      4   2011.0   2013.25   2017.0
## 1          Yearling      3   2012.0   2013.00   2014.0
## 2  Popular Question      3   2014.0   2014.00   2014.0
## 3  Notable Question      2   2015.0   2015.00   2015.0

Had we not converted Year to float, we would obtain a meaningless average year, without any warning.

pd.testing.assert_frame_equal(res8a, res8b)
Example 13.22
Badges2 = Badges.copy()
Badges2.loc[:, "Year"] = (
    Badges2.Date.astype("datetime64").dt.strftime("%Y").astype("float")
)
res9b = (
    Badges2.
    loc[ Badges2.UserId == 23, ["Name", "Year"] ].
    groupby(["Name", "Year"]).
    size().
    rename("Count").
    reset_index()
)
res9b = (
    res9b.
    loc[ res9b.Count > 1, : ].
    sort_values("Count", ascending=False).
    reset_index(drop=True)
)
res9b
##                Name    Year  Count
## 0  Popular Question  2014.0      3
## 1  Notable Question  2015.0      2

The HAVING part is performed after WHERE and GROUP BY.

pd.testing.assert_frame_equal(res9a, res9b)
Example 13.23
res10b = pd.merge(Posts, Tags, left_on="Id", right_on="WikiPostId")
res10b = pd.merge(Users, res10b, left_on="AccountId", right_on="OwnerUserId")
res10b = (
    res10b.
    loc[
        (res10b.OwnerUserId != -1) & (~res10b.OwnerUserId.isna()),
        ["TagName", "Count", "OwnerUserId", "Age", "Location", "DisplayName"]
    ].
    sort_values(["Count", "TagName"], ascending=[False, True]).
    head(6).
    reset_index(drop=True)
)
res10b
##          TagName  Count  ...          Location       DisplayName
## 0         canada    802  ...     Mumbai, India             hitec
## 1         europe    681  ...  Philadelphia, PA       Adam Tuttle
## 2  visa-refusals    554  ...      New York, NY  Benjamin Pollack
## 3      australia    411  ...     Mumbai, India             hitec
## 4             eu    204  ...  Philadelphia, PA       Adam Tuttle
## 5  new-york-city    204  ...     Mumbai, India             hitec
## 
## [6 rows x 6 columns]

In SQL, “not equals to -1” implies that “IS NOT NULL”.

pd.testing.assert_frame_equal(res10a, res10b)
Example 13.24

To obtain res11b, we first create an auxiliary data frame that corresponds to the subquery.

UpVotesTab = (
    Votes.
    loc[Votes.VoteTypeId==2, :].
    groupby("PostId").
    size().
    rename("UpVotes").
    reset_index()
)

And now:

res11b = pd.merge(UpVotesTab, Posts, left_on="PostId", right_on="Id")
res11b = (
    res11b.
    loc[res11b.PostTypeId==1, ["PostId", "UpVotes", "Title"]].
    sort_values("UpVotes", ascending=False).
    head(5).
    reset_index(drop=True)
)
res11b
##    PostId  UpVotes                                              Title
## 0    3080      307  OK we're all adults here, so really, how on ea...
## 1   38177      254  How do you know if Americans genuinely/literal...
## 2   24540      221  How to intentionally get denied entry to the U...
## 3   20207      211  Why are airline passengers asked to lift up wi...
## 4   96447      178                       Why prohibit engine braking?
pd.testing.assert_frame_equal(res11a, res11b)

13.4. Closing the Database Connection

We have said we should not forget about:

conn.close()

This gives some sense of closure. Such a relief.

13.5. Common Data Serialisation Formats for the Web

CSV files are an all-round way to exchange tabular data between different programming and data analysis environments.

For unstructured or non-tabularly-structured data, XML as well as JSON (and its superset, YAML) are the common formats of choice, especially for communicating with different Web APIs.

Important

It is recommended that we solve some of the following exercises to make sure we can fetch data in these formats. Warning: often, this will quite tedious labour, neither art nor science; see also [vdLdJ18] and [DJ03]. Good luck.

Exercise 13.25

Consider the Web API for accessing the on-street parking bay sensor data in Melbourne, VIC, Australia. Using, for example, the json package, convert the data in the JSON format to a pandas data frame.

Exercise 13.26

Australian Radiation Protection and Nuclear Safety Agency publishes UV data for different Aussie cities. Using, for example, the xml package, convert this XML dataset to a pandas data frame.

Exercise 13.27

(*) Check out the English Wikipedia article featuring a list of 20th-century classical composers. Using pandas.read_html, convert the Climate Data table included therein to a data frame.

Exercise 13.28

(*) Using, for example, the lxml package, write a function that converts each bullet list featured in a given Wikipedia article (e.g., this one), to a list of strings.

Exercise 13.29

(**) Import an archived version of a Stack Exchange site that you find interesting and store it in an SQLite database. You can find the relevant data dumps here.

Exercise 13.30

(**) Download and then import an archived version of one of the wikis hosted by the Wikimedia Foundation (e.g., the whole English Wikipedia) so that it can be stored in an SQLite database.

13.6. Working with Many Files

For the mass-processing of many files, it is worth knowing of the most basic functions for dealing with file paths, searching for files, etc. Usually we will be looking up ways to complete specific tasks at hand, e.g., how to read data from a ZIP archive, on the internet.

13.6.1. File Paths

UNIX-like operating systems, including GNU/Linux and macOS, use slashes, `/`, as path separators, e.g., "/home/marek/file.csv". Windows, however, uses backslashes, `\`, which have a special meaning in character strings (escape sequences, see Section 2.1.3), therefore they should be input as, e.g., "c:\\users\\marek\\file.csv" or r"c:\users\marek\file.csv" (note the r prefix).

When constructing file paths programmatically, it is thus best to rely on os.path.join, which takes care of the system-specific nuances.

import os.path
os.path.join("~", "Desktop", "file.csv")  # we are on GNU/Linux
## '~/Desktop/file.csv'

The tilde, `~`, denotes the current user’s home directory.

Note

For storing auxiliary data, we will often be using the system’s temporary directory. See the tempfile module for functions that generate appropriate file paths therein.

For instance, a subdirectory inside the temporary directory can be created via a call to tempfile.mkdtemp.

Important

We will frequently be referring to file paths relative to the working directory of the currently executed Python session (e.g., from which IPython/Jupyter notebook server was started). The latter can be read by calling os.getcwd.

And thus, all non-absolute file names (ones that do not start with `~`, `/`, `c:\\`, and the like), for example, "filename.csv" or os.path.join("subdir", "filename.csv") are always relative to the current working directory.

For instance, if the working directory is "/home/marek/projects/python", then "filename.csv" refers to "/home/marek/projects/python/filename.csv".

Also, `..` denotes the current working directory’s parent directory. Thus, "../filename2.csv" resolves to "/home/marek/projects/filename2.csv".

Exercise 13.31

Print the result output by os.getcwd. Next, download the file air_quality_2018_param.csv and save it in the current Python session’s working directory (e.g., in your web browser, right-click on the web page’s canvas and select Save Page As…). Load it as a pandas data frame by passing simply "air_quality_2018_param.csv" as the input path.

Exercise 13.32

(*) Download the aforementioned file programmatically (if you have not done so) using the requests module.

13.7. Exercises

Exercise 13.34

Find an example of an XML and JSON file. Which one is more human-readable? Do they differ in terms of capabilities?

Exercise 13.35

What is wrong with constructing file paths like "~" + "\\" + "filename.csv"?

Exercise 13.36

What are the benefits of using a SQL-supporting relational database management system in data science activities?

Exercise 13.37

(**) How to populate a database with gigabytes of data read from many CSV files?