Minimalist Data Wrangling with Python

Marek Gagolewski

doi:10.5281/zenodo.6451068

13. Accessing databases¶

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Deep R Programming [36] too.

pandas is convenient for working with data that fit into memory and which can be stored in individual CSV files. Still, larger information banks in a shared environment will often be made available to us via relational (structured) databases such as PostgreSQL or MariaDB, or a wide range of commercial products.

Most commonly, we use SQL (Structured Query Language) to define the data chunks[1] we want to analyse. Then, we fetch them from the database driver in the form of a pandas data frame. This enables us to perform the operations we are already familiar with, e.g., various transformations or visualisations.

Below we make a quick introduction to the basics of SQL using SQLite, which is a lightweight, flat-file, and server-less open-source database management system. Overall, SQLite is a sensible choice for data of even hundreds or thousands of gigabytes in size that fit on a single computer’s disk. This is more than enough for playing with our data science projects or prototyping more complex solutions.

Important

In this chapter, we will learn that the syntax of SQL is very readable: it is modelled after the natural (English) language. The purpose of this introduction is not to compose own queries nor to design new databanks. The latter is covered by a separate course on database systems; see, e.g., [17, 21].

13.1. Example database¶

In this chapter, we will be working with a simplified data dump of the Q&A site Travel Stack Exchange, which we downloaded on 2017-10-31. It consists of five separate data frames.

First, Tags gives, amongst others, topic categories (TagName) and how many questions mention them (Count):

Tags = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/travel_stackexchange_com_2017/Tags.csv.gz",
    comment="#")
Tags.head(3)
##    Count  ExcerptPostId  Id    TagName  WikiPostId
## 0    104         2138.0   1   cruising      2137.0
## 1     43          357.0   2  caribbean       356.0
## 2     43          319.0   4  vacations       318.0

Second, Users provides information on the registered users.

Users = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/travel_stackexchange_com_2017/Users.csv.gz",
    comment="#")
Users.head(3)
##    AccountId   Age             CreationDate  ... Reputation  UpVotes  Views
## 0       -1.0   NaN  2011-06-21T15:16:44.253  ...        1.0   2472.0    0.0
## 1        2.0  40.0  2011-06-21T20:10:03.720  ...      101.0      1.0   31.0
## 2     7598.0  32.0  2011-06-21T20:11:02.490  ...      101.0      1.0   14.0
## 
## [3 rows x 11 columns]

Third, Badges recalls all rewards handed to the users (UserId) for their engaging in various praiseworthy activities:

Badges = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/travel_stackexchange_com_2017/Badges.csv.gz",
    comment="#")
Badges.head(3)
##    Class                     Date  Id            Name  TagBased  UserId
## 0      3  2011-06-21T20:16:48.910   1  Autobiographer     False       2
## 1      3  2011-06-21T20:16:48.910   2  Autobiographer     False       3
## 2      3  2011-06-21T20:16:48.910   3  Autobiographer     False       4

Fourth, Posts lists all the questions and answers (the latter do not have ParentId set to NaN).

Posts = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/travel_stackexchange_com_2017/Posts.csv.gz",
    comment="#")
Posts.head(3)
##    AcceptedAnswerId  ...  ViewCount
## 0             393.0  ...      419.0
## 1               NaN  ...     1399.0
## 2               NaN  ...        NaN
## 
## [3 rows x 17 columns]

Fifth, Votes list all the up-votes (VoteTypeId equal to 2) and down-votes (VoteTypeId of 3) to all the posts.

Votes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
    "teaching-data/master/travel_stackexchange_com_2017/Votes.csv.gz",
    comment="#")
Votes.head(3)
##    BountyAmount             CreationDate  Id  PostId  UserId  VoteTypeId
## 0           NaN  2011-06-21T00:00:00.000   1       1     NaN           2
## 1           NaN  2011-06-21T00:00:00.000   2       1     NaN           2
## 2           NaN  2011-06-21T00:00:00.000   3       2     NaN           2

Exercise 13.1

See the README file for a detailed description of each column. Note that rows are uniquely defined by their respective Ids. They are relations between the data frames, e.g., Users.Id vs Badges.UserId, Posts.Id vs Votes.PostId, etc. Moreover, for privacy reasons, some UserIds might be missing. In such a case, they are encoded with a not-a-number; compare Chapter 15.

13.2. Exporting data to a database¶

Let’s establish a connection with a new SQLite database. In our case, this will be an ordinary file stored on the computer’s disk:

import tempfile, os.path
dbfile = os.path.join(tempfile.mkdtemp(), "travel.db")
print(dbfile)
## /tmp/tmpo5b5kdfp/travel.db

It defines the file path (compare Section 13.6.1) where the database is going to be stored. We use a randomly generated filename inside the local file system’s (we are on Linux) temporary directory, /tmp. This is just a pleasant exercise, and we will not be using this database afterwards. The reader might prefer setting a filename relative to the current working directory (as given by os.getcwd), e.g., dbfile = "travel.db".

We can now connect to the database:

import sqlite3
conn = sqlite3.connect(dbfile)

The database might now be queried: we can add new tables, insert new rows, and retrieve records.

Important

In the end, we must not forget about the call to conn.close().

Our data are already in the form of pandas data frames. Therefore, exporting them to the database is straightforward. We only need to make a series of calls to the pandas.DataFrame.to_sql method.

Tags.to_sql("Tags", conn, index=False)
Users.to_sql("Users", conn, index=False)
Badges.to_sql("Badges", conn, index=False)
Posts.to_sql("Posts", conn, index=False)
Votes.to_sql("Votes", conn, index=False)

Note

(*) It is possible to export data that do not fit into memory by reading them in chunks of considerable, but not too large, sizes. In particular pandas.read_csv has the nrows argument that lets us read several rows from a file connection; see Section 13.6.4. Then, pandas.DataFrame.to_sql(..., if_exists="append") can be used to append new rows to an existing table.

Exporting data can be done without pandas as well, e.g., when they are to be fetched from XML or JSON files (compare Section 13.5) and processed manually, row by row. Intermediate-level SQL users can call conn.execute("CREATE TABLE t..."), followed by conn.executemany("INSERT INTO t VALUES(?, ?, ?)", l), and then conn.commit(). This will create a new table (here: named t) populated by a list of records (e.g., in the form of tuples or numpy vectors). For more details, see the manual of the sqlite3 package.

13.3. Exercises on SQL vs pandas¶

We can use pandas to fetch the results of any SQL query in the form of a data frame. For example:

pd.read_sql_query("""
    SELECT * FROM Tags LIMIT 3
""", conn)
##    Count  ExcerptPostId  Id    TagName  WikiPostId
## 0    104         2138.0   1   cruising      2137.0
## 1     43          357.0   2  caribbean       356.0
## 2     43          319.0   4  vacations       318.0

This query selected all columns (SELECT *) and the first three rows (LIMIT 3) from the Tags table.

Exercise 13.2

For the afore- and all the undermentioned SQL queries, write the equivalent Python code that generates the same result using pandas functions and methods. In each case, there might be more than one equally fine solution. In case of any doubt about the meaning of the queries, please refer to the SQLite documentation. Example solutions are provided at the end of this section.

Example 13.3

For a reference query:

res1a = pd.read_sql_query("""
    SELECT * FROM Tags LIMIT 3
""", conn)

The equivalent pandas implementation might look like:

res1b = Tags.head(3)

To verify that the results are equal, we can call:

pd.testing.assert_frame_equal(res1a, res1b)  # no error == OK

No error message means that the test is passed. The cordial thing about the assert_frame_equal function is that it ignores small round-off errors introduced by arithmetic operations.

Nonetheless, the results generated by pandas might be the same up to the reordering of rows. In such a case, before calling pandas.testing.assert_frame_equal, we can invoke DataFrame.sort_values on both data frames to sort them with respect to 1 or 2 chosen columns.

13.3.1. Filtering¶

Exercise 13.4

From Tags, select two columns TagName and Count and rows for which TagName is equal to one of the three choices provided.

res2a = pd.read_sql_query("""
    SELECT TagName, Count
    FROM Tags
    WHERE TagName IN ('poland', 'australia', 'china')
""", conn)
res2a
##      TagName  Count
## 0      china    443
## 1  australia    411
## 2     poland    139

Hint: use pandas.Series.isin.

Exercise 13.5

Select a set of columns from Posts whose rows fulfil a given set of conditions.

res3a = pd.read_sql_query("""
    SELECT Title, Score, ViewCount, FavoriteCount
    FROM Posts
    WHERE PostTypeId=1 AND
        ViewCount>=10000 AND
        FavoriteCount BETWEEN 35 AND 100
""", conn)
res3a
##                                                Title  ...  FavoriteCount
## 0  When traveling to a country with a different c...  ...           35.0
## 1         How can I do a "broad" search for flights?  ...           49.0
## 2  Tactics to avoid getting harassed by corrupt p...  ...           42.0
## 3  Flight tickets: buy two weeks before even duri...  ...           36.0
## 4  OK we're all adults here, so really, how on ea...  ...           79.0
## 5  How to intentionally get denied entry to the U...  ...           53.0
## 6  How do you know if Americans genuinely/literal...  ...           79.0
## 7  OK, we are all adults here, so what is a bidet...  ...           38.0
## 8          How to cope with too slow Wi-Fi at hotel?  ...           41.0
## 
## [9 rows x 4 columns]

13.3.2. Ordering¶

Exercise 13.6

Select the Title and Score columns from Posts where ParentId is missing (i.e., the post is, in fact, a question) and Title is well-defined. Then, sort the results by the Score column, decreasingly (descending order). Finally, return only the first five rows (e.g., top five scoring questions).

res4a = pd.read_sql_query("""
    SELECT Title, Score
    FROM Posts
    WHERE ParentId IS NULL AND Title IS NOT NULL
    ORDER BY Score DESC
    LIMIT 5
""", conn)
res4a
##                                                Title  Score
## 0  OK we're all adults here, so really, how on ea...    306
## 1  How do you know if Americans genuinely/literal...    254
## 2  How to intentionally get denied entry to the U...    219
## 3  Why are airline passengers asked to lift up wi...    210
## 4                       Why prohibit engine braking?    178

Hint: use pandas.DataFrame.sort_values and numpy.isnan or pandas.isnull.

13.3.3. Removing duplicates¶

Exercise 13.7

Get all unique badge names for the user with Id=23.

res5a = pd.read_sql_query("""
    SELECT DISTINCT Name
    FROM Badges
    WHERE UserId=23
""", conn)
res5a
##                 Name
## 0          Supporter
## 1            Student
## 2            Teacher
## 3            Scholar
## 4               Beta
## 5      Nice Question
## 6             Editor
## 7        Nice Answer
## 8           Yearling
## 9   Popular Question
## 10        Taxonomist
## 11  Notable Question

Hint: use pandas.DataFrame.drop_duplicates.

Exercise 13.8

For each badge handed to the user with Id=23, extract the award year store it in a new column named Year. Then, select only the unique pairs (Name, Year).

res6a = pd.read_sql_query("""
    SELECT DISTINCT
        Name,
        CAST(strftime('%Y', Date) AS FLOAT) AS Year
    FROM Badges
    WHERE UserId=23
""", conn)
res6a
##                 Name    Year
## 0          Supporter  2011.0
## 1            Student  2011.0
## 2            Teacher  2011.0
## 3            Scholar  2011.0
## 4               Beta  2011.0
## 5      Nice Question  2011.0
## 6             Editor  2012.0
## 7        Nice Answer  2012.0
## 8           Yearling  2012.0
## 9      Nice Question  2012.0
## 10     Nice Question  2013.0
## 11          Yearling  2013.0
## 12  Popular Question  2014.0
## 13          Yearling  2014.0
## 14        Taxonomist  2014.0
## 15  Notable Question  2015.0
## 16     Nice Question  2017.0

Hint: use Badges.Date.astype("datetime64[ns]").dt.strftime("%Y").astype("float"); see Chapter 16.

13.3.4. Grouping and aggregating¶

Exercise 13.9

Count how many badges of each type the user with Id=23 won. Also, for each badge type, compute the minimal, average, and maximal receiving year. Return only the top four badges (with respect to the counts).

res7a = pd.read_sql_query("""
    SELECT
        Name,
        COUNT(*) AS Count,
        MIN(CAST(strftime('%Y', Date) AS FLOAT)) AS MinYear,
        AVG(CAST(strftime('%Y', Date) AS FLOAT)) AS MeanYear,
        MAX(CAST(strftime('%Y', Date) AS FLOAT)) AS MaxYear
    FROM Badges
    WHERE UserId=23
    GROUP BY Name
    ORDER BY Count DESC
    LIMIT 4
""", conn)
res7a
##                Name  Count  MinYear  MeanYear  MaxYear
## 0     Nice Question      4   2011.0   2013.25   2017.0
## 1          Yearling      3   2012.0   2013.00   2014.0
## 2  Popular Question      3   2014.0   2014.00   2014.0
## 3  Notable Question      2   2015.0   2015.00   2015.0

Exercise 13.10

Count how many unique combinations of pairs (Name, Year) for the badges won by the user with Id=23 are there. Then, return only the rows having Count greater than 1 and order the results by Count decreasingly. In other words, list the badges received more than once in any given year.

res8a = pd.read_sql_query("""
    SELECT
        Name,
        CAST(strftime('%Y', Date) AS FLOAT) AS Year,
        COUNT(*) AS Count
    FROM Badges
    WHERE UserId=23
    GROUP BY Name, Year
    HAVING Count > 1
    ORDER BY Count DESC
""", conn)
res8a
##                Name    Year  Count
## 0  Popular Question  2014.0      3
## 1  Notable Question  2015.0      2

Note that WHERE is performed before GROUP BY, and HAVING is applied thereafter.

13.3.5. Joining¶

Exercise 13.11

Join (merge) Tags, Posts, and Users for all posts with OwnerUserId not equal to -1 (i.e., the tags which were created by “alive” users). Return the top six records with respect to Tags.Count.

res9a = pd.read_sql_query("""
    SELECT Tags.TagName, Tags.Count, Posts.OwnerUserId,
        Users.Age, Users.Location, Users.DisplayName
    FROM Tags
    JOIN Posts ON Posts.Id=Tags.WikiPostId
    JOIN Users ON Users.AccountId=Posts.OwnerUserId
    WHERE OwnerUserId != -1
    ORDER BY Tags.Count DESC, Tags.TagName ASC
    LIMIT 6
""", conn)
res9a
##          TagName  Count  ...          Location       DisplayName
## 0         canada    802  ...     Mumbai, India             hitec
## 1         europe    681  ...  Philadelphia, PA       Adam Tuttle
## 2  visa-refusals    554  ...      New York, NY  Benjamin Pollack
## 3      australia    411  ...     Mumbai, India             hitec
## 4             eu    204  ...  Philadelphia, PA       Adam Tuttle
## 5  new-york-city    204  ...     Mumbai, India             hitec
## 
## [6 rows x 6 columns]

Exercise 13.12

First, create an auxiliary (temporary) table named UpVotesTab, where we store the information about the number of up-votes (VoteTypeId=2) that each post has received. Then, join (merge) this table with Posts and fetch some details about the five questions (PostTypeId=1) with the most up-votes.

res10a = pd.read_sql_query("""
    SELECT UpVotesTab.*, Posts.Title FROM
    (
        SELECT PostId, COUNT(*) AS UpVotes
        FROM Votes
        WHERE VoteTypeId=2
        GROUP BY PostId
    ) AS UpVotesTab
    JOIN Posts ON UpVotesTab.PostId=Posts.Id
    WHERE Posts.PostTypeId=1
    ORDER BY UpVotesTab.UpVotes DESC LIMIT 5
""", conn)
res10a
##    PostId  UpVotes                                              Title
## 0    3080      307  OK we're all adults here, so really, how on ea...
## 1   38177      254  How do you know if Americans genuinely/literal...
## 2   24540      221  How to intentionally get denied entry to the U...
## 3   20207      211  Why are airline passengers asked to lift up wi...
## 4   96447      178                       Why prohibit engine braking?

13.3.6. Solutions to exercises¶

In this section, we provide example solutions to the above exercises.

Example 13.13

To obtain a result equivalent to res2a, we need simple data filtering only:

res2b = (
    Tags.
    loc[
        Tags.TagName.isin(["poland", "australia", "china"]),
        ["TagName", "Count"]
    ].
    reset_index(drop=True)
)

Let’s verify whether the two data frames are identical:

pd.testing.assert_frame_equal(res2a, res2b)  # no error == OK

Example 13.14

To generate res3a with pandas only, we need some more complex filtering with loc[...]:

res3b = (
    Posts.
    loc[
        (Posts.PostTypeId == 1) & (Posts.ViewCount >= 10000) &
        (Posts.FavoriteCount >= 35) & (Posts.FavoriteCount <= 100),
        ["Title", "Score", "ViewCount", "FavoriteCount"]
    ].
    reset_index(drop=True)
)
pd.testing.assert_frame_equal(res3a, res3b)  # no error == OK

Example 13.15

For res4a, some filtering and sorting is all we need:

res4b = (
    Posts.
    loc[
        Posts.ParentId.isna() & (~Posts.Title.isna()),
        ["Title", "Score"]
    ].
    sort_values("Score", ascending=False).
    head(5).
    reset_index(drop=True)
)
pd.testing.assert_frame_equal(res4a, res4b)  # no error == OK

Example 13.16

The key to res5a is the pandas.DataFrame.drop_duplicates method:

res5b = (
    Badges.
    loc[Badges.UserId == 23, ["Name"]].
    drop_duplicates().
    reset_index(drop=True)
)
pd.testing.assert_frame_equal(res5a, res5b)  # no error == OK

Example 13.17

For res6a, we first need to add a new column to the copy of Badges:

Badges2 = Badges.copy()  # otherwise we would destroy the original object
Badges2.loc[:, "Year"] = (
    Badges2.Date.astype("datetime64[ns]").dt.strftime("%Y").astype("float")
)

Then, we apply some filtering and the removal of duplicated rows:

res6b = (
    Badges2.
    loc[Badges2.UserId == 23, ["Name", "Year"]].
    drop_duplicates().
    reset_index(drop=True)
)
pd.testing.assert_frame_equal(res6a, res6b)  # no error == OK

Example 13.18

For res7a, we can use pandas.DataFrameGroupBy.aggregate:

Badges2 = Badges.copy()
Badges2.loc[:, "Year"] = (
    Badges2.Date.astype("datetime64[ns]").dt.strftime("%Y").astype("float")
)
res7b = (
    Badges2.
    loc[Badges2.UserId == 23, ["Name", "Year"]].
    groupby("Name")["Year"].
    aggregate([len, "min", "mean", "max"]).
    sort_values("len", ascending=False).
    head(4).
    reset_index()
)
res7b.columns = ["Name", "Count", "MinYear", "MeanYear", "MaxYear"]

Had we not converted Year to float, we would obtain a meaningless average year, without any warning.

Unfortunately, the rows in res7a and res7b are ordered differently. For testing, we need to reorder them in the same way:

pd.testing.assert_frame_equal(
    res7a.sort_values(["Name", "Count"]).reset_index(drop=True),
    res7b.sort_values(["Name", "Count"]).reset_index(drop=True)
)  # no error == OK

Example 13.19

For res8a, we first count the number of values in each group:

Badges2 = Badges.copy()
Badges2.loc[:, "Year"] = (
    Badges2.Date.astype("datetime64[ns]").dt.strftime("%Y").astype("float")
)
res8b = (
    Badges2.
    loc[ Badges2.UserId == 23, ["Name", "Year"] ].
    groupby(["Name", "Year"]).
    size().
    rename("Count").
    reset_index()
)

The HAVING part is performed after WHERE and GROUP BY.

res8b = (
    res8b.
    loc[ res8b.Count > 1, : ].
    sort_values("Count", ascending=False).
    reset_index(drop=True)
)
pd.testing.assert_frame_equal(res8a, res8b)  # no error == OK

Example 13.20

To obtain a result equivalent to res9a, we need to merge Posts with Tags, and then merge the result with Users:

res9b = pd.merge(Posts, Tags, left_on="Id", right_on="WikiPostId")
res9b = pd.merge(Users, res9b, left_on="AccountId", right_on="OwnerUserId")

Then, some filtering and sorting will do the trick:

res9b = (
    res9b.
    loc[
        (res9b.OwnerUserId != -1) & (~res9b.OwnerUserId.isna()),
        ["TagName", "Count", "OwnerUserId", "Age", "Location", "DisplayName"]
    ].
    sort_values(["Count", "TagName"], ascending=[False, True]).
    head(6).
    reset_index(drop=True)
)

In SQL, “not equals to -1” implies IS NOT NULL.

pd.testing.assert_frame_equal(res9a, res9b)  # no error == OK

Example 13.21

To obtain a result equivalent to res10a, we first need to create an auxiliary data frame that corresponds to the subquery.

UpVotesTab = (
    Votes.
    loc[Votes.VoteTypeId==2, :].
    groupby("PostId").
    size().
    rename("UpVotes").
    reset_index()
)

And now:

res10b = pd.merge(UpVotesTab, Posts, left_on="PostId", right_on="Id")
res10b = (
    res10b.
    loc[res10b.PostTypeId==1, ["PostId", "UpVotes", "Title"]].
    sort_values("UpVotes", ascending=False).
    head(5).
    reset_index(drop=True)
)
pd.testing.assert_frame_equal(res10a, res10b)  # no error == OK

13.4. Closing the database connection¶

We said we should not forget about:

conn.close()

This gives some sense of closure. Such a relief.

13.5. Common data serialisation formats for the Web¶

CSV files are an all-round way to exchange tabular data between different programming and data analysis environments.

For unstructured or non-tabularly-structured data, XML and JSON (and its superset, YAML) are the common formats of choice, especially for communicating with different Web APIs.

To ensure we can fetch data in these formats, we recommend solving the undermentioned exercises. Sadly, often this will require some tedious labour, neither art nor science; see also [93] and [20].

Exercise 13.22

Consider the Web API for accessing the on-street parking bay sensor data in Melbourne, VIC, Australia. Using the json package, convert the data in the JSON format to a data frame.

Exercise 13.23

Australian Radiation Protection and Nuclear Safety Agency publishes UV data for different Aussie cities. Using the xml package, convert this XML dataset to a data frame.

Exercise 13.24

(*) Check out the English Wikipedia article with a list of 20th-century classical composers. Using pandas.read_html, convert the Climate Data table included therein to a data frame.

Exercise 13.25

(*) Using the lxml package, author a function that converts each bullet list featured in a given Wikipedia article (e.g., this one), to a list of strings.

Exercise 13.26

(**) Import an archived version of a Stack Exchange site that you find interesting and store it in an SQLite database. You can find the relevant data dumps here.

Exercise 13.27

(**) Download and then import an archived version of one of the wikis hosted by the Wikimedia Foundation (e.g., the whole English Wikipedia) so that it can be stored in an SQLite database.

13.6. Working with many files¶

For the mass-processing of many files, it is worth knowing of some functions for dealing with file paths, searching for files, etc. Usually, we will be looking up ways to complete specific tasks at hand, e.g., how to read data from a ZIP-like archive, on the internet. After all, contrary to the basic operations of vectors, matrices, and data frames, they are not amongst the actions that we perform frequently.

Good development practices related to data storage are described in [48].

13.6.1. File paths¶

UNIX-like operating systems, including GNU/Linux and mOS, use slashes, “/”, as path separators, e.g., "/home/marek/file.csv". Win*s, however, uses backslashes, “\”, which have a special meaning in character strings (escape sequences; see Section 2.1.3). Therefore, they should be input as, e.g., "c:\\users\\marek\\file.csv". Alternatively, we can use raw strings, where the backslash is treated literally, e.g., r"c:\users\marek\file.csv".

When constructing file paths programmatically, it is thus best to rely on os.path.join, which takes care of the system-specific nuances.

import os.path
os.path.join("~", "Desktop", "file.csv")  # we are on GNU/Linux
## '~/Desktop/file.csv'

The tilde, “~”, denotes the current user’s home directory.

For storing auxiliary data, we can use the system’s temporary directory. See the tempfile module for functions that generate appropriate file paths therein. For instance, a subdirectory inside the temporary directory can be created via a call to tempfile.mkdtemp.

Important

We will frequently be referring to file paths relative to the working directory of the currently executed Python session (e.g., from which IPython/Jupyter notebook server was started); see os.getcwd.

All non-absolute file names (ones that do not start with “~”, “/”, “c:\\”, and the like), for example, "filename.csv" or os.path.join("subdir", "filename.csv") are always relative to the current working directory.

For instance, if the working directory is "/home/marek/projects/python", then "filename.csv" refers to "/home/marek/projects/python/filename.csv".

Also, “..” denotes the current working directory’s parent directory. Thus, "../filename2.csv" resolves to "/home/marek/projects/filename2.csv".

Exercise 13.28

Print the current working directory by calling os.getcwd. Next, download the file air_quality_2018_param and save it in the current Python session’s working directory (e.g., in your web browser, right-click on the web page’s canvas and select Save Page As…). Load with pandas.read_csv by passing "air_quality_2018_param.csv" as the input path.

Exercise 13.29

(*) Download the aforementioned file programmatically (if you have not done so yet) using the requests module.

13.6.2. File search¶

glob.glob and os.listdir generate a list of files in a given directory (and possibly all its subdirectories).

os.path.isdir and os.path.isfile determine the type of a given object in the file system.

Exercise 13.30

Write a function that computes the total size of all the files in a given directory and all its subdirectories.

13.6.3. Exception handling¶

Accessing resources on the disk or the internet can lead to errors, for example, when the file is not found. The try..except statement can be used if we want to be able to react to any of the envisaged errors

try:
    # statements to execute
    x = pd.read_csv("file_not_found.csv")
    print(x.head())  # not executed if the above raises an error
except OSError:
    # if an exception occurs, we can handle it here
    print("File has not been found")
## File has not been found

For more details, refer to the documentation.

13.6.4. File connections (*)¶

Basic ways of opening and reading from/writing to file connections are described in the Python documentation. Section 14.3.5 shows an example where we create a Markdown file manually.

They may be useful for processing large files chunk by chunk. In particular, pandas.read_csv accepts a file handler (see open and numpy.DataSource). Then, by passing the nrows argument we can request a specific number of records to be read at the current position.

With pandas.to_csv we can also append portions of data frames to a file.

13.7. Further reading¶

pandas is not the only package that brings data frames to the Python world; check out polars. Furthermore, Dask and modin can be helpful with data that do not fit into memory (and when we do not want to rely on sampling or chunking).

13.8. Exercises¶

Exercise 13.31

Find an example of an XML and JSON file. Which one is more human-readable? Do they differ in terms of capabilities?

Exercise 13.32

What is wrong with constructing file paths like "~" + "\\" + "filename.csv"?

Exercise 13.33

What are the benefits of using a SQL database management system in data science activities?

Exercise 13.34

(*) How can we populate a database with gigabytes of data read from many CSV files?