13. Accessing databases¶
This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Deep R Programming [36] too.
pandas is convenient for working with data that fit into memory and which can be stored in individual CSV files. Still, larger information banks in a shared environment will often be made available to us via relational (structured) databases such as PostgreSQL or MariaDB, or a wide range of commercial products.
Most commonly, we use SQL (Structured Query Language) to define the data chunks[1] we want to analyse. Then, we fetch them from the database driver in the form of a pandas data frame. This enables us to perform the operations we are already familiar with, e.g., various transformations or visualisations.
Below we make a quick introduction to the basics of SQL using SQLite, which is a lightweight, flat-file, and server-less open-source database management system. Overall, SQLite is a sensible choice for data of even hundreds or thousands of gigabytes in size that fit on a single computer’s disk. This is more than enough for playing with our data science projects or prototyping more complex solutions.
Important
In this chapter, we will learn that the syntax of SQL is very readable: it is modelled after the natural (English) language. The purpose of this introduction is not to compose own queries nor to design new databanks. The latter is covered by a separate course on database systems; see, e.g., [17, 21].
13.1. Example database¶
In this chapter, we will be working with a simplified data dump of the Q&A site Travel Stack Exchange, which we downloaded on 2017-10-31. It consists of five separate data frames.
First, Tags
gives, amongst others,
topic categories (TagName
) and how many questions mention them (Count
):
Tags = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Tags.csv.gz",
comment="#")
Tags.head(3)
## Count ExcerptPostId Id TagName WikiPostId
## 0 104 2138.0 1 cruising 2137.0
## 1 43 357.0 2 caribbean 356.0
## 2 43 319.0 4 vacations 318.0
Second,
Users
provides information on the registered users.
Users = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Users.csv.gz",
comment="#")
Users.head(3)
## AccountId Age CreationDate ... Reputation UpVotes Views
## 0 -1.0 NaN 2011-06-21T15:16:44.253 ... 1.0 2472.0 0.0
## 1 2.0 40.0 2011-06-21T20:10:03.720 ... 101.0 1.0 31.0
## 2 7598.0 32.0 2011-06-21T20:11:02.490 ... 101.0 1.0 14.0
##
## [3 rows x 11 columns]
Third,
Badges
recalls all rewards handed to the users (UserId
)
for their engaging in various praiseworthy activities:
Badges = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Badges.csv.gz",
comment="#")
Badges.head(3)
## Class Date Id Name TagBased UserId
## 0 3 2011-06-21T20:16:48.910 1 Autobiographer False 2
## 1 3 2011-06-21T20:16:48.910 2 Autobiographer False 3
## 2 3 2011-06-21T20:16:48.910 3 Autobiographer False 4
Fourth,
Posts
lists all the questions and answers (the latter
do not have ParentId
set to NaN
).
Posts = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Posts.csv.gz",
comment="#")
Posts.head(3)
## AcceptedAnswerId ... ViewCount
## 0 393.0 ... 419.0
## 1 NaN ... 1399.0
## 2 NaN ... NaN
##
## [3 rows x 17 columns]
Fifth,
Votes
list all the up-votes (VoteTypeId
equal to 2)
and down-votes (VoteTypeId
of 3)
to all the posts.
Votes = pd.read_csv("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/travel_stackexchange_com_2017/Votes.csv.gz",
comment="#")
Votes.head(3)
## BountyAmount CreationDate Id PostId UserId VoteTypeId
## 0 NaN 2011-06-21T00:00:00.000 1 1 NaN 2
## 1 NaN 2011-06-21T00:00:00.000 2 1 NaN 2
## 2 NaN 2011-06-21T00:00:00.000 3 2 NaN 2
See the README
file for a detailed description of each column.
Note that rows are uniquely defined by their respective Id
s.
They are relations between the data frames, e.g.,
Users.Id
vs Badges.UserId
,
Posts.Id
vs Votes.PostId
, etc.
Moreover, for privacy reasons, some UserId
s might be missing.
In such a case, they are encoded with a not-a-number;
compare Chapter 15.
13.2. Exporting data to a database¶
Let’s establish a connection with a new SQLite database. In our case, this will be an ordinary file stored on the computer’s disk:
import tempfile, os.path
dbfile = os.path.join(tempfile.mkdtemp(), "travel.db")
print(dbfile)
## /tmp/tmpo5b5kdfp/travel.db
It defines the file path (compare Section 13.6.1)
where the database is going to be stored. We use a randomly generated
filename inside the local file system’s (we are on Linux)
temporary directory, /tmp
. This is just a pleasant exercise, and
we will not be using this database afterwards.
The reader might prefer setting a filename relative to the
current working directory (as given by os.getcwd), e.g.,
dbfile = "travel.db"
.
We can now connect to the database:
import sqlite3
conn = sqlite3.connect(dbfile)
The database might now be queried: we can add new tables, insert new rows, and retrieve records.
Important
In the end, we must not forget about the call to conn.close()
.
Our data are already in the form of pandas data frames. Therefore, exporting them to the database is straightforward. We only need to make a series of calls to the pandas.DataFrame.to_sql method.
Tags.to_sql("Tags", conn, index=False)
Users.to_sql("Users", conn, index=False)
Badges.to_sql("Badges", conn, index=False)
Posts.to_sql("Posts", conn, index=False)
Votes.to_sql("Votes", conn, index=False)
Note
(*) It is possible to export data that do not fit into memory
by reading them in chunks of considerable, but not too large, sizes.
In particular pandas.read_csv has the nrows
argument that
lets us read several rows from a file connection; see
Section 13.6.4. Then,
pandas.DataFrame.to_sql(..., if_exists="append")
can
be used to append new rows to an existing table.
Exporting data can be done without pandas as well,
e.g., when they are to be fetched from XML or JSON files
(compare Section 13.5) and processed manually, row by row.
Intermediate-level SQL users can call conn.execute("CREATE TABLE t...")
,
followed by
conn.executemany("INSERT INTO t VALUES(?, ?, ?)", l)
,
and then
conn.commit()
. This will create a new table (here: named t
)
populated by a list of records (e.g., in the form of tuples or
numpy vectors). For more details, see the
manual
of the sqlite3 package.
13.3. Exercises on SQL vs pandas¶
We can use pandas to fetch the results of any SQL query in the form of a data frame. For example:
pd.read_sql_query("""
SELECT * FROM Tags LIMIT 3
""", conn)
## Count ExcerptPostId Id TagName WikiPostId
## 0 104 2138.0 1 cruising 2137.0
## 1 43 357.0 2 caribbean 356.0
## 2 43 319.0 4 vacations 318.0
This query selected all columns (SELECT *
) and
the first three rows (LIMIT 3
) from the Tags
table.
For the afore- and all the undermentioned SQL queries, write the equivalent Python code that generates the same result using pandas functions and methods. In each case, there might be more than one equally fine solution. In case of any doubt about the meaning of the queries, please refer to the SQLite documentation. Example solutions are provided at the end of this section.
For a reference query:
res1a = pd.read_sql_query("""
SELECT * FROM Tags LIMIT 3
""", conn)
The equivalent pandas implementation might look like:
res1b = Tags.head(3)
To verify that the results are equal, we can call:
pd.testing.assert_frame_equal(res1a, res1b) # no error == OK
No error message means that the test is passed. The cordial thing about the assert_frame_equal function is that it ignores small round-off errors introduced by arithmetic operations.
Nonetheless, the results generated by pandas
might be the same up to the reordering of rows. In such a case,
before calling pandas.testing.assert_frame_equal,
we can invoke DataFrame.sort_values
on both data frames to sort them
with respect to 1 or 2 chosen columns.
13.3.1. Filtering¶
From Tags
, select two columns TagName
and Count
and rows for which TagName
is equal to one of the three choices provided.
res2a = pd.read_sql_query("""
SELECT TagName, Count
FROM Tags
WHERE TagName IN ('poland', 'australia', 'china')
""", conn)
res2a
## TagName Count
## 0 china 443
## 1 australia 411
## 2 poland 139
Hint: use pandas.Series.isin.
Select a set of columns from Posts
whose rows fulfil a given
set of conditions.
res3a = pd.read_sql_query("""
SELECT Title, Score, ViewCount, FavoriteCount
FROM Posts
WHERE PostTypeId=1 AND
ViewCount>=10000 AND
FavoriteCount BETWEEN 35 AND 100
""", conn)
res3a
## Title ... FavoriteCount
## 0 When traveling to a country with a different c... ... 35.0
## 1 How can I do a "broad" search for flights? ... 49.0
## 2 Tactics to avoid getting harassed by corrupt p... ... 42.0
## 3 Flight tickets: buy two weeks before even duri... ... 36.0
## 4 OK we're all adults here, so really, how on ea... ... 79.0
## 5 How to intentionally get denied entry to the U... ... 53.0
## 6 How do you know if Americans genuinely/literal... ... 79.0
## 7 OK, we are all adults here, so what is a bidet... ... 38.0
## 8 How to cope with too slow Wi-Fi at hotel? ... 41.0
##
## [9 rows x 4 columns]
13.3.2. Ordering¶
Select the Title
and Score
columns from Posts
where ParentId
is missing (i.e., the post is, in fact, a question)
and Title
is well-defined.
Then, sort the results by the Score
column, decreasingly
(descending order). Finally, return only the first five rows (e.g.,
top five scoring questions).
res4a = pd.read_sql_query("""
SELECT Title, Score
FROM Posts
WHERE ParentId IS NULL AND Title IS NOT NULL
ORDER BY Score DESC
LIMIT 5
""", conn)
res4a
## Title Score
## 0 OK we're all adults here, so really, how on ea... 306
## 1 How do you know if Americans genuinely/literal... 254
## 2 How to intentionally get denied entry to the U... 219
## 3 Why are airline passengers asked to lift up wi... 210
## 4 Why prohibit engine braking? 178
Hint: use pandas.DataFrame.sort_values and numpy.isnan or pandas.isnull.
13.3.3. Removing duplicates¶
Get all unique badge names for the user with Id=23
.
res5a = pd.read_sql_query("""
SELECT DISTINCT Name
FROM Badges
WHERE UserId=23
""", conn)
res5a
## Name
## 0 Supporter
## 1 Student
## 2 Teacher
## 3 Scholar
## 4 Beta
## 5 Nice Question
## 6 Editor
## 7 Nice Answer
## 8 Yearling
## 9 Popular Question
## 10 Taxonomist
## 11 Notable Question
Hint: use pandas.DataFrame.drop_duplicates.
For each badge handed to the user with Id=23
,
extract the award year store it in a new column named Year
.
Then, select only the unique pairs (Name, Year)
.
res6a = pd.read_sql_query("""
SELECT DISTINCT
Name,
CAST(strftime('%Y', Date) AS FLOAT) AS Year
FROM Badges
WHERE UserId=23
""", conn)
res6a
## Name Year
## 0 Supporter 2011.0
## 1 Student 2011.0
## 2 Teacher 2011.0
## 3 Scholar 2011.0
## 4 Beta 2011.0
## 5 Nice Question 2011.0
## 6 Editor 2012.0
## 7 Nice Answer 2012.0
## 8 Yearling 2012.0
## 9 Nice Question 2012.0
## 10 Nice Question 2013.0
## 11 Yearling 2013.0
## 12 Popular Question 2014.0
## 13 Yearling 2014.0
## 14 Taxonomist 2014.0
## 15 Notable Question 2015.0
## 16 Nice Question 2017.0
Hint: use
Badges.Date.astype("datetime64[ns]").dt.strftime("%Y").astype("float")
;
see Chapter 16.
13.3.4. Grouping and aggregating¶
Count how many badges of each type the user with Id=23
won.
Also, for each badge type, compute the minimal, average, and maximal
receiving year.
Return only the top four badges (with respect to the counts).
res7a = pd.read_sql_query("""
SELECT
Name,
COUNT(*) AS Count,
MIN(CAST(strftime('%Y', Date) AS FLOAT)) AS MinYear,
AVG(CAST(strftime('%Y', Date) AS FLOAT)) AS MeanYear,
MAX(CAST(strftime('%Y', Date) AS FLOAT)) AS MaxYear
FROM Badges
WHERE UserId=23
GROUP BY Name
ORDER BY Count DESC
LIMIT 4
""", conn)
res7a
## Name Count MinYear MeanYear MaxYear
## 0 Nice Question 4 2011.0 2013.25 2017.0
## 1 Yearling 3 2012.0 2013.00 2014.0
## 2 Popular Question 3 2014.0 2014.00 2014.0
## 3 Notable Question 2 2015.0 2015.00 2015.0
Count how many unique combinations of pairs (Name, Year)
for the badges won by the user with Id=23
are there.
Then, return only the rows having Count
greater than 1
and order the results by Count
decreasingly.
In other words, list the badges received more than once in any given
year.
res8a = pd.read_sql_query("""
SELECT
Name,
CAST(strftime('%Y', Date) AS FLOAT) AS Year,
COUNT(*) AS Count
FROM Badges
WHERE UserId=23
GROUP BY Name, Year
HAVING Count > 1
ORDER BY Count DESC
""", conn)
res8a
## Name Year Count
## 0 Popular Question 2014.0 3
## 1 Notable Question 2015.0 2
Note that WHERE
is performed before GROUP BY
, and HAVING
is applied thereafter.
13.3.5. Joining¶
Join (merge) Tags
, Posts
, and Users
for all posts with OwnerUserId
not equal to -1
(i.e.,
the tags which were created by “alive” users).
Return the top six records with respect to Tags.Count
.
res9a = pd.read_sql_query("""
SELECT Tags.TagName, Tags.Count, Posts.OwnerUserId,
Users.Age, Users.Location, Users.DisplayName
FROM Tags
JOIN Posts ON Posts.Id=Tags.WikiPostId
JOIN Users ON Users.AccountId=Posts.OwnerUserId
WHERE OwnerUserId != -1
ORDER BY Tags.Count DESC, Tags.TagName ASC
LIMIT 6
""", conn)
res9a
## TagName Count ... Location DisplayName
## 0 canada 802 ... Mumbai, India hitec
## 1 europe 681 ... Philadelphia, PA Adam Tuttle
## 2 visa-refusals 554 ... New York, NY Benjamin Pollack
## 3 australia 411 ... Mumbai, India hitec
## 4 eu 204 ... Philadelphia, PA Adam Tuttle
## 5 new-york-city 204 ... Mumbai, India hitec
##
## [6 rows x 6 columns]
First, create an auxiliary (temporary) table named UpVotesTab
,
where we store the information about the number of up-votes
(VoteTypeId=2
) that each post has received.
Then, join (merge) this table with Posts
and fetch some details about the five questions (PostTypeId=1
)
with the most up-votes.
res10a = pd.read_sql_query("""
SELECT UpVotesTab.*, Posts.Title FROM
(
SELECT PostId, COUNT(*) AS UpVotes
FROM Votes
WHERE VoteTypeId=2
GROUP BY PostId
) AS UpVotesTab
JOIN Posts ON UpVotesTab.PostId=Posts.Id
WHERE Posts.PostTypeId=1
ORDER BY UpVotesTab.UpVotes DESC LIMIT 5
""", conn)
res10a
## PostId UpVotes Title
## 0 3080 307 OK we're all adults here, so really, how on ea...
## 1 38177 254 How do you know if Americans genuinely/literal...
## 2 24540 221 How to intentionally get denied entry to the U...
## 3 20207 211 Why are airline passengers asked to lift up wi...
## 4 96447 178 Why prohibit engine braking?
13.3.6. Solutions to exercises¶
In this section, we provide example solutions to the above exercises.
To obtain a result equivalent to res2a
, we need simple
data filtering only:
res2b = (
Tags.
loc[
Tags.TagName.isin(["poland", "australia", "china"]),
["TagName", "Count"]
].
reset_index(drop=True)
)
Let’s verify whether the two data frames are identical:
pd.testing.assert_frame_equal(res2a, res2b) # no error == OK
To generate res3a
with pandas only, we need some more complex
filtering with loc[...]
:
res3b = (
Posts.
loc[
(Posts.PostTypeId == 1) & (Posts.ViewCount >= 10000) &
(Posts.FavoriteCount >= 35) & (Posts.FavoriteCount <= 100),
["Title", "Score", "ViewCount", "FavoriteCount"]
].
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res3a, res3b) # no error == OK
For res4a
, some filtering and sorting is all we need:
res4b = (
Posts.
loc[
Posts.ParentId.isna() & (~Posts.Title.isna()),
["Title", "Score"]
].
sort_values("Score", ascending=False).
head(5).
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res4a, res4b) # no error == OK
The key to res5a
is the pandas.DataFrame.drop_duplicates
method:
res5b = (
Badges.
loc[Badges.UserId == 23, ["Name"]].
drop_duplicates().
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res5a, res5b) # no error == OK
For res6a
, we first need to add a new column to the copy of Badges
:
Badges2 = Badges.copy() # otherwise we would destroy the original object
Badges2.loc[:, "Year"] = (
Badges2.Date.astype("datetime64[ns]").dt.strftime("%Y").astype("float")
)
Then, we apply some filtering and the removal of duplicated rows:
res6b = (
Badges2.
loc[Badges2.UserId == 23, ["Name", "Year"]].
drop_duplicates().
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res6a, res6b) # no error == OK
For res7a
, we can use pandas.DataFrameGroupBy.aggregate:
Badges2 = Badges.copy()
Badges2.loc[:, "Year"] = (
Badges2.Date.astype("datetime64[ns]").dt.strftime("%Y").astype("float")
)
res7b = (
Badges2.
loc[Badges2.UserId == 23, ["Name", "Year"]].
groupby("Name")["Year"].
aggregate([len, "min", "mean", "max"]).
sort_values("len", ascending=False).
head(4).
reset_index()
)
res7b.columns = ["Name", "Count", "MinYear", "MeanYear", "MaxYear"]
Had we not converted Year
to float
, we would obtain
a meaningless average year, without any warning.
Unfortunately, the rows in res7a
and res7b
are ordered
differently. For testing, we need to reorder them in the same way:
pd.testing.assert_frame_equal(
res7a.sort_values(["Name", "Count"]).reset_index(drop=True),
res7b.sort_values(["Name", "Count"]).reset_index(drop=True)
) # no error == OK
For res8a
, we first count the number of values in each group:
Badges2 = Badges.copy()
Badges2.loc[:, "Year"] = (
Badges2.Date.astype("datetime64[ns]").dt.strftime("%Y").astype("float")
)
res8b = (
Badges2.
loc[ Badges2.UserId == 23, ["Name", "Year"] ].
groupby(["Name", "Year"]).
size().
rename("Count").
reset_index()
)
The HAVING
part is performed after WHERE
and GROUP BY
.
res8b = (
res8b.
loc[ res8b.Count > 1, : ].
sort_values("Count", ascending=False).
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res8a, res8b) # no error == OK
To obtain a result equivalent to res9a
, we need to merge
Posts
with Tags
, and then merge the result with Users
:
res9b = pd.merge(Posts, Tags, left_on="Id", right_on="WikiPostId")
res9b = pd.merge(Users, res9b, left_on="AccountId", right_on="OwnerUserId")
Then, some filtering and sorting will do the trick:
res9b = (
res9b.
loc[
(res9b.OwnerUserId != -1) & (~res9b.OwnerUserId.isna()),
["TagName", "Count", "OwnerUserId", "Age", "Location", "DisplayName"]
].
sort_values(["Count", "TagName"], ascending=[False, True]).
head(6).
reset_index(drop=True)
)
In SQL, “not equals to -1” implies IS NOT NULL
.
pd.testing.assert_frame_equal(res9a, res9b) # no error == OK
To obtain a result equivalent to res10a
,
we first need to create an auxiliary data frame
that corresponds to the subquery.
UpVotesTab = (
Votes.
loc[Votes.VoteTypeId==2, :].
groupby("PostId").
size().
rename("UpVotes").
reset_index()
)
And now:
res10b = pd.merge(UpVotesTab, Posts, left_on="PostId", right_on="Id")
res10b = (
res10b.
loc[res10b.PostTypeId==1, ["PostId", "UpVotes", "Title"]].
sort_values("UpVotes", ascending=False).
head(5).
reset_index(drop=True)
)
pd.testing.assert_frame_equal(res10a, res10b) # no error == OK
13.4. Closing the database connection¶
We said we should not forget about:
conn.close()
This gives some sense of closure. Such a relief.
13.5. Common data serialisation formats for the Web¶
CSV files are an all-round way to exchange tabular data between different programming and data analysis environments.
For unstructured or non-tabularly-structured data, XML and JSON (and its superset, YAML) are the common formats of choice, especially for communicating with different Web APIs.
To ensure we can fetch data in these formats, we recommend solving the undermentioned exercises. Sadly, often this will require some tedious labour, neither art nor science; see also [93] and [20].
Australian Radiation Protection and Nuclear Safety Agency publishes UV data for different Aussie cities. Using the xml package, convert this XML dataset to a data frame.
(*) Check out the English Wikipedia article with a list of 20th-century classical composers. Using pandas.read_html, convert the Climate Data table included therein to a data frame.
(*) Using the lxml package, author a function that converts each bullet list featured in a given Wikipedia article (e.g., this one), to a list of strings.
(**) Import an archived version of a Stack Exchange site that you find interesting and store it in an SQLite database. You can find the relevant data dumps here.
(**) Download and then import an archived version of one of the wikis hosted by the Wikimedia Foundation (e.g., the whole English Wikipedia) so that it can be stored in an SQLite database.
13.6. Working with many files¶
For the mass-processing of many files, it is worth knowing of some functions for dealing with file paths, searching for files, etc. Usually, we will be looking up ways to complete specific tasks at hand, e.g., how to read data from a ZIP-like archive, on the internet. After all, contrary to the basic operations of vectors, matrices, and data frames, they are not amongst the actions that we perform frequently.
Good development practices related to data storage are described in [48].
13.6.1. File paths¶
UNIX-like operating systems, including GNU/Linux and mOS,
use slashes, “/
”, as path separators, e.g., "/home/marek/file.csv"
.
Win*s, however, uses backslashes, “\
”, which have a special meaning
in character strings (escape sequences; see Section 2.1.3).
Therefore, they should be input as, e.g., "c:\\users\\marek\\file.csv"
.
Alternatively, we can use raw strings, where the backslash
is treated literally, e.g., r"c:\users\marek\file.csv"
.
When constructing file paths programmatically, it is thus best to rely on os.path.join, which takes care of the system-specific nuances.
import os.path
os.path.join("~", "Desktop", "file.csv") # we are on GNU/Linux
## '~/Desktop/file.csv'
The tilde, “~
”, denotes the current user’s home directory.
For storing auxiliary data, we can use the system’s temporary directory. See the tempfile module for functions that generate appropriate file paths therein. For instance, a subdirectory inside the temporary directory can be created via a call to tempfile.mkdtemp.
Important
We will frequently be referring to file paths relative to the working directory of the currently executed Python session (e.g., from which IPython/Jupyter notebook server was started); see os.getcwd.
All non-absolute file names (ones that do not start
with “~
”, “/
”, “c:\\
”, and the like),
for example, "filename.csv"
or
os.path.join("subdir", "filename.csv")
are always relative to the current working directory.
For instance, if the working directory is
"/home/marek/projects/python"
,
then "filename.csv"
refers to "/home/marek/projects/python/filename.csv"
.
Also, “..
” denotes the current working directory’s parent directory.
Thus, "../filename2.csv"
resolves to "/home/marek/projects/filename2.csv"
.
Print the current working directory by calling os.getcwd.
Next, download the file
air_quality_2018_param
and save it in the current Python session’s working directory
(e.g., in your web browser, right-click on the web page’s canvas
and select Save Page As…).
Load with pandas.read_csv by passing
"air_quality_2018_param.csv"
as the input path.
(*) Download the aforementioned file programmatically (if you have not done so yet) using the requests module.
13.6.2. File search¶
glob.glob and os.listdir generate a list of files in a given directory (and possibly all its subdirectories).
os.path.isdir and os.path.isfile determine the type of a given object in the file system.
Write a function that computes the total size of all the files in a given directory and all its subdirectories.
13.6.3. Exception handling¶
Accessing resources on the disk or the internet can lead to errors, for example, when the file is not found. The try..except statement can be used if we want to be able to react to any of the envisaged errors
try:
# statements to execute
x = pd.read_csv("file_not_found.csv")
print(x.head()) # not executed if the above raises an error
except OSError:
# if an exception occurs, we can handle it here
print("File has not been found")
## File has not been found
For more details, refer to the documentation.
13.6.4. File connections (*)¶
Basic ways of opening and reading from/writing to file connections are described in the Python documentation. Section 14.3.5 shows an example where we create a Markdown file manually.
They may be useful for processing large files chunk by chunk.
In particular, pandas.read_csv
accepts a file handler (see open and numpy.DataSource).
Then, by passing the nrows
argument we can request a specific
number of records to be read at the current position.
With pandas.to_csv we can also append portions of data frames to a file.
13.7. Further reading¶
pandas is not the only package that brings data frames to the Python world; check out polars. Furthermore, Dask and modin can be helpful with data that do not fit into memory (and when we do not want to rely on sampling or chunking).
13.8. Exercises¶
Find an example of an XML and JSON file. Which one is more human-readable? Do they differ in terms of capabilities?
What is wrong with constructing file paths like
"~" + "\\" + "filename.csv"
?
What are the benefits of using a SQL database management system in data science activities?
(*) How can we populate a database with gigabytes of data read from many CSV files?