Lecture 7

Databases

Project evolution

Each Project Part is a baseline; you'll probably need to do more than that to get to the end result.

Open Data Week

School of Data
- Strongly recommend
- Early-bird ticket sales end today
- As a sweetener: I'm showing a policy-related dance-theater piece on the Sunday (UnSchool of Data) 🕺
You'll be required to attend at least one event.
- Doesn't need to be (Un)School of Data.

Packages

Whenever documentation says pip install …, you probably want to add it to your requirements.txt.

SQL

Who has experience with SQL? What have you used it for?

SQLite

DuckDB

SQL syntax is largely the same across SQL databases:

SQLite
DuckDB
PostgreSQL
MySQL
BigQuery
Oracle
…

The variations are known as "dialects".

Install the command-line tool

Recommend installing through a system package manager. If you have any trouble, the default instructions are fine.

Example

Go to this dataset page.
- Country: Select all (☑️)
- Series: Access to electricity (% of total population)
- Time: Select all (☑️)
Download and unzip as a CSV.
Query it.
```
SELECT * FROM read_csv('[path].csv');
```

Specify null values

SELECT * FROM read_csv(
   '[path].csv',
   nullstr=['', '..']
);

Create a table

CREATE TABLE electricity AS SELECT ...;

What's a question we might want to ask?

Column names with spaces need double quotes. Alternatively, normalize_names.

SELECT
   "Country name",
   "2022 [YR2022]" - "1990 [YR1990]" AS diff
FROM electricity
ORDER BY diff DESC;

SQL similarities to pandas

Tabular
DuckDB: read_csv()
Tables are like DataFrames
Columns have types
Column-based operations
SELECT is like boolean indexing
GROUP BY is like groupby()

pandas allows you to build up operations over multiple lines; harder to do that in SQL.

Views

Clients

Lots of ways to connect to databases from Python, including:

pandas
SQLAlchemy
- Object Relational Mapper (ORM)

Drivers

Allow you to use the same Python syntax across databases

Writing data

How would you take data from an API and get it into a database?

Individual `INSERT`s

Loops

Writing from a DataFrame

BigQuery

In the Google Cloud Console, make sure your Project is selected.
Open BigQuery.
Enable the API.
Open a public dataset.
- Try Category of Economics and Price of Free

SELECT company_name, COUNT(*) AS num_complaints
FROM `bigquery-public-data.cfpb_complaints.complaint_database`
GROUP BY company_name
ORDER BY num_complaints DESC;

Reading response questions

Code Quality

At what point does adding more code quality tools (linters, formatters, type checkers) improve productivity, and when might it slow development?
At what point does separating code into modules become fragmentation, and how do you decide where one module ends and another begins?

Databases

When is Pandas used vs SQL used? What are the advantages/disadvantages of both? I’ve noticed that many dbs use SQL instead of pandas.
In our quant class, we learn that R is widely used in social science research for statistical analysis. How do these languages typically work together in real data systems? Also, if these large data systems combine multiple languages, does relying on multiple languages make such systems harder to maintain?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lecture 7

Project evolution

Open Data Week

Packages

SQL

SQLite

DuckDB

Install the command-line tool

Example

Specify null values

Create a table

SQL similarities to pandas

Views

Clients

Drivers

Writing data

Individual `INSERT`s

Loops

Writing from a DataFrame

BigQuery

Reading response questions

Code Quality

Databases

FilesExpand file tree

lecture_07.md

Latest commit

History

lecture_07.md

File metadata and controls

Lecture 7

Project evolution

Packages

SQL

Example

SQL similarities to pandas

Views

Clients

Drivers

Writing data

Code Quality

Databases