Databases
Project evolution
Each Project Part is a baseline; you'll probably need to do more than that to get to the end result.
- School of Data
- Strongly recommend
- Early-bird ticket sales end today
- As a sweetener: I'm showing a policy-related dance-theater piece on the Sunday (UnSchool of Data) 🕺
- You'll be required to attend at least one event.
- Doesn't need to be (Un)School of Data.
Whenever documentation says pip install …, you probably want to add it to your requirements.txt.
Who has experience with SQL? What have you used it for?
SQL syntax is largely the same across SQL databases:
- SQLite
- DuckDB
- PostgreSQL
- MySQL
- BigQuery
- Oracle
- …
The variations are known as "dialects".
Recommend installing through a system package manager. If you have any trouble, the default instructions are fine.
-
Go to this dataset page.
- Country: Select all (☑️)
- Series: Access to electricity (% of total population)
- Time: Select all (☑️)
-
Download and unzip as a CSV.
-
Query it.
SELECT * FROM read_csv('[path].csv');
SELECT * FROM read_csv(
'[path].csv',
nullstr=['', '..']
);CREATE TABLE electricity AS SELECT ...;What's a question we might want to ask?
Column names with spaces need double quotes. Alternatively, normalize_names.
SELECT
"Country name",
"2022 [YR2022]" - "1990 [YR1990]" AS diff
FROM electricity
ORDER BY diff DESC;- Tabular
- DuckDB:
read_csv() - Tables are like DataFrames
- Columns have types
- Column-based operations
SELECTis like boolean indexingGROUP BYis likegroupby()
pandas allows you to build up operations over multiple lines; harder to do that in SQL.
Lots of ways to connect to databases from Python, including:
- pandas
- SQLAlchemy
- Object Relational Mapper (ORM)
Allow you to use the same Python syntax across databases
How would you take data from an API and get it into a database?
- In the Google Cloud Console, make sure your Project is selected.
- Open BigQuery.
- Enable the API.
- Open a public dataset.
- Try
CategoryofEconomicsandPriceofFree
- Try
SELECT company_name, COUNT(*) AS num_complaints
FROM `bigquery-public-data.cfpb_complaints.complaint_database`
GROUP BY company_name
ORDER BY num_complaints DESC;- At what point does adding more code quality tools (linters, formatters, type checkers) improve productivity, and when might it slow development?
- At what point does separating code into modules become fragmentation, and how do you decide where one module ends and another begins?
- When is Pandas used vs SQL used? What are the advantages/disadvantages of both? I’ve noticed that many dbs use SQL instead of pandas.
- In our quant class, we learn that R is widely used in social science research for statistical analysis. How do these languages typically work together in real data systems? Also, if these large data systems combine multiple languages, does relying on multiple languages make such systems harder to maintain?