Skip to content

Latest commit

 

History

History
205 lines (128 loc) · 4.95 KB

File metadata and controls

205 lines (128 loc) · 4.95 KB

Lecture 7

Databases


Project evolution

Each Project Part is a baseline; you'll probably need to do more than that to get to the end result.



Packages

Whenever documentation says pip install …, you probably want to add it to your requirements.txt.


SQL

Who has experience with SQL? What have you used it for?




SQL syntax is largely the same across SQL databases:

  • SQLite
  • DuckDB
  • PostgreSQL
  • MySQL
  • BigQuery
  • Oracle

The variations are known as "dialects".


Recommend installing through a system package manager. If you have any trouble, the default instructions are fine.


Example

  1. Go to this dataset page.

    • Country: Select all (☑️)
    • Series: Access to electricity (% of total population)
    • Time: Select all (☑️)
  2. Download and unzip as a CSV.

  3. Query it.

    SELECT * FROM read_csv('[path].csv');

SELECT * FROM read_csv(
   '[path].csv',
   nullstr=['', '..']
);

CREATE TABLE electricity AS SELECT ...;

What's a question we might want to ask?


Column names with spaces need double quotes. Alternatively, normalize_names.


SELECT
   "Country name",
   "2022 [YR2022]" - "1990 [YR1990]" AS diff
FROM electricity
ORDER BY diff DESC;

SQL similarities to pandas

  • Tabular
  • DuckDB: read_csv()
  • Tables are like DataFrames
  • Columns have types
  • Column-based operations
  • SELECT is like boolean indexing
  • GROUP BY is like groupby()

pandas allows you to build up operations over multiple lines; harder to do that in SQL.


Views


Clients


Lots of ways to connect to databases from Python, including:


Drivers

Allow you to use the same Python syntax across databases


Writing data

How would you take data from an API and get it into a database?






  1. In the Google Cloud Console, make sure your Project is selected.
  2. Open BigQuery.
  3. Enable the API.
  4. Open a public dataset.
    • Try Category of Economics and Price of Free

SELECT company_name, COUNT(*) AS num_complaints
FROM `bigquery-public-data.cfpb_complaints.complaint_database`
GROUP BY company_name
ORDER BY num_complaints DESC;

Code Quality

  • At what point does adding more code quality tools (linters, formatters, type checkers) improve productivity, and when might it slow development?
  • At what point does separating code into modules become fragmentation, and how do you decide where one module ends and another begins?

Databases

  • When is Pandas used vs SQL used? What are the advantages/disadvantages of both? I’ve noticed that many dbs use SQL instead of pandas.
  • In our quant class, we learn that R is widely used in social science research for statistical analysis. How do these languages typically work together in real data systems? Also, if these large data systems combine multiple languages, does relying on multiple languages make such systems harder to maintain?