Skip to content

Commit 1696fdd

Browse files
committed
cleanup + readme
1 parent 70dd5b4 commit 1696fdd

25 files changed

+631
-862
lines changed

src/webapp/.env.example

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,11 @@ DB_CERT=""
1515
DB_ROOT_CERT=""
1616
DB_KEY=""
1717

18-
# The initial user credentials for DEV (facilitates development)
19-
DEV_INIT_DB_PASSWORD="<PUT PASSWORD HERE>"
20-
DEV_INIT_DB_USER="tester@datakind.org"
18+
# Generate the following using `openssl rand -hex 32`
19+
# This is the initial API key for one-time use during setup.
20+
INITIAL_API_KEY=""
21+
# Its corresponding ID. You can generate this using uuid4 e.g. uuid.uuid4() in python3
22+
INITIAL_API_KEY_ID=""
2123

2224
# GCP related env vars
2325
GCP_REGION="us-east4"
@@ -32,4 +34,5 @@ DATABRICKS_HOST_URL=""
3234
DATABRICKS_SERVICE_ACCOUNT_EMAIL=""
3335

3436
# Datakinders allowed to issue API key. This should be the MINIMUM set. Keep this group small. Pass as a comma separated string structured like so: "abc@dk.org,bcd@dk.org"
35-
API_KEY_ISSUERS="tester@datakind.org"
37+
# The initial value set is api_key_initial which is the initial API key, this is needed for one-time setup. You can remove this once the api key table is populated.
38+
API_KEY_ISSUERS="api_key_initial"

src/webapp/README.md

Lines changed: 104 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,84 @@
1-
Notes:
1+
# Overview of the REST API for SST
22

3-
REST API for SST functionality.
3+
## See Swagger UI for self-documentation of all currently available API endpoints
44

5-
Notes:
6-
### API Callers
5+
Go to `<env>-sst.datakind.org/api/v1/docs`: e.g. https://dev-sst.datakind.org/api/v1/docs
76

8-
API callers will need to create a user using the backend and then generate an API token. They will also need the GCloud upload auth token.
7+
Note that dev and staging links are all behind a GCP Identity-Aware Proxy.
98

10-
### Prerequisites
9+
## Authentication
1110

12-
In order to work with and test GCS related functionality, you'll need to setup default credentials:
13-
https://cloud.google.com/docs/authentication/set-up-adc-local-dev-environment#local-user-cred
11+
Authentication of the API is primarily via JWTs in the Authorization header of the HTTP calls. These JWTs are short-term tokens that expire and are signed by the API. They are slightly more secure than using API keys directly to authenticate every call as API keys are long-term/non-expiring authentication tokens which are more powerful if stolen. So the mechanism used here is that API keys are exchanged for JWTs, which are then used to authenticate each call and can store additional information such as "enduser" identity.
1412

15-
You will also need to add the permission Storage Writer or Storage Admin to your Datakind Account in GCP to allow for local interaction with the storage buckets.
13+
Note that intentionally, there is no way to use the user table's email/password combination to authenticate to the API directly. An API key is required. This means that a "user" is almost only a frontend concept. However, we do have to retain access to the user table in the backend to look up access type/institution and such. See more in the databases section below.
1614

17-
Note that to generate GCP URLS you'll need a service account key (doesn't work locally).
15+
There are also multiple types of API keys, and they can have the same access types as a user. Additionally, API keys can have the additional attribute of "allows_enduser". BE CAUTIOUS WHEN SETTING THIS TO TRUE when generating API keys. This means that this API key can impersonate any user. This API key should really only be used for the frontend, which needs to do enduser impersonation. Note that only DATAKINDER access types can allow endusers.
1816

19-
### For local testing:
17+
Additionally, API keys, like users, can have an institution set. DATAKINDER type keys should not set an institution.
18+
19+
### Authenticate/Generating Tokens via the Swagger UI
20+
21+
NOTE: Treat keys as secrets. These will grant full access to this API.
22+
23+
0. Get a valid API key. In the LOCAL environment, you can use the `INITIAL_API_KEY` set in your `.env` file or `key_1`. In other environments, you can use `INITIAL_API_KEY` set in the `.env` file or any existing generated API keys.
24+
1. Hit the authorize button on the top right, enter a valid API key in the `api-key` field (ignore the `api_key_scheme` which contains username/password -- this is to allow FastAPI to auto populate bearer tokens generated from the API keys; the actual username/password fields intentionally do not work).
25+
2. Then generate a token using the `/token-from-api-key` POST method. This will be the only endpoint you can access with the API key directly.
26+
3. Get the resulting token value which you can then use to CURL to any endpoint. For example,
27+
28+
```
29+
$ curl -X 'GET' \
30+
'http://127.0.0.1:8000/api/v1/non-inst-users' \
31+
-H 'Authorization: Bearer <paste_the_token_here>'
32+
```
33+
34+
In the long-term, look into a way to have the API key --> token conversion be handled directly by FastAPI so that the Swagger UI can do the conversion directly and you won't have to curl with your token.
35+
36+
## Databases
37+
38+
All data is stored in MySQL databases (for dev/staging/prod, these are databases in GCP's Cloud SQL), the main file you'll want to look at is [src/webapp/database.py](https://github.com/datakind/sst-app-api/blob/develop/src/webapp/database.py).
39+
40+
At time of writing, the databases the API cares about and tracks, are as follows:
41+
42+
* Institution Table ("inst"): the institutions, including info about them like PDP ID if applicable, creator/creation time, etc.
43+
* API Key Table ("apikey"): the API keys including access type, valid status (you can disable a key), etc.
44+
* Account Table ("users"): **THIS TABLE IS (the only table) SHARED WITH THE FRONTEND**. This contains enduser email/password, access types, inst if applicable etc. Because this table is shared with the frontend, any changes to the table definition should be reflected in both the ORM handling the table in the frontend _and_ the backend. Note that intentionally, there's no way to create new users from the backend. This is because the backend only uses API keys to authenticate and also lacks some reqiured fields such as team id generation that is required by Laravel to use the user table. The frontend can directly create users in the table which the backend will be able to read.
45+
* Account History Table ("account_history"): audit trail of certain events undertaken by users. TODO: interactions with this table largely remain unimplemented.
46+
* File Table ("file"): tracks files
47+
* Batch Table ("batch"): tracks batches
48+
* Model Table ("model"): tracks models
49+
* Job Table ("job"): tracks Databricks jobs, storing the per-run unique job_run_id. Status of the job is also partially tracked here. Note that failed jobs are currently indistinguishable from incomplete jobs.
50+
51+
NOTE: naming convention is to use a singular descriptor for the table name, however, the "users" table has to follow Laravel's table naming convention, which has the users table called "users".
52+
53+
## Testing
54+
55+
Unit test files are named `<file_under_test>_test.py` to correspond with the files they are testing. Unit tests only test behavior introduced by logic written in those files and do not test any integration with other systems. To respect test isolation, we have the following levels of testing:
56+
57+
1. Unit tests where all other systems are mocked out (e.g. Databricks, GCP storage etc.)
58+
2. Dev environment with fake data to test integration on real systems, as all integration points are connected to the real endpoints in Databricks, in GCP etc.
59+
3. Staging environment with real data (potentially sampled if your datasets are large) to test real data flowing through the full end to end setup of a real system that mimics prod.
60+
4. Prod environment with real data on real systems.
61+
62+
63+
That means for functions that are mainly doing integration pieces, we do not have unit tests for them, as we assume external systems work and mocking and testing these integration points would be near-useless. These can be tested at level 2 in the dev environment, which is setup for just this purpose.
64+
65+
This also means, it's not recommended for the local environment to connect to the dev environment. The four environments, `local`, `dev`, `staging`, `prod`, should also be isolated from each other.
66+
67+
While working in the local environment, it's recommended you mock out/stub out the calls to external systems. If you don't want to do that feel free to look into official documentation on how to auth to GCS and to Databricks from your local environment.
68+
69+
### Comment on Deployment
70+
71+
* Dev environment: gets deployed upon any new commit to b/develop.
72+
* Staging environment: requires manual Cloud Build Trigger Run initiated by a human to pick up the most recent changes from b/develop.
73+
* Prod environment: requires manual Cloud Build Trigger Run initiated by a human to pick up the most recent changes from b/develop.
74+
75+
For more information on deployment, see the Terraform setup and the GCP setup in the GCP console.
76+
77+
## Package Management
78+
79+
Package management is done via [uv](https://docs.astral.sh/uv/). When adding a new package, add it according to the uv documents and keep the `uv.lock` and `pyproject.toml` files up to date.
80+
81+
## Local Environment Setup
2082

2183
Enter into the root directory of the repo.
2284

@@ -25,47 +87,55 @@ Enter into the root directory of the repo.
2587
1. `source .venv/bin/activate`
2688
1. `pip install uv`
2789
1. `uv sync --all-extras --dev`
28-
1. `coverage run -m pytest -v -s ./src/webapp/`
2990

30-
For integration wtih Databricks, run:
91+
You're now in your virtual env with all your dependencies added.
3192

32-
The workspace URL will look like `https://<some_id>.gcp.databricks.com`
93+
For all of the following, the steps above are pre-requisites and you should be in the root folder of `sst-app-api/`.
3394

34-
1. `databricks auth login --host <workspace_url>`
95+
### Spin up the app locally:
3596

36-
For all of the following, be in the repo root folder (`sst-app-api/`).
97+
1. `export ENV_FILE_PATH=<full_path_to_your_webapp_.env_file>`
98+
1. `fastapi dev src/webapp/main.py --port 8000`
99+
1. Go to `http://127.0.0.1:8000/api/v1/docs`
100+
1. Hit the `Authorize` button on the top right and enter `key_1` in the `api-key` field (scroll past/ignore `api_key_scheme` fields).
101+
1. Generate a token using the `/token-from-api-key` POST method.
102+
1. Use the token in to CURL to any endpoint. For example,
37103

38-
If you need to generate signed URLs to upload data to GCS you should impersonate a service account.
39-
You can use the [cloud run service account](https://console.cloud.google.com/iam-admin/iam) or create
40-
your own with the desired permissions.
104+
```
105+
$ curl -X 'GET' \
106+
'http://127.0.0.1:8000/api/v1/non-inst-users' \
107+
-H 'Authorization: Bearer <paste_the_token_here>'
108+
```
41109

42-
1. `gcloud auth application-default login --impersonate-service-account <service-account-email>`
110+
### Before committing, run the formatter and run the unit tests
43111

44-
Spin up the app locally:
112+
1. Formatter: `black src/webapp/.`
113+
1. Unit tests: `coverage run -m pytest -v -s ./src/webapp/`
45114

46-
1. `ENV_FILE_PATH='/full/path/to/.env' fastapi dev src/webapp/main.py`
47-
1. Go to `http://127.0.0.1:8000/docs`
48-
1. Hit the `Authorize` button on the top right and enter the tester credentials:
115+
#### Optionally run pylint
49116

50-
* username: `tester@datakind.org`
51-
* password: `tester_password`
117+
`uv run pylint './src/webapp/*' --errors-only` for only errors.
52118

53-
Before committing, make sure to run:
119+
Non-error Pylint is very opinionated, and **SOMETIMES WRONG**. For example, there exist warnings to switch `== NONE` to `is None` for SQL query where clauses. THIS WILL CAUSE THE SQL QUERY TO NOT WORK -- (it appears to be due to how SqlAlchemy understands the clauses). So be careful when following the recommendations from pylint.
54120

55-
1. `black src/webapp/.`
56-
1. Test using `coverage run -m pytest -v -s ./src/webapp/*.py`
57-
1. Test using `coverage run -m pytest -v -s ./src/webapp/routers/*.py`
121+
## Usage Notes
58122

59-
### Notes:
123+
Some general things that may be helpful to call out.
60124

61-
postgresql requires that SSL certs be 0600 or 0640 depending on group/owners. The way we configure the
125+
### Adding a Datakinder vs an institutional user
62126

63127
The flow to add a Datakinder user is different from adding a user to an institution:
128+
64129
* adding a user to an institution has to happen prior to that user creating an account (by allowlisting their email for a given institution)
65130
* adding a Datakinder user has to happen after the Datakinder person has already created their account, then their account's access type is updated.
66131

67-
In general, the service account used to run this service in GCP will also need to be granted Databricks access in the equivalent environment.
132+
### Uploading files
133+
134+
The process to upload a file involves three API calls:
135+
1. Get the GCS upload URL: `GET /institutions/{inst_id}/upload-url/{file_name}`
136+
1. Post to the GCS upload URL: `POST <the_gcp_url_returned_from_step_1>`
137+
1. Validate the file: `POST /institutions/{inst_id}/input/validate-upload/{file_name}` OR `POST /institutions/{inst_id}/input/validate-sftp/{file_name}` -- depending on what input mechanism your file used. This sets a field in the File database table which indicates the source of the file (`MANUAL_UPLOAD` etc.) which is helpful information for the frontend.
68138

69-
### Local VSCode Debugging
139+
## Local VSCode Debugging
70140

71141
From the Run & Debug panel (⇧⌘D on 🍎) you can run the [debug launch config](../../.vscode/launch.json) for the webapp or worker modules. This will allow you to set breakpoints within the source code while the applications are running.

src/webapp/authn.py

Lines changed: 19 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -2,30 +2,20 @@
22
Functions related to authentication.
33
"""
44

5+
from datetime import timedelta, datetime, timezone
56
import jwt
6-
77
from fastapi import Security, HTTPException, status
88
from fastapi.security import (
99
OAuth2PasswordBearer,
10-
OAuth2PasswordRequestForm,
1110
APIKeyHeader,
1211
)
1312
from passlib.context import CryptContext
1413
from pydantic import BaseModel
15-
from datetime import timedelta, datetime, timezone
1614
from .config import env_vars
1715

1816

1917
pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")
2018

21-
oauth2_scheme = OAuth2PasswordBearer(
22-
scheme_name="user_scheme",
23-
tokenUrl="token",
24-
# We are using scope to sideload info on the end user. So "enduser" here is just a placeholder,
25-
# but the actual username will be passed by the frontend.
26-
scopes={"enduser": "end user to act as (a valid username), if frontend"},
27-
)
28-
2919
oauth2_apikey_scheme = OAuth2PasswordBearer(
3020
scheme_name="api_key_scheme",
3121
tokenUrl="token-from-api-key",
@@ -43,59 +33,54 @@
4333

4434

4535
class Token(BaseModel):
36+
"""Info stored in the JWT."""
37+
4638
access_token: str
4739
token_type: str
4840

4941

50-
class TokenData(BaseModel):
51-
username: str | None = None
52-
53-
5442
def get_api_key(
55-
api_key_header: str = Security(api_key_header),
56-
api_key_inst_header: str = Security(api_key_inst_header),
57-
api_key_enduser_header: str = Security(api_key_enduser_header),
58-
) -> str:
59-
"""Retrieve the api key and enduser header key if present.
60-
61-
Args:
62-
api_key_header: The API key passed in the HTTP header.
63-
64-
Returns:
65-
A tuple with the api key and enduser header if present. Authentication happens elsewhere.
66-
Raises:
67-
HTTPException: If the API key is invalid or missing.
68-
"""
69-
if api_key_header:
70-
return (api_key_header, api_key_inst_header, api_key_enduser_header)
43+
api_key: str = Security(api_key_header),
44+
api_key_inst: str = Security(api_key_inst_header),
45+
api_key_enduser: str = Security(api_key_enduser_header),
46+
) -> tuple[str, str, str]:
47+
"""Retrieve the api key and enduser header key if present."""
48+
if api_key:
49+
return (api_key, api_key_inst, api_key_enduser)
7150
raise HTTPException(
7251
status_code=status.HTTP_401_UNAUTHORIZED,
7352
detail="Invalid or missing API Key",
7453
)
7554

7655

7756
def verify_password(plain_password: str, hashed_password: str) -> bool:
57+
"""Verify a plain password against a hash. Includes a 2y/2b replacement since Laravel
58+
Generates hashes that start with 2y. The hashing scheme recognizes both."""
7859
revert_hash = hashed_password.replace("$2y", "$2b", 1)
7960
return pwd_context.verify(plain_password, revert_hash)
8061

8162

8263
def verify_api_key(plain_api_key: str, hashed_key: str) -> bool:
64+
"""Verify a plain API Key against a hash."""
8365
return pwd_context.verify(plain_api_key, hashed_key)
8466

8567

8668
def get_api_key_hash(api_key: str):
69+
"""Hash a given api key."""
8770
return pwd_context.hash(api_key)
8871

8972

9073
def get_password_hash(password: str):
91-
# to align with the password hashing used by Laravel, we have to replace the 2b
92-
# generated by pwd_context with 2y and that should be the version we store.
93-
# They should be functionally the same: https://stackoverflow.com/a/36225192/28478909
74+
"""Hash a password. To align with the password hashing used by Laravel, we have to replace the 2b
75+
generated by pwd_context with 2y and that should be the version we store.
76+
They should be functionally the same: https://stackoverflow.com/a/36225192/28478909
77+
"""
9478
initial_hash = pwd_context.hash(password)
9579
return initial_hash.replace("$2b", "$2y", 1)
9680

9781

9882
def create_access_token(data: dict, expires_delta: timedelta | None = None):
83+
"""Create a JWT."""
9984
to_encode = data.copy()
10085
if expires_delta:
10186
expire = datetime.now(timezone.utc) + expires_delta

src/webapp/authn_test.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
11
"""Test file for authn.py."""
22

3-
import pytest
4-
5-
from fastapi import HTTPException
6-
import uuid
73
from .authn import (
84
get_password_hash,
95
verify_password,

src/webapp/config.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@
1111
"ACCESS_TOKEN_EXPIRE_MINUTES": "120",
1212
# The Issuers env var will be stored as an array of emails.
1313
"API_KEY_ISSUERS": [],
14+
"INITIAL_API_KEY": "",
15+
"INITIAL_API_KEY_ID": "",
1416
}
1517

1618
# The INSTANCE_HOST is the private IP of CLoudSQL instance e.g. '127.0.0.1' ('172.17.0.1' if deployed to GAE Flex)
@@ -50,8 +52,8 @@
5052
}
5153

5254

53-
# Setup function to get environment variables. Should be called at startup time.
5455
def startup_env_vars():
56+
"""Setup function to get environment variables. Should be called at startup time."""
5557
env_file = os.environ.get("ENV_FILE_PATH")
5658
if not env_file:
5759
raise ValueError(
@@ -81,8 +83,7 @@ def startup_env_vars():
8183
"ENV environment variable not one of: PROD, STAGING, DEV, LOCAL."
8284
)
8385
if (
84-
name == "ACCESS_TOKEN_EXPIRE_MINUTES"
85-
or name == "ACCESS_TOKEN_EXPIRE_MINUTES"
86+
name in ("ACCESS_TOKEN_EXPIRE_MINUTES", "ACCESS_TOKEN_EXPIRE_MINUTES")
8687
) and not env_var.isdigit():
8788
raise ValueError(
8889
"ACCESS_TOKEN_EXPIRE_MINUTES and ACCESS_TOKEN_EXPIRE_MINUTES environment variables must be an int."
@@ -111,8 +112,8 @@ def startup_env_vars():
111112
databricks_vars[name] = env_var
112113

113114

114-
# Setup function to get db environment variables. Should be called at db startup time.
115115
def setup_database_vars():
116+
"""Setup function to get db environment variables. Should be called at db startup time."""
116117
global engine_vars
117118
for name in engine_vars:
118119
env_var = os.environ.get(name)

0 commit comments

Comments
 (0)