You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/webapp/.env.example
+7-4Lines changed: 7 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -15,9 +15,11 @@ DB_CERT=""
15
15
DB_ROOT_CERT=""
16
16
DB_KEY=""
17
17
18
-
# The initial user credentials for DEV (facilitates development)
19
-
DEV_INIT_DB_PASSWORD="<PUT PASSWORD HERE>"
20
-
DEV_INIT_DB_USER="tester@datakind.org"
18
+
# Generate the following using `openssl rand -hex 32`
19
+
# This is the initial API key for one-time use during setup.
20
+
INITIAL_API_KEY=""
21
+
# Its corresponding ID. You can generate this using uuid4 e.g. uuid.uuid4() in python3
22
+
INITIAL_API_KEY_ID=""
21
23
22
24
# GCP related env vars
23
25
GCP_REGION="us-east4"
@@ -32,4 +34,5 @@ DATABRICKS_HOST_URL=""
32
34
DATABRICKS_SERVICE_ACCOUNT_EMAIL=""
33
35
34
36
# Datakinders allowed to issue API key. This should be the MINIMUM set. Keep this group small. Pass as a comma separated string structured like so: "abc@dk.org,bcd@dk.org"
35
-
API_KEY_ISSUERS="tester@datakind.org"
37
+
# The initial value set is api_key_initial which is the initial API key, this is needed for one-time setup. You can remove this once the api key table is populated.
Authentication of the API is primarily via JWTs in the Authorization header of the HTTP calls. These JWTs are short-term tokens that expire and are signed by the API. They are slightly more secure than using API keys directly to authenticate every call as API keys are long-term/non-expiring authentication tokens which are more powerful if stolen. So the mechanism used here is that API keys are exchanged for JWTs, which are then used to authenticate each call and can store additional information such as "enduser" identity.
14
12
15
-
You will also need to add the permission Storage Writer or Storage Admin to your Datakind Account in GCP to allow for local interaction with the storage buckets.
13
+
Note that intentionally, there is no way to use the user table's email/password combination to authenticate to the API directly. An API key is required. This means that a "user" is almost only a frontend concept. However, we do have to retain access to the user table in the backend to look up access type/institution and such. See more in the databases section below.
16
14
17
-
Note that to generate GCP URLS you'll need a service account key (doesn't work locally).
15
+
There are also multiple types of API keys, and they can have the same access types as a user. Additionally, API keys can have the additional attribute of "allows_enduser". BE CAUTIOUS WHEN SETTING THIS TO TRUE when generating API keys. This means that this API key can impersonate any user. This API key should really only be used for the frontend, which needs to do enduser impersonation. Note that only DATAKINDER access types can allow endusers.
18
16
19
-
### For local testing:
17
+
Additionally, API keys, like users, can have an institution set. DATAKINDER type keys should not set an institution.
18
+
19
+
### Authenticate/Generating Tokens via the Swagger UI
20
+
21
+
NOTE: Treat keys as secrets. These will grant full access to this API.
22
+
23
+
0. Get a valid API key. In the LOCAL environment, you can use the `INITIAL_API_KEY` set in your `.env` file or `key_1`. In other environments, you can use `INITIAL_API_KEY` set in the `.env` file or any existing generated API keys.
24
+
1. Hit the authorize button on the top right, enter a valid API key in the `api-key` field (ignore the `api_key_scheme` which contains username/password -- this is to allow FastAPI to auto populate bearer tokens generated from the API keys; the actual username/password fields intentionally do not work).
25
+
2. Then generate a token using the `/token-from-api-key` POST method. This will be the only endpoint you can access with the API key directly.
26
+
3. Get the resulting token value which you can then use to CURL to any endpoint. For example,
27
+
28
+
```
29
+
$ curl -X 'GET' \
30
+
'http://127.0.0.1:8000/api/v1/non-inst-users' \
31
+
-H 'Authorization: Bearer <paste_the_token_here>'
32
+
```
33
+
34
+
In the long-term, look into a way to have the API key --> token conversion be handled directly by FastAPI so that the Swagger UI can do the conversion directly and you won't have to curl with your token.
35
+
36
+
## Databases
37
+
38
+
All data is stored in MySQL databases (for dev/staging/prod, these are databases in GCP's Cloud SQL), the main file you'll want to look at is [src/webapp/database.py](https://github.com/datakind/sst-app-api/blob/develop/src/webapp/database.py).
39
+
40
+
At time of writing, the databases the API cares about and tracks, are as follows:
41
+
42
+
* Institution Table ("inst"): the institutions, including info about them like PDP ID if applicable, creator/creation time, etc.
43
+
* API Key Table ("apikey"): the API keys including access type, valid status (you can disable a key), etc.
44
+
* Account Table ("users"): **THIS TABLE IS (the only table) SHARED WITH THE FRONTEND**. This contains enduser email/password, access types, inst if applicable etc. Because this table is shared with the frontend, any changes to the table definition should be reflected in both the ORM handling the table in the frontend _and_ the backend. Note that intentionally, there's no way to create new users from the backend. This is because the backend only uses API keys to authenticate and also lacks some reqiured fields such as team id generation that is required by Laravel to use the user table. The frontend can directly create users in the table which the backend will be able to read.
45
+
* Account History Table ("account_history"): audit trail of certain events undertaken by users. TODO: interactions with this table largely remain unimplemented.
46
+
* File Table ("file"): tracks files
47
+
* Batch Table ("batch"): tracks batches
48
+
* Model Table ("model"): tracks models
49
+
* Job Table ("job"): tracks Databricks jobs, storing the per-run unique job_run_id. Status of the job is also partially tracked here. Note that failed jobs are currently indistinguishable from incomplete jobs.
50
+
51
+
NOTE: naming convention is to use a singular descriptor for the table name, however, the "users" table has to follow Laravel's table naming convention, which has the users table called "users".
52
+
53
+
## Testing
54
+
55
+
Unit test files are named `<file_under_test>_test.py` to correspond with the files they are testing. Unit tests only test behavior introduced by logic written in those files and do not test any integration with other systems. To respect test isolation, we have the following levels of testing:
56
+
57
+
1. Unit tests where all other systems are mocked out (e.g. Databricks, GCP storage etc.)
58
+
2. Dev environment with fake data to test integration on real systems, as all integration points are connected to the real endpoints in Databricks, in GCP etc.
59
+
3. Staging environment with real data (potentially sampled if your datasets are large) to test real data flowing through the full end to end setup of a real system that mimics prod.
60
+
4. Prod environment with real data on real systems.
61
+
62
+
63
+
That means for functions that are mainly doing integration pieces, we do not have unit tests for them, as we assume external systems work and mocking and testing these integration points would be near-useless. These can be tested at level 2 in the dev environment, which is setup for just this purpose.
64
+
65
+
This also means, it's not recommended for the local environment to connect to the dev environment. The four environments, `local`, `dev`, `staging`, `prod`, should also be isolated from each other.
66
+
67
+
While working in the local environment, it's recommended you mock out/stub out the calls to external systems. If you don't want to do that feel free to look into official documentation on how to auth to GCS and to Databricks from your local environment.
68
+
69
+
### Comment on Deployment
70
+
71
+
* Dev environment: gets deployed upon any new commit to b/develop.
72
+
* Staging environment: requires manual Cloud Build Trigger Run initiated by a human to pick up the most recent changes from b/develop.
73
+
* Prod environment: requires manual Cloud Build Trigger Run initiated by a human to pick up the most recent changes from b/develop.
74
+
75
+
For more information on deployment, see the Terraform setup and the GCP setup in the GCP console.
76
+
77
+
## Package Management
78
+
79
+
Package management is done via [uv](https://docs.astral.sh/uv/). When adding a new package, add it according to the uv documents and keep the `uv.lock` and `pyproject.toml` files up to date.
80
+
81
+
## Local Environment Setup
20
82
21
83
Enter into the root directory of the repo.
22
84
@@ -25,47 +87,55 @@ Enter into the root directory of the repo.
25
87
1.`source .venv/bin/activate`
26
88
1.`pip install uv`
27
89
1.`uv sync --all-extras --dev`
28
-
1.`coverage run -m pytest -v -s ./src/webapp/`
29
90
30
-
For integration wtih Databricks, run:
91
+
You're now in your virtual env with all your dependencies added.
31
92
32
-
The workspace URL will look like `https://<some_id>.gcp.databricks.com`
93
+
For all of the following, the steps above are pre-requisites and you should be in the root folder of `sst-app-api/`.
33
94
34
-
1.`databricks auth login --host <workspace_url>`
95
+
### Spin up the app locally:
35
96
36
-
For all of the following, be in the repo root folder (`sst-app-api/`).
### Before committing, run the formatter and run the unit tests
43
111
44
-
Spin up the app locally:
112
+
1. Formatter: `black src/webapp/.`
113
+
1. Unit tests: `coverage run -m pytest -v -s ./src/webapp/`
45
114
46
-
1.`ENV_FILE_PATH='/full/path/to/.env' fastapi dev src/webapp/main.py`
47
-
1. Go to `http://127.0.0.1:8000/docs`
48
-
1. Hit the `Authorize` button on the top right and enter the tester credentials:
115
+
#### Optionally run pylint
49
116
50
-
* username: `tester@datakind.org`
51
-
* password: `tester_password`
117
+
`uv run pylint './src/webapp/*' --errors-only` for only errors.
52
118
53
-
Before committing, make sure to run:
119
+
Non-error Pylint is very opinionated, and **SOMETIMES WRONG**. For example, there exist warnings to switch `== NONE` to `is None` for SQL query where clauses. THIS WILL CAUSE THE SQL QUERY TO NOT WORK -- (it appears to be due to how SqlAlchemy understands the clauses). So be careful when following the recommendations from pylint.
54
120
55
-
1.`black src/webapp/.`
56
-
1. Test using `coverage run -m pytest -v -s ./src/webapp/*.py`
57
-
1. Test using `coverage run -m pytest -v -s ./src/webapp/routers/*.py`
121
+
## Usage Notes
58
122
59
-
### Notes:
123
+
Some general things that may be helpful to call out.
60
124
61
-
postgresql requires that SSL certs be 0600 or 0640 depending on group/owners. The way we configure the
125
+
### Adding a Datakinder vs an institutional user
62
126
63
127
The flow to add a Datakinder user is different from adding a user to an institution:
128
+
64
129
* adding a user to an institution has to happen prior to that user creating an account (by allowlisting their email for a given institution)
65
130
* adding a Datakinder user has to happen after the Datakinder person has already created their account, then their account's access type is updated.
66
131
67
-
In general, the service account used to run this service in GCP will also need to be granted Databricks access in the equivalent environment.
132
+
### Uploading files
133
+
134
+
The process to upload a file involves three API calls:
135
+
1. Get the GCS upload URL: `GET /institutions/{inst_id}/upload-url/{file_name}`
136
+
1. Post to the GCS upload URL: `POST <the_gcp_url_returned_from_step_1>`
137
+
1. Validate the file: `POST /institutions/{inst_id}/input/validate-upload/{file_name}` OR `POST /institutions/{inst_id}/input/validate-sftp/{file_name}` -- depending on what input mechanism your file used. This sets a field in the File database table which indicates the source of the file (`MANUAL_UPLOAD` etc.) which is helpful information for the frontend.
68
138
69
-
###Local VSCode Debugging
139
+
## Local VSCode Debugging
70
140
71
141
From the Run & Debug panel (⇧⌘D on 🍎) you can run the [debug launch config](../../.vscode/launch.json) for the webapp or worker modules. This will allow you to set breakpoints within the source code while the applications are running.
0 commit comments