Skip to content

Commit 3738af1

Browse files
authored
Merge pull request #84 from datakind/release/1.0.5
Release/1.0.5
2 parents 162f9ef + bf38b7e commit 3738af1

File tree

13 files changed

+775
-807
lines changed

13 files changed

+775
-807
lines changed

CONTRIBUTING.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,73 @@ you can open a new issue using a relevant [issue form](https://github.com/dataki
1616

1717
As a general rule, we don’t assign issues to anyone. If you find an issue to work on, you are welcome to open a PR with a fix.
1818

19+
### More complex configuration options
20+
21+
All the configuration files must be located under the [config](dot/config) folder of the DOT.
22+
23+
### Main config file
24+
25+
The main config file must be called `dot_config.yml` and located at the top [config](dot/config) folder. Note that
26+
this file will be ignored for version control. You may use the [example dot_config yaml](dot/config/example/dot_config.yml)
27+
as a template.
28+
29+
Besides the DOT DB connection in the paragraph above, see below for additional config options.
30+
31+
#### Connection parameters for each of the projects to run
32+
33+
For each of the projects you would like to run, add a key to the DOT config yaml with the following structure:
34+
```
35+
<project_name>_db:
36+
type: connection type e.g. postgres
37+
host: host
38+
user: username
39+
pass: password
40+
port: port number e.g 5432
41+
dbname: database name
42+
schema: schema name, e.g. public
43+
threads: nubmer of threads for DBT, e.g. 4
44+
```
45+
46+
#### Output schema suffix
47+
48+
The DOT generates 2 kind of database objects:
49+
- Entities of the models that are being tested, e.g. assessments, follow ups, patients
50+
- Results of the failing tests
51+
52+
If nothing is done, these objects would be created in the same schema as the original data for the project
53+
(thus polluting the DB). If the key `output_schema_suffix` is added, its value will be added as a suffix; i.e. if the
54+
project data is stored in a certain schema, the output objects will go to `<project_schema>_<schema_suffix>`
55+
(e.g. to `public_tests` if the project schema is `public` and the suffix is set to `tests` in the lines above).
56+
57+
Note that this mechanism uses a DBT feature, and that the same applies to the GE tests.
58+
59+
#### Save passed tests
60+
61+
The key `save_passed_tests` accepts boolean values. If set to true, tha results of the passing tests will be also stored
62+
to the DOT DB. If not, only the results of failing tests will be stored.
63+
64+
### Other config file locations
65+
Optional configuration for DBT and Great Expectations can be added, per project, in a structure as follows.
66+
67+
```bash
68+
|____config
69+
| |____<project_name>
70+
| | |____dbt
71+
| | | |____profiles.yml
72+
| | | |____dbt_project.yml
73+
| | |____ge
74+
| | | |____great_expectations.yml
75+
| | | |____config_variables.yml
76+
| | | |____batch_config.json
77+
```
78+
In general these customizations will not be needed, but only in some scenarios with particular requirements; these
79+
require a deeper knowledge of the DOT and of either DBT and/or Great Expectations.
80+
81+
There are examples for all the files above under [this folder](dot/config/example/project_name). For each of the
82+
files you want to customize, you may copy and adapt the examples provided following the directory structure above.
83+
84+
More details in the [config README](dot/config/README.md).
85+
1986
## Making Code changes
2087

2188
## Setting up a Development Environment

README.md

Lines changed: 34 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -275,17 +275,16 @@ also use the DOT user interface for tests, for more details please see section s
275275
The DOT will run tests against user-defined views onto the underlying data. These views are called "entities" and defined in table `dot.configured_entities`:
276276

277277

278-
| Column | Description |
279-
| :----------- | :----------- |
280-
| entity_id | UUID of the entity |
281-
| entity_name | Name of the entity e.g. ancview_danger_sign |
278+
| Column | Description |
279+
| :----------- |:--------------------------------------------------------------------------|
280+
| entity_id | Name of the entity e.g. ancview_danger_sign |
282281
| entity_category | Category of the entity e.g. anc => needs to be in `dot.entity_categories` |
283-
| entity_definition | String for the SQL query that defines the entity |
282+
| entity_definition | String for the SQL query that defines the entity |
284283

285284
For example, this would be an insert command to create `ancview_danger_sign`:
286285

287286
```postgres-sql
288-
INSERT INTO dot.configured_entities VALUES('b05f1f9c-2176-46b0-8e8f-d6690f696b9b',
287+
INSERT INTO dot.configured_entities (project_id,entity_id,entity_category,entity_definition,date_added,date_modified,last_updated_by) VALUES('Project1',
289288
'ancview_danger_sign', 'anc', '{{ config(materialized=''view'') }}
290289
{% set schema = <schema> %}
291290
@@ -294,8 +293,6 @@ from {{ schema }}.ancview_danger_sign');
294293
295294
```
296295

297-
Note: UUID in the above statement will be overwritten with an automatically generated value.
298-
299296
All entities use Jinja macro statements - the parts between `{ ... }` - which the DOT uses to create the entity
300297
materialized views in the correct database location. Use the above format for any new entities you create.
301298

@@ -368,58 +365,58 @@ generated one.
368365
<br><br>
369366
```
370367
'INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '0cdc9702-91e0-3499-b6f0-4dec12ad0f08', 'ASSESS-1', 3, '', '',
371-
'', 'dot_model__ancview_pregnancy', 'relationships', 'uuid', '',
368+
'', 'ancview_pregnancy', 'relationships', 'uuid', '',
372369
$${"name": "danger_signs_with_no_pregnancy", "to": "ref('dot_model__ancview_danger_sign')", "field": "pregnancy_uuid"}$$,
373370
'2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
374371
```
375372
2. `unique`
376373
<br><br>
377374
```
378375
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '52d7352e-56ee-3084-9c67-e5ab24afc3a3', 'DUPLICATE-1', 3, '',
379-
'', '', '6ba8075f-6f35-4ff1-be3a-4c75d0884bf4', 'unique', 'uuid', 'alternative index?', '',
376+
'', '', 'ancview_pregnancy', 'unique', 'uuid', 'alternative index?', '',
380377
'2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
381378
```
382379
3. `not_negative_string_column`
383380
<br><br>
384381
```
385382
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '8aca2bee-9e95-3f8a-90e9-153714e05367', 'INCONSISTENT-1', 3,
386-
'', '', '', '95bd0f60-ab59-48fc-a62e-f256f5f3e6de', 'not_negative_string_column', 'patient_age_in_years', '',
383+
'', '', '', 'ancview_pregnancy', 'not_negative_string_column', 'patient_age_in_years', '',
387384
$${"name": "patient_age_in_years"}$$, '2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
388385
```
389386
4. `not_null`
390387
<br><br>
391388
```
392389
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '549c0575-e64c-3605-85a9-70356a23c4d2', 'MISSING-1', 3, '',
393-
'', '', '638ed10b-3a2f-4f18-9ca1-ebf23563fdc0', 'not_null', 'patient_id', '', '', '2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
390+
'', '', 'ancview_pregnancy', 'not_null', 'patient_id', '', '', '2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
394391
```
395392
5. `accepted_values`
396393
<br><br>
397394
```
398395
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '935e6b61-b664-3eab-9d67-97c2c9c2bec0', 'INCONSISTENT-1', 3,
399-
'', '', '', '95bd0f60-ab59-48fc-a62e-f256f5f3e6de', 'accepted_values', 'fp_method_being_used', '',
396+
'', '', '', 'ancview_pregnancy', 'accepted_values', 'fp_method_being_used', '',
400397
$${"values": ['oral mini-pill (progestogen)', 'male condom', 'female sterilization', 'iud', 'oral combination pill', 'implants', 'injectible']}$$,
401398
'2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
402399
```
403400
6. `possible_duplicate_forms`
404401
<br><br>
405402
```
406403
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '7f78de0e-8268-3da6-8845-9a445457cc9a', 'DUPLICATE-1', 3, '',
407-
'', '', '66f5d13a-8f74-4f97-836b-334d97932781', 'possible_duplicate_forms', '', '',
404+
'', '', 'ancview_pregnancy', 'possible_duplicate_forms', '', '',
408405
$${"table_specific_reported_date": "delivery_date", "table_specific_patient_uuid": "patient_id", "table_specific_uuid": "uuid"}$$, '2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
409406
```
410407
7. `associated_columns_not_null`
411408
<br><br>
412409
```
413410
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', 'd74fc600-31c3-307d-9501-5b7f6b09aff5', 'MISSING-1', 3, '',
414-
'', '', 'dot_model__iccmview_assessment', 'associated_columns_not_null', 'diarrhea_dx', 'diarrhea diagnosis',
411+
'', '', 'ancview_pregnancy', 'associated_columns_not_null', 'diarrhea_dx', 'diarrhea diagnosis',
415412
$${"name": "diarrhea_dx_has_duration", "col_value": True, "associated_columns": ['max_symptom_duration']}$$,
416413
'2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'your-name');
417414
```
418415
8. `expect_similar_means_across_reporters`
419416
<br><br>
420417
```
421418
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '0cdc9702-91e0-3499-b6f0-4dec12ad0f08', 'BIAS-1', 3,
422-
'Test for miscalibrated thermometer', '', '', 'baf349c9-c919-40ff-a611-61ddc59c2d52', 'expect_similar_means_across_reporters',
419+
'Test for miscalibrated thermometer', '', '', 'ancview_pregnancy', 'expect_similar_means_across_reporters',
423420
'child_temperature_pre_chw', '', '{"key": "reported_by","quantity": "child_temperature_pre_chw",
424421
"form_name": "dot_model__iccmview_assessment","id_column": "reported_by"}', '2022-01-19 20:00:00.000 -0500',
425422
'2022-01-19 20:00:00.000 -0500', 'your-name');
@@ -428,7 +425,7 @@ generated one.
428425
<br><br>
429426
```
430427
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '3081f033-e8f4-4f3b-aea8-36f8c5df05dc', 'INCONSISTENT-1', 3,
431-
'Wrong treatment/dosage arising from wrong age of children (WT-1)', '', '', 'baf349c9-c919-40ff-a611-61ddc59c2d52',
428+
'Wrong treatment/dosage arising from wrong age of children (WT-1)', '', '', 'ancview_pregnancy',
432429
'expression_is_true', '', '',
433430
$${"name": "t_under_24_months_wrong_dosage", "expression": "malaria_act_dosage is not null", "condition": "(patient_age_in_months<24) and (malaria_give_act is not null)"}$$,
434431
'2022-02-14 19:00:00.000 -0500', '2022-02-14 19:00:00.000 -0500', 'your-name');
@@ -437,7 +434,7 @@ generated one.
437434
<br><br>
438435
Custom SQL queries require special case because they must have `primary_table` and `primary_table_id_field` specified within the SQL query as shown below:
439436
```
440-
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', 'c4a3da8f-32f4-4e9b-b135-354de203ca90', 'TREAT-1', 6, 'Test for new family planning method (NFP-1)', '', '', '95bd0f60-ab59-48fc-a62e-f256f5f3e6de', 'custom_sql', '', '',
437+
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', 'c4a3da8f-32f4-4e9b-b135-354de203ca90', 'TREAT-1', 6, 'Test for new family planning method (NFP-1)', '', '', 'ancview_pregnancy', 'custom_sql', '', '',
441438
format('{%s: %s}',
442439
to_json('query'::text),
443440
to_json($query$
@@ -527,72 +524,7 @@ custom SQL query. Given this, there is a useful Postgres function which will ret
527524
see 'Seeing the raw data for failed tests' above.
528525
529526
530-
## More complex configuration options
531-
532-
All the configuration files must be located under the [config](dot/config) folder of the DOT.
533-
534-
### Main config file
535-
536-
The main config file must be called `dot_config.yml` and located at the top [config](dot/config) folder. Note that
537-
this file will be ignored for version control. You may use the [example dot_config yaml](dot/config/example/dot_config.yml)
538-
as a template.
539-
540-
Besides the DOT DB connection in the paragraph above, see below for additional config options.
541-
542-
#### Connection parameters for each of the projects to run
543-
544-
For each of the projects you would like to run, add a key to the DOT config yaml with the following structure:
545-
```
546-
<project_name>_db:
547-
type: connection type e.g. postgres
548-
host: host
549-
user: username
550-
pass: password
551-
port: port number e.g 5432
552-
dbname: database name
553-
schema: schema name, e.g. public
554-
threads: nubmer of threads for DBT, e.g. 4
555-
```
556-
557-
#### Output schema suffix
558-
559-
The DOT generates 2 kind of database objects:
560-
- Entities of the models that are being tested, e.g. assessments, follow ups, patients
561-
- Results of the failing tests
562-
563-
If nothing is done, these objects would be created in the same schema as the original data for the project
564-
(thus polluting the DB). If the key `output_schema_suffix` is added, its value will be added as a suffix; i.e. if the
565-
project data is stored in a certain schema, the output objects will go to `<project_schema>_<schema_suffix>`
566-
(e.g. to `public_tests` if the project schema is `public` and the suffix is set to `tests` in the lines above).
567-
568-
Note that this mechanism uses a DBT feature, and that the same applies to the GE tests.
569-
570-
#### Save passed tests
571-
572-
The key `save_passed_tests` accepts boolean values. If set to true, tha results of the passing tests will be also stored
573-
to the DOT DB. If not, only the results of failing tests will be stored.
574-
575-
### Other config file locations
576-
Optional configuration for DBT and Great Expectations can be added, per project, in a structure as follows.
577-
578-
```bash
579-
|____config
580-
| |____<project_name>
581-
| | |____dbt
582-
| | | |____profiles.yml
583-
| | | |____dbt_project.yml
584-
| | |____ge
585-
| | | |____great_expectations.yml
586-
| | | |____config_variables.yml
587-
| | | |____batch_config.json
588-
```
589-
In general these customizations will not be needed, but only in some scenarios with particular requirements; these
590-
require a deeper knowledge of the DOT and of either DBT and/or Great Expectations.
591-
592-
There are examples for all the files above under [this folder](dot/config/example/project_name). For each of the
593-
files you want to customize, you may copy and adapt the examples provided following the directory structure above.
594-
595-
More details in the [config README](dot/config/README.md).
527+
### Please refer to [CONTRIBUTING.md](./CONTRIBUTING.md) for information on more complex configuration options.
596528
597529
## How to visualize the results using Superset
598530
@@ -706,7 +638,7 @@ NOTE: You might need to use docker-compose on some hosts.
706638
707639
`docker compose -f docker-compose-with-airflow.yml down -v`
708640
709-
### Running the DOT in Airflow
641+
### Running the DOT in Airflow (Demo)
710642
711643
A DAG has been included which copies data from the uploaded DB dump into the DOT DB 'data_ScanProject1' schema, and then runs
712644
the toolkit against this data. To do this ...
@@ -733,6 +665,23 @@ Or to run just DOT stage ...
733665
734666
`airflow tasks test run_dot_project run_dot 2022-03-01`
735667
668+
669+
### Running the DOT in Airflow (Connecting to external databases)
670+
671+
The following instructions illustrate how to use a local airflow environment, connecting with external databases for the data and DOT.
672+
673+
**NOTE:** These are for illustrative purposes only. If using Airflow in production it's important that it is set up correctly
674+
and does not expose a http connection to the internet, and also has adequate network security (firewal, strong password, etc)
675+
676+
1. Edit [./dot/dot_config.yml] and set the correct parameters for your external dot_db
677+
2. Create a section for your data databases and set connection parameters
678+
3. If you have a DAG json file `dot_projects.json` already, deploy it into `./airflow/dags`
679+
4. Run steps 1-11 in [Configuring/Building Airflow Docker environment](#Configuring/Building Airflow Docker environment)
680+
5. Run steps 12 and 13, but use the values for your external databases you configured in `dot_config.yml`
681+
682+
You will need to configure DOT tests and the DAG json file appropriately for your installation.
683+
684+
736685
#### Adding more projects
737686
738687
If configuring Airflow in production, you will need to adjust `./docker/dot/dot_config.yml` accordingly. You can also

db/dot/4-upload_sample_dot_data.sql

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ $${"table_specific_reported_date": "departure_time", "table_specific_patient_uui
6565
"uuid", "table_specific_period": "day"}$$, '2021-12-23 19:00:00.000 -0500', '2022-03-21 19:00:00.000 -0500', 'Matt');
6666

6767
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', 'c4a3da8f-32f4-4e9b-b135-354de203ca90', 'TREAT-1',
68-
5, 'Number of stops has a reasonible value', '', '', 'all_flight_data', 'custom_sql', '', '',
68+
5, 'Number of stops has a reasonable value', '', '', 'all_flight_data', 'custom_sql', '', '',
6969
format('{%s: %s}',
7070
to_json('query'::text),
7171
to_json($query$
@@ -79,6 +79,11 @@ format('{%s: %s}',
7979
)::json,
8080
'2021-12-23 19:00:00.000 -0500', '2021-12-23 19:00:00.000 -0500', 'Lorenzo');
8181

82+
INSERT INTO dot.configured_tests VALUES(TRUE, 'ScanProject1', '3081f033-e8f4-4f3b-aea8-36f8c5df05dc', 'INCONSISTENT-1',
83+
8, 'Price is a positive number for direct flights', '', '', 'all_flight_data', 'expression_is_true',
84+
'', '', $${"name": "t_direct_flights_positive_price", "expression": "price is not null and price > 0",
85+
"condition": "stops = 'non-stop'"}$$, '2022-12-10 19:00:00.000 -0500', '2022-12-10 19:00:00.000 -0500', 'Lorenzo');
86+
8287
COMMIT;
8388

8489

docker/appsmith/DOT App V2.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

docker/run_demo.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
url_demo_data = "https://drive.google.com/uc?id=157Iad8mHnwbZ_dAeLQy5XfLihhcpD6yc"
1212
filename_demo_data = "dot_demo_data.tar.gz"
13-
url_dot_ui = "http://localhost:82/app/data-observation-toolkit/run-log-634491ea0da61b0e9f38760d?embed=True"
13+
url_dot_ui = "http://localhost:82/app/data-observation-toolkit/run-log-634491ea0da61b0e9f38760d?embed=True" # pylint: disable=line-too-long
1414

1515
# Check if db, appsmith and tar file are there and if so, delete them.
1616
os.chdir("demo/")
@@ -30,12 +30,12 @@
3030

3131
# Open/Extract tarfile
3232
with tarfile.open(filename_demo_data) as my_tar:
33-
my_tar.extractall('')
33+
my_tar.extractall("")
3434
my_tar.close()
3535

3636
with open("./db/.env") as f:
37-
demo_pwd=f.read().split("=")[1]
38-
os.environ['POSTGRES_PASSWORD'] = demo_pwd
37+
demo_pwd = f.read().split("=")[1]
38+
os.environ["POSTGRES_PASSWORD"] = demo_pwd
3939

4040
# Composing and running container(s)
4141
print("Starting DOT...\n")
@@ -49,8 +49,10 @@
4949

5050
webbrowser.open(url_dot_ui)
5151

52-
print("In case DOT was not opened in your browser, please go to this URL: "
53-
"http://localhost:82/app/data-observation-toolkit/run-log-634491ea0da61b0e9f38760d?embed=True\n")
52+
print(
53+
"In case DOT was not opened in your browser, please go to this URL: "
54+
"http://localhost:82/app/data-observation-toolkit/run-log-634491ea0da61b0e9f38760d?embed=True\n"
55+
)
5456
input("Press return to stop DOT container\n")
5557
print("Container is being stopped - we hope you enjoyed this demo :)")
5658
docker.compose.stop()

0 commit comments

Comments
 (0)