Skip to content

Commit f2e3b82

Browse files
Merge pull request #4536 from fedspendingtransparency/tst/emr-pipeline-testing
Downmerge EMR cutover branch into QAT
2 parents eb59915 + 593367d commit f2e3b82

File tree

73 files changed

+325
-273
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+325
-273
lines changed

.env.template

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,8 @@ USASPENDING_DB_PASSWORD=usaspender
3939

4040
# The Broker configuration below supports tests creating a Broker DB on the usaspending-db
4141
# container as part of standing up the test suite.
42-
# All values of BROKER_DB_* must match what is in DATA_BROKER_DATABASE_URL if BOTH are given
43-
DATA_BROKER_DATABASE_URL=postgres://usaspending:usaspender@usaspending-db:5432/data_broker
42+
# All values of BROKER_DB_* must match what is in BROKER_DB if BOTH are given
43+
BROKER_DB=postgres://usaspending:usaspender@usaspending-db:5432/data_broker
4444
# Configuration values for a connection string to a Broker database
4545
# Only necessary for some management commands
4646
BROKER_DB_HOST=usaspending-db

.github/actions/init-test-environment/action.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ runs:
1717
- name: Set combined ENV
1818
shell: bash
1919
run: |
20-
echo "DATA_BROKER_DATABASE_URL=postgres://$BROKER_DB_USER:$BROKER_DB_PASSWORD@$BROKER_DB_HOST:$BROKER_DB_PORT/$BROKER_DB_NAME" >> $GITHUB_ENV
20+
echo "BROKER_DB=postgres://$BROKER_DB_USER:$BROKER_DB_PASSWORD@$BROKER_DB_HOST:$BROKER_DB_PORT/$BROKER_DB_NAME" >> $GITHUB_ENV
2121
echo "DATABASE_URL=postgres://$USASPENDING_DB_USER:$USASPENDING_DB_PASSWORD@$USASPENDING_DB_HOST:$USASPENDING_DB_PORT/$USASPENDING_DB_NAME" >> $GITHUB_ENV
2222
echo "DOWNLOAD_DATABASE_URL=postgres://$USASPENDING_DB_USER:$USASPENDING_DB_PASSWORD@$USASPENDING_DB_HOST:$USASPENDING_DB_PORT/$USASPENDING_DB_NAME" >> $GITHUB_ENV
2323
echo "ES_HOSTNAME=$ES_SCHEME://$ES_HOST:$ES_PORT" >> $GITHUB_ENV

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ Create a `.envrc` file in the repo root, which will be ignored by git. Change cr
6060
```shell
6161
export DATABASE_URL=postgres://usaspending:usaspender@localhost:5432/data_store_api
6262
export ES_HOSTNAME=http://localhost:9200
63-
export DATA_BROKER_DATABASE_URL=postgres://admin:root@localhost:5435/data_broker
63+
export BROKER_DB=postgres://admin:root@localhost:5435/data_broker
6464
```
6565

6666
If `direnv` does not pick this up after saving the file, type
@@ -220,10 +220,10 @@ Deployed production API endpoints and docs are found by following links here: `h
220220
221221
3. To run all USAspending tests in the docker services run
222222
```shell
223-
docker compose run --rm -e DATA_BROKER_DATABASE_URL='' usaspending-test
223+
docker compose run --rm -e BROKER_DB='' usaspending-test
224224
```
225225
226-
_**NOTE**: If an env var named `DATA_BROKER_DATABASE_URL` is set, Broker Integration tests will attempt to be run as well. If doing so, Broker dependencies must be met (see below) or ALL tests will fail hard. Running the above command with `-e DATA_BROKER_DATABASE_URL=''` is a precaution to keep them excluded, unless you really want them (see below if so)._
226+
_**NOTE**: If an env var named `BROKER_DB` is set, Broker Integration tests will attempt to be run as well. If doing so, Broker dependencies must be met (see below) or ALL tests will fail hard. Running the above command with `-e BROKER_DB=''` is a precaution to keep them excluded, unless you really want them (see below if so)._
227227
228228
To run tests locally and not in the docker services, you need:
229229
@@ -273,7 +273,7 @@ To satisfy these dependencies and include execution of these tests, do the follo
273273
```shell
274274
docker build -t dataact-broker-backend ../data-act-broker-backend
275275
```
276-
1. Ensure you have the `DATA_BROKER_DATABASE_URL` environment variable set, and it points to what will be a live PostgreSQL server (no database required) at the time tests are run.
276+
1. Ensure you have the `BROKER_DB` environment variable set, and it points to what will be a live PostgreSQL server (no database required) at the time tests are run.
277277
1. _WARNING: If this is set at all, then ALL above dependencies must be met or ALL tests will fail (Django will try this connection on ALL tests' run)_
278278
1. This DB could be one you always have running in a local Postgres instance, or one you spin up in a Docker container just before tests are run
279279
1. If invoking `pytest` within a docker container (e.g. using the `usaspending-test` container), you _must_ mount the host's docker socket. This is declared already in the `docker-compose.yml` file services, but would be done manually with: `-v /var/run/docker.sock:/var/run/docker.sock`
@@ -286,15 +286,15 @@ Re-running the test suite using `pytest -rs` with these dependencies satisfied s
286286
287287
_From within a container_
288288
289-
_**NOTE**: `DATA_BROKER_DATABASE_URL` is set in the `docker-compose.yml` file (and could pick up `.env` values, if set)_
289+
_**NOTE**: `BROKER_DB` is set in the `docker-compose.yml` file (and could pick up `.env` values, if set)_
290290
291291
```shell
292292
docker compose run --rm usaspending-test pytest --capture=no --verbose --tb=auto --no-cov --log-cli-level=INFO -k test_broker_integration
293293
```
294294
295295
_From Developer Desktop_
296296
297-
_**NOTE**: `DATA_BROKER_DATABASE_URL` is set in the `.envrc` file and available in the shell_
297+
_**NOTE**: `BROKER_DB` is set in the `.envrc` file and available in the shell_
298298
```shell
299299
pytest --capture=no --verbose --tb=auto --no-cov --log-cli-level=INFO -k test_broker_integration
300300
```

docker-compose.yml

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ services:
4444
DJANGO_DEBUG: ${DJANGO_DEBUG}
4545
DATABASE_URL: postgres://${USASPENDING_DB_USER}:${USASPENDING_DB_PASSWORD}@${USASPENDING_DB_HOST}:${USASPENDING_DB_PORT}/data_store_api
4646
ES_HOSTNAME: ${ES_HOSTNAME}
47-
DATA_BROKER_DATABASE_URL: postgresql://${BROKER_DB_USER}:${BROKER_DB_PASSWORD}@${BROKER_DB_HOST}:${BROKER_DB_PORT}/data_broker
47+
BROKER_DB: postgresql://${BROKER_DB_USER}:${BROKER_DB_PASSWORD}@${BROKER_DB_HOST}:${BROKER_DB_PORT}/data_broker
4848

4949
usaspending-test:
5050
profiles:
@@ -68,7 +68,7 @@ services:
6868
DATABASE_URL: postgres://${USASPENDING_DB_USER}:${USASPENDING_DB_PASSWORD}@${USASPENDING_DB_HOST}:${USASPENDING_DB_PORT}/data_store_api
6969
ES_HOST: ${ES_HOST}
7070
ES_HOSTNAME: ${ES_HOSTNAME}
71-
DATA_BROKER_DATABASE_URL: postgresql://${BROKER_DB_USER}:${BROKER_DB_PASSWORD}@${BROKER_DB_HOST}:${BROKER_DB_PORT}/data_broker
71+
BROKER_DB: postgresql://${BROKER_DB_USER}:${BROKER_DB_PASSWORD}@${BROKER_DB_HOST}:${BROKER_DB_PORT}/data_broker
7272
MINIO_HOST: ${MINIO_HOST}
7373
DOWNLOAD_DATABASE_URL: postgres://${USASPENDING_DB_USER}:${USASPENDING_DB_PASSWORD}@${USASPENDING_DB_HOST}:${USASPENDING_DB_PORT}/data_store_api
7474
# Location in host machine where broker src code root can be found
@@ -107,7 +107,7 @@ services:
107107
environment:
108108
DATABASE_URL: postgres://${USASPENDING_DB_USER}:${USASPENDING_DB_PASSWORD}@${USASPENDING_DB_HOST}:${USASPENDING_DB_PORT}/data_store_api
109109
ES_HOSTNAME: ${ES_HOSTNAME}
110-
DATA_BROKER_DATABASE_URL: postgresql://${BROKER_DB_USER}:${BROKER_DB_PASSWORD}@${BROKER_DB_HOST}:${BROKER_DB_PORT}/data_broker
110+
BROKER_DB: postgresql://${BROKER_DB_USER}:${BROKER_DB_PASSWORD}@${BROKER_DB_HOST}:${BROKER_DB_PORT}/data_broker
111111
# Location in host machine where broker src code root can be found
112112
DATA_BROKER_SRC_PATH: "${PWD}/../data-act-broker-backend"
113113

@@ -233,7 +233,11 @@ services:
233233
mkdir -p data/dti-da-public-files-nonprod/user_reference_docs
234234
# Create the bucket within MinIO used for endpoints that list generated downloads
235235
mkdir -p data/bulk-download
236+
# Create the bucket for MinIO used for Spark
237+
mkdir -p data/data/files
238+
# Populate initial files in buckets
236239
cp dockermount/usaspending_api/data/Data_Dictionary_Crosswalk.xlsx data/dti-da-public-files-nonprod/user_reference_docs/Data_Dictionary_Crosswalk.xlsx
240+
cp dockermount/usaspending_api/data/COVID-19_download_readme.txt data/data/files/COVID-19_download_readme.txt
237241
minio server --address ":10001" --console-address ":10002" /data
238242
"
239243
healthcheck:

loading_data.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ To load in the reference data, from the same directory as manage.py:
3535

3636
To load certified submission data from the broker, you will need a read-only (or higher) connection string to the broker PostgreSQL database. If not running locally, you will also need to ensure your IP address has been whitelisted in the appropriate AWS Security Groups. Set this environment variable before running the **load_submission** command:
3737

38-
DATA_BROKER_DATABASE_URL=postgres://user:password@url:5432/data_broker
38+
BROKER_DB=postgres://user:password@url:5432/data_broker
3939

4040
To load a submission from data broker database:
4141

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType
2+
3+
4+
AWARD_ID_LOOKUP_SCHEMA = StructType(
5+
[
6+
StructField("award_id", LongType(), False),
7+
StructField("is_fpds", BooleanType(), False),
8+
StructField("transaction_unique_id", StringType(), False),
9+
StructField("generated_unique_award_id", StringType(), False),
10+
]
11+
)

usaspending_api/awards/management/commands/generate_unlinked_awards_download.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from usaspending_api.awards.management.sql.spark.unlinked_awards_summary_file import summary_file
1717
from usaspending_api.awards.management.sql.spark.unlinked_assistance_file_d2 import file_d2_sql_string
1818
from usaspending_api.awards.management.sql.spark.unlinked_accounts_file_c import file_c_sql_string
19+
from usaspending_api.config import CONFIG
1920
from usaspending_api.download.filestreaming.file_description import build_file_description, save_file_description
2021
from usaspending_api.download.filestreaming.zip_file import append_files_to_zip_file
2122
from usaspending_api.references.models.toptier_agency import ToptierAgency
@@ -108,9 +109,10 @@ def handle(self, *args, **options):
108109
# Save queries as delta tables for efficiency
109110
for delta_table_name, sql_file, final_name in self.download_file_list:
110111
df = self.spark.sql(sql_file)
111-
df.write.format(source="delta").mode(saveMode="overwrite").option("overwriteSchema", "True").saveAsTable(
112-
name=delta_table_name
113-
)
112+
df.write.format(source="delta").mode(saveMode="overwrite").options(
113+
overwriteSchema=True,
114+
path=f"s3a://{CONFIG.SPARK_S3_BUCKET}/{CONFIG.DELTA_LAKE_S3_PATH}/temp/{delta_table_name}",
115+
).saveAsTable(name=f"temp.{delta_table_name}")
114116

115117
for agency in toptier_agencies:
116118
agency_name = agency["name"]
@@ -140,7 +142,7 @@ def process_data_copy_jobs(self, zip_file_path):
140142
self.filepaths_to_delete.append(zip_file_path)
141143

142144
for delta_table_name, sql_file, final_name in self.download_file_list:
143-
df = self.spark.sql(f"select * from {delta_table_name} where toptier_code = '{self._toptier_code}'")
145+
df = self.spark.sql(f"select * from temp.{delta_table_name} where toptier_code = '{self._toptier_code}'")
144146
sql_file = None
145147
final_path = self._create_data_csv_dest_path(final_name)
146148
intermediate_data_file_path = final_path.parent / (final_path.name + "_temp")

usaspending_api/broker/helpers/delete_fabs_transactions.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ def get_delete_pks_for_afa_keys(afa_ids_to_delete):
4343
is_active is not true
4444
"""
4545

46-
with connections[settings.DATA_BROKER_DB_ALIAS].cursor() as cursor:
46+
with connections[settings.BROKER_DB_ALIAS].cursor() as cursor:
4747
cursor.execute(sql, [uppercased])
4848
rows = cursor.fetchall()
4949

usaspending_api/broker/management/commands/derive_office_names.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,16 @@ class Command(load_base.Command):
1616
"""
1717

1818
help = "Derives all FABS office names from the office codes in the Office table in Data broker. The \
19-
DATA_BROKER_DATABASE_URL environment variable must set so we can pull Office data from their db."
19+
BROKER_DB environment variable must set so we can pull Office data from their db."
2020

2121
def handle(self, *args, **options):
2222
# Grab data broker database connections
2323
if not options["test"]:
2424
try:
25-
db_conn = connections[settings.DATA_BROKER_DB_ALIAS]
25+
db_conn = connections[settings.BROKER_DB_ALIAS]
2626
db_cursor = db_conn.cursor()
2727
except Exception as err:
28-
logger.critical("Could not connect to database. Is DATA_BROKER_DATABASE_URL set?")
28+
logger.critical("Could not connect to database. Is BROKER_DB set?")
2929
logger.critical(print(err))
3030
raise
3131
else:

usaspending_api/broker/management/commands/load_broker_table.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ def handle(self, *args, **options):
6363
f'Copying "{broker_schema_name}"."{broker_table_name}" from Broker to '
6464
f'"{usas_schema_name}"."{usas_table_name}" in USAspending.'
6565
)
66-
broker_conn = connections[settings.DATA_BROKER_DB_ALIAS]
66+
broker_conn = connections[settings.BROKER_DB_ALIAS]
6767
usas_conn = connections[settings.DEFAULT_DB_ALIAS]
6868
table_exists_query = f"""
6969
SELECT EXISTS (

0 commit comments

Comments
 (0)