Skip to content

SCHOL 399: merge etl pipeline QA env tests#1033

Open
bantucaravan wants to merge 10 commits intomainfrom
SCHOL-399/qa-tests-to-self-hosted
Open

SCHOL 399: merge etl pipeline QA env tests#1033
bantucaravan wants to merge 10 commits intomainfrom
SCHOL-399/qa-tests-to-self-hosted

Conversation

@bantucaravan
Copy link
Copy Markdown
Contributor

@bantucaravan bantucaravan commented Mar 30, 2026

Describe your changes

merge the gh-hosted and self-hosted workflow variants

remove unneeded set up used in ci-self-hosted variant

retained pytest-rerunner usage

junitxml output removed bc it was consumed

upadated hash files call based on docs here: https://docs.github.com/en/actions/reference/workflows-and-actions/expressions#hashfiles

How to test

all tests should pass

previous core functionality worked on the self-hosted runners: see: https://github.com/NYPL/digital-research-books/actions/runs/23767475519/job/69250855487

switched to authentication to IAM roles recommended
for use with the self-hosted runner
so I can confirm they run work on the self-hosted runner b4 merging
turns out manual trigger only works if its in the
"main" version and by them we will have already
triggered the etl-pipeline-ci by main merge
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
digital-research-books Ready Ready Preview, Comment Mar 31, 2026 2:36pm

Request Review

Comment on lines +72 to +86
pip install --prefer-binary -r dev-requirements.txt
pip install --prefer-binary -r requirements.txt

- name: Run API tests
- name: Run API integration tests against QA env
working-directory: ./etl-pipeline
run: |
pytest tests/integration/api --env=qa

- name: Run (some) functional tests against QA env
working-directory: ./etl-pipeline
run: |
python -m pytest tests/functional/processes/ingest \
--reruns=3 \
--reruns-delay=5 \
--env=qa
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have been only running a subset of functional/integration tests on QA post PR merge into main and post deployment of the new main HEAD to QA.

I guess this makes sense because all tests (unit, functional, QA) passed against the local env in the PR, and so maybe we only need to run test on code that is now running in QA infra not local infra. and we do not expect there to be differerences btw QA infra an local infra for tests that rely on DB data fixtures, etc...

However, I do wonder if we should be running all tests against QA?

@jdohan
@alea12

Copy link
Copy Markdown
Contributor

@jdohan jdohan Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a good point due to having to seed the Postgres DB accordingly for some desired set of books in the vector DB—which I'm currently building a test dataset for in a dedicated testing namespace—when running tests on local builds (#1025 achieves success with this for VRA integration tests), but considering we're using QA as our live environment for VRA, I think it's still best to run the tests in etl-pipeline-tests.yaml prior to merge against local builds with respect to defect prevention.

Copy link
Copy Markdown
Contributor

@jdohan jdohan Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can run the entire test suite after deploy, but really I think targeted checks for critical paths are necessary once they've passed in local builds.

Copy link
Copy Markdown
Contributor

@jdohan jdohan Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think further about it, a restructuring of the test suite soon would be beneficial so that tests scoped strictly to legacy DRB functionality get separated from VRA tests. Since code changes for VRA (i.e. every code change being made now) have little risk of impacting legacy functionality (correct me if I'm wrong), we could aim to omit those from checks on PRs targeting main and run them only before deploying to prod.

Top-level separation within the test suite would enable straightforward targeting of the two sets of tests, e.g.:

etl-pipeline/
└── tests/
    ├── drb/
    │   ├── unit/          # Isolated logic for legacy DRB components
    │   ├── integration/   # Database/API interactions for legacy flows
    │   └── functional/    # End-to-end legacy pipeline validation
    └── vra/
        ├── unit/          # Core logic for new VRA features
        ├── integration/   # Interaction between agent and search, DBs, API tests
        └── functional/    # End-to-end tests for VRA backend

There is likely some overlap in functional areas like ingest, which we'd have to account for when separating them in this manner.

Comment on lines +84 to +85
--reruns=3 \
--reruns-delay=5 \
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these really necessary? these are only used here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@jdohan jdohan Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have known flaky tests, reruns are just a waste of time. Maybe 1, if anything.

Do note however that I'm still not yet totally acquainted with the entire backend test suite, given its breadth—particularly, the nuances of targeting different environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants