Skip to content

Conversation

@ValentineDragan
Copy link
Collaborator

@ValentineDragan ValentineDragan commented Oct 16, 2025

Pull Request Summary

This PR upgrades the model-engine Docker base image to use Chainguard's FIPS-compliant Python image, and fixes bugs in the CircleCI integration tests.

FIPS compliance changes:

  • Update Dockerfile to use chainguard base image for FIPS compliance
    • Delete the now identical federal/Dockerfile copy
  • Upgrading SQLAlchemy to 2.0.21 which uses FIPS-compliant md5 hashing
    • This removes the need to monkey patching the hashing library with sitecustomize.py which was making integration tests fail because md5 is still needed for non-security hashing (i.e. generating Git/CircleCI hashes)
  • Set celery_enable_sha256: true in all configs for FIPS compliance

Fixing integration tests:

  • Update integration tests to use the current/latest model-engine image instead of a hardcoded image tag from 2024
  • Update helm chart to mount service configs in CircleCI
  • Add chainctl authentication to CircleCI to enable pulling the chainguard base image

Test Plan and Usage Guide

  • All unit tests and integration tests pass
    • (previously integration tests weren't reflecting the latest repo changes due to using hardcoded image)

@socket-security
Copy link

socket-security bot commented Oct 28, 2025

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedsqlalchemy@​2.0.4 ⏵ 2.0.2197100100100100

View full report

@ValentineDragan ValentineDragan changed the title Update Dockerfile with Chainguard base image Make model-engine FIPS compliant by updating base chainguard image Oct 28, 2025
@ValentineDragan ValentineDragan marked this pull request as ready for review October 28, 2025 23:48
{{- end }}
{{- end }}
{{- if $config_values }}
- name: service-config-volume
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if you needed to add this for specific reason? do you actually use batch-job-orchestration-job

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we don't add this change, there are integration tests running batch jobs that will fail because they can't find the service configs. See explanation below:

This is part of fixing the integration tests bug in the file below (rest_api_utils.py). Context (I debugged all this by SSH-ing into the instance running the CircleCI workflows and inspecting the kubernetes logs):

  • Some of the integration tests were using a hardcoded model engine image tag (830c81ecba2a147022e504917c6ce18b00c2af44) to run - see CREATE_DOCKER_IMAGE_BATCH_JOB_BUNDLE_REQUEST, CREATE_FINE_TUNE_DI_BATCH_JOB_BUNDLE_REQUEST..
  • The integration tests would spin up kubernetes pods for some batch jobs, and they were being created from the hardcoded model engine image. But that meant that new changes to model engine server might not actually be reflected in the integration tests, so I updated the rest_api_utils.py file to rebuild the container.

Fixing this bug caused the integration tests to fail because the old hardcoded image had the service configs copied inside the container image, whereas new images need to mount them instead. I reran these tests on another branch where I only changed the model engine image tag used to confirm this is an isolated issue - no other changes (i.e. Dockerfile):

@@ -1,9 +1,9 @@
# federal/Dockerfile.chainguard
FROM cgr.dev/scale.com/python-fips:3.10.15-dev
FROM cgr.dev/scale.com/python-fips:3.10.19-dev
Copy link
Collaborator

@dmchoiboi dmchoiboi Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andytang-scale can you check if this change is ok w/ your use case?

Copy link
Collaborator

@dmchoiboi dmchoiboi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm outside of changing the dockerfile used by fed. tagged @andytang-scale for review

Copy link
Contributor

@andytang-scale andytang-scale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just checking though that by removing the sitecustomize.py file these other md5 changes I see implemented allow the dockerfile to run? I just did the general sitecustomize.py to make sure it catches all hashing calls into fips compliance

@ValentineDragan
Copy link
Collaborator Author

@andytang-scale Thank you 🙌🏻 Yes, previously the sitecustomize.py was fixing the fips vulnerabilities by changing all md5 calls to sha256 (becuase md5 hashing for secrets/cybersecurity is not safe). However this was causing the CircleCI integration tests to fail (because some CircleCI code does actually need md5 to compute git/image hashes). So instead, I bumped the sqlalchemy to a version where they fixed its fips-compliance issues, and marked the md5 calls we make in our code as non-security (we don't use them for secrets, we use them for image/commits).

I tested with a trivy scan and the new Dockerfile.fips is fips compliant.

Also tested (with CircleCI integration tests) that both the standard Dockerfile and Dockerfile.fips are running correctly and passing all integration tests

@ValentineDragan ValentineDragan merged commit 210fa3e into main Nov 3, 2025
7 checks passed
@ValentineDragan ValentineDragan deleted the fix/fix-vulnerabilities-in-model-engine-image branch November 3, 2025 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants