- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.2k
 
Description
π Describe the new functionality needed
Llama Stack API Conformance Tests and API Stability
Conformance Tests
Llama Stack should have a conformance test suite for its "stable" APIs, ensuring major changes between z-streams (x1.y1.z1 -> x1.y1.z2). This test suite should ensure the API surface on each PR (the functions, datatypes, and client usages) are the same.
Stable APIs
In order to do this, we need to decide which APIs are "stable":
- Inference (definite)
 - VectorIO (definite)
 - DatasetIO (definite)
 - Safety
 - Scoring
 - Telemetry (definite)
 - Tools
 - OpenAI compatible APIs like Responses
 
(note, I chose these APIs based on the work I have seen being done, there could be others I am missing)
These APIs, especially the ones marked "definite" should be constantly validated for backwads compatability between z-streams and only broken in certain ways across y-streams.
Tests
Add a new test suite called api-conformance which has a matrix of the above APIs. Each API should pass the following tests:
1. Schema Snapshot Test
This test compares the newly generated openapi.json file against a previously saved, correct version. It is the most effective way to catch any structural changes, such as removed endpoints or modified data fields.
2. Automated Contract Test
This test uses the old OpenAPI schema as a contract to automatically generate and run API requests against new code. It verifies that the new implementation still provides valid responses that conform to the old contract's rules.
This could use something like Schemathesis
3. Added/Removed Endpoint Test
When you delete or add an endpoint, this test confirms that calling its URL now returns: a 404 Not Found error or some reasonable Deprecation notice for a removed endpoint and additionally confirms that any new endpoints are not mandatory. This ensures that old clients receive a predictable error instead of an unexpected one, and even more importantly ensures external providers do not receive breaking changes between z-streams saying things like: they do not "implement all required methods" of an API.
Changing the required surface of an API will break things like external providers and these sorts of changes likely should only happen on major version bumps.
4. Pydantic Model Test
This test ensures that new Pydantic models can still correctly parse and validate data that is structured according to the old models. This prevents errors when clients send data in a previously valid format.
This would test both the pydantic models of the API types AND the Pydantic Models of the BuildConfig, the DistributionConfig, and the StackRunConfig. All of these types make up our "public facing API". I will explain more speficially about these tests below.
Versioning and Version Enforcement
Additionally to conformance testing, a useful tool (that we already have but don't enforce) is object versioning. Both the StackRunConfig (run.yaml) and BuildConfig (build.yaml) objects have versions.
using pytest-snapshot we can include a test in the api-conformance suite or an additional object-versioning suite. This suite would work like this:
# tests/test_models.py
from llama_stack.distribution.datatypes import BuildConfig
def test_build_config_v1_schema_is_unchanged(snapshot):
    """
    Ensures the V1 schema never changes.
    """
    snapshot.assert_match(
        BuildConfig.model_json_schema(), 'stored_build_config_v1_schema.json'
    )
pytest-snapshot either generates stored_build_config_v1_schema.json if it does not exist or uses a stored one to compare the two models. If any differences are found, this errors out.
The stability of these datatypes is very important as consumers already have stored build and run configs for their specific scenario, breaking them causes upgrade hesitation and headache.
Additionally, if we actually want to make a breaking change, we should have a test to ensure that breaking changes only come with version bumps. This allows us to keep 1 or 2 older schemas around for backwards compatability:
# tests/test_models.py
from llama_stack.distribution.datatypes import StackRunConfigV1, StackRunConfigV2
def test_structural_changes_require_version_bump():
    """
    Compares two model schemas to ensure a structural change
    is accompanied by a version change.
    """
    v1_schema = StackRunConfigV1.model_json_schema()
    v2_schema = StackRunConfigV2.model_json_schema()
    # Isolate the data fields, excluding the version field itself
    v1_properties = ...
    v2_properties = ...
    # If the properties have changed...
    if v1_properties != v2_properties:
        # ...then the version MUST have also changed.
        assert v1_schema["version"] != v2_schema["version"]
    else:
        # Otherwise, the version should be the same.
        assert v1_schema["version"] == v2_schema["version"]
API Versioning
Beyond testing of the current API and its Datatypes, the API needs to likely be packaged apart from the rest of Llama Stack and versioned differently. We have already seen instances of this like: #2978 (comment) where external providers rely on the llama stack datatypes but need to import the entirety of LLS to use them, creating a strange dependency situation.
Our API, the associated datatypes, functions, etc should likely be packaged in a llama_stack.api installable that is versioned differently and at a slower pace than the utilities and other functionalities the core llama_stack package contains.
π‘ Why is this needed? What if we don't build it?
Without ensuring not only backwards compatability of datatypes between versions but also stability of llama stack between y and z streams, it will be difficult to gain users both upstream and within products. The above outlined tests and compatibility strategies are a good way to get started and to keep the upstream maintainers/contributors aware of the types of changes being introduced.
Other thoughts
No response