Llama Stack API Conformance Tests and API Stability

### 🚀 Describe the new functionality needed

## Llama Stack API Conformance Tests and API Stability

### Conformance Tests

Llama Stack should have a conformance test suite for its "stable" APIs, ensuring major changes between z-streams (x1.y1.z1 -> x1.y1.z2). This test suite should ensure the API surface on each PR (the functions, datatypes, and client usages) are the same.

#### Stable APIs

In order to do this, we need to decide which APIs are "stable":

1. Inference (definite)
2. VectorIO (definite)
3. DatasetIO (definite)
4. Safety
5. Scoring
6. Telemetry (definite)
7. Tools
8. OpenAI compatible APIs like Responses

(note, I chose these APIs based on the work I have seen being done, there could be others I am missing)

These APIs, especially the ones marked "definite" should be constantly validated for backwads compatability between z-streams and only broken in certain ways across y-streams.


#### Tests

Add a new test suite called `api-conformance` which has a matrix of the above APIs. Each API should pass the following tests:

**1. Schema Snapshot Test**

This test compares the newly generated openapi.json file against a previously saved, correct version. It is the most effective way to catch any structural changes, such as removed endpoints or modified data fields.

**2. Automated Contract Test**

This test uses the old OpenAPI schema as a contract to automatically generate and run API requests against new code. It verifies that the new implementation still provides valid responses that conform to the old contract's rules.

This could use something like [Schemathesis](https://github.com/schemathesis/schemathesis)

**3. Added/Removed Endpoint Test**

When you delete or add an endpoint, this test confirms that calling its URL now returns: a 404 Not Found error or some reasonable Deprecation notice for a removed endpoint and additionally confirms that any new endpoints are not mandatory. This ensures that old clients receive a predictable error instead of an unexpected one, and even more importantly ensures external providers do not receive breaking changes between z-streams saying things like: they do not "implement all required methods" of an API.

Changing the required surface of an API will break things like external providers and these sorts of changes likely should only happen on major version bumps.

**4. Pydantic Model Test**

This test ensures that new Pydantic models can still correctly parse and validate data that is structured according to the old models. This prevents errors when clients send data in a previously valid format.

This would test both the pydantic models of the API types AND the Pydantic Models of the BuildConfig, the DistributionConfig, and the StackRunConfig. All of these types make up our "public facing API". I will explain more speficially about these tests below.


### Versioning and Version Enforcement

Additionally to conformance testing, a useful tool (that we already have but don't enforce) is object versioning. Both the StackRunConfig (run.yaml) and BuildConfig (build.yaml) objects have `version`s. 

using `pytest-snapshot` we can include a test in the `api-conformance` suite or an additional `object-versioning` suite. This suite would work like this:

```python=
# tests/test_models.py
from llama_stack.distribution.datatypes import BuildConfig

def test_build_config_v1_schema_is_unchanged(snapshot):
    """
    Ensures the V1 schema never changes.
    """
    snapshot.assert_match(
        BuildConfig.model_json_schema(), 'stored_build_config_v1_schema.json'
    )
```

`pytest-snapshot` either generates `stored_build_config_v1_schema.json` if it does not exist or uses a stored one to compare the two models. If any differences are found, this errors out.

The stability of these datatypes is very important as consumers already have stored build and run configs for their specific scenario, breaking them causes upgrade hesitation and headache.

Additionally, if we actually _want_ to make a breaking change, we should have a test to ensure that breaking changes only come with version bumps. This allows us to keep 1 or 2 older schemas around for backwards compatability:

```python=
# tests/test_models.py
from llama_stack.distribution.datatypes import StackRunConfigV1, StackRunConfigV2

def test_structural_changes_require_version_bump():
    """
    Compares two model schemas to ensure a structural change
    is accompanied by a version change.
    """
    v1_schema = StackRunConfigV1.model_json_schema()
    v2_schema = StackRunConfigV2.model_json_schema()

    # Isolate the data fields, excluding the version field itself
    v1_properties = ...
    v2_properties = ...

    # If the properties have changed...
    if v1_properties != v2_properties:
        # ...then the version MUST have also changed.
        assert v1_schema["version"] != v2_schema["version"]
    else:
        # Otherwise, the version should be the same.
        assert v1_schema["version"] == v2_schema["version"]
```


### API Versioning

Beyond testing of the current API and its Datatypes, the API needs to likely be packaged apart from the rest of Llama Stack and versioned differently. We have already seen instances of this like: https://github.com/llamastack/llama-stack/pull/2978#issuecomment-3145115942 where external providers rely on the llama stack datatypes but need to import the entirety of LLS to use them, creating a strange dependency situation.

Our API, the associated datatypes, functions, etc should likely be packaged in a `llama_stack.api` installable that is versioned differently and at a slower pace than the utilities and other functionalities the core `llama_stack` package contains.




### 💡 Why is this needed? What if we don't build it?

Without ensuring not only backwards compatability of datatypes between versions but also _stability_ of llama stack between y and z streams, it will be difficult to gain users both upstream and within products. The above outlined tests and compatibility strategies are a good way to get started and to keep the upstream maintainers/contributors aware of the types of changes being introduced.

### Other thoughts

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama Stack API Conformance Tests and API Stability #3237

🚀 Describe the new functionality needed