Skip to content

Conversation

@avinash2692
Copy link
Contributor

Description

This PR address #53 and is a follow up to #67.

  • The overall PR marks stochastic tests with a llm marker so that they are expected to fail during runtime.
  • This only happens if GITHUB_ACTION == 1 is set as an env variable, which lets users run all stochastic models locally to run the full test suite.
  • Also adds a github action to run tests once quality checks pass.

Copy link
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; a few suggestions

Also, is it possible to do a dry run of these actions somehow to make sure the tests run on the github runners?

@pytest.fixture(scope="module")
def backend():
"""Shared HuggingFace backend for all tests in this module."""
# TODO: find a smalle 1B model to do Alora stuff on github actions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this or open a separate issue to fix this?

test/conftest.py Outdated
Comment on lines 28 to 48
# # Check if there is a session fixture.
# try:
# session: MelleaSession = item._request.getfixturevalue("m_session")
# except Exception:
# # Skip test cause all llm marked tests need a session fixture.
# pytest.skip("`llm` marked tests requires a `m_session` fixture.")
# # Get the Ollama name.
# if isinstance(session.backend, OllamaModelBackend) or isinstance(session.backend, OpenAIBackend):
# model_id = session.backend.model_id.ollama_name
# # Skip tests of the model name is llama 1b
# if model_id == "llama3.2:1b":
# pytest.skip(
# "Skipping LLM test: got model_id == llama3.2:1b in ollama. Used only in gh workflows."
# )
# elif isinstance(session.backend, LocalHFBackend):
# model_id = session.backend.model_id.hf_model_name
# # Skip tests of the model name is llama 1b
# if model_id == "unsloth/Llama-3.2-1B":
# pytest.skip(
# "Skipping LLM test: got model_id == unsloth/Llama-3.2-1B in hf. Used only in gh workflows."
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this commented out code?

def test_session_with_parameters(model_id):
"""Test contextual session with custom parameters."""
with start_session(backend_name="ollama", model_id="granite3.3:8b") as m:
with start_session(backend_name="ollama", model_id=META_LLAMA_3_2_1B) as m:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be using the model_id fixture?

pyproject.toml Outdated

[tool.pytest.ini_options]
markers = [
"llm: Marks the test as needing an exact output from an LLM (deselect with '-m \" not llm\"'); this depends on the session.backend.model_id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like some of the tests you've marked with this flag don't actually require an exact output. (I commented the ones in huggingface that I feel shouldn't have been, but it looks like the same applies to the other backends.).

Also, I think the naming is a bit confusing here. Can we do something like qualitative or output_checked or something along those lines. Because some of the tests that still run use llms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I changed it to quantitative. Let me know it that works.

session.reset()


@pytest.mark.llm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't require the output to be a specific value.

print(result)


@pytest.mark.llm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be fair to expect the alora to output only Y/N. But maybe it gets it wrong enough that we need to mark it this way.

assert alora_output in ["Y", "N"], alora_output


@pytest.mark.llm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above here. It should be forced to Y/N here.

assert str(val_result.reason) not in ["Y", "N"]


@pytest.mark.llm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't require the output to be a specific value.

)


@pytest.mark.llm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't require the output to be a specific value.

assert len(results) == len(prompts)


@pytest.mark.llm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't require the output to be a specific value.

@avinash2692
Copy link
Contributor Author

lgtm; a few suggestions

Also, is it possible to do a dry run of these actions somehow to make sure the tests run on the github runners?

There isn't a way to do it on github (or I haven't found it yet). But I have been using https://github.com/nektos/act to test it locally on a docker machine. Some of it it failing now cause of a stupid typo in the env variables (GITHUB_ACTIONS --> GITHUB_ACTION) 🤦 , but will fix that and your suggestions in a commit in a bit.

Copy link
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; I will approve but have a few comments:

  1. I think you are still applying the qualitative marker too broadly; I saw a far amount of tests that generate but don't check the content of the output getting flagged with it
  2. I think the watsonx skip is fine for now, but we should come up with a more systematic way of skipping backends

@avinash2692
Copy link
Contributor Author

lgtm; I will approve but have a few comments:

1. I think you are still applying the qualitative marker too broadly; I saw a far amount of tests that generate but don't check the content of the output getting flagged with it

2. I think the watsonx skip is fine for now, but we should come up with a more systematic way of skipping backends
  1. Agree. I think this is something that we might need to address in a larger PR on how we can separate our tests to inference vs. non-inference.
  2. Agree. I think this is probably a job for a DummyBackend at some point.

@avinash2692 avinash2692 merged commit 03d93b4 into main Aug 26, 2025
3 checks passed
@avinash2692 avinash2692 deleted the avi/marking-test-as-stochastic branch August 26, 2025 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants