test LLM output for semantic similarity using vector embeddings#59
Closed
carl wants to merge 36 commits intothisisartium:mainfrom
Closed
test LLM output for semantic similarity using vector embeddings#59carl wants to merge 36 commits intothisisartium:mainfrom
carl wants to merge 36 commits intothisisartium:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR adjusts the similarity threshold in the hallucination test example from 70% to 64% to better align with updated expectations. Key changes include:
- Lowering the cosine similarity threshold in the if condition and assert statement.
- Renaming file handle variables from "f" to "fp" and adding noinspection comments.
Files not reviewed (2)
- examples/team_recommender/tests/example_1_text_response/snapshots/test_good_fit_for_project/test_llm_will_hallucinate_given_no_data/hallucination_response.txt: Language not supported
- examples/team_recommender/tests/fixtures/hallucination_response.json: Language not supported
Comments suppressed due to low confidence (2)
examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py:124
- Verify that lowering the threshold to 0.64 accurately reflects the intended test behavior and does not inadvertently allow borderline cases to pass.
if cosine_similarity < 0.64:
examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py:146
- [nitpick] Consider using a more descriptive variable name instead of 'fp' for the file handle to enhance readability.
) as fp:
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR adjusts the expectation for hallucination detection in the team recommender tests by updating the similarity comparison logic and adding a new helper function.
- Updated the test to create an embedding object using a specified model ("text-embedding-3-large").
- Replaced a cosine similarity threshold check with comparison of semantic similarity scores.
- Added a new function, compute_alignment, to the cosine_similarity module.
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py | Modified test to use semantic similarity scores for hallucination detection and updated variable names and embedding model. |
| examples/team_recommender/tests/example_1_text_response/cosine_similarity.py | Added a compute_alignment function to normalize the difference vector. |
Files not reviewed (1)
- examples/team_recommender/tests/example_1_text_response/snapshots/test_good_fit_for_project/test_llm_will_hallucinate_given_no_data/hallucination_response.txt: Language not supported
examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py
Outdated
Show resolved
Hide resolved
examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py
Outdated
Show resolved
Hide resolved
- update cosine similarity tests
…ignment computation tests
… assertion from less than to higher than for log message # Conflicts: # examples/team_recommender/tests/helpers.py
…on with snapshot assertions
…testing for embedding object is not reliable
…plement snapshot loading for embedding equivalence tests
…t and implement snapshot loading for embedding equivalence tests" This reverts commit d72fc74.
…ood_fit_for_project.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ood_fit_for_project.py add tolerance_margin = 0.05 Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…values with enough precision
…n still fails on values close to 0
Signed-off-by: Paul Zabelin <paulzabelin@artium.ai>
Signed-off-by: Paul Zabelin <paulzabelin@artium.ai>
Contributor
|
closing in favor of: #61 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add example how to test LLM output for semantic similarity using vector embeddings.
Snapshot testing is allows capture embeddings vector and notice when it changes.
This pull request includes multiple changes to enhance the functionality of the
team_recommendermodule, particularly focusing on embedding stabilization, alignment computation, and testing improvements. The most important changes include adding new functions for embedding stabilization, implementing alignment computation, and updating tests to reflect these new functionalities.Embedding Stabilization and Alignment Computation:
examples/team_recommender/tests/example_1_text_response/openai_embeddings.py: Added functionsstabilize_embedding_object,stabilize_float,float_to_int_same_bits, andint_to_float_same_bitsto stabilize embeddings by ensuring consistent floating-point representation.examples/team_recommender/tests/example_1_text_response/cosine_similarity.py: Added thecompute_alignmentfunction to calculate the alignment vector between two embeddings.Testing Enhancements:
examples/team_recommender/tests/example_1_text_response/test_compute_alignment.py: Added tests forcompute_alignmentandstabilize_floatto ensure the new functionalities work as expected.examples/team_recommender/tests/example_1_text_response/test_compute_cosine_similarity.py: Added tests to verify embedding equivalence and the ability to reproduce the same text embedding.examples/team_recommender/tests/helpers.py: Removed outdated test functions and parameterized tests to streamline the testing process. [1] [2]Snapshot Updates:
examples/team_recommender/tests/example_1_text_response/snapshots/test_good_fit_for_project/test_llm_will_hallucinate_given_no_data/hallucination_response.txt: Removed outdated snapshot data to reflect updated test scenarios.examples/team_recommender/tests/example_1_text_response/snapshots/test_good_fit_for_project/test_llm_will_hallucinate_given_no_data/please_provide_missing_information_response.txt: Removed outdated snapshot data to reflect updated test scenarios.