test LLM output for semantic similarity using vector embeddings by carl · Pull Request #59 · thisisartium/continuous-alignment-testing

carl · 2025-03-19T23:58:21Z

Add example how to test LLM output for semantic similarity using vector embeddings.

Snapshot testing is allows capture embeddings vector and notice when it changes.

This pull request includes multiple changes to enhance the functionality of the team_recommender module, particularly focusing on embedding stabilization, alignment computation, and testing improvements. The most important changes include adding new functions for embedding stabilization, implementing alignment computation, and updating tests to reflect these new functionalities.

Embedding Stabilization and Alignment Computation:

examples/team_recommender/tests/example_1_text_response/openai_embeddings.py: Added functions stabilize_embedding_object, stabilize_float, float_to_int_same_bits, and int_to_float_same_bits to stabilize embeddings by ensuring consistent floating-point representation.
examples/team_recommender/tests/example_1_text_response/cosine_similarity.py: Added the compute_alignment function to calculate the alignment vector between two embeddings.

Testing Enhancements:

examples/team_recommender/tests/example_1_text_response/test_compute_alignment.py: Added tests for compute_alignment and stabilize_float to ensure the new functionalities work as expected.
examples/team_recommender/tests/example_1_text_response/test_compute_cosine_similarity.py: Added tests to verify embedding equivalence and the ability to reproduce the same text embedding.
examples/team_recommender/tests/helpers.py: Removed outdated test functions and parameterized tests to streamline the testing process. [1] [2]

Snapshot Updates:

examples/team_recommender/tests/example_1_text_response/snapshots/test_good_fit_for_project/test_llm_will_hallucinate_given_no_data/hallucination_response.txt: Removed outdated snapshot data to reflect updated test scenarios.
examples/team_recommender/tests/example_1_text_response/snapshots/test_good_fit_for_project/test_llm_will_hallucinate_given_no_data/please_provide_missing_information_response.txt: Removed outdated snapshot data to reflect updated test scenarios.

Copilot

Pull Request Overview

This PR adjusts the similarity threshold in the hallucination test example from 70% to 64% to better align with updated expectations. Key changes include:

Lowering the cosine similarity threshold in the if condition and assert statement.
Renaming file handle variables from "f" to "fp" and adding noinspection comments.

Files not reviewed (2)

examples/team_recommender/tests/example_1_text_response/snapshots/test_good_fit_for_project/test_llm_will_hallucinate_given_no_data/hallucination_response.txt: Language not supported
examples/team_recommender/tests/fixtures/hallucination_response.json: Language not supported

Comments suppressed due to low confidence (2)

examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py:124

Verify that lowering the threshold to 0.64 accurately reflects the intended test behavior and does not inadvertently allow borderline cases to pass.

if cosine_similarity < 0.64:

examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py:146

[nitpick] Consider using a more descriptive variable name instead of 'fp' for the file handle to enhance readability.

) as fp:

Copilot

Pull Request Overview

This PR adjusts the expectation for hallucination detection in the team recommender tests by updating the similarity comparison logic and adding a new helper function.

Updated the test to create an embedding object using a specified model ("text-embedding-3-large").
Replaced a cosine similarity threshold check with comparison of semantic similarity scores.
Added a new function, compute_alignment, to the cosine_similarity module.

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py	Modified test to use semantic similarity scores for hallucination detection and updated variable names and embedding model.
examples/team_recommender/tests/example_1_text_response/cosine_similarity.py	Added a compute_alignment function to normalize the difference vector.

Files not reviewed (1)

examples/team_recommender/tests/example_1_text_response/snapshots/test_good_fit_for_project/test_llm_will_hallucinate_given_no_data/hallucination_response.txt: Language not supported

examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py

- update cosine similarity tests

…ignment computation tests

… assertion from less than to higher than for log message # Conflicts: # examples/team_recommender/tests/helpers.py

…on with snapshot assertions

…testing for embedding object is not reliable

…plement snapshot loading for embedding equivalence tests

…t and implement snapshot loading for embedding equivalence tests" This reverts commit d72fc74.

…test

…similarity test

…ood_fit_for_project.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…arity test

…ood_fit_for_project.py add tolerance_margin = 0.05 Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…dding function

… values

…eeds cleanup

…values with enough precision

…dings

…n still fails on values close to 0

Signed-off-by: Paul Zabelin <paulzabelin@artium.ai>

paulz · 2025-03-24T18:51:51Z

closing in favor of: #61

tkersey requested a review from Copilot March 20, 2025 00:21

Copilot AI reviewed Mar 20, 2025

View reviewed changes

tkersey requested a review from Copilot March 20, 2025 17:39

Copilot AI reviewed Mar 20, 2025

View reviewed changes

examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py Outdated Show resolved Hide resolved

examples/team_recommender/tests/example_1_text_response/test_good_fit_for_project.py Outdated Show resolved Hide resolved

carl force-pushed the fix_example_1 branch from 7e60e81 to 07b22e7 Compare March 20, 2025 21:54

paulz changed the title ~~adjust expectation for hallucination example to 64%~~ test LLM output for semantic similarity using vector embeddings Mar 24, 2025

carl and others added 24 commits March 24, 2025 11:30

adjust expectation for hallucination example to 64%

2752541

- fix pycharm warning about file type object

c67b60a

- implement compute_alignment function

5fe5261

- update cosine similarity tests

Refactor: update embedding creation and similarity computation in tests

e17c301

Enhance: modify cosine similarity function to return lists and add al…

6bf424a

…ignment computation tests

Refactor: moved tests to test_helpers.py, and fixed language for test…

665a24f

… assertion from less than to higher than for log message # Conflicts: # examples/team_recommender/tests/helpers.py

Refactor: update tests for embedding creation and alignment computati…

1961809

…on with snapshot assertions

Clearly separate fixture naming from snapshot naming, wip - snapshot …

9189c17

…testing for embedding object is not reliable

Enhance: switch to base64 encoding on OpenAI embedding request and im…

881aa9c

…plement snapshot loading for embedding equivalence tests

Revert "Enhance: switch to base64 encoding on OpenAI embedding reques…

151b09a

…t and implement snapshot loading for embedding equivalence tests" This reverts commit d72fc74.

Reproduce comparison of the embedding failures, snapshots unstable

940759f

Add assertion to validate embedding differences in cosine similarity …

a6637e6

…test

Add assertion to check count of elements outside tolerance in cosine …

5932278

…similarity test

Update examples/team_recommender/tests/example_1_text_response/test_g…

ce47441

…ood_fit_for_project.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Refactor: remove unused import of compute_alignment from cosine_simil…

10dd0d3

…arity test

Refactor: remove unused imports from test_good_fit_for_project.py

d82b11f

- add variant embeddings

165b8be

Update examples/team_recommender/tests/example_1_text_response/test_g…

615430e

…ood_fit_for_project.py add tolerance_margin = 0.05 Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- lint changes and add assertion for 900 outliers with 0.0001 tolerance

34a3d05

- snapshot up to 4 digits precision

9492628

- clearly see the difference with 4 digits precision

3b6530c

- optimize imports

2592cc6

update alignment vector snapshot and refactor rounding to stable_embe…

44b689c

…dding function

use 2 digit precision

c7099da

carl and others added 12 commits March 24, 2025 11:33

use 1 digit precision

71dcdfc

refactor stable_embedding to use 3 digit precision for non-negligible…

e79c564

… values

fix stability of a snapshot for alignment_vector using bit massage, n…

f20fc18

…eeds cleanup

test_stabilize_float shows that stabilize_float creates stable float …

5822ffd

…values with enough precision

add stabilize_embedding_object to ensure stable float values in embed…

17aa541

…dings

fix stabilize_float to improve precision by adjusting bit manipulatio…

90ff312

…n still fails on values close to 0

fix: 32 bit shift to align floats

963b4e5

add tests for confidence ranges and success rate calculations

0a15ddc

add test for next_success_rate with additional case

6c000a4

refactor: remove redundant test case

fcfed0c

refactor: extract tests for openai embeddings

fc1bdaa

Signed-off-by: Paul Zabelin <paulzabelin@artium.ai>

add tests for success rate confidence with additional cases

def8785

Signed-off-by: Paul Zabelin <paulzabelin@artium.ai>

paulz force-pushed the fix_example_1 branch from 55b81ee to def8785 Compare March 24, 2025 18:39

paulz closed this Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test LLM output for semantic similarity using vector embeddings#59

test LLM output for semantic similarity using vector embeddings#59
carl wants to merge 36 commits intothisisartium:mainfrom
carl:fix_example_1

carl commented Mar 19, 2025 •

edited by paulz

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

paulz commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

carl commented Mar 19, 2025 • edited by paulz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Embedding Stabilization and Alignment Computation:

Testing Enhancements:

Snapshot Updates:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

paulz commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carl commented Mar 19, 2025 •

edited by paulz

Loading