Skip to content

Commit c6c45e3

Browse files
committed
removed duplicate code
1 parent 5d721b3 commit c6c45e3

File tree

4 files changed

+31
-138
lines changed

4 files changed

+31
-138
lines changed

README.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,12 @@ $ make setup
1616
$ source ./venv/bin/activate
1717
$ pip install -r requirements.txt
1818
```
19-
You can then run the analysis on OpenAI or Anthropic models by running `main.py` with the command line arguments shown below. `LLMNeedleHaystackTester` parameters can also be passed as command line arguments, except `model_to_test` and `evaluator` of course.
19+
You can then run the analysis on OpenAI or Anthropic models by running `main.py` with the command line arguments shown below. `LLMNeedleHaystackTester` and `LLMMultiNeedleHaystackTester` parameters can also be passed as command line arguments, except `model_to_test` and `evaluator` of course.
2020
* `provider` - The provider of the model, available options are `openai` and `anthropic`. Defaults to `openai`
2121
* `evaluator` - The evaluator, which can either be a `model` or `LangSmith`. See more on `LangSmith` below. If using a `model`, only `openai` is currently supported. Defaults to `openai`.
2222
* `api_key` - API key for either OpenAI or Anthropic provider. Can either be passed as a command line argument or an environment variable named `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` depending on the provider. Defaults to `None`.
2323
* `evaluator_api_key` - API key for OpenAI provider. Can either be passed as a command line argument or an environment variable named `OPENAI_API_KEY`. Defaults to `None`
24+
* `multi_needle` - Whether to run multi-needle tester or not. Default to `False`
2425

2526
## The Test
2627
1. Place a random fact or statement (the 'needle') in the middle of a long context window (the 'haystack')
@@ -57,8 +58,8 @@ I've put the results from the original tests in `/original_results`. I've upgrad
5758
* `print_ongoing_status` - Default: True, whether or not to print the status of test as they complete
5859

5960
`LLMMultiNeedleHaystackTester` parameters:
60-
* `multi_needle` - True or False, whether to run multi-needle
6161
* `needles` - List of needles to insert in the context
62+
* `eval_set` - The evaluation set identifier.
6263

6364
Other Parameters:
6465
* `api_key` - API key for either OpenAI or Anthropic provider. Can either be passed when creating the object or an environment variable
@@ -107,16 +108,16 @@ Needle 10: 40 + 9 * 6 = 94
107108

108109
You can use LangSmith to orchestrate evals and store results.
109110

110-
(1) Sign up for [LangSmith](https://docs.smith.langchain.com/setup)
111-
(2) Set env variables for LangSmith as specified in the setup.
112-
(3) In the `Datasets + Testing` tab, use `+ Dataset` to create a new dataset, call it `multi-needle-eval-sf` to start.
113-
(4) Populate the dataset with a test question:
114-
```
115-
question: What are the 5 best things to do in San Franscisco?
116-
answer: "The 5 best things to do in San Francisco are: 1) Go to Dolores Park. 2) Eat at Tony's Pizza Napoletana. 3) Visit Alcatraz. 4) Hike up Twin Peaks. 5) Bike across the Golden Gate Bridge"
117-
```
118-
![Screenshot 2024-03-05 at 4 54 15 PM](https://github.com/rlancemartin/LLMTest_NeedleInAHaystack/assets/122662504/2f903955-ed1d-49cc-b995-ed0407d6212a)
119-
(5) Run with ` --evaluator langsmith` and `--eval_set multi-needle-eval-sf` to run against our recently created eval set.
111+
1. Sign up for [LangSmith](https://docs.smith.langchain.com/setup)
112+
2. Set env variables for LangSmith as specified in the setup.
113+
3. In the `Datasets + Testing` tab, use `+ Dataset` to create a new dataset, call it `multi-needle-eval-sf` and set dataset type to `Key-Value`.
114+
4. Populate the dataset with a test question:
115+
```
116+
question: What are the 5 best things to do in San Franscisco?
117+
answer: "The 5 best things to do in San Francisco are: 1) Go to Dolores Park. 2) Eat at Tony's Pizza Napoletana. 3) Visit Alcatraz. 4) Hike up Twin Peaks. 5) Bike across the Golden Gate Bridge"
118+
```
119+
![Screenshot 2024-03-05 at 4 54 15 PM](https://github.com/rlancemartin/LLMTest_NeedleInAHaystack/assets/122662504/2f903955-ed1d-49cc-b995-ed0407d6212a)
120+
5. Run with ` --evaluator langsmith` and `--eval_set multi-needle-eval-sf` to run against our recently created eval set.
120121
121122
Let's see all these working together on a new dataset, `multi-needle-eval-pizza`.
122123

main.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -155,10 +155,8 @@ def main():
155155
args.evaluator = get_evaluator(args)
156156

157157
if args.multi_needle == True:
158-
print("Testing multi-needle")
159158
tester = LLMMultiNeedleHaystackTester(**args.__dict__)
160159
else:
161-
print("Testing single-needle")
162160
tester = LLMNeedleHaystackTester(**args.__dict__)
163161
tester.start_test()
164162

src/evaluators/langsmith_evaluator.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from typing import Union
1+
import os
22
import uuid
33

44
from langchain_openai import ChatOpenAI
@@ -12,7 +12,7 @@
1212
from langsmith.schemas import Example, Run
1313

1414
@run_evaluator
15-
def score_relevance(run: Run, example: Union[Example, None] = None):
15+
def score_relevance(run: Run, example: Example | None = None):
1616
"""
1717
A custom evaluator function that grades the language model's response based on its relevance
1818
to a reference answer.
@@ -24,10 +24,6 @@ def score_relevance(run: Run, example: Union[Example, None] = None):
2424
Returns:
2525
EvaluationResult: The result of the evaluation, containing the relevance score.
2626
"""
27-
28-
print("--LANGSMITH EVAL--")
29-
#print("--MODEL: ", model_name)
30-
#print("--EVAL SET: ", eval_set)
3127
student_answer = run.outputs["output"]
3228
reference = example.outputs["answer"]
3329

@@ -90,7 +86,10 @@ def __init__(self, api_key: str = None):
9086
Args:
9187
api_key (str, optional): The API key for authenticating evaluator model.
9288
"""
93-
self.api_key = api_key
89+
if (api_key is None) and (not os.getenv('LANGCHAIN_API_KEY')):
90+
raise ValueError("Either api_key must be supplied with init, or LANGCHAIN_API_KEY must be in env. Used for evaluation model")
91+
92+
self.api_key = api_key or os.getenv('LANGCHAIN_API_KEY')
9493

9594
def evaluate_chain(self, chain, context_length, depth_percent, model_name, eval_set):
9695
"""
Lines changed: 12 additions & 117 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,7 @@
1-
import asyncio
2-
import glob
3-
import json
4-
import os
5-
import time
6-
from asyncio import Semaphore
7-
from datetime import datetime, timezone
8-
9-
import numpy as np
10-
111
from .evaluators import Evaluator
122
from .llm_needle_haystack_tester import LLMNeedleHaystackTester
133
from .providers import ModelProvider
144

15-
165
class LLMMultiNeedleHaystackTester(LLMNeedleHaystackTester):
176
"""
187
Extends LLMNeedleHaystackTester to support testing with multiple needles in the haystack.
@@ -24,21 +13,17 @@ class LLMMultiNeedleHaystackTester(LLMNeedleHaystackTester):
2413
print_ongoing_status (bool): Flag to print ongoing status messages.
2514
eval_set (str): The evaluation set identifier.
2615
"""
27-
def __init__(self, *args,
28-
needles=[],
16+
def __init__(self,
2917
model_to_test: ModelProvider = None,
30-
evaluator: Evaluator = None,
31-
print_ongoing_status = True,
18+
evaluator: Evaluator = None,
19+
needles=[],
3220
eval_set = "multi-needle-eval-sf",
21+
*args,
3322
**kwargs):
3423

35-
super().__init__(*args, model_to_test=model_to_test, **kwargs)
24+
super().__init__(model_to_test, evaluator, *args, **kwargs)
3625
self.needles = needles
37-
self.evaluator = evaluator
38-
self.model_to_test = model_to_test
3926
self.eval_set = eval_set
40-
self.model_name = self.model_to_test.model_name
41-
self.print_ongoing_status = print_ongoing_status
4227

4328
async def insert_needles(self, context, depth_percent, context_length):
4429
"""
@@ -84,9 +69,6 @@ async def insert_needles(self, context, depth_percent, context_length):
8469
# For simplicity, evenly distribute needles throughout the context
8570
insertion_point = int(len(tokens_context) * (depth_percent / 100))
8671
tokens_context = tokens_context[:insertion_point] + tokens_needle + tokens_context[insertion_point:]
87-
# Log
88-
insertion_percentage = (insertion_point / len(tokens_context)) * 100
89-
print(f"Inserted '{needle}' at {insertion_percentage:.2f}% of the context, total length now: {len(tokens_context)} tokens")
9072
# Adjust depth for next needle
9173
depth_percent += depth_percent_interval
9274

@@ -104,10 +86,7 @@ def encode_and_trim(self, context, context_length):
10486
Returns:
10587
str: The encoded and trimmed context.
10688
"""
107-
tokens = self.model_to_test.encode_text_to_tokens(context)
108-
if len(tokens) > context_length:
109-
context = self.model_to_test.decode_tokens(tokens, context_length)
110-
return context
89+
return super().encode_and_trim(context, context_length)
11190

11291
async def generate_context(self, context_length, depth_percent):
11392
"""
@@ -140,103 +119,19 @@ async def evaluate_and_log(self, context_length, depth_percent):
140119
# Go generate the required length context and place your needle statement in
141120
context = await self.generate_context(context_length, depth_percent)
142121

143-
test_start_time = time.time()
144-
145122
# LangSmith
146123
## TODO: Support for many evaluators
147-
if self.evaluator.__class__.__name__ == "LangSmithEvaluator":
148-
print("EVALUATOR: LANGSMITH")
124+
if self.evaluation_model.__class__.__name__ == "LangSmithEvaluator":
149125
chain = self.model_to_test.get_langchain_runnable(context)
150-
self.evaluator.evaluate_chain(chain, context_length, depth_percent, self.model_to_test.model_name, self.eval_set)
151-
test_end_time = time.time()
152-
test_elapsed_time = test_end_time - test_start_time
153-
126+
self.evaluation_model.evaluate_chain(chain, context_length, depth_percent, self.model_name, self.eval_set)
154127
else:
155-
print("EVALUATOR: OpenAI Model")
156-
# Prepare your message to send to the model you're going to evaluate
157-
prompt = self.model_to_test.generate_prompt(context, self.retrieval_question)
158-
# Go see if the model can answer the question to pull out your random fact
159-
response = await self.model_to_test.evaluate_model(prompt)
160-
# Compare the reponse to the actual needle you placed
161-
score = self.evaluation_model.evaluate_response(response)
162-
163-
test_end_time = time.time()
164-
test_elapsed_time = test_end_time - test_start_time
165-
166-
results = {
167-
# 'context' : context, # Uncomment this line if you'd like to save the context the model was asked to retrieve from. Warning: This will become very large.
168-
'model' : self.model_to_test.model_name,
169-
'context_length' : int(context_length),
170-
'depth_percent' : float(depth_percent),
171-
'version' : self.results_version,
172-
'needle' : self.needle,
173-
'model_response' : response,
174-
'score' : score,
175-
'test_duration_seconds' : test_elapsed_time,
176-
'test_timestamp_utc' : datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S%z')
177-
}
178-
179-
self.testing_results.append(results)
180-
181-
if self.print_ongoing_status:
182-
print (f"-- Test Summary -- ")
183-
print (f"Duration: {test_elapsed_time:.1f} seconds")
184-
print (f"Context: {context_length} tokens")
185-
print (f"Depth: {depth_percent}%")
186-
print (f"Score: {score}")
187-
print (f"Response: {response}\n")
188-
189-
context_file_location = f'{self.model_name.replace(".", "_")}_len_{context_length}_depth_{int(depth_percent*100)}'
190-
191-
if self.save_contexts:
192-
results['file_name'] = context_file_location
193-
194-
# Save the context to file for retesting
195-
if not os.path.exists('contexts'):
196-
os.makedirs('contexts')
197-
198-
with open(f'contexts/{context_file_location}_context.txt', 'w') as f:
199-
f.write(context)
200-
201-
if self.save_results:
202-
# Save the context to file for retesting
203-
if not os.path.exists('results'):
204-
os.makedirs('results')
205-
206-
# Save the result to file for retesting
207-
with open(f'results/{context_file_location}_results.json', 'w') as f:
208-
json.dump(results, f)
209-
210-
if self.seconds_to_sleep_between_completions:
211-
await asyncio.sleep(self.seconds_to_sleep_between_completions)
212-
213-
async def bound_evaluate_and_log(self, sem, *args):
214-
async with sem:
215-
await self.evaluate_and_log(*args)
216-
217-
async def run_test(self):
218-
sem = Semaphore(self.num_concurrent_requests)
219-
220-
# Run through each iteration of context_lengths and depths
221-
tasks = []
222-
for context_length in self.context_lengths:
223-
for depth_percent in self.document_depth_percents:
224-
task = self.bound_evaluate_and_log(sem, context_length, depth_percent)
225-
tasks.append(task)
226-
227-
# Wait for all tasks to complete
228-
await asyncio.gather(*tasks)
128+
await super().evaluate_and_log(context, context_length, depth_percent)
229129

230130
def print_start_test_summary(self):
231131
print ("\n")
232-
print ("Starting Needle In A Haystack Testing...")
132+
print ("Starting Needles In A Haystack Testing...")
233133
print (f"- Model: {self.model_name}")
234134
print (f"- Context Lengths: {len(self.context_lengths)}, Min: {min(self.context_lengths)}, Max: {max(self.context_lengths)}")
235135
print (f"- Document Depths: {len(self.document_depth_percents)}, Min: {min(self.document_depth_percents)}%, Max: {max(self.document_depth_percents)}%")
236-
print (f"- Needle: {self.needle.strip()}")
237-
print ("\n\n")
238-
239-
def start_test(self):
240-
if self.print_ongoing_status:
241-
self.print_start_test_summary()
242-
asyncio.run(self.run_test())
136+
print (f"- Needles: {[needle.strip() for needle in self.needles]}")
137+
print ("\n\n")

0 commit comments

Comments
 (0)