You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+13-12Lines changed: 13 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,11 +16,12 @@ $ make setup
16
16
$ source ./venv/bin/activate
17
17
$ pip install -r requirements.txt
18
18
```
19
-
You can then run the analysis on OpenAI or Anthropic models by running `main.py` with the command line arguments shown below. `LLMNeedleHaystackTester` parameters can also be passed as command line arguments, except `model_to_test` and `evaluator` of course.
19
+
You can then run the analysis on OpenAI or Anthropic models by running `main.py` with the command line arguments shown below. `LLMNeedleHaystackTester`and `LLMMultiNeedleHaystackTester`parameters can also be passed as command line arguments, except `model_to_test` and `evaluator` of course.
20
20
*`provider` - The provider of the model, available options are `openai` and `anthropic`. Defaults to `openai`
21
21
*`evaluator` - The evaluator, which can either be a `model` or `LangSmith`. See more on `LangSmith` below. If using a `model`, only `openai` is currently supported. Defaults to `openai`.
22
22
*`api_key` - API key for either OpenAI or Anthropic provider. Can either be passed as a command line argument or an environment variable named `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` depending on the provider. Defaults to `None`.
23
23
*`evaluator_api_key` - API key for OpenAI provider. Can either be passed as a command line argument or an environment variable named `OPENAI_API_KEY`. Defaults to `None`
24
+
*`multi_needle` - Whether to run multi-needle tester or not. Default to `False`
24
25
25
26
## The Test
26
27
1. Place a random fact or statement (the 'needle') in the middle of a long context window (the 'haystack')
@@ -57,8 +58,8 @@ I've put the results from the original tests in `/original_results`. I've upgrad
57
58
*`print_ongoing_status` - Default: True, whether or not to print the status of test as they complete
58
59
59
60
`LLMMultiNeedleHaystackTester` parameters:
60
-
*`multi_needle` - True or False, whether to run multi-needle
61
61
*`needles` - List of needles to insert in the context
62
+
*`eval_set` - The evaluation set identifier.
62
63
63
64
Other Parameters:
64
65
*`api_key` - API key for either OpenAI or Anthropic provider. Can either be passed when creating the object or an environment variable
@@ -107,16 +108,16 @@ Needle 10: 40 + 9 * 6 = 94
107
108
108
109
You can use LangSmith to orchestrate evals and store results.
109
110
110
-
(1) Sign up for [LangSmith](https://docs.smith.langchain.com/setup)
111
-
(2) Set env variables for LangSmith as specified in the setup.
112
-
(3) In the `Datasets + Testing` tab, use `+ Dataset` to create a new dataset, call it `multi-needle-eval-sf` to start.
113
-
(4) Populate the dataset with a test question:
114
-
```
115
-
question: What are the 5 best things to do in San Franscisco?
116
-
answer: "The 5 best things to do in San Francisco are: 1) Go to Dolores Park. 2) Eat at Tony's Pizza Napoletana. 3) Visit Alcatraz. 4) Hike up Twin Peaks. 5) Bike across the Golden Gate Bridge"
117
-
```
118
-

119
-
(5) Run with ` --evaluator langsmith` and `--eval_set multi-needle-eval-sf` to run against our recently created eval set.
111
+
1. Sign up for [LangSmith](https://docs.smith.langchain.com/setup)
112
+
2. Set env variables for LangSmith as specified in the setup.
113
+
3. In the `Datasets + Testing` tab, use `+ Dataset` to create a new dataset, call it `multi-needle-eval-sf`and set dataset type to `Key-Value`.
114
+
4. Populate the dataset with a test question:
115
+
```
116
+
question: What are the 5 best things to do in San Franscisco?
117
+
answer: "The 5 best things to do in San Francisco are: 1) Go to Dolores Park. 2) Eat at Tony's Pizza Napoletana. 3) Visit Alcatraz. 4) Hike up Twin Peaks. 5) Bike across the Golden Gate Bridge"
118
+
```
119
+

120
+
5. Run with ` --evaluator langsmith` and `--eval_set multi-needle-eval-sf` to run against our recently created eval set.
120
121
121
122
Let's see all these working together on a new dataset, `multi-needle-eval-pizza`.
# 'context' : context, # Uncomment this line if you'd like to save the context the model was asked to retrieve from. Warning: This will become very large.
0 commit comments