- 
                Notifications
    
You must be signed in to change notification settings  - Fork 371
 
Adds inspectai #1022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
      
    
      
        
          +1,181
        
        
          −52
        
        
          
        
      
    
  
  
     Merged
                    Adds inspectai #1022
Changes from all commits
      Commits
    
    
            Show all changes
          
          
            67 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      2696a49
              
                use inspect-ai to evaluate aime25 and gsm8k
              
              
                NathanHB 578d530
              
                revert file
              
              
                NathanHB 21fa870
              
                working for 3 tasks
              
              
                NathanHB 27b2af1
              
                parallel evals of tasks
              
              
                NathanHB b9a610d
              
                adds gpqa diamond to inspect
              
              
                NathanHB 25c1128
              
                move tasks to individual files
              
              
                NathanHB 0d42edf
              
                move tasks to individual files
              
              
                NathanHB 6cc3c04
              
                enable extended tasks as well
              
              
                NathanHB 4c38951
              
                run precomit hook
              
              
                NathanHB d2fd5e1
              
                fix mkqa
              
              
                NathanHB 2ddb0f9
              
                chaange extended suite to lighteval
              
              
                NathanHB ee97122
              
                chaange extended suite to lighteval
              
              
                NathanHB e2c8e22
              
                add metdata to tasks
              
              
                NathanHB c980ddb
              
                add metdata to tasks
              
              
                NathanHB 57fe390
              
                remove license notice and put docstring on top of file
              
              
                NathanHB ee081f2
              
                homogenize tags
              
              
                NathanHB 1ed1602
              
                add docstring for all multilingual tasks
              
              
                NathanHB f4b0e27
              
                add docstring for all multilingual tasks
              
              
                NathanHB 81d9e4e
              
                add name and dataset to metadata
              
              
                NathanHB b734532
              
                use TASKS_TABLE for multilingual tasks
              
              
                NathanHB c3911fc
              
                use TASKS_TABLE for default tasks
              
              
                NathanHB e439f70
              
                use TASKS_TABLE for default tasks
              
              
                NathanHB 6447ee7
              
                loads all tasks correclty
              
              
                NathanHB 88754bf
              
                move community tasks to default tasks and update doc
              
              
                NathanHB 5445f5c
              
                move community tasks to default tasks and update doc
              
              
                NathanHB f53bd76
              
                Merge remote-tracking branch 'origin/main' into nathan-reorg-tasks
              
              
                NathanHB 6a0c615
              
                revert uneeded changes
              
              
                NathanHB 1435e38
              
                fix doc build
              
              
                NathanHB 15f41f2
              
                fix doc build
              
              
                NathanHB 74e5c0f
              
                remove custom tasks and let user decide if loading multilingual tasks
              
              
                NathanHB aad136c
              
                load-tasks multilingual fix
              
              
                NathanHB 242bc43
              
                update doc
              
              
                NathanHB 6806bf8
              
                remove uneeded file
              
              
                NathanHB e94fa59
              
                update readme
              
              
                NathanHB 8800d1a
              
                update readme
              
              
                NathanHB 970f33b
              
                update readme
              
              
                NathanHB b8c26dc
              
                fix test
              
              
                NathanHB 764de72
              
                add back the custom tasks
              
              
                NathanHB a326ea8
              
                add back the custom tasks
              
              
                NathanHB 81081cd
              
                fix tasks
              
              
                NathanHB 74b40f6
              
                fix tasks
              
              
                NathanHB 083fb1b
              
                fix tasks
              
              
                NathanHB 2dab2bf
              
                fix tests
              
              
                NathanHB 57ca0e5
              
                fix tests
              
              
                NathanHB 480e40a
              
                add inspect-ai
              
              
                NathanHB ade2900
              
                add tasks
              
              
                NathanHB 079ceaf
              
                add gpqa
              
              
                NathanHB 8d00799
              
                make model config work
              
              
                NathanHB cea5e99
              
                Update src/lighteval/metrics/metrics.py
              
              
                NathanHB fb47bb7
              
                init
              
              
                NathanHB 2736bc9
              
                Merge branch 'nathan-move-to-inspectai' of github.com:huggingface/lig…
              
              
                NathanHB d5e6c9f
              
                Merge branch 'main' into nathan-move-to-inspectai
              
              
                NathanHB e55a9af
              
                fix tests
              
              
                NathanHB ba41f1c
              
                Merge branch 'nathan-move-to-inspectai' of github.com:huggingface/lig…
              
              
                NathanHB 59c5dcc
              
                fix tests
              
              
                NathanHB 40254db
              
                fix tests
              
              
                NathanHB 53275fe
              
                fix tests
              
              
                NathanHB 72e5c2b
              
                add correct system prompt for hle
              
              
                NathanHB 7fc1753
              
                add correct system prompt for hle
              
              
                NathanHB 260d744
              
                review suggestions
              
              
                NathanHB 835b799
              
                add doc
              
              
                NathanHB c216a27
              
                change buttons
              
              
                NathanHB 21e6020
              
                change buttons
              
              
                NathanHB 7e65400
              
                change buttons
              
              
                NathanHB 0a4f6be
              
                move benchmark finder to openeval org
              
              
                NathanHB b661d0d
              
                better help for eval
              
              
                NathanHB f142b39
              
                better help for eval
              
              
                NathanHB File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -1,28 +1,30 @@ | ||
| # Available tasks | ||
| 
     | 
||
| Browse and inspect tasks available in LightEval. | ||
| <iframe | ||
| src="https://saylortwift-benchmark-finder.hf.space" | ||
| src="https://openevals-benchmark-finder.hf.space" | ||
| frameborder="0" | ||
| width="850" | ||
| height="450" | ||
| ></iframe> | ||
| 
     | 
||
| 
     | 
||
| 
     | 
||
| You can get a list of all available tasks by running: | ||
| List all tasks: | ||
| 
     | 
||
| ```bash | ||
| lighteval tasks list | ||
| ``` | ||
| 
     | 
||
| ### Inspect Specific Tasks | ||
| ### Inspect specific tasks | ||
| 
     | 
||
| You can inspect a specific task to see its configuration, metrics, and requirements by running: | ||
| Inspect a task to view its config, metrics, and requirements: | ||
| 
     | 
||
| ```bash | ||
| lighteval tasks inspect <task_name> | ||
| ``` | ||
| 
     | 
||
| For example: | ||
| Example: | ||
| ```bash | ||
| lighteval tasks inspect "lighteval|truthfulqa:mc|0" | ||
| ``` | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| # Evaluate your model with Inspect-AI | ||
| 
     | 
||
| Pick the right benchmarks with our benchmark finder: | ||
| Search by language, task type, dataset name, or keywords. | ||
| 
     | 
||
| > [!WARNING] | ||
| > Not all tasks are compatible with inspect-ai's API as of yet, we are working on converting all of them ! | ||
| 
     | 
||
| 
     | 
||
| <iframe | ||
| src="https://openevals-open-benchmark-index.hf.space" | ||
| frameborder="0" | ||
| width="850" | ||
| height="450" | ||
| ></iframe> | ||
| 
     | 
||
| Once you've chosen a benchmark, run it with `lighteval eval`. Below are examples for common setups. | ||
| 
     | 
||
| ### Examples | ||
| 
     | 
||
| 1. Evaluate a model via Hugging Face Inference Providers. | ||
| 
     | 
||
| ```bash | ||
| lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0" | ||
| ``` | ||
| 
     | 
||
| 2. Run multiple evals at the same time. | ||
| 
     | 
||
| ```bash | ||
| lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0,lighteval|aime25|0" | ||
| ``` | ||
| 
     | 
||
| 3. Compare providers for the same model. | ||
| 
     | 
||
| ```bash | ||
| lighteval eval \ | ||
| hf-inference-providers/openai/gpt-oss-20b:fireworks-ai \ | ||
| hf-inference-providers/openai/gpt-oss-20b:together \ | ||
| hf-inference-providers/openai/gpt-oss-20b:nebius \ | ||
| "lighteval|gpqa:diamond|0" | ||
| ``` | ||
| 
     | 
||
| 4. Evaluate a vLLM or SGLang model. | ||
| 
     | 
||
| ```bash | ||
| lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct "lighteval|gpqa:diamond|0" | ||
| ``` | ||
| 
     | 
||
| 5. See the impact of few-shot on your model. | ||
| 
     | 
||
| ```bash | ||
| lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0,lighteval|gsm8k|5" | ||
| ``` | ||
| 
     | 
||
| 6. Optimize custom server connections. | ||
| 
     | 
||
| ```bash | ||
| lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0" \ | ||
| --max-connections 50 \ | ||
| --timeout 30 \ | ||
| --retry-on-error 1 \ | ||
| --max-retries 1 \ | ||
| --max-samples 10 | ||
| ``` | ||
| 
     | 
||
| 7. Use multiple epochs for more reliable results. | ||
| 
     | 
||
| ```bash | ||
| lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --epochs 16 --epochs-reducer "pass_at_4" | ||
| ``` | ||
| 
     | 
||
| 8. Push to the Hub to share results. | ||
| 
     | 
||
| ```bash | ||
| lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|hle|0" \ | ||
| --bundle-dir gpt-oss-bundle \ | ||
| --repo-id OpenEvals/evals \ | ||
| --max-samples 100 | ||
| ``` | ||
| 
     | 
||
| Resulting Space: | ||
| 
     | 
||
| <iframe | ||
| src="https://openevals-evals.static.hf.space" | ||
| frameborder="0" | ||
| width="850" | ||
| height="450" | ||
| ></iframe> | ||
| 
     | 
||
| 9. Change model behaviour | ||
| 
     | 
||
| You can use any argument defined in inspect-ai's API. | ||
| 
     | 
||
| ```bash | ||
| lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --temperature 0.1 | ||
| ``` | ||
| 
     | 
||
| 10. Use model-args to use any inference provider specific argument. | ||
| 
     | 
||
| ```bash | ||
| lighteval eval google/gemini-2.5-pro "lighteval|aime25|0" --model-args location=us-east5 | ||
| ``` | ||
| 
     | 
||
| ```bash | ||
| lighteval eval openai/gpt-4o "lighteval|gpqa:diamond|0" --model-args service_tier=flex,client_timeout=1200 | ||
| ``` | ||
| 
     | 
||
| 
     | 
||
| LightEval prints a per-model results table: | ||
| 
     | 
||
| ``` | ||
| Completed all tasks in 'lighteval-logs' successfully | ||
| 
     | 
||
| | Model |gpqa|gpqa:diamond| | ||
| |---------------------------------------|---:|-----------:| | ||
| |vllm/HuggingFaceTB/SmolLM-135M-Instruct|0.01| 0.01| | ||
| 
     | 
||
| results saved to lighteval-logs | ||
| run "inspect view --log-dir lighteval-logs" to view the results | ||
| ``` | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
              
      
      Oops, something went wrong.
        
    
  
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Uh oh!
There was an error while loading. Please reload this page.