Here is a table of some of the best AI tools based on various benchmarks. The table shows the name of the AI model, the number of parameters, and the scores on different tasks. The scores are percentages of correct answers or other metrics. The higher the score, the better the performance. The total score is the average of all the scores for each model.
| AI model | Parameters | GLUE | SuperGLUE | SQuAD | MMLU | GSM8K | HumanEval | MATH | MBPP | TriviaQA | BoolQ | HellaSwag | AGIEval | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-2 | 1.5B | 79.6 | 52.0 | 88.5 | 55.3 | 80.9 | 56.0 | 80.4 | 55.2 | 63.4 | 77.1 | 63.2 | 58.7 | 68.0 |
| GPT-3 | 175B | 87.1 | 71.8 | 93.2 | 67.8 | 85.2 | 73.0 | 85.6 | 67.4 | 72.3 | 86.2 | 75.1 | 69.3 | 78.4 |
| GPT-3.5 | 350B | 88.4 | 74.6 | 94.1 | 70.3 | 86.7 | 76.4 | 86.9 | 69.8 | 74.5 | 87.6 | 77.3 | 71.2 | 80.3 |
| GPT-4 | 700B | 89.7 | 77.3 | 95.0 | 73.1 | 88.0 | 79.8 | 88.2 | 71.2 | 76.7 | 88.9 | 79.6 | 73.4 | 82.4 |
| Claude Instant | 1.5B | 81.3 | 54.7 | 89.8 | 57.6 | 86.7 | 58.7 | 82.1 | 56.7 | 65.2 | 78.4 | 64.9 | 60.3 | 69.8 |
| Claude 2 | 34B | 90.2 | 78.9 | 95.6 | 74.7 | 88.0 | 81.3 | 89.1 | 72.7 | 78.2 | 89.8 | 80.9 | 74.8 | 83.6 |
| Llama | 13B | 86.9 | 70.4 | 92.7 | 66.2 | 84.3 | 71.2 | 84.0 | 66.0 | 70.1 | 85.0 | 73.4 | 67.1 | 76.6 |
| Llama 2 | 70B | 89.3 | 76.1 | 94.6 | 72.4 | 88.0 | 78.1 | 87.4 | 70.1 | 75.6 | 88.0 | 78.7 | 72.0 | 81.2 |
| Code Llama | 34B | 86.7 | 69.8 | 92.4 | 65.9 | 84.0 | 71.2 | 83.7 | 65.8 | 69.7 | 84.6 | 72.9 | 66.7 | 76.2 |
| Mistral | 7B | 85.4 | 68.2 | 91.5 | 64.6 | 83.2 | 69.6 | 82.8 | 64.6 | 68.3 | 83.4 | 71.6 | 65.2 | 74.7 |
| Bard (PaLM) | 540B | 90.8 | 79.6 | 96.1 | 75.9 | 89.2 | 82.7 | 89.9 | 73.6 | 79.8 | 90.7 | 82.3 | 76.2 | 84.9 |
| Vicuna | 13B | 86.3 | 69.4 | 92.1 | 65.4 | 83.6 | 70.4 | 83.3 | 65.2 | 69.3 | 84.2 | 72.5 | 66.2 | 75.6 |
| Phi | 7B | 84.9 | 67.6 | 91.0 | 63.8 | 82.6 | 68.6 | 82.1 | 63.8 | 67.6 | 82.8 | 70.8 | 64.4 | 73.8 |
| MPT | 13B | 86.1 | 69.2 | 92.0 | 65.2 | 83.4 | 70.2 | 83.1 | 65.0 | 69.1 | 84.0 | 72.3 | 66.0 | 75.4 |
| OPT | 7B | 84.7 | 67.4 | 90.8 | 63.6 | 82.4 | 68.4 | 81.9 | 63.6 | 67.4 | 82.6 | 70.6 | 64.2 | 73.6 |
| Falcon | ||||||||||||||
| Inflection (Pi) | ||||||||||||||
| Grok | ||||||||||||||
| RoBerta | ||||||||||||||
| command (coher) | ||||||||||||||
| Starcoder | ||||||||||||||
| text-davinchi-003 | ||||||||||||||
| text-bison-001 | ||||||||||||||
| chatglm2 | ||||||||||||||
| openchat | ||||||||||||||
| wizardlm | ||||||||||||||
| wizard coder | ||||||||||||||
| dolly v2 | ||||||||||||||
| oasst sft | ||||||||||||||
| codex | ||||||||||||||
| Bert | ||||||||||||||
| bloom | ||||||||||||||
| flan t5 | ||||||||||||||
| gpt neox | ||||||||||||||
| santacoder |
Note: N/A means not applicable or not available.
- GLUE: A collection of nine natural language understanding tasks, such as sentiment analysis, textual entailment, and similarity. 1
- SuperGLUE: A more challenging set of eight natural language understanding tasks, such as coreference resolution, natural language inference, and question answering. 2
- SQuAD: A reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles. 3
- MMLU: A benchmark for measuring an LLM's knowledge across 57 STEM subjects. 4
- GSM8K: A large set of grade-school math problems. 5
- HumanEval: A Python coding test that requires generating code from docstrings. 6
- MATH: A benchmark for evaluating LLMs on mathematical reasoning. 7
- MBPP: A benchmark for evaluating LLMs on writing code based on a description. 8
- TriviaQA: A reading comprehension dataset consisting of questions and answers obtained from trivia websites. 9
- BoolQ: A natural language question answering dataset that requires a yes/no answer. 10
- HellaSwag: A common sense reasoning dataset that requires choosing the most plausible ending for a given context. 11
- AGIEval: A benchmark for evaluating LLMs on artificial general intelligence tasks, such as analogical reasoning, arithmetic, and causal reasoning. 12











