Skip to content

Open source LLM performance evaluation #90

@streetycat

Description

@streetycat

I will list the test results of various open-source models here. You can refer to these data to select models and configure devices. Of course, the evaluation of LLM is quite subjective. I also suggest you make evaluations more suitable for your needs based on your own requirements. Your opinions and suggestions on the evaluation methods and results are also welcome.

I will give the overall score in the first comment, and provide performance statistics in the second comment.

At present, I plan to complete the evaluation of several mainstream models first, and may also pay attention to some related fine-tuned models in the middle.

  1. Alpaca
  2. Vicuna
  3. Mistral
  4. Bloom
  5. Aquila

There are several tasks that need to be handled as follows:

  • Test cases
  • ChatGPT-4(as a reference)
    • Execute test cases
  • ChatGPT-3.5(as a reference)
    • Execute test cases
  • Llama 70B Chat
    • Execute test cases
  • Llama 13B Chat
    • Execute test cases
  • Alpaca
    • Download model
    • Execute test cases
  • Vicuna
    • Download model
    • Execute test cases
  • Mistral
    • Download model
    • Execute test cases
  • Falcon
    • Download model
    • Execute test cases
  • Aquila
    • Download model
    • Execute test cases

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions