-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Background
We have a chatbot that uses the PAL method to program some custom functions to answer user questions in combination with user data in the system.
User data is sleep and exercise data uploaded by users through smart wearable devices. The data types are very rich (including 50+ data fields), and the amount of data is large (every user will generate multiple pieces of data every day).
The Python code generated by LLM, that determines the time range of the query data, the data fields that need to be queried, the function arrangement and other information
Prompt template:
As a sleep and sport AI, You focus on sleep, health, and exercise. You provide Python code to answer sleep-related or and sport-related questions with personal data.
...
## At any point, you have access to the following functions:
- get_data_by_date_range(start_date: str, end_date: str, fields: list): Query the specified sleep and sport metrics data for the user within a specified time range.
- draw(data: list): Plot the graph based on the queried data and the required metric.
- summarize(data: list, question: str): Respond to non-graphical sleep-related questions from the user based on the queried sleep data.
- combination_response(summarize_response_list: list, chart_list: list): Used to aggregate user query results and return them uniformly.
...
## Here are all sleep data metrics (fields definitions) we have:
- `sleep_duration`: Sleep duration, in minutes.
- `rem_duration`: Duration of time spent in REM (rapid eye movement) sleep, in minutes.
- `resting_heart_rate`: Resting heart rate.
... (The other 50+ field descriptions are omitted here)
## Here some examples of how to use the functions:
Human: Show a chart displaying the duration of my sleep duration and heart rate this week, and give me some suggestions.
AI:
```python
start_date = "2023-04-17"
end_date = "2023-04-23"
fields = ["sleep_duration","resting_heart_rate"]
sleep_data = get_data_by_date_range(start_date, end_date, fields)
summarize_resp = summarize(sleep_data, "Please analyze my sleep data give me some suggestions.")
draw_chart = draw(sleep_data)
response = combination_response([summarize_resp], [draw_chart])
``
... (The other 4 examples are omitted here)
Thank you very much for your patience in reading this far. I wrote a lot of background information in order to describe the problem, which resulted in a very long text.
Question
PAL is an amazing method. We have already used it in production. We want to replace openai LLM with the open source LLM and encounter some problems:
- Recently, many popular open source LLM have been released. Have we conducted supplementary tests? Is there any recommended open source LLM?
- Accuracy on our test cases:
| LLM | Accuracy |
|---|---|
| gpt-3.5-turbo | 96% |
| PaLM2(text-bison@001) | 72.88% |
| WizardCoder-15B | 45% |
| Vicuna-13B | 29% |
Found by analysis:
a. Compared with gpt-3.5-turbo, The PaLM2, WizardCoder, and Vicuna all have a decline in date reasoning performance. Is there any way to improve date reasoning?
b. The generalization ability of WizardCoder-15B and Vicuna-13B is insufficient. There are many output codes, basically copying few-shot, and not generating code according to the problem. Is it caused by insufficient model parameters?
- Our prompt is very longm, There are many field descriptions, function descriptions, and few-shot examples in it. Can we fine-tuning to reduce the number of input tokens? The few-shot can be omitted, but the functions description and fields definitions can be omitted?
- Any suggestions for fine-tuning the training and testing datasets?
Thanks again everyone, if you can pick some questions and help me answer them