You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: recipes/use_cases/end2end-recipes/raft/README.md
+21-19Lines changed: 21 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
## Introduction:
2
-
As our Meta llama models become more popular, we noticed there is a great demand to apply our Meta Llama models toward a custom domain to better serve the customers in that domain.
2
+
As our Meta llama models become more popular, we noticed that there is a great demand to apply our Meta Llama models toward a custom domain to better serve the customers in that domain.
3
3
For example, a common scenario can be that a company has all the related documents in plain text for its custom domain and want to build chatbot that can help answer questions a client
4
4
could have.
5
5
@@ -38,7 +38,7 @@ We can use on prem solutions such as the [TGI](../../../../inference/model_serve
**NOTE** Please make sure the port has not been used. Since Meta Llama3 70B instruct model requires at least 135GB GPU memory, we need to use multiple GPUs to host it in a tensor parallel way.
**NOTE** When using cloud API, you need to be aware of your RPM (requests per minute), TPM (tokens per minute) and TPD (tokens per day), limit on your account in case using any of model API providers. This is experimental and totally depends on your documents, wealth of information in them and how you prefer to handle question, short or longer answers etc.
60
60
61
-
This python program will read all the documents inside of "data" folder and transform the text into embeddings and split the data into batches by the SemanticChunker. Then we apply the question_prompt_template, defined in "raft.yaml", to each batch, and finally we will use each batch to query VLLM server and save the return a list of question list for all batches.
61
+
This python script will read all the documents either from local or web, and split the data into text chunks of 1000 charaters (defined by "chunk_size") using RecursiveCharacterTextSplitter.
62
+
Then we apply the question_prompt_template, defined in "raft.yaml", to each chunk, to get question list out of the text chunk.
62
63
63
-
We now have a related context as text chunk and a corresponding question list. For each question in the question list, we want to generate a Chain-of-Thought (COT) style question using Llama 3 70B Instruct as well. Once we have the COT answers, we can start to make a dataset that contains "instruction" which includes some unrelated chunks called distractor and has a probability P to include the related chunk.
64
+
We now have a related context as text chunk and a corresponding question list. For each question in the question list, we want to generate a Chain-of-Thought (COT) style question using Llama 3 70B Instruct as well.
65
+
Once we have the COT answers, we can start to make a dataset that where each sample contains "instruction" section includes some unrelated chunks called distractor and has a probability P to include the related chunk.
64
66
65
-
Here is a RAFT format json example. We have a "question" section for the generated question, "cot_answer" section for generated COT answers, where the final answer will be added after "<ANSWER>" token, and we also created a "instruction" section
66
-
that has all the documents included (each document splited by <DOCUMENT> <\/DOCUMENT>) and finally the question appended in the very end. This "instruction"
67
+
Here is a RAFT format json example from our saved raft.jsonl file. We have a "question" section for the generated question, "cot_answer" section for generated COT answers, where the final answer will be added after "<ANSWER>" token, and we also created a "instruction" section
68
+
that has all the documents included (each document splited by <DOCUMENT> <\/DOCUMENT> tag) and finally the question appended in the very end. This "instruction"
67
69
section will be the input during the training, and the "cot_answer" will be the output label that the loss will be calculated on.
68
70
69
71
```python
@@ -98,31 +100,31 @@ section will be the input during the training, and the "cot_answer" will be the
98
100
"instruction":"<DOCUMENT> DISTRACT_DOCS 1 <\/DOCUMENT>...<DOCUMENT> DISTRACT_DOCS 5 <\/DOCUMENT>\nWhat is the context length supported by Llama 3 models?"
99
101
}
100
102
```
101
-
To create a evalset, we can shuffle and select 100 examples out of RAFT dataset. For evaluation purpose, we only need to keep the "question" section, and the final answer section in
102
-
"cot_answer",
103
+
To create a evalset, ideally we should use human-annotation to create the question and answer pairs to make sure the the questions are related and answers are fully correct.
104
+
However, for demo purpose, we will use a subset of training json as the eval set. We can shuffle and random select 100 examples out of RAFT dataset. For evaluation purpose, we only need to keep the "question" section,
105
+
and the final answer section, marked by <ANSWER> tag in "cot_answer". Then we can manually check each example and remove those low-quaility examples, where the questions
106
+
are not related Llama or can not be infer without correct context. After the manual check, we keep 72 question and answer pairs as the eval_llama.json.
103
107
104
108
### Step 3: Run the fune-tuning
105
-
Once the RAFT dataset is ready, we can start the full fine-tuning step using the following commands in the llama-recipe main folder:
109
+
Once the RAFT dataset is ready in a json format, we can start the fine-tuning steps. Unfornately we found out that the LORA method did not produce a good result so we have to use the full fine-tuning using the following commands in the llama-recipe main folder:
For more details, please check the readme in the finetuning recipe.
122
123
123
124
### Step 4: Evaluating with local inference
124
125
125
-
Once we have the fine-tuned model, we now need to evaluate it to understand its performance. Normally, to create a evaluation set, we should first gather some questions and manually write the ground truth answer. In this case, we created a eval set mostly based on the Llama [Troubleshooting & FAQ](https://llama.meta.com/faq/), where the answers are written by human experts. Then we pass the evalset question to our fine-tuned model to get the model generated answers. To compare the model generated answers with ground truth, we can use either traditional eval method, eg. calcucate rouge score, or use LLM to act like a judge to score the similarity of them.
126
+
Once we have the fine-tuned model, we now need to evaluate it to understand its performance. We can use either traditional eval method, eg. calcucate exact match rate or rouge score.
127
+
In this tutorial, we can also use LLM to act like a judge to score model generated .
126
128
127
129
128
130
```bash
@@ -142,10 +144,10 @@ On another terminal, we can use another Meta Llama 3 70B Instruct model as a jud
Copy file name to clipboardExpand all lines: recipes/use_cases/end2end-recipes/raft/raft.yaml
+34-31Lines changed: 34 additions & 31 deletions
Original file line number
Diff line number
Diff line change
@@ -1,40 +1,43 @@
1
1
COT_prompt_template: >
2
-
<|begin_of_text|><|start_header_id|>system<|end_header_id|> Answer the following question using the information given in the context below. Here is things to pay attention to:
3
-
- First provide step-by-step reasoning on how to answer the question.
4
-
- In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context.
5
-
- End your response with final answer in the form <ANSWER>: $answer, the answer should less than 60 words.
6
-
You MUST begin your final answer with the tag "<ANSWER>: <|eot_id|>
2
+
<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful chatbot who can provide an answer to every questions from the user given a relevant context.<|eot_id|>
Answer this question using the information given by multiple documents in the context above. Here is things to pay attention to:
6
+
- The context contains many documents, each document starts with <DOCUMENT> and ends </DOCUMENT>.
7
+
- First provide step-by-step reasoning on how to answer the question.
8
+
- In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context.
9
+
- End your response with final answer in the form <ANSWER>: $answer, the answer should less than 60 words.
10
+
You MUST begin your final answer with the tag "<ANSWER> <|eot_id|><|start_header_id|>assistant<|end_header_id|>
20
11
21
12
question_prompt_template: >
22
-
<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a language model skilled in creating quiz questions.
23
-
You will be provided with a document,
24
-
read it and please generate factoid question and answer pairs that are most likely be asked by a user of Llama language models
25
-
which includes LLama, Llama2, Meta Llama3, Code Llama, Meta Llama Guard 1, Meta Llama Guard 2
26
-
Your factoid questions should be answerable with a specific, concise piece of factual information from the context.
27
-
Your factoid questions should be formulated in the same style as questions users could ask in a search engine.
28
-
This means that your factoid questions MUST NOT mention something like "according to the passage" or "context".
29
-
please make sure you follow those rules:
30
-
1. Generate {num_questions} question answer pairs, you can generate less answer if there is nothing related to
31
-
model, training, fine-tuning and evaluation details of Llama language models,
32
-
2. The questions can be answered based *solely* on the given passage.
33
-
3. Avoid asking questions with similar meaning.
34
-
4. Never use any abbreviation.
35
-
5. The questions should be able to be answered in 60 words or less. Include only the questions in your response. <|eot_id|>
13
+
<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a synthetic question-answer pair generator. Given a chunk of context about
14
+
some topic(s), generate {num_questions} example questions a user could ask and would be answered
15
+
using information from the chunk. For example, if the given context was a Wikipedia
16
+
paragraph about the United States, an example question could be 'How many states are
17
+
in the United States?
18
+
Your questions should be formulated in the same style as questions that users could ask in a search engine.
19
+
This means that your questions MUST NOT mention something like "according to the passage" or "context".
20
+
The questions should be able to be answered in 60 words or less. Include only the questions in your response.<|eot_id|>
0 commit comments