Skip to content

Commit af53ee0

Browse files
committed
added refusal and adjusted prompts
1 parent ca74a84 commit af53ee0

File tree

6 files changed

+85
-74
lines changed

6 files changed

+85
-74
lines changed

recipes/use_cases/end2end-recipes/raft/README.md

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
## Introduction:
2-
As our Meta llama models become more popular, we noticed there is a great demand to apply our Meta Llama models toward a custom domain to better serve the customers in that domain.
2+
As our Meta llama models become more popular, we noticed that there is a great demand to apply our Meta Llama models toward a custom domain to better serve the customers in that domain.
33
For example, a common scenario can be that a company has all the related documents in plain text for its custom domain and want to build chatbot that can help answer questions a client
44
could have.
55

@@ -38,7 +38,7 @@ We can use on prem solutions such as the [TGI](../../../../inference/model_serve
3838

3939
```bash
4040
# Make sure VLLM has been installed
41-
CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 2 --disable-log-requests --port 8001
41+
CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 2 --disable-log-requests --port 8001
4242
```
4343

4444
**NOTE** Please make sure the port has not been used. Since Meta Llama3 70B instruct model requires at least 135GB GPU memory, we need to use multiple GPUs to host it in a tensor parallel way.
@@ -58,12 +58,14 @@ python raft.py -u "CLOUD_API_URL" -t 5
5858

5959
**NOTE** When using cloud API, you need to be aware of your RPM (requests per minute), TPM (tokens per minute) and TPD (tokens per day), limit on your account in case using any of model API providers. This is experimental and totally depends on your documents, wealth of information in them and how you prefer to handle question, short or longer answers etc.
6060

61-
This python program will read all the documents inside of "data" folder and transform the text into embeddings and split the data into batches by the SemanticChunker. Then we apply the question_prompt_template, defined in "raft.yaml", to each batch, and finally we will use each batch to query VLLM server and save the return a list of question list for all batches.
61+
This python script will read all the documents either from local or web, and split the data into text chunks of 1000 charaters (defined by "chunk_size") using RecursiveCharacterTextSplitter.
62+
Then we apply the question_prompt_template, defined in "raft.yaml", to each chunk, to get question list out of the text chunk.
6263

63-
We now have a related context as text chunk and a corresponding question list. For each question in the question list, we want to generate a Chain-of-Thought (COT) style question using Llama 3 70B Instruct as well. Once we have the COT answers, we can start to make a dataset that contains "instruction" which includes some unrelated chunks called distractor and has a probability P to include the related chunk.
64+
We now have a related context as text chunk and a corresponding question list. For each question in the question list, we want to generate a Chain-of-Thought (COT) style question using Llama 3 70B Instruct as well.
65+
Once we have the COT answers, we can start to make a dataset that where each sample contains "instruction" section includes some unrelated chunks called distractor and has a probability P to include the related chunk.
6466

65-
Here is a RAFT format json example. We have a "question" section for the generated question, "cot_answer" section for generated COT answers, where the final answer will be added after "<ANSWER>" token, and we also created a "instruction" section
66-
that has all the documents included (each document splited by <DOCUMENT> <\/DOCUMENT>) and finally the question appended in the very end. This "instruction"
67+
Here is a RAFT format json example from our saved raft.jsonl file. We have a "question" section for the generated question, "cot_answer" section for generated COT answers, where the final answer will be added after "<ANSWER>" token, and we also created a "instruction" section
68+
that has all the documents included (each document splited by <DOCUMENT> <\/DOCUMENT> tag) and finally the question appended in the very end. This "instruction"
6769
section will be the input during the training, and the "cot_answer" will be the output label that the loss will be calculated on.
6870

6971
```python
@@ -98,31 +100,31 @@ section will be the input during the training, and the "cot_answer" will be the
98100
"instruction":"<DOCUMENT> DISTRACT_DOCS 1 <\/DOCUMENT>...<DOCUMENT> DISTRACT_DOCS 5 <\/DOCUMENT>\nWhat is the context length supported by Llama 3 models?"
99101
}
100102
```
101-
To create a evalset, we can shuffle and select 100 examples out of RAFT dataset. For evaluation purpose, we only need to keep the "question" section, and the final answer section in
102-
"cot_answer",
103+
To create a evalset, ideally we should use human-annotation to create the question and answer pairs to make sure the the questions are related and answers are fully correct.
104+
However, for demo purpose, we will use a subset of training json as the eval set. We can shuffle and random select 100 examples out of RAFT dataset. For evaluation purpose, we only need to keep the "question" section,
105+
and the final answer section, marked by <ANSWER> tag in "cot_answer". Then we can manually check each example and remove those low-quaility examples, where the questions
106+
are not related Llama or can not be infer without correct context. After the manual check, we keep 72 question and answer pairs as the eval_llama.json.
103107

104108
### Step 3: Run the fune-tuning
105-
Once the RAFT dataset is ready, we can start the full fine-tuning step using the following commands in the llama-recipe main folder:
109+
Once the RAFT dataset is ready in a json format, we can start the fine-tuning steps. Unfornately we found out that the LORA method did not produce a good result so we have to use the full fine-tuning using the following commands in the llama-recipe main folder:
106110

107-
For distributed fine-tuning:
108111
```bash
109-
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes 1 --nproc_per_node 4 recipes/finetuning/finetuning.py --lr 1e-5 --context_length 8192 --enable_fsdp --model_name meta-llama/Meta-Llama-3-8B-Instruct --output_dir pt_ep1_full0614 --num_epochs 1 --batch_size_training 4 --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/finetuning/datasets/raft_dataset.py" --use-wandb --run_validation True --custom_dataset.data_path 'recipes/use_cases/end2end-recipes/raft/raft.jsonl'
112+
torchrun --nnodes 1 --nproc_per_node 4 recipes/finetuning/finetuning.py --enable_fsdp --lr 1e-5 --context_length 8192 --num_epochs 1 --batch_size_training 1 --model_name meta-llama/Meta-Llama-3-8B-Instruct --dist_checkpoint_root_folder PATH_TO_ROOT_FOLDER --dist_checkpoint_folder fine-tuned --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/finetuning/datasets/raft_dataset.py" --use-wandb --run_validation True --custom_dataset.data_path 'PATH_TO_RAFT_JSON'
110113
```
111-
```bash
112-
torchrun --nnodes 1 --nproc_per_node 4 recipes/finetuning/finetuning.py --enable_fsdp --lr 1e-5 --context_length 8192 --num_epochs 1 --batch_size_training 2 --model_name meta-llama/Meta-Llama-3-8B-Instruct --dist_checkpoint_root_folder llama+pt_ep1_full0616 --dist_checkpoint_folder fine-tuned --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/finetuning/datasets/raft_dataset.py" --use-wandb --run_validation True --custom_dataset.data_path 'recipes/use_cases/end2end-recipes/raft/pytorch_data/all_17k.jsonl'
113-
```
114-
Then convert the FSDP checkpoint to HuggingFace checkpoints using:
114+
115+
Then convert the FSDP checkpoint to HuggingFace checkpoint using the following command:
115116

116117
```bash
117-
python src/llama_recipes/inference/checkpoint_converter_fsdp_hf.py --fsdp_checkpoint_path /home/kaiwu/work/llama-recipes/llama+pt_ep1_full0616/fine-tuned-meta-llama/Meta-Llama-3-8B-Instruct --consolidated_model_path /home/kaiwu/work/llama-recipes/llama+pt_ep1_full0616/fine-tuned-meta-llama --HF_model_path_or_name /home/kaiwu/work/llama-recipes/llama+pt_ep1_full0616/
118+
python src/llama_recipes/inference/checkpoint_converter_fsdp_hf.py --fsdp_checkpoint_path PATH_TO_ROOT_FOLDER --consolidated_model_path PATH_TO_ROOT_FOLDER/fine-tuned-meta-llama --HF_model_path_or_name PATH_TO_ROOT_FOLDER
118119

119120
```
120121

121122
For more details, please check the readme in the finetuning recipe.
122123

123124
### Step 4: Evaluating with local inference
124125

125-
Once we have the fine-tuned model, we now need to evaluate it to understand its performance. Normally, to create a evaluation set, we should first gather some questions and manually write the ground truth answer. In this case, we created a eval set mostly based on the Llama [Troubleshooting & FAQ](https://llama.meta.com/faq/), where the answers are written by human experts. Then we pass the evalset question to our fine-tuned model to get the model generated answers. To compare the model generated answers with ground truth, we can use either traditional eval method, eg. calcucate rouge score, or use LLM to act like a judge to score the similarity of them.
126+
Once we have the fine-tuned model, we now need to evaluate it to understand its performance. We can use either traditional eval method, eg. calcucate exact match rate or rouge score.
127+
In this tutorial, we can also use LLM to act like a judge to score model generated .
126128

127129

128130
```bash
@@ -142,10 +144,10 @@ On another terminal, we can use another Meta Llama 3 70B Instruct model as a jud
142144
CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 2 --disable-log-requests --port 8002
143145
```
144146

145-
Then we can pass the port to the eval script:
147+
Then we can pass the ports to the eval script:
146148

147149
```bash
148-
CUDA_VISIBLE_DEVICES=5 python raft_eval.py -m raft-8b -v 8000 -j 8001 -o all_rag5 -r 5
150+
CUDA_VISIBLE_DEVICES=1 python raft_eval.py -m raft-8b -v 8000 -j 8001 -r 5
149151
```
150152

151153

recipes/use_cases/end2end-recipes/raft/raft.py

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
11
import logging
2-
from typing import Literal, Any
3-
import json
4-
import random
52
import os
63
import argparse
74
from raft_utils import generate_questions, add_chunk_to_dataset
@@ -10,8 +7,6 @@
107

118
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
129

13-
NUM_DISTRACT_DOCS = 5 # number of distracting documents to add to each chunk
14-
ORCALE_P = 0.8 # probability of related documents to be added to each chunk
1510
def main(api_config):
1611
ds = None
1712
try:
@@ -26,7 +21,7 @@ def main(api_config):
2621
for question in questions:
2722
logging.info(f"Question: {question}")
2823
logging.info(f"Successfully generated {sum([len(q) for c,q in chunk_questions_zip])} question/answer pairs.")
29-
ds = add_chunk_to_dataset(chunk_questions_zip,api_config,ds,NUM_DISTRACT_DOCS, ORCALE_P)
24+
ds = add_chunk_to_dataset(chunk_questions_zip,api_config,ds)
3025
ds.save_to_disk(args.output)
3126
logging.info(f"Data successfully written to {api_config['output']}. Process completed.")
3227
formatter = DatasetConverter()
@@ -92,6 +87,7 @@ def parse_arguments():
9287
api_config["api_key"] = os.environ["API_KEY"]
9388
logging.info(f"Configuration loaded. Generating {args.questions_per_chunk} question per chunk using model '{args.model}'.")
9489
logging.info(f"Chunk size: {args.chunk_size}.")
90+
logging.info(f"num_distract_docs: {api_config['num_distract_docs']}, orcale_p: {api_config['orcale_p']}")
9591
logging.info(f"Will use endpoint_url: {args.endpoint_url}.")
9692
logging.info(f"Output will be written to {args.output}.")
9793
main(api_config)

recipes/use_cases/end2end-recipes/raft/raft.yaml

Lines changed: 34 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,43 @@
11
COT_prompt_template: >
2-
<|begin_of_text|><|start_header_id|>system<|end_header_id|> Answer the following question using the information given in the context below. Here is things to pay attention to:
3-
- First provide step-by-step reasoning on how to answer the question.
4-
- In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context.
5-
- End your response with final answer in the form <ANSWER>: $answer, the answer should less than 60 words.
6-
You MUST begin your final answer with the tag "<ANSWER>: <|eot_id|>
2+
<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful chatbot who can provide an answer to every questions from the user given a relevant context.<|eot_id|>
73
<|start_header_id|>user<|end_header_id|>
8-
Question: {question}\nContext: {context}\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>
9-
10-
# question_prompt_template: >
11-
# <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a synthetic question-answer pair generator. Given a chunk of context about
12-
# some topic(s), generate {num_questions} example questions a user could ask and would be answered
13-
# using information from the chunk. For example, if the given context was a Wikipedia
14-
# paragraph about the United States, an example question could be 'How many states are
15-
# in the United States?
16-
# The questions should be able to be answered in 100 words or less. Include only the
17-
# questions in your response.<|eot_id|>
18-
# <|start_header_id|>user<|end_header_id|>
19-
# Context: {context}\n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
4+
Question: {question}\nContext: {context}\n
5+
Answer this question using the information given by multiple documents in the context above. Here is things to pay attention to:
6+
- The context contains many documents, each document starts with <DOCUMENT> and ends </DOCUMENT>.
7+
- First provide step-by-step reasoning on how to answer the question.
8+
- In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context.
9+
- End your response with final answer in the form <ANSWER>: $answer, the answer should less than 60 words.
10+
You MUST begin your final answer with the tag "<ANSWER> <|eot_id|><|start_header_id|>assistant<|end_header_id|>
2011
2112
question_prompt_template: >
22-
<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a language model skilled in creating quiz questions.
23-
You will be provided with a document,
24-
read it and please generate factoid question and answer pairs that are most likely be asked by a user of Llama language models
25-
which includes LLama, Llama2, Meta Llama3, Code Llama, Meta Llama Guard 1, Meta Llama Guard 2
26-
Your factoid questions should be answerable with a specific, concise piece of factual information from the context.
27-
Your factoid questions should be formulated in the same style as questions users could ask in a search engine.
28-
This means that your factoid questions MUST NOT mention something like "according to the passage" or "context".
29-
please make sure you follow those rules:
30-
1. Generate {num_questions} question answer pairs, you can generate less answer if there is nothing related to
31-
model, training, fine-tuning and evaluation details of Llama language models,
32-
2. The questions can be answered based *solely* on the given passage.
33-
3. Avoid asking questions with similar meaning.
34-
4. Never use any abbreviation.
35-
5. The questions should be able to be answered in 60 words or less. Include only the questions in your response. <|eot_id|>
13+
<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a synthetic question-answer pair generator. Given a chunk of context about
14+
some topic(s), generate {num_questions} example questions a user could ask and would be answered
15+
using information from the chunk. For example, if the given context was a Wikipedia
16+
paragraph about the United States, an example question could be 'How many states are
17+
in the United States?
18+
Your questions should be formulated in the same style as questions that users could ask in a search engine.
19+
This means that your questions MUST NOT mention something like "according to the passage" or "context".
20+
The questions should be able to be answered in 60 words or less. Include only the questions in your response.<|eot_id|>
3621
<|start_header_id|>user<|end_header_id|>
3722
Context: {context}\n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
23+
24+
# question_prompt_template: >
25+
# <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a language model skilled in creating quiz questions.
26+
# You will be provided with a document,
27+
# read it and please generate factoid question and answer pairs that are most likely be asked by a user of Llama language models
28+
# which includes LLama, Llama2, Meta Llama3, Code Llama, Meta Llama Guard 1, Meta Llama Guard 2
29+
# Your factoid questions should be answerable with a specific, concise piece of factual information from the context.
30+
# Your factoid questions should be formulated in the same style as questions users could ask in a search engine.
31+
# This means that your factoid questions MUST NOT mention something like "according to the passage" or "context".
32+
# please make sure you follow those rules:
33+
# 1. Generate {num_questions} question answer pairs, you can generate less answer if there is nothing related to
34+
# model, training, fine-tuning and evaluation details of Llama language models,
35+
# 2. The questions can be answered based *solely* on the given passage.
36+
# 3. Avoid asking questions with similar meaning.
37+
# 4. Never use any abbreviation.
38+
# 5. The questions should be able to be answered in 60 words or less. Include only the questions in your response. <|eot_id|>
39+
# <|start_header_id|>user<|end_header_id|>
40+
# Context: {context}\n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
3841
data_dir: "./data"
3942

4043
xml_path: ""

recipes/use_cases/end2end-recipes/raft/raft_eval.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from langchain_openai import ChatOpenAI
99
from langchain_community.embeddings import HuggingFaceEmbeddings
1010
from langchain_community.vectorstores import FAISS
11-
from langchain.text_splitter import RecursiveCharacterTextSplitter,TokenTextSplitter
11+
from langchain.text_splitter import RecursiveCharacterTextSplitter
1212
from langchain_community.vectorstores.utils import DistanceStrategy
1313
from datetime import datetime
1414
from langchain_community.document_loaders import DirectoryLoader

0 commit comments

Comments
 (0)