added refusal and adjusted prompts

wukaixingxp · wukaixingxp · commit af53ee051ea5 · 2024-06-21T10:19:12.000-07:00
diff --git a/recipes/use_cases/end2end-recipes/raft/README.md b/recipes/use_cases/end2end-recipes/raft/README.md
@@ -1,5 +1,5 @@
 ## Introduction:
-As our Meta llama models become more popular, we noticed there is a great demand to apply our Meta Llama models toward a custom domain to better serve the customers in that domain.
+As our Meta llama models become more popular, we noticed that there is a great demand to apply our Meta Llama models toward a custom domain to better serve the customers in that domain.
 For example, a common scenario can be that a company has all the related documents in plain text for its custom domain and want to build chatbot that can help answer questions a client
 could have.
 
@@ -38,7 +38,7 @@ We can use on prem solutions such as the [TGI](../../../../inference/model_serve
 
 ```bash
 # Make sure VLLM has been installed
-CUDA_VISIBLE_DEVICES=6,7 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 2 --disable-log-requests --port 8001
+CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 2 --disable-log-requests --port 8001
 ```
 
 **NOTE** Please make sure the port has not been used. Since Meta Llama3 70B instruct model requires at least 135GB GPU memory, we need to use multiple GPUs to host it in a tensor parallel way.
@@ -58,12 +58,14 @@ python raft.py -u "CLOUD_API_URL" -t 5
 
 **NOTE** When using cloud API, you need to be aware of your RPM (requests per minute), TPM (tokens per minute) and TPD (tokens per day), limit on your account in case using any of model API providers. This is experimental and totally depends on your documents, wealth of information in them and how you prefer to handle question, short or longer answers etc.
 
-This python program will read all the documents inside of "data" folder and transform the text into embeddings and split the data into batches by the SemanticChunker. Then we apply the question_prompt_template, defined in "raft.yaml", to each batch, and finally we will use each batch to query VLLM server and save the return a list of question list for all batches.
+This python script will read all the documents either from local or web, and split the data into text chunks of 1000 charaters (defined by "chunk_size") using RecursiveCharacterTextSplitter.
+Then we apply the question_prompt_template, defined in "raft.yaml", to each chunk, to get question list out of the text chunk.
 
-We now have a related context as text chunk and a corresponding question list. For each question in the question list, we want to generate a Chain-of-Thought (COT) style question using Llama 3 70B Instruct as well. Once we have the COT answers, we can start to make a dataset that contains "instruction" which includes some unrelated chunks called distractor and has a probability P to include the related chunk.
+We now have a related context as text chunk and a corresponding question list. For each question in the question list, we want to generate a Chain-of-Thought (COT) style question using Llama 3 70B Instruct as well.
+Once we have the COT answers, we can start to make a dataset that where each sample contains "instruction" section includes some unrelated chunks called distractor and has a probability P to include the related chunk.
 
-Here is a RAFT format json example. We have a "question" section for the generated question, "cot_answer" section for generated COT answers, where the final answer will be added after "<ANSWER>" token, and we also created a "instruction" section
-that has all the documents included (each document splited by <DOCUMENT> <\/DOCUMENT>) and finally the question appended in the very end. This "instruction"
+Here is a RAFT format json example from our saved raft.jsonl file. We have a "question" section for the generated question, "cot_answer" section for generated COT answers, where the final answer will be added after "<ANSWER>" token, and we also created a "instruction" section
+that has all the documents included (each document splited by <DOCUMENT> <\/DOCUMENT> tag) and finally the question appended in the very end. This "instruction"
 section will be the input during the training, and the "cot_answer" will be the output label that the loss will be calculated on.
 
 ```python
@@ -98,31 +100,31 @@ section will be the input during the training, and the "cot_answer" will be the
    "instruction":"<DOCUMENT> DISTRACT_DOCS 1 <\/DOCUMENT>...<DOCUMENT> DISTRACT_DOCS 5 <\/DOCUMENT>\nWhat is the context length supported by Llama 3 models?"
 }
 ```
-To create a evalset, we can shuffle and select 100 examples out of RAFT dataset. For evaluation purpose, we only need to keep the "question" section, and the final answer section in
-"cot_answer",
+To create a evalset, ideally we should use human-annotation to create the question and answer pairs to make sure the the questions are related and answers are fully correct.
+However, for demo purpose, we will use a subset of training json as the eval set. We can shuffle and random select 100 examples out of RAFT dataset. For evaluation purpose, we only need to keep the "question" section,
+and the final answer section, marked by <ANSWER> tag in "cot_answer". Then we can manually check each example and remove those low-quaility examples, where the questions
+are not related Llama or can not be infer without correct context. After the manual check, we keep 72 question and answer pairs as the eval_llama.json.
 
 ### Step 3: Run the fune-tuning
-Once the RAFT dataset is ready, we can start the full fine-tuning step using the following commands in the llama-recipe main folder:
+Once the RAFT dataset is ready in a json format, we can start the fine-tuning steps. Unfornately we found out that the LORA method did not produce a good result so we have to use the full fine-tuning using the following commands in the llama-recipe main folder:
 
-For distributed fine-tuning:
 ```bash
-CUDA_VISIBLE_DEVICES=0,1,2,3  torchrun --nnodes 1 --nproc_per_node 4  recipes/finetuning/finetuning.py --lr 1e-5 --context_length 8192 --enable_fsdp  --model_name meta-llama/Meta-Llama-3-8B-Instruct --output_dir pt_ep1_full0614 --num_epochs 1 --batch_size_training 4 --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/finetuning/datasets/raft_dataset.py" --use-wandb  --run_validation True  --custom_dataset.data_path 'recipes/use_cases/end2end-recipes/raft/raft.jsonl'
+torchrun --nnodes 1 --nproc_per_node 4  recipes/finetuning/finetuning.py --enable_fsdp --lr 1e-5 --context_length 8192 --num_epochs 1 --batch_size_training 1 --model_name meta-llama/Meta-Llama-3-8B-Instruct --dist_checkpoint_root_folder PATH_TO_ROOT_FOLDER --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/finetuning/datasets/raft_dataset.py" --use-wandb  --run_validation True  --custom_dataset.data_path 'PATH_TO_RAFT_JSON'
 ```
-```bash
-torchrun --nnodes 1 --nproc_per_node 4  recipes/finetuning/finetuning.py --enable_fsdp --lr 1e-5 --context_length 8192 --num_epochs 1 --batch_size_training 2 --model_name meta-llama/Meta-Llama-3-8B-Instruct --dist_checkpoint_root_folder llama+pt_ep1_full0616 --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/finetuning/datasets/raft_dataset.py" --use-wandb  --run_validation True  --custom_dataset.data_path 'recipes/use_cases/end2end-recipes/raft/pytorch_data/all_17k.jsonl'
-```
-Then convert the FSDP checkpoint to HuggingFace checkpoints using:
+
+Then convert the FSDP checkpoint to HuggingFace checkpoint using the following command:
 
 ```bash
-python src/llama_recipes/inference/checkpoint_converter_fsdp_hf.py --fsdp_checkpoint_path  /home/kaiwu/work/llama-recipes/llama+pt_ep1_full0616/fine-tuned-meta-llama/Meta-Llama-3-8B-Instruct --consolidated_model_path /home/kaiwu/work/llama-recipes/llama+pt_ep1_full0616/fine-tuned-meta-llama --HF_model_path_or_name /home/kaiwu/work/llama-recipes/llama+pt_ep1_full0616/
+python src/llama_recipes/inference/checkpoint_converter_fsdp_hf.py --fsdp_checkpoint_path  PATH_TO_ROOT_FOLDER --consolidated_model_path PATH_TO_ROOT_FOLDER/fine-tuned-meta-llama --HF_model_path_or_name PATH_TO_ROOT_FOLDER
 
 ```
 
 For more details, please check the readme in the finetuning recipe.
 
 ### Step 4: Evaluating with local inference
 
-Once we have the fine-tuned model, we now need to evaluate it to understand its performance. Normally, to create a evaluation set, we should first gather some questions and manually write the ground truth answer. In this case, we created a eval set mostly based on the Llama [Troubleshooting & FAQ](https://llama.meta.com/faq/), where the answers are written by human experts. Then we pass the evalset question to our fine-tuned model to get the model generated answers. To compare the model generated answers with ground truth, we can use either traditional eval method, eg. calcucate rouge score, or use LLM to act like a judge to score the similarity of them.
+Once we have the fine-tuned model, we now need to evaluate it to understand its performance. We can use either traditional eval method, eg. calcucate exact match rate or rouge score.
+In this tutorial, we can also use LLM to act like a judge to score model generated .
 
 
 ```bash
@@ -142,10 +144,10 @@ On another terminal, we can use another Meta Llama 3 70B Instruct model as a jud
 CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server  --model meta-llama/Meta-Llama-3-70B-Instruct --tensor-parallel-size 2 --disable-log-requests --port 8002
 ```
 
-Then we can pass the port to the eval script:
+Then we can pass the ports to the eval script:
 
 ```bash
-CUDA_VISIBLE_DEVICES=5 python raft_eval.py -m raft-8b -v 8000 -j 8001 -o all_rag5 -r 5
+CUDA_VISIBLE_DEVICES=1 python raft_eval.py -m raft-8b -v 8000 -j 8001 -r 5
 ```
 
 
diff --git a/recipes/use_cases/end2end-recipes/raft/raft.py b/recipes/use_cases/end2end-recipes/raft/raft.py
@@ -1,7 +1,4 @@
 import logging
-from typing import Literal, Any
-import json
-import random
 import os
 import argparse
 from raft_utils import generate_questions, add_chunk_to_dataset
@@ -10,8 +7,6 @@
 
 logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 
-NUM_DISTRACT_DOCS = 5 # number of distracting documents to add to each chunk
-ORCALE_P = 0.8 # probability of related documents to be added to each chunk
 def main(api_config):
     ds = None
     try:
@@ -26,7 +21,7 @@ def main(api_config):
             for question in questions:
                 logging.info(f"Question: {question}")
         logging.info(f"Successfully generated {sum([len(q) for c,q in chunk_questions_zip])} question/answer pairs.")
-        ds = add_chunk_to_dataset(chunk_questions_zip,api_config,ds,NUM_DISTRACT_DOCS, ORCALE_P)
+        ds = add_chunk_to_dataset(chunk_questions_zip,api_config,ds)
         ds.save_to_disk(args.output)
         logging.info(f"Data successfully written to {api_config['output']}. Process completed.")
         formatter = DatasetConverter()
@@ -92,6 +87,7 @@ def parse_arguments():
         api_config["api_key"] = os.environ["API_KEY"]
     logging.info(f"Configuration loaded. Generating {args.questions_per_chunk} question per chunk using model '{args.model}'.")
     logging.info(f"Chunk size: {args.chunk_size}.")
+    logging.info(f"num_distract_docs: {api_config['num_distract_docs']}, orcale_p: {api_config['orcale_p']}")
     logging.info(f"Will use endpoint_url: {args.endpoint_url}.")
     logging.info(f"Output will be written to {args.output}.")
     main(api_config)
diff --git a/recipes/use_cases/end2end-recipes/raft/raft.yaml b/recipes/use_cases/end2end-recipes/raft/raft.yaml
@@ -1,40 +1,43 @@
 COT_prompt_template: >
-  <|begin_of_text|><|start_header_id|>system<|end_header_id|> Answer the following question using the information given in the context below. Here is things to pay attention to:
-    - First provide step-by-step reasoning on how to answer the question.
-    - In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context.
-    - End your response with final answer in the form <ANSWER>: $answer, the answer should less than 60 words.
-    You MUST begin your final answer with the tag "<ANSWER>: <|eot_id|>
+  <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful chatbot who can provide an answer to every questions from the user given a relevant context.<|eot_id|>
   <|start_header_id|>user<|end_header_id|>
-  Question: {question}\nContext: {context}\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>
-
-# question_prompt_template: >
-#   <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a synthetic question-answer pair generator. Given a chunk of context about
-#   some topic(s), generate {num_questions} example questions a user could ask and would be answered
-#   using information from the chunk. For example, if the given context was a Wikipedia
-#   paragraph about the United States, an example question could be 'How many states are
-#   in the United States?
-#   The questions should be able to be answered in 100 words or less. Include only the
-#   questions in your response.<|eot_id|>
-#   <|start_header_id|>user<|end_header_id|>
-#   Context: {context}\n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
+  Question: {question}\nContext: {context}\n
+  Answer this question using the information given by multiple documents in the context above. Here is things to pay attention to:
+  - The context contains many documents, each document starts with <DOCUMENT> and ends </DOCUMENT>.
+  - First provide step-by-step reasoning on how to answer the question.
+  - In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context.
+  - End your response with final answer in the form <ANSWER>: $answer, the answer should less than 60 words.
+  You MUST begin your final answer with the tag "<ANSWER> <|eot_id|><|start_header_id|>assistant<|end_header_id|>
 
 question_prompt_template: >
-  <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a language model skilled in creating quiz questions.
-  You will be provided with a document,
-  read it and please generate factoid question and answer pairs that are most likely be asked by a user of Llama language models
-  which includes LLama, Llama2, Meta Llama3, Code Llama, Meta Llama Guard 1,	Meta Llama Guard 2
-  Your factoid questions should be answerable with a specific, concise piece of factual information from the context.
-  Your factoid questions should be formulated in the same style as questions users could ask in a search engine.
-  This means that your factoid questions MUST NOT mention something like "according to the passage" or "context".
-  please make sure you follow those rules:
-  1. Generate {num_questions} question answer pairs, you can generate less answer if there is nothing related to
-  model, training, fine-tuning and evaluation details of Llama language models,
-  2. The questions can be answered based *solely* on the given passage.
-  3. Avoid asking questions with similar meaning.
-  4. Never use any abbreviation.
-  5. The questions should be able to be answered in 60 words or less. Include only the questions in your response. <|eot_id|>
+  <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a synthetic question-answer pair generator. Given a chunk of context about
+  some topic(s), generate {num_questions} example questions a user could ask and would be answered
+  using information from the chunk. For example, if the given context was a Wikipedia
+  paragraph about the United States, an example question could be 'How many states are
+  in the United States?
+  Your questions should be formulated in the same style as questions that users could ask in a search engine.
+  This means that your questions MUST NOT mention something like "according to the passage" or "context".
+  The questions should be able to be answered in 60 words or less. Include only the questions in your response.<|eot_id|>
   <|start_header_id|>user<|end_header_id|>
   Context: {context}\n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
+
+# question_prompt_template: >
+#   <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a language model skilled in creating quiz questions.
+#   You will be provided with a document,
+#   read it and please generate factoid question and answer pairs that are most likely be asked by a user of Llama language models
+#   which includes LLama, Llama2, Meta Llama3, Code Llama, Meta Llama Guard 1,	Meta Llama Guard 2
+#   Your factoid questions should be answerable with a specific, concise piece of factual information from the context.
+#   Your factoid questions should be formulated in the same style as questions users could ask in a search engine.
+#   This means that your factoid questions MUST NOT mention something like "according to the passage" or "context".
+#   please make sure you follow those rules:
+#   1. Generate {num_questions} question answer pairs, you can generate less answer if there is nothing related to
+#   model, training, fine-tuning and evaluation details of Llama language models,
+#   2. The questions can be answered based *solely* on the given passage.
+#   3. Avoid asking questions with similar meaning.
+#   4. Never use any abbreviation.
+#   5. The questions should be able to be answered in 60 words or less. Include only the questions in your response. <|eot_id|>
+#   <|start_header_id|>user<|end_header_id|>
+#   Context: {context}\n <|eot_id|><|start_header_id|>assistant<|end_header_id|>
 data_dir: "./data"
 
 xml_path: ""
diff --git a/recipes/use_cases/end2end-recipes/raft/raft_eval.py b/recipes/use_cases/end2end-recipes/raft/raft_eval.py
@@ -8,7 +8,7 @@
 from langchain_openai import ChatOpenAI
 from langchain_community.embeddings import HuggingFaceEmbeddings
 from langchain_community.vectorstores import FAISS
-from langchain.text_splitter import RecursiveCharacterTextSplitter,TokenTextSplitter
+from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain_community.vectorstores.utils import DistanceStrategy
 from datetime import datetime
 from langchain_community.document_loaders import DirectoryLoader
diff --git a/recipes/use_cases/end2end-recipes/raft/raft_eval_config.yaml b/recipes/use_cases/end2end-recipes/raft/raft_eval_config.yaml
diff --git a/recipes/use_cases/end2end-recipes/raft/raft_utils.py b/recipes/use_cases/end2end-recipes/raft/raft_utils.py