-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathllama_2_prompting_guide_with_gradio.py
More file actions
241 lines (161 loc) · 7.21 KB
/
llama_2_prompting_guide_with_gradio.py
File metadata and controls
241 lines (161 loc) · 7.21 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
# -*- coding: utf-8 -*-
"""LLaMa 2 Prompting Guide with Gradio.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1SSv6lzX3Byu50PooYogmiwHqf5PQN68E
## Introduction
In this Colab Notebook, we are going to chat with Llama-2 7B.
By the end of this tutorial, you'll be able to interact with this model and use it to generate conversational responses.
Whether you're curious about chatbot technology or simply want to see a machine-generated response to a particular question, this notebook will serve as a comprehensive guide.
## Workflow
1. **Installations**: We'll begin by setting up our environment with the required libraries.
2. **Prerequisites**: Ensure we have access to the Llama-2 7B model on Hugging Face.
3. **Loading the Model & Tokenizer**: Retrieve the model and tokenizer for our session.
4. **Creating the Llama Pipeline**: Prepare our model for generating responses.
5. **Interacting with Llama through Gradio's ChatInterface**: Prompt the model for answers and explore its capabilities.
Let's dive in!
**First, change runtime to GPU.**
You can play with Llama-2 7B Chat here: https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat
## Installations
Before we proceed, we need to ensure that the essential libraries are installed:
- `Hugging Face Transformers`: Provides us with a straightforward way to use pre-trained models.
- `PyTorch`: Serves as the backbone for deep learning operations.
- `Accelerate`: Optimizes PyTorch operations, especially on GPU.
"""
!pip install transformers torch accelerate
"""To use `gr.ChatInterface()` we need the newest Gradio"""
!pip install --upgrade gradio
"""If `!pip install --upgrade gradio` returns an error that says: `NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968`, do the following:
1. Uncomment the next cell.
2. Run the cell
3. Restart Runtime: `Runtime -> Restart Runtime`
"""
# import locale
# locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
"""### Prerequisites
To load our desired model, `meta-llama/Llama-2-7b-chat-hf`, we first need to authenticate ourselves on Hugging Face. This ensures we have the correct permissions to fetch the model.
1. Gain access to the model on Hugging Face: [Link](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
2. Use the Hugging Face CLI to login and verify your authentication status.
"""
!huggingface-cli login
!huggingface-cli whoami
"""### Loading Model & Tokenizer
Here, we are preparing our session by loading both the Llama model and its associated tokenizer.
The tokenizer will help in converting our text prompts into a format that the model can understand and process.
"""
from transformers import AutoTokenizer
import transformers
import torch
model = "meta-llama/Llama-2-7b-chat-hf" # meta-llama/Llama-2-7b-chat-hf
tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)
"""### Creating the Llama Pipeline
We'll set up a pipeline for text generation.
This pipeline simplifies the process of feeding prompts to our model and receiving generated text as output.
*Note*: This cell takes 2-3 minutes to run
"""
from transformers import pipeline
llama_pipeline = pipeline(
"text-generation", # LLM task
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
"""## Comparing Methods: Basic vs. Advanced
Before diving into our advanced method for conversational interaction, let's explore a basic method for generating responses using the `get_response()` function. We will later discuss its limitations and how the advanced method overcomes them.
"""
def get_response(prompt: str) -> None:
"""
Generate a response from the Llama model.
Parameters:
prompt (str): The user's input/question for the model.
Returns:
None: Prints the model's response.
"""
sequences = llama_pipeline(
prompt,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=256,
)
print("Chatbot:", sequences[0]['generated_text'])
get_response("Hi, I'm Kris")
get_response("What's my name?")
"""### Drawbacks of `get_response()`
1. **Lack of Conversation History**: The basic approach does not account for past interactions, making it less effective for maintaining a coherent conversation.
2. **Limited Customization**: The function doesn't allow for advanced prompt formatting or handling system-level instructions.
3. **Not Ready for UI Integration**: This basic approach isn't designed for easy integration with user interface libraries like Gradio.
## Improved Prompts
The right structure of Llama 2 prompts:
```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_message }} [/INST]
```
### Building the Prompt
Explaining the parameters:
- `message` is the current message we're sending
- `history` is the history of conversation as a list of tupples `[(user_msg1, bot_msg1), (usr_msg2, bot_msg2), ...]`
"""
SYSTEM_PROMPT = """<s>[INST] <<SYS>>
You are a helpful bot. Your answers are clear and concise.
<</SYS>>
"""
# Formatting function for message and history
def format_message(message: str, history: list, memory_limit: int = 3) -> str:
"""
Formats the message and history for the Llama model.
Parameters:
message (str): Current message to send.
history (list): Past conversation history.
memory_limit (int): Limit on how many past interactions to consider.
Returns:
str: Formatted message string
"""
# always keep len(history) <= memory_limit
if len(history) > memory_limit:
history = history[-memory_limit:]
if len(history) == 0:
return SYSTEM_PROMPT + f"{message} [/INST]"
formatted_message = SYSTEM_PROMPT + f"{history[0][0]} [/INST] {history[0][1]} </s>"
# Handle conversation history
for user_msg, model_answer in history[1:]:
formatted_message += f"<s>[INST] {user_msg} [/INST] {model_answer} </s>"
# Handle the current message
formatted_message += f"<s>[INST] {message} [/INST]"
return formatted_message
"""### Getting Responses
We need the function to generate responses.
"""
# Generate a response from the Llama model
def get_llama_response(message: str, history: list) -> str:
"""
Generates a conversational response from the Llama model.
Parameters:
message (str): User's input message.
history (list): Past conversation history.
Returns:
str: Generated response from the Llama model.
"""
query = format_message(message, history)
response = ""
sequences = llama_pipeline(
query,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=1024,
)
generated_text = sequences[0]['generated_text']
response = generated_text[len(query):] # Remove the prompt from the output
print("Chatbot:", response.strip())
return response.strip()
import gradio as gr
gr.ChatInterface(get_llama_response).launch()
"""### Conclusion
Thanks to the Hugging Face Library, creating a pipeline to chat with llama 2 (or any other open-source LLM) is quite easy.
But if you worked a lot with much larger models such as GPT-4, you need to adjust your expectations.
"""