|
5 | 5 | "cell_type": "markdown",
|
6 | 6 | "metadata": {},
|
7 | 7 | "source": [
|
8 |
| - "# Prompt Engineering with Llama 2\n", |
| 8 | + "# Prompt Engineering with Llama 3\n", |
9 | 9 | "\n",
|
10 | 10 | "Prompt engineering is using natural language to produce a desired response from a large language model (LLM).\n",
|
11 | 11 | "\n",
|
12 |
| - "This interactive guide covers prompt engineering & best practices with Llama 2." |
| 12 | + "This interactive guide covers prompt engineering & best practices with Llama 3." |
13 | 13 | ]
|
14 | 14 | },
|
15 | 15 | {
|
|
41 | 41 | "\n",
|
42 | 42 | "In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.\n",
|
43 | 43 | "\n",
|
44 |
| - "Llama 2 models come in 7 billion, 13 billion, and 70 billion parameter sizes. Smaller models are cheaper to deploy and run (see: deployment and performance); larger models are more capable.\n", |
| 44 | + "Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.\n", |
| 45 | + "\n", |
| 46 | + "#### Llama 3\n", |
| 47 | + "1. `llama-3-8b` - base pretrained 8 billion parameter model\n", |
| 48 | + "1. `llama-3-70b` - base pretrained 8 billion parameter model\n", |
| 49 | + "1. `llama-3-8b-instruct` - instruction fine-tuned 8 billion parameter model\n", |
| 50 | + "1. `llama-3-70b-instruct` - instruction fine-tuned 70 billion parameter model (flagship)\n", |
45 | 51 | "\n",
|
46 | 52 | "#### Llama 2\n",
|
47 | 53 | "1. `llama-2-7b` - base pretrained 7 billion parameter model\n",
|
|
86 | 92 | "\n",
|
87 | 93 | "Large language models are deployed and accessed in a variety of ways, including:\n",
|
88 | 94 | "\n",
|
89 |
| - "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama 2 on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n", |
| 95 | + "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n", |
90 | 96 | " * Best for privacy/security or if you already have a GPU.\n",
|
91 |
| - "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama 2 on cloud providers like AWS, Azure, GCP, and others.\n", |
| 97 | + "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.\n", |
92 | 98 | " * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).\n",
|
93 |
| - "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama 2 inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n", |
| 99 | + "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n", |
94 | 100 | " * Easiest option overall."
|
95 | 101 | ]
|
96 | 102 | },
|
|
118 | 124 | "\n",
|
119 | 125 | "> Our destiny is written in the stars.\n",
|
120 | 126 | "\n",
|
121 |
| - "...is tokenized into `[\"our\", \"dest\", \"iny\", \"is\", \"written\", \"in\", \"the\", \"stars\"]` for Llama 2.\n", |
| 127 | + "...is tokenized into `[\"Our\", \"destiny\", \"is\", \"written\", \"in\", \"the\", \"stars\", \".\"]` for Llama 3.\n", |
122 | 128 | "\n",
|
123 | 129 | "Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).\n",
|
124 | 130 | "\n",
|
125 |
| - "Each model has a maximum context length that your prompt cannot exceed. That's 4096 tokens for Llama 2 and 100K for Code Llama. \n" |
| 131 | + "Each model has a maximum context length that your prompt cannot exceed. That's 8K tokens for Llama 3 and 100K for Code Llama. \n" |
126 | 132 | ]
|
127 | 133 | },
|
128 | 134 | {
|
|
132 | 138 | "source": [
|
133 | 139 | "## Notebook Setup\n",
|
134 | 140 | "\n",
|
135 |
| - "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 2 chat using [Replicate](https://replicate.com/meta/llama-2-70b-chat) and use LangChain to easily set up a chat completion API.\n", |
| 141 | + "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3 chat using [Grok](https://console.groq.com/playground?model=llama3-70b-8192).\n", |
136 | 142 | "\n",
|
137 | 143 | "To install prerequisites run:"
|
138 | 144 | ]
|
|
143 | 149 | "metadata": {},
|
144 | 150 | "outputs": [],
|
145 | 151 | "source": [
|
146 |
| - "pip install langchain replicate" |
| 152 | + "import sys\n", |
| 153 | + "!{sys.executable} -m pip install groq" |
147 | 154 | ]
|
148 | 155 | },
|
149 | 156 | {
|
|
152 | 159 | "metadata": {},
|
153 | 160 | "outputs": [],
|
154 | 161 | "source": [
|
155 |
| - "from typing import Dict, List\n", |
156 |
| - "from langchain.llms import Replicate\n", |
157 |
| - "from langchain.memory import ChatMessageHistory\n", |
158 |
| - "from langchain.schema.messages import get_buffer_string\n", |
159 | 162 | "import os\n",
|
| 163 | + "from typing import Dict, List\n", |
| 164 | + "from groq import Groq\n", |
160 | 165 | "\n",
|
161 |
| - "# Get a free API key from https://replicate.com/account/api-tokens\n", |
162 |
| - "os.environ[\"REPLICATE_API_TOKEN\"] = \"YOUR_KEY_HERE\"\n", |
| 166 | + "# Get a free API key from https://console.groq.com/keys\n", |
| 167 | + "# os.environ[\"GROQ_API_KEY\"] = \"YOUR_KEY_HERE\"\n", |
163 | 168 | "\n",
|
164 |
| - "LLAMA2_70B_CHAT = \"meta/llama-2-70b-chat:2d19859030ff705a87c746f7e96eea03aefb71f166725aee39692f1476566d48\"\n", |
165 |
| - "LLAMA2_13B_CHAT = \"meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d\"\n", |
| 169 | + "LLAMA3_70B_INSTRUCT = \"llama3-70b-8192\"\n", |
| 170 | + "LLAMA3_8B_INSTRUCT = \"llama3-8b-8192\"\n", |
166 | 171 | "\n",
|
167 |
| - "# We'll default to the smaller 13B model for speed; change to LLAMA2_70B_CHAT for more advanced (but slower) generations\n", |
168 |
| - "DEFAULT_MODEL = LLAMA2_13B_CHAT\n", |
| 172 | + "DEFAULT_MODEL = LLAMA3_70B_INSTRUCT\n", |
169 | 173 | "\n",
|
170 |
| - "def completion(\n", |
171 |
| - " prompt: str,\n", |
172 |
| - " model: str = DEFAULT_MODEL,\n", |
| 174 | + "client = Groq()\n", |
| 175 | + "\n", |
| 176 | + "def assistant(content: str):\n", |
| 177 | + " return { \"role\": \"assistant\", \"content\": content }\n", |
| 178 | + "\n", |
| 179 | + "def user(content: str):\n", |
| 180 | + " return { \"role\": \"user\", \"content\": content }\n", |
| 181 | + "\n", |
| 182 | + "def chat_completion(\n", |
| 183 | + " messages: List[Dict],\n", |
| 184 | + " model = DEFAULT_MODEL,\n", |
173 | 185 | " temperature: float = 0.6,\n",
|
174 | 186 | " top_p: float = 0.9,\n",
|
175 | 187 | ") -> str:\n",
|
176 |
| - " llm = Replicate(\n", |
| 188 | + " response = client.chat.completions.create(\n", |
| 189 | + " messages=messages,\n", |
177 | 190 | " model=model,\n",
|
178 |
| - " model_kwargs={\"temperature\": temperature,\"top_p\": top_p, \"max_new_tokens\": 1000}\n", |
| 191 | + " temperature=temperature,\n", |
| 192 | + " top_p=top_p,\n", |
179 | 193 | " )\n",
|
180 |
| - " return llm(prompt)\n", |
| 194 | + " return response.choices[0].message.content\n", |
| 195 | + " \n", |
181 | 196 | "\n",
|
182 |
| - "def chat_completion(\n", |
183 |
| - " messages: List[Dict],\n", |
184 |
| - " model = DEFAULT_MODEL,\n", |
| 197 | + "def completion(\n", |
| 198 | + " prompt: str,\n", |
| 199 | + " model: str = DEFAULT_MODEL,\n", |
185 | 200 | " temperature: float = 0.6,\n",
|
186 | 201 | " top_p: float = 0.9,\n",
|
187 | 202 | ") -> str:\n",
|
188 |
| - " history = ChatMessageHistory()\n", |
189 |
| - " for message in messages:\n", |
190 |
| - " if message[\"role\"] == \"user\":\n", |
191 |
| - " history.add_user_message(message[\"content\"])\n", |
192 |
| - " elif message[\"role\"] == \"assistant\":\n", |
193 |
| - " history.add_ai_message(message[\"content\"])\n", |
194 |
| - " else:\n", |
195 |
| - " raise Exception(\"Unknown role\")\n", |
196 |
| - " return completion(\n", |
197 |
| - " get_buffer_string(\n", |
198 |
| - " history.messages,\n", |
199 |
| - " human_prefix=\"USER\",\n", |
200 |
| - " ai_prefix=\"ASSISTANT\",\n", |
201 |
| - " ),\n", |
202 |
| - " model,\n", |
203 |
| - " temperature,\n", |
204 |
| - " top_p,\n", |
| 203 | + " return chat_completion(\n", |
| 204 | + " [user(prompt)],\n", |
| 205 | + " model=model,\n", |
| 206 | + " temperature=temperature,\n", |
| 207 | + " top_p=top_p,\n", |
205 | 208 | " )\n",
|
206 | 209 | "\n",
|
207 |
| - "def assistant(content: str):\n", |
208 |
| - " return { \"role\": \"assistant\", \"content\": content }\n", |
209 |
| - "\n", |
210 |
| - "def user(content: str):\n", |
211 |
| - " return { \"role\": \"user\", \"content\": content }\n", |
212 |
| - "\n", |
213 | 210 | "def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):\n",
|
214 | 211 | " print(f'==============\\n{prompt}\\n==============')\n",
|
215 | 212 | " response = completion(prompt, model)\n",
|
|
223 | 220 | "source": [
|
224 | 221 | "### Completion APIs\n",
|
225 | 222 | "\n",
|
226 |
| - "Llama 2 models tend to be wordy and explain their rationale. Later we'll explore how to manage the response length." |
| 223 | + "Let's try Llama 3!" |
227 | 224 | ]
|
228 | 225 | },
|
229 | 226 | {
|
|
345 | 342 | "cell_type": "markdown",
|
346 | 343 | "metadata": {},
|
347 | 344 | "source": [
|
348 |
| - "You can think about giving explicit instructions as using rules and restrictions to how Llama 2 responds to your prompt.\n", |
| 345 | + "You can think about giving explicit instructions as using rules and restrictions to how Llama 3 responds to your prompt.\n", |
349 | 346 | "\n",
|
350 | 347 | "- Stylization\n",
|
351 | 348 | " - `Explain this to me like a topic on a children's educational network show teaching elementary students.`\n",
|
|
387 | 384 | "\n",
|
388 | 385 | "#### Zero-Shot Prompting\n",
|
389 | 386 | "\n",
|
390 |
| - "Large language models like Llama 2 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n", |
| 387 | + "Large language models like Llama 3 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n", |
391 | 388 | "\n",
|
392 |
| - "Let's try using Llama 2 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting." |
| 389 | + "Let's try using Llama 3 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting." |
393 | 390 | ]
|
394 | 391 | },
|
395 | 392 | {
|
|
459 | 456 | "source": [
|
460 | 457 | "### Role Prompting\n",
|
461 | 458 | "\n",
|
462 |
| - "Llama 2 will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n", |
| 459 | + "Llama will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n", |
463 | 460 | "\n",
|
464 |
| - "Let's use Llama 2 to create a more focused, technical response for a question around the pros and cons of using PyTorch." |
| 461 | + "Let's use Llama 3 to create a more focused, technical response for a question around the pros and cons of using PyTorch." |
465 | 462 | ]
|
466 | 463 | },
|
467 | 464 | {
|
|
484 | 481 | "source": [
|
485 | 482 | "### Chain-of-Thought\n",
|
486 | 483 | "\n",
|
487 |
| - "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting:" |
| 484 | + "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting.\n", |
| 485 | + "\n", |
| 486 | + "Llama 3 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness." |
488 | 487 | ]
|
489 | 488 | },
|
490 | 489 | {
|
|
493 | 492 | "metadata": {},
|
494 | 493 | "outputs": [],
|
495 | 494 | "source": [
|
496 |
| - "complete_and_print(\"Who lived longer Elvis Presley or Mozart?\")\n", |
497 |
| - "# Often gives incorrect answer of \"Mozart\"\n", |
| 495 | + "prompt = \"Who lived longer, Mozart or Elvis?\"\n", |
| 496 | + "\n", |
| 497 | + "complete_and_print(prompt)\n", |
| 498 | + "# Llama 2 would often give the incorrect answer of \"Mozart\"\n", |
498 | 499 | "\n",
|
499 |
| - "complete_and_print(\"Who lived longer Elvis Presley or Mozart? Let's think through this carefully, step by step.\")\n", |
| 500 | + "complete_and_print(f\"{prompt} Let's think through this carefully, step by step.\")\n", |
500 | 501 | "# Gives the correct answer \"Elvis\""
|
501 | 502 | ]
|
502 | 503 | },
|
|
523 | 524 | " response = completion(\n",
|
524 | 525 | " \"John found that the average of 15 numbers is 40.\"\n",
|
525 | 526 | " \"If 10 is added to each number then the mean of the numbers is?\"\n",
|
526 |
| - " \"Report the answer surrounded by three backticks, for example: ```123```\",\n", |
527 |
| - " model = LLAMA2_70B_CHAT\n", |
| 527 | + " \"Report the answer surrounded by backticks (example: `123`)\",\n", |
528 | 528 | " )\n",
|
529 |
| - " match = re.search(r'```(\\d+)```', response)\n", |
| 529 | + " match = re.search(r'`(\\d+)`', response)\n", |
530 | 530 | " if match is None:\n",
|
531 | 531 | " return None\n",
|
532 | 532 | " return match.group(1)\n",
|
|
538 | 538 | " f\"Final answer: {mode(answers)}\",\n",
|
539 | 539 | " )\n",
|
540 | 540 | "\n",
|
541 |
| - "# Sample runs of Llama-2-70B (all correct):\n", |
542 |
| - "# [50, 50, 750, 50, 50] -> 50\n", |
543 |
| - "# [130, 10, 750, 50, 50] -> 50\n", |
544 |
| - "# [50, None, 10, 50, 50] -> 50" |
| 541 | + "# Sample runs of Llama-3-70B (all correct):\n", |
| 542 | + "# ['60', '50', '50', '50', '50'] -> 50\n", |
| 543 | + "# ['50', '50', '50', '60', '50'] -> 50\n", |
| 544 | + "# ['50', '50', '60', '50', '50'] -> 50" |
545 | 545 | ]
|
546 | 546 | },
|
547 | 547 | {
|
|
560 | 560 | "metadata": {},
|
561 | 561 | "outputs": [],
|
562 | 562 | "source": [
|
563 |
| - "complete_and_print(\"What is the capital of the California?\", model = LLAMA2_70B_CHAT)\n", |
| 563 | + "complete_and_print(\"What is the capital of the California?\")\n", |
564 | 564 | "# Gives the correct answer \"Sacramento\""
|
565 | 565 | ]
|
566 | 566 | },
|
|
677 | 677 | " \"\"\"\n",
|
678 | 678 | " # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n",
|
679 | 679 | " \"\"\",\n",
|
680 |
| - " model=\"meta/codellama-34b:67942fd0f55b66da802218a19a8f0e1d73095473674061a6ea19f2dc8c053152\"\n", |
681 | 680 | ")"
|
682 | 681 | ]
|
683 | 682 | },
|
|
687 | 686 | "metadata": {},
|
688 | 687 | "outputs": [],
|
689 | 688 | "source": [
|
690 |
| - "# The following code was generated by Code Llama 34B:\n", |
| 689 | + "# The following code was generated by Llama 3 70B:\n", |
691 | 690 | "\n",
|
692 |
| - "num1 = (-5 + 93 * 4 - 0)\n", |
693 |
| - "num2 = (4**4 + -7 + 0 * 5)\n", |
694 |
| - "answer = num1 * num2\n", |
695 |
| - "print(answer)" |
| 691 | + "result = ((-5 + 93 * 4 - 0) * (4**4 - 7 + 0 * 5))\n", |
| 692 | + "print(result)" |
696 | 693 | ]
|
697 | 694 | },
|
698 | 695 | {
|
|
715 | 712 | "source": [
|
716 | 713 | "complete_and_print(\n",
|
717 | 714 | " \"Give me the zip code for Menlo Park in JSON format with the field 'zip_code'\",\n",
|
718 |
| - " model = LLAMA2_70B_CHAT,\n", |
719 | 715 | ")\n",
|
720 | 716 | "# Likely returns the JSON and also \"Sure! Here's the JSON...\"\n",
|
721 | 717 | "\n",
|
|
726 | 722 | " Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}\n",
|
727 | 723 | " Now here is my question: What is the zip code of Menlo Park?\n",
|
728 | 724 | " \"\"\",\n",
|
729 |
| - " model = LLAMA2_70B_CHAT,\n", |
730 | 725 | ")\n",
|
731 | 726 | "# \"{'zip_code': 94025}\""
|
732 | 727 | ]
|
|
770 | 765 | "mimetype": "text/x-python",
|
771 | 766 | "name": "python",
|
772 | 767 | "nbconvert_exporter": "python",
|
773 |
| - "pygments_lexer": "ipython3" |
| 768 | + "pygments_lexer": "ipython3", |
| 769 | + "version": "3.12.3" |
774 | 770 | },
|
775 | 771 | "last_base_url": "https://bento.edge.x2p.facebook.net/",
|
776 | 772 | "last_kernel_id": "161e2a7b-2d2b-4995-87f3-d1539860ecac",
|
|
0 commit comments