|
5 | 5 | "cell_type": "markdown",
|
6 | 6 | "metadata": {},
|
7 | 7 | "source": [
|
8 |
| - "# Prompt Engineering with Llama 2\n", |
| 8 | + "<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n", |
| 9 | + "\n", |
| 10 | + "# Prompt Engineering with Llama 3\n", |
9 | 11 | "\n",
|
10 | 12 | "Prompt engineering is using natural language to produce a desired response from a large language model (LLM).\n",
|
11 | 13 | "\n",
|
12 |
| - "This interactive guide covers prompt engineering & best practices with Llama 2." |
| 14 | + "This interactive guide covers prompt engineering & best practices with Llama 3." |
13 | 15 | ]
|
14 | 16 | },
|
15 | 17 | {
|
|
41 | 43 | "\n",
|
42 | 44 | "In 2023, Meta introduced the [Llama language models](https://ai.meta.com/llama/) (Llama Chat, Code Llama, Llama Guard). These are general purpose, state-of-the-art LLMs.\n",
|
43 | 45 | "\n",
|
44 |
| - "Llama 2 models come in 7 billion, 13 billion, and 70 billion parameter sizes. Smaller models are cheaper to deploy and run (see: deployment and performance); larger models are more capable.\n", |
| 46 | + "Llama models come in varying parameter sizes. The smaller models are cheaper to deploy and run; the larger models are more capable.\n", |
| 47 | + "\n", |
| 48 | + "#### Llama 3\n", |
| 49 | + "1. `llama-3-8b` - base pretrained 8 billion parameter model\n", |
| 50 | + "1. `llama-3-70b` - base pretrained 70 billion parameter model\n", |
| 51 | + "1. `llama-3-8b-instruct` - instruction fine-tuned 8 billion parameter model\n", |
| 52 | + "1. `llama-3-70b-instruct` - instruction fine-tuned 70 billion parameter model (flagship)\n", |
45 | 53 | "\n",
|
46 | 54 | "#### Llama 2\n",
|
47 | 55 | "1. `llama-2-7b` - base pretrained 7 billion parameter model\n",
|
|
69 | 77 | "1. `codellama-7b` - code fine-tuned 7 billion parameter model\n",
|
70 | 78 | "1. `codellama-13b` - code fine-tuned 13 billion parameter model\n",
|
71 | 79 | "1. `codellama-34b` - code fine-tuned 34 billion parameter model\n",
|
| 80 | + "1. `codellama-70b` - code fine-tuned 70 billion parameter model\n", |
72 | 81 | "1. `codellama-7b-instruct` - code & instruct fine-tuned 7 billion parameter model\n",
|
73 | 82 | "2. `codellama-13b-instruct` - code & instruct fine-tuned 13 billion parameter model\n",
|
74 | 83 | "3. `codellama-34b-instruct` - code & instruct fine-tuned 34 billion parameter model\n",
|
| 84 | + "3. `codellama-70b-instruct` - code & instruct fine-tuned 70 billion parameter model\n", |
75 | 85 | "1. `codellama-7b-python` - Python fine-tuned 7 billion parameter model\n",
|
76 | 86 | "2. `codellama-13b-python` - Python fine-tuned 13 billion parameter model\n",
|
77 |
| - "3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model" |
| 87 | + "3. `codellama-34b-python` - Python fine-tuned 34 billion parameter model\n", |
| 88 | + "3. `codellama-70b-python` - Python fine-tuned 70 billion parameter model" |
78 | 89 | ]
|
79 | 90 | },
|
80 | 91 | {
|
|
86 | 97 | "\n",
|
87 | 98 | "Large language models are deployed and accessed in a variety of ways, including:\n",
|
88 | 99 | "\n",
|
89 |
| - "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama 2 on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n", |
| 100 | + "1. **Self-hosting**: Using local hardware to run inference. Ex. running Llama on your Macbook Pro using [llama.cpp](https://github.com/ggerganov/llama.cpp).\n", |
90 | 101 | " * Best for privacy/security or if you already have a GPU.\n",
|
91 |
| - "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama 2 on cloud providers like AWS, Azure, GCP, and others.\n", |
| 102 | + "1. **Cloud hosting**: Using a cloud provider to deploy an instance that hosts a specific model. Ex. running Llama on cloud providers like AWS, Azure, GCP, and others.\n", |
92 | 103 | " * Best for customizing models and their runtime (ex. fine-tuning a model for your use case).\n",
|
93 |
| - "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama 2 inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n", |
| 104 | + "1. **Hosted API**: Call LLMs directly via an API. There are many companies that provide Llama inference APIs including AWS Bedrock, Replicate, Anyscale, Together and others.\n", |
94 | 105 | " * Easiest option overall."
|
95 | 106 | ]
|
96 | 107 | },
|
|
118 | 129 | "\n",
|
119 | 130 | "> Our destiny is written in the stars.\n",
|
120 | 131 | "\n",
|
121 |
| - "...is tokenized into `[\"our\", \"dest\", \"iny\", \"is\", \"written\", \"in\", \"the\", \"stars\"]` for Llama 2.\n", |
| 132 | + "...is tokenized into `[\"Our\", \" destiny\", \" is\", \" written\", \" in\", \" the\", \" stars\", \".\"]` for Llama 3. See [this](https://tiktokenizer.vercel.app/?model=meta-llama%2FMeta-Llama-3-8B) for an interactive tokenizer tool.\n", |
122 | 133 | "\n",
|
123 | 134 | "Tokens matter most when you consider API pricing and internal behavior (ex. hyperparameters).\n",
|
124 | 135 | "\n",
|
125 |
| - "Each model has a maximum context length that your prompt cannot exceed. That's 4096 tokens for Llama 2 and 100K for Code Llama. \n" |
| 136 | + "Each model has a maximum context length that your prompt cannot exceed. That's 8K tokens for Llama 3, 4K for Llama 2, and 100K for Code Llama. \n" |
126 | 137 | ]
|
127 | 138 | },
|
128 | 139 | {
|
|
132 | 143 | "source": [
|
133 | 144 | "## Notebook Setup\n",
|
134 | 145 | "\n",
|
135 |
| - "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 2 chat using [Replicate](https://replicate.com/meta/llama-2-70b-chat) and use LangChain to easily set up a chat completion API.\n", |
| 146 | + "The following APIs will be used to call LLMs throughout the guide. As an example, we'll call Llama 3 chat using [Grok](https://console.groq.com/playground?model=llama3-70b-8192).\n", |
136 | 147 | "\n",
|
137 | 148 | "To install prerequisites run:"
|
138 | 149 | ]
|
|
143 | 154 | "metadata": {},
|
144 | 155 | "outputs": [],
|
145 | 156 | "source": [
|
146 |
| - "pip install langchain replicate" |
| 157 | + "import sys\n", |
| 158 | + "!{sys.executable} -m pip install groq" |
147 | 159 | ]
|
148 | 160 | },
|
149 | 161 | {
|
|
152 | 164 | "metadata": {},
|
153 | 165 | "outputs": [],
|
154 | 166 | "source": [
|
155 |
| - "from typing import Dict, List\n", |
156 |
| - "from langchain.llms import Replicate\n", |
157 |
| - "from langchain.memory import ChatMessageHistory\n", |
158 |
| - "from langchain.schema.messages import get_buffer_string\n", |
159 | 167 | "import os\n",
|
| 168 | + "from typing import Dict, List\n", |
| 169 | + "from groq import Groq\n", |
160 | 170 | "\n",
|
161 |
| - "# Get a free API key from https://replicate.com/account/api-tokens\n", |
162 |
| - "os.environ[\"REPLICATE_API_TOKEN\"] = \"YOUR_KEY_HERE\"\n", |
| 171 | + "# Get a free API key from https://console.groq.com/keys\n", |
| 172 | + "os.environ[\"GROQ_API_KEY\"] = \"YOUR_GROQ_API_KEY\"\n", |
163 | 173 | "\n",
|
164 |
| - "LLAMA2_70B_CHAT = \"meta/llama-2-70b-chat:2d19859030ff705a87c746f7e96eea03aefb71f166725aee39692f1476566d48\"\n", |
165 |
| - "LLAMA2_13B_CHAT = \"meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d\"\n", |
| 174 | + "LLAMA3_70B_INSTRUCT = \"llama3-70b-8192\"\n", |
| 175 | + "LLAMA3_8B_INSTRUCT = \"llama3-8b-8192\"\n", |
166 | 176 | "\n",
|
167 |
| - "# We'll default to the smaller 13B model for speed; change to LLAMA2_70B_CHAT for more advanced (but slower) generations\n", |
168 |
| - "DEFAULT_MODEL = LLAMA2_13B_CHAT\n", |
| 177 | + "DEFAULT_MODEL = LLAMA3_70B_INSTRUCT\n", |
169 | 178 | "\n",
|
170 |
| - "def completion(\n", |
171 |
| - " prompt: str,\n", |
172 |
| - " model: str = DEFAULT_MODEL,\n", |
| 179 | + "client = Groq()\n", |
| 180 | + "\n", |
| 181 | + "def assistant(content: str):\n", |
| 182 | + " return { \"role\": \"assistant\", \"content\": content }\n", |
| 183 | + "\n", |
| 184 | + "def user(content: str):\n", |
| 185 | + " return { \"role\": \"user\", \"content\": content }\n", |
| 186 | + "\n", |
| 187 | + "def chat_completion(\n", |
| 188 | + " messages: List[Dict],\n", |
| 189 | + " model = DEFAULT_MODEL,\n", |
173 | 190 | " temperature: float = 0.6,\n",
|
174 | 191 | " top_p: float = 0.9,\n",
|
175 | 192 | ") -> str:\n",
|
176 |
| - " llm = Replicate(\n", |
| 193 | + " response = client.chat.completions.create(\n", |
| 194 | + " messages=messages,\n", |
177 | 195 | " model=model,\n",
|
178 |
| - " model_kwargs={\"temperature\": temperature,\"top_p\": top_p, \"max_new_tokens\": 1000}\n", |
| 196 | + " temperature=temperature,\n", |
| 197 | + " top_p=top_p,\n", |
179 | 198 | " )\n",
|
180 |
| - " return llm(prompt)\n", |
| 199 | + " return response.choices[0].message.content\n", |
| 200 | + " \n", |
181 | 201 | "\n",
|
182 |
| - "def chat_completion(\n", |
183 |
| - " messages: List[Dict],\n", |
184 |
| - " model = DEFAULT_MODEL,\n", |
| 202 | + "def completion(\n", |
| 203 | + " prompt: str,\n", |
| 204 | + " model: str = DEFAULT_MODEL,\n", |
185 | 205 | " temperature: float = 0.6,\n",
|
186 | 206 | " top_p: float = 0.9,\n",
|
187 | 207 | ") -> str:\n",
|
188 |
| - " history = ChatMessageHistory()\n", |
189 |
| - " for message in messages:\n", |
190 |
| - " if message[\"role\"] == \"user\":\n", |
191 |
| - " history.add_user_message(message[\"content\"])\n", |
192 |
| - " elif message[\"role\"] == \"assistant\":\n", |
193 |
| - " history.add_ai_message(message[\"content\"])\n", |
194 |
| - " else:\n", |
195 |
| - " raise Exception(\"Unknown role\")\n", |
196 |
| - " return completion(\n", |
197 |
| - " get_buffer_string(\n", |
198 |
| - " history.messages,\n", |
199 |
| - " human_prefix=\"USER\",\n", |
200 |
| - " ai_prefix=\"ASSISTANT\",\n", |
201 |
| - " ),\n", |
202 |
| - " model,\n", |
203 |
| - " temperature,\n", |
204 |
| - " top_p,\n", |
| 208 | + " return chat_completion(\n", |
| 209 | + " [user(prompt)],\n", |
| 210 | + " model=model,\n", |
| 211 | + " temperature=temperature,\n", |
| 212 | + " top_p=top_p,\n", |
205 | 213 | " )\n",
|
206 | 214 | "\n",
|
207 |
| - "def assistant(content: str):\n", |
208 |
| - " return { \"role\": \"assistant\", \"content\": content }\n", |
209 |
| - "\n", |
210 |
| - "def user(content: str):\n", |
211 |
| - " return { \"role\": \"user\", \"content\": content }\n", |
212 |
| - "\n", |
213 | 215 | "def complete_and_print(prompt: str, model: str = DEFAULT_MODEL):\n",
|
214 | 216 | " print(f'==============\\n{prompt}\\n==============')\n",
|
215 | 217 | " response = completion(prompt, model)\n",
|
|
223 | 225 | "source": [
|
224 | 226 | "### Completion APIs\n",
|
225 | 227 | "\n",
|
226 |
| - "Llama 2 models tend to be wordy and explain their rationale. Later we'll explore how to manage the response length." |
| 228 | + "Let's try Llama 3!" |
227 | 229 | ]
|
228 | 230 | },
|
229 | 231 | {
|
|
345 | 347 | "cell_type": "markdown",
|
346 | 348 | "metadata": {},
|
347 | 349 | "source": [
|
348 |
| - "You can think about giving explicit instructions as using rules and restrictions to how Llama 2 responds to your prompt.\n", |
| 350 | + "You can think about giving explicit instructions as using rules and restrictions to how Llama 3 responds to your prompt.\n", |
349 | 351 | "\n",
|
350 | 352 | "- Stylization\n",
|
351 | 353 | " - `Explain this to me like a topic on a children's educational network show teaching elementary students.`\n",
|
|
387 | 389 | "\n",
|
388 | 390 | "#### Zero-Shot Prompting\n",
|
389 | 391 | "\n",
|
390 |
| - "Large language models like Llama 2 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n", |
| 392 | + "Large language models like Llama 3 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n", |
391 | 393 | "\n",
|
392 |
| - "Let's try using Llama 2 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting." |
| 394 | + "Let's try using Llama 3 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting." |
393 | 395 | ]
|
394 | 396 | },
|
395 | 397 | {
|
|
459 | 461 | "source": [
|
460 | 462 | "### Role Prompting\n",
|
461 | 463 | "\n",
|
462 |
| - "Llama 2 will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n", |
| 464 | + "Llama will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n", |
463 | 465 | "\n",
|
464 |
| - "Let's use Llama 2 to create a more focused, technical response for a question around the pros and cons of using PyTorch." |
| 466 | + "Let's use Llama 3 to create a more focused, technical response for a question around the pros and cons of using PyTorch." |
465 | 467 | ]
|
466 | 468 | },
|
467 | 469 | {
|
|
484 | 486 | "source": [
|
485 | 487 | "### Chain-of-Thought\n",
|
486 | 488 | "\n",
|
487 |
| - "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting:" |
| 489 | + "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting.\n", |
| 490 | + "\n", |
| 491 | + "Llama 3 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness." |
488 | 492 | ]
|
489 | 493 | },
|
490 | 494 | {
|
|
493 | 497 | "metadata": {},
|
494 | 498 | "outputs": [],
|
495 | 499 | "source": [
|
496 |
| - "complete_and_print(\"Who lived longer Elvis Presley or Mozart?\")\n", |
497 |
| - "# Often gives incorrect answer of \"Mozart\"\n", |
| 500 | + "prompt = \"Who lived longer, Mozart or Elvis?\"\n", |
| 501 | + "\n", |
| 502 | + "complete_and_print(prompt)\n", |
| 503 | + "# Llama 2 would often give the incorrect answer of \"Mozart\"\n", |
498 | 504 | "\n",
|
499 |
| - "complete_and_print(\"Who lived longer Elvis Presley or Mozart? Let's think through this carefully, step by step.\")\n", |
| 505 | + "complete_and_print(f\"{prompt} Let's think through this carefully, step by step.\")\n", |
500 | 506 | "# Gives the correct answer \"Elvis\""
|
501 | 507 | ]
|
502 | 508 | },
|
|
523 | 529 | " response = completion(\n",
|
524 | 530 | " \"John found that the average of 15 numbers is 40.\"\n",
|
525 | 531 | " \"If 10 is added to each number then the mean of the numbers is?\"\n",
|
526 |
| - " \"Report the answer surrounded by three backticks, for example: ```123```\",\n", |
527 |
| - " model = LLAMA2_70B_CHAT\n", |
| 532 | + " \"Report the answer surrounded by backticks (example: `123`)\",\n", |
528 | 533 | " )\n",
|
529 |
| - " match = re.search(r'```(\\d+)```', response)\n", |
| 534 | + " match = re.search(r'`(\\d+)`', response)\n", |
530 | 535 | " if match is None:\n",
|
531 | 536 | " return None\n",
|
532 | 537 | " return match.group(1)\n",
|
|
538 | 543 | " f\"Final answer: {mode(answers)}\",\n",
|
539 | 544 | " )\n",
|
540 | 545 | "\n",
|
541 |
| - "# Sample runs of Llama-2-70B (all correct):\n", |
542 |
| - "# [50, 50, 750, 50, 50] -> 50\n", |
543 |
| - "# [130, 10, 750, 50, 50] -> 50\n", |
544 |
| - "# [50, None, 10, 50, 50] -> 50" |
| 546 | + "# Sample runs of Llama-3-70B (all correct):\n", |
| 547 | + "# ['60', '50', '50', '50', '50'] -> 50\n", |
| 548 | + "# ['50', '50', '50', '60', '50'] -> 50\n", |
| 549 | + "# ['50', '50', '60', '50', '50'] -> 50" |
545 | 550 | ]
|
546 | 551 | },
|
547 | 552 | {
|
|
560 | 565 | "metadata": {},
|
561 | 566 | "outputs": [],
|
562 | 567 | "source": [
|
563 |
| - "complete_and_print(\"What is the capital of the California?\", model = LLAMA2_70B_CHAT)\n", |
| 568 | + "complete_and_print(\"What is the capital of the California?\")\n", |
564 | 569 | "# Gives the correct answer \"Sacramento\""
|
565 | 570 | ]
|
566 | 571 | },
|
|
677 | 682 | " \"\"\"\n",
|
678 | 683 | " # Python code to calculate: ((-5 + 93 * 4 - 0) * (4^4 + -7 + 0 * 5))\n",
|
679 | 684 | " \"\"\",\n",
|
680 |
| - " model=\"meta/codellama-34b:67942fd0f55b66da802218a19a8f0e1d73095473674061a6ea19f2dc8c053152\"\n", |
681 | 685 | ")"
|
682 | 686 | ]
|
683 | 687 | },
|
|
687 | 691 | "metadata": {},
|
688 | 692 | "outputs": [],
|
689 | 693 | "source": [
|
690 |
| - "# The following code was generated by Code Llama 34B:\n", |
| 694 | + "# The following code was generated by Llama 3 70B:\n", |
691 | 695 | "\n",
|
692 |
| - "num1 = (-5 + 93 * 4 - 0)\n", |
693 |
| - "num2 = (4**4 + -7 + 0 * 5)\n", |
694 |
| - "answer = num1 * num2\n", |
695 |
| - "print(answer)" |
| 696 | + "result = ((-5 + 93 * 4 - 0) * (4**4 - 7 + 0 * 5))\n", |
| 697 | + "print(result)" |
696 | 698 | ]
|
697 | 699 | },
|
698 | 700 | {
|
|
702 | 704 | "source": [
|
703 | 705 | "### Limiting Extraneous Tokens\n",
|
704 | 706 | "\n",
|
705 |
| - "A common struggle is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\").\n", |
| 707 | + "A common struggle with Llama 2 is getting output without extraneous tokens (ex. \"Sure! Here's more information on...\"), even if explicit instructions are given to Llama 2 to be concise and no preamble. Llama 3 can better follow instructions.\n", |
706 | 708 | "\n",
|
707 | 709 | "Check out this improvement that combines a role, rules and restrictions, explicit instructions, and an example:"
|
708 | 710 | ]
|
|
715 | 717 | "source": [
|
716 | 718 | "complete_and_print(\n",
|
717 | 719 | " \"Give me the zip code for Menlo Park in JSON format with the field 'zip_code'\",\n",
|
718 |
| - " model = LLAMA2_70B_CHAT,\n", |
719 | 720 | ")\n",
|
720 | 721 | "# Likely returns the JSON and also \"Sure! Here's the JSON...\"\n",
|
721 | 722 | "\n",
|
|
726 | 727 | " Example question: What is the zip code of the Empire State Building? Example answer: {'zip_code': 10118}\n",
|
727 | 728 | " Now here is my question: What is the zip code of Menlo Park?\n",
|
728 | 729 | " \"\"\",\n",
|
729 |
| - " model = LLAMA2_70B_CHAT,\n", |
730 | 730 | ")\n",
|
731 | 731 | "# \"{'zip_code': 94025}\""
|
732 | 732 | ]
|
|
770 | 770 | "mimetype": "text/x-python",
|
771 | 771 | "name": "python",
|
772 | 772 | "nbconvert_exporter": "python",
|
773 |
| - "pygments_lexer": "ipython3" |
| 773 | + "pygments_lexer": "ipython3", |
| 774 | + "version": "3.10.14" |
774 | 775 | },
|
775 | 776 | "last_base_url": "https://bento.edge.x2p.facebook.net/",
|
776 | 777 | "last_kernel_id": "161e2a7b-2d2b-4995-87f3-d1539860ecac",
|
|
0 commit comments