Prompt Engineering Techniques for Reliable LLM Outputs (2023–2025) Introduction Prompt engineering – the craft of designing inputs to guide large language models (LLMs) – has emerged as a key tool for improving the reliability of model outputs. “Reliability” encompasses factual accuracy, logical consistency, safety/harmlessness, and robustness to variations or adversarial inputs. Between 2023 and 2025, a rich body of research (peer-reviewed papers and industry best practices) analyzed how different prompt formats and strategies impact these reliability aspects. This report surveys major techniques such as zero-shot vs. few-shot prompting, chain-of-thought prompting, instruction tuning, and other structured prompt methods. We summarize each technique, its impact on LLM reliability, and comparative findings from empirical studies, with an emphasis on factual accuracy, consistency, safety, and robustness. Zero-Shot vs. Few-Shot Prompting Zero-shot prompting means instructing the model with a task description alone, without example solutions. It relies on the model’s pretrained knowledge and generalization. Few-shot prompting (in-context learning) provides a handful of example question–answer pairs (demonstrations) in the prompt before the actual query, guiding the model’s predictions via pattern matching[1]. Few-shot examples often improve output format consistency and accuracy by showing the model what a correct solution looks like. Research has generally found: • Improved Accuracy with Few-Shot: Few-shot prompts tend to outperform zero-shot in many tasks, especially for classification or structured outputs[1]. By giving task-specific examples, the model is less likely to produce irrelevant or incorrect answers. For instance, one study observed a 12.6% accuracy increase (67% vs 60%) in a domain-specific Q&A task when using few-shot prompting instead of zero-shot[2]. Similarly, a comprehensive 2025 evaluation reported that few-shot learning yields higher predictive accuracy than zero-shot on average, albeit at the cost of longer inputs and higher computation[1]. • Zero-Shot for Generalized Tasks: Zero-shot prompting can still perform competitively on well-understood tasks (e.g. basic QA or sentiment analysis) when the model’s pretraining covers them well[3]. Zero-shot has the advantage of efficiency (shorter prompts, lower latency) and avoids the need to curate examples. Effective zero-shot often depends on prompt phrasing: even simple phrasing tweaks can elicit better answers. For example, adding brief instructions like “Give a step-by-step solution” or ensuring the question is explicit helps maximize zero-shot performance. • Example Quality & Relevance: The benefit of few-shot prompting depends on example quality. High-quality, relevant exemplars yield better reliability, whereas misleading or erroneous examples can degrade performance. Researchers have explored methods to select or optimize exemplars (e.g. based on similarity to the query or through automated search) to consistently improve accuracy[4]. In practice, a few well-chosen examples often calibrate the model to avoid mistakes (such as format errors or misunderstandings of the task). • Diminished Returns for Strong Models: Notably, recent studies with state-of-the-art models suggest that the gap between few-shot and zero-shot can shrink for very powerful LLMs. Cheng et al. (2024) found that for certain advanced models (e.g. Baidu’s Qwen-7B/14B series), adding chain-of-thought exemplars provided no significant reasoning gain over a good zero-shot prompt – the main role of exemplars was just to nudge the output format[5]. This implies that highly capable, instruction-tuned models might internally handle reasoning in zero-shot mode, making few-shot less critical for those tasks. Nonetheless, for weaker or less-tuned models and for novel domains, few-shot prompting remains a reliable way to boost accuracy. Key Takeaways: Few-shot prompting generally improves reliability (accuracy and format consistency) compared to zero-shot, by demonstrating the desired behavior. However, it increases prompt length and latency. Zero-shot is simpler and works well with strong, aligned models or generic tasks. A practitioner should weigh the performance gains versus efficiency costs. Hybrid strategies (e.g. giving one example or using zero-shot chain-of-thought, discussed next) can sometimes capture the best of both. Chain-of-Thought Prompting Chain-of-Thought (CoT) prompting is a technique that directs the LLM to generate a step-by-step reasoning process before giving a final answer[6]. Instead of answering immediately, the model is prompted (explicitly or by example) to “think out loud” in natural language – listing intermediate steps, calculations, or logic. This method has been a major breakthrough for reliability on complex tasks: • Improved Reasoning and Accuracy: Asking the model to articulate its reasoning significantly enhances performance on mathematics, logical reasoning, and multi-hop question answering tasks[7]. By decomposing problems into steps, CoT reduces the chance of leaps of logic or overlooked details. Wei et al. (2022) first demonstrated that few-shot chain-of-thought prompts enabled GPT-3 to solve math word problems and logic puzzles far better than standard prompting[6]. The model’s accuracy on benchmarks like GSM8K (grade-school math) jumped dramatically when it was prompted with reasoning steps. Subsequent works confirmed that CoT improves correctness on tasks requiring multi-step inference or understanding of contextual nuance. The model essentially uses the prompt as scratch paper to work through the problem, leading to more reliable final answers. • Zero-Shot CoT: Kojima et al. (2022) showed that even without exemplar demonstrations, simply appending a phrase like “Let’s think step by step.” to a prompt can trigger the model to produce a chain of thought on its own[8]. This zero-shot CoT approach is attractive because it doesn’t require adding examples – it relies on the model’s learned ability to follow the “think step by step” instruction. Remarkably, large models (GPT-3/4, etc.) can interpret this prompt and generate coherent reasoning steps, often improving accuracy compared to a direct answer. Other prompt variants (e.g. “Let’s work this out step by step to be sure we have the right answer”[8]) have been explored, and finding the optimal phrasing for eliciting reasoning has itself been a research theme[9]. Overall, zero-shot CoT offers a simple reliability boost: it encourages the model to double-check itself and produce a chain of logic that can be audited. • Few-Shot CoT: Providing one or more exemplars that include a question, a detailed step-by-step solution, and the answer is another effective way to induce chain-of-thought reasoning. This was the original CoT prompting method[6]. Few-shot CoT prompts can “teach” the model how to reason by example. Studies refer to this as Manual CoT prompting, and it can significantly enhance performance on complex benchmarks[10]. For example, including a worked solution as a demonstration helps the model format its reasoning properly and often leads to correct answers on similar problems. Few-shot CoT tends to be especially useful for models that are not already instruction-tuned for reasoning – it provides an on-the-fly fine-tuning of the model’s thought process. Recent research also looked at automatically generating CoT examples: methods like Auto-CoT use the model itself to create reasoning exemplars that are then fed back in as demonstrations[11]. • Consistent Logic and Reduced Errors: A major benefit of chain-of-thought is that it increases the transparency and consistency of the model’s reasoning. By examining the intermediate steps the model generates, one can often identify where an error might have occurred (either as a user or via an automated checker). The model, too, is less likely to make simple arithmetic or logical mistakes when it writes out the solution method. In essence, CoT acts as a form of self-audit. It can reduce hallucination in multi-step contexts because the model focuses on deriving the answer from given facts stepwise, rather than jumping to a possibly unfounded conclusion. However, note that CoT does not guarantee factual accuracy of each step – the model can still introduce a false statement in its reasoning. But when it does, that false step is at least evident in the chain-of-thought, which could be caught by verification techniques (more on this later). • Performance Gains: Empirical gains from CoT prompting are significant. For instance, on the GSM8K math word problem dataset, CoT prompting improved accuracy by dozens of percentage points over direct answering in early experiments[12][6]. Additionally, when chain-of-thought is combined with better decoding strategies (like self-consistency, discussed below), state-of-the-art models have reached or surpassed expert-level performance on many reasoning benchmarks. CoT prompting has been so successful that it’s now a common practice in evaluations of LLMs’ reasoning abilities. Example: Using Chain-of-Thought PromptingUser prompt (with CoT instruction): “Q: Alice has 5 apples, Bob gives her 7 more. How many apples does Alice have now? Let’s think step by step.”Model’s chain-of-thought: “Alice starts with 5. Bob gives 7, so we add: 5 + 7 = 12. Therefore, Alice has 12 apples.”Final answer: “12.”In this simple example, the model explicitly showed the reasoning, which is trivial here but illustrative. In complex scenarios (like puzzle questions, coding tasks, etc.), CoT prompts lead the model to produce much more elaborate multi-sentence reasoning, greatly clarifying how it arrived at an answer[13]. Self-Consistency (Ensemble of Chains): An important enhancement to chain-of-thought prompting is the self-consistency decoding strategy[12]. Instead of trusting a single chain-of-thought, the model generates multiple independent reasoning paths (by sampling with diversity) for the same question. These multiple answers with rationale are then aggregated (e.g. by a majority vote on the final answer). The idea, proposed by Wang et al. (2023), is that a complex question often has a unique correct answer but many possible reasoning paths[14]. If we sample several chains, the most consistent answer across different reasoning attempts is likely correct. This approach greatly improved reliability on reasoning benchmarks: • Wang et al. reported striking performance boosts using self-consistency: e.g. +17.9% accuracy on GSM8K, +12.2% on AQuA (math QA), +6.4% on StrategyQA (commonsense) compared to single-path CoT prompting[12]. In other words, ensemble-of-thoughts decoding closed a large fraction of the gap between CoT-prompted LLMs and 100% accuracy on these tasks. The method reduces variance from picking a potentially flawed chain-of-thought. It also serves as a confidence estimator – if the chains disagree, one might conclude the model is uncertain. • Extensions of self-consistency include weighting answers by confidence or using validation questions. Google researchers introduced an “uncertainty-routed CoT” where if no single answer has a clear majority among sampled chains, the model falls back to a different strategy (or a greedy pass)[15]. This achieved gains on a challenging knowledge test (MMLU) for models like GPT-4[16], suggesting improved factual accuracy through voting on chain-of-thought outputs. • Trade-offs: The trade-off for self-consistency is the extra compute – you must generate n reasoning traces instead of one. For important queries, this is often worthwhile. Another consideration is that if the model has a systematic bias or knowledge gap, all chains might be wrong (just wrong in different ways). But empirically, self-consistency has proven to almost always match or beat single-chain performance on reasoning tasks[17], making LLM outputs more robust and correct. It’s a “prompting-time” analog to ensembling methods in traditional ML. In summary, chain-of-thought prompting is a foundation for reliable reasoning with LLMs. It transforms the prompting task from “give me the answer” to “show me your working then answer,” which has clear benefits for accuracy and interpretability[7]. Combined with strategies like self-consistency, CoT has dramatically closed the gap between LLM reasoning and human-level problem solving on many benchmarks. Most importantly, it gives users and developers a tool to inspect the model’s thought process, which is invaluable for debugging and trust. Instruction Tuning and Aligned Model Prompts While not a prompting technique per se, instruction tuning is a related paradigm that greatly improves an LLM’s reliability in following prompts. Instruction tuning means fine-tuning a pre-trained LLM on a dataset of tasks described by instructions and corresponding solutions, so that the model learns to interpret and obey natural language instructions. By 2023, many top models (OpenAI’s GPT-3.5/4, Google’s Flan-T5/PaLM, Meta’s LLaMA-2-Chat, Anthropic’s Claude, etc.) underwent instruction tuning or related alignment training (often coupled with Reinforcement Learning from Human Feedback). The result is models that are more responsive, accurate, and safe when prompted, compared to their base pretrained counterparts. Reliability benefits of instruction-tuned models: • Better Task Following: Instruction-tuned LLMs exhibit improved factual accuracy and relevance because they are explicitly trained to follow user prompts correctly[18]. They are less likely to ignore or misunderstand instructions, reducing off-target or nonsensical outputs. For example, an instruction-tuned model will reliably answer in the requested format (bullet list, JSON, etc.) and stick to the question asked, whereas a base model might ramble or require very carefully phrased prompts. This alignment with user intent means fewer mistakes and more consistent results. • Fewer Hallucinations: Fine-tuning on high-quality, factually grounded responses can substantially reduce hallucinations. Tuning essentially teaches the model what accurate responses look like. A recent survey on LLM factuality notes that targeted fine-tuning (including instruction tuning) “substantially enhances reliability” on factual tasks[18]. The model learns to be more cautious and truth-seeking in its answers. OpenAI’s own evaluations of InstructGPT (early 2022) showed that it produced factually wrong answers less often than the base GPT-3, and users judged its outputs as more truthful and useful. Similarly, Flan-PaLM (an instruction-tuned 540B model by Google) achieved large jumps in benchmark performance across 1,000+ tasks, indicating it had learned to produce more correct and directly useful answers than an untuned model of equal size. • Safety and Alignment: Instruction-tuned models are typically trained on datasets filtered or annotated for safe behavior, and often incorporate explicit refusal or safe-completion behaviors for disallowed prompts. This leads to more harmless outputs and adherence to ethical guidelines out-of-the-box. Anthropic’s Claude, for instance, is instruction-tuned with a “harmlessness” objective (and further refined via Constitutional AI, discussed later). The result is that these models are far less likely to produce toxic or dangerous content unless deliberately tricked. Even without additional prompt engineering, an aligned model will usually respond with a refusal if asked for something obviously harmful. In practice, this built-in safety is crucial for reliability – it means the user does not have to manually craft complex prompts to steer the model away from bad content every time. Industry best practices now strongly favor using instruction/RLHF-tuned models for any application where reliability matters, as these models have been optimized to follow instructions helpfully and safely. • Consistency and Tone: Instruction tuning often imparts a consistent conversational style (for chat models) or at least a more polite, on-topic tone. This consistency means that if the user uses the same prompt at different times, the model is more likely to produce similar-quality outputs (less variability than an untuned model which might be more unstable). Also, an aligned model might implicitly handle certain prompt nuances – for example, it might automatically avoid first-person or ensure the answer remains neutral – because those patterns were present in the fine-tuning data. In essence, instruction tuning makes prompt engineering easier and more effective. A naive prompt that would confuse a base model might be correctly interpreted by an instruction-tuned model. These models come “pre-prompted” with a general understanding of following orders. For instance, an untuned model might need very explicit prompting to not exceed a word limit (“Answer in at most 50 words.”), whereas an instruction-tuned model might respect length instructions more reliably. A concrete example of instruction tuning’s impact: OpenAI’s GPT-4 (2023) is deeply instruction-tuned and integrated with RLHF. As a result, simple prompts like “Explain the cause of seasons in one paragraph.” yield a focused, correct explanation. A base LLM might have given an overly long essay or gone off-topic about seasons. Instruction tuning aligned GPT-4 to factual correctness and brevity in this case, without the user needing to engineer a complicated prompt. Academic studies have similarly reported that instruction-tuned models achieve higher factual QA scores and fewer reasoning errors than models of the same size that are not instruction-tuned[18]. It is important to note that instruction tuning requires collecting or generating a large instruction dataset (which can include chain-of-thought style examples, explanations, etc.). By 2023, several such datasets were available (OpenAI’s WebGPT and Instruct data, FLAN collections, Self-Instruct generated data, ShareGPT conversations, etc.), enabling even open-source models to be instruction-tuned by the community. The consensus in the community is that if you want a reliable model, fine-tuning on instructions and human feedback is a must – this is how ChatGPT was created from GPT-3.5, how Bard was created from PaLM, etc. From a prompt engineering perspective, using an instruction-tuned model means you can rely on simpler prompts to get good results. However, combining instruction-tuned models with advanced prompting can stack the benefits. For example, an instruction-tuned model using chain-of-thought prompting + self-consistency might achieve the best of all worlds: the model knows how to follow the CoT instruction, and the decoding ensures correctness. This stacking has indeed been used in research to push reliability even further[15]. In summary, instruction tuning is a macro-level technique to improve reliability: it changes the model itself to be more responsive to prompts. While not “prompt engineering” in the narrow sense, it underpins many best practices. We include it here because nearly all cutting-edge results in 2023–2025 on reliable LLM outputs assume an instruction-tuned backbone model. A well-aligned model combined with good prompting techniques is significantly safer, more factual, and more consistent than either alone[18]. Advanced Structured Prompting Methods Beyond basic few-shot and chain-of-thought prompting, researchers have developed a variety of structured prompt strategies to target reliability issues. These methods often break or reframe the task in clever ways, or involve multi-step interactions with the model (sometimes called prompt chaining or meta-prompting). We highlight several notable techniques in this category and their impact: Problem Decomposition Prompts (Least-to-Most, etc.) Complex questions can often be answered more reliably if broken into simpler sub-questions. Problem decomposition prompting explicitly instructs the LLM to perform this breakdown. One prominent approach is Least-to-Most Prompting (Zhou et al., 2022)[19]. In least-to-most prompting, the model is first asked to list out the sub-problems or intermediate questions needed to solve the main problem (without solving them yet). Then, each sub-problem is solved one by one, and those solutions are combined to get the final answer[20]. This method mirrors how a human might tackle a complicated problem stepwise. • Reliability Impact: By focusing the model on one piece of the problem at a time, decomposition prompts reduce reasoning errors and lapses in attention, leading to more correct solutions. Zhou et al. showed significant improvements on tasks involving symbolic manipulation, compositional logic, and math reasoning using least-to-most prompting[21]. Essentially, the model’s cognitive load is managed by the prompt: it doesn’t have to solve everything at once, which avoids confusion. Breaking down questions also helps ensure that each piece can be verified or solved with higher confidence. This approach improves factual accuracy in multi-hop QA – rather than the model guessing a multi-hop answer in one go (and possibly missing a hop), it systematically answers each hop. • Example: “What is the capital of the country where the inventor of the telephone was born?” – This question requires multiple hops. A decomposition prompt might have the model first identify “the inventor of the telephone” (Alexander Graham Bell), then find “the country where he was born” (Scotland), then find “the capital of that country” (Edinburgh). Doing this stepwise helps ensure each fact is correct, thus improving the final answer’s accuracy. A direct one-shot answer might be prone to error or hallucination on such a query, whereas a least-to-most approach guides the model through the necessary factual steps. • Variants: Other forms of decomposition include self-asking (where the model poses a sub-question to itself and answers it) and decomposed prompting (DECOMP) for complex QA[22]. The key theme is explicitly instructing the model to handle sub-tasks. This structured reasoning often pairs well with chain-of-thought: the chain-of-thought becomes organized into sections addressing each sub-problem. Empirical studies on complex benchmarks like HotpotQA and multi-step reasoning puzzles show that such prompting yields higher exactness and lower hallucination rates. Role Prompting and Persona Contexts A simple yet effective structured prompt method is to assign the model a specific role or persona relevant to the task. For example: “You are a clinical diagnostic assistant. Explain the medical result…” or “Act as a critical fact-checker and answer the following question.” By doing so, the user provides context that can constrain the model’s style and knowledge used. • Reliability Impact: Setting a role can enhance consistency and factuality by priming the model with domain-specific knowledge or norms. A model told it “is an expert mathematician” might be more systematic and less likely to make basic math errors, for instance. Bsharat et al. (2023) proposed 26 “principled instructions” for prompts – one principle is precisely to integrate the intended audience or role into the prompt[23][24]. They found that assigning a specific role can elicit outputs that better match the expected domain accuracy and style[25][23]. For example, telling the model “the audience is a 5-year-old child” will cause it to produce a simplified and clear explanation, whereas “the audience is an expert in the field” yields a detailed, jargon-rich answer[26]. This consistency with the intended level improves the usefulness and factual relevance of the output for that context. • A study applying these principles showed notable performance gains: using carefully designed roles and prompt structures led GPT-4 to produce more concise, factual, and accurate responses – the quality and accuracy scores improved by 57.7% and 36.4% on average compared to a baseline prompt across their evaluations[27]. This underscores that prompt clarity and context (like roles) indeed translate to more reliable outputs. • Example: If querying about a legal matter, prefacing with “You are a legal expert with 10 years of experience in contract law.” often yields a more precise and legally accurate answer. The model “steers into” the role, using more relevant knowledge and phrasing. Without the role, the answer might be more generic or even slightly incorrect due to lack of context. • Safety Angle: Role prompting can also incorporate safety or ethical guidelines into the prompt. For example: “You are a trustworthy AI assistant that never discloses private data and avoids any harmful language.” While aligned models have such principles inherently, stating them in the prompt can reinforce safe behavior. This is a form of prompt-based safety conditioning – instructing the model on how to behave. It’s not foolproof, but it can mitigate risky outputs. Anthropic’s Constitutional AI approach (discussed below) can be seen as giving the model an explicit “role” of a harmless, principled AI that self-corrects according to a written constitution of rules. Self-Critique and Refinement (Reflective Prompting) Another powerful structured approach is to have the model evaluate and refine its own output. Instead of the human checking the answer, the prompt can direct the model itself to do a self-critique and improvement cycle. In 2023, methods like Self-Refine (Madaan et al., NeurIPS 2023) were proposed, which do the following: the model generates an initial answer, then is prompted (with the same question plus its answer) to provide feedback or identify any errors in that answer, and finally it produces a revised, improved answer[28]. This can be done iteratively, improving the answer step by step. • Reliability Impact: Self-critique prompting leads to notably higher quality outputs, often fixing factual errors or reasoning flaws that the first attempt contained. Madaan et al. report that across diverse tasks (from dialogue to math), outputs generated with Self-Refine were preferred by humans and scored higher automatically than one-shot generations – about a 20% absolute improvement in task performance on average[29]. Even top models like GPT-4 saw gains, indicating that an initial draft can be polished significantly by a second pass of model-generated feedback[30]. Essentially, this leverages the model’s capacity to be its own reviewer. Because the model can sometimes catch inconsistencies or missing details upon reflection, it enhances both factual accuracy and answer completeness. • Example Workflow: • Prompt the model: “[Question]. Provide an answer, and then review your answer for any mistakes or missing info, correcting it if needed.” • The model’s first answer might have some uncertainty or a minor mistake. • The model then (because of the prompt) generates a critique: e.g. “Upon review, I realize I didn’t address X, and Y might be incorrect because… Now I will fix the answer.” • Model provides a final refined answer.This process can often catch logical errors or hallucinations. For instance, if the model initially answers a question incorrectly, in the critique step it might notice conflict with a known fact and then correct the answer. • Automated Feedback Types: The feedback prompt can target various dimensions – factual accuracy (“Check if all claims are true using knowledge”), consistency (“Does the answer contradict itself or the context?”), safety (“Is any part of the answer inappropriate?”), etc. Researchers have even tried chain-of-thought style self-critiques, where the model lists things to verify about its answer. One such method is “verify-and-edit”, where multiple initial chains-of-thought are checked against an external knowledge source and edited if found incorrect[31]. For example, Zhao et al. (2023) generated several reasoning paths, retrieved relevant facts from a database for each, and allowed the model to correct those reasoning paths using the retrieved evidence[31]. This yielded more factually accurate answers by combining self-reflection with external verification. • Refinement without Training: A key point is that approaches like Self-Refine do not require additional fine-tuning or human labels; they work at inference time with clever prompting[32]. This makes them very appealing for deployment: one can wrap a raw model in a self-refinement loop to boost reliability without changing the model’s weights. Many practical prompt chains now include a step like “Before finalizing, double-check the answer above and correct any errors.” • According to Self-Refine’s authors, even state-of-the-art aligned models have headroom for improvement with this method – “even GPT-4 can be further improved at test-time using our simple approach”[33]. This suggests that LLMs sometimes know more (or can reason more carefully) than their first answer shows; a reflective prompt can draw out that latent reliability. Tool-Augmented Prompting (ReAct, RAG) One of the most effective ways to enhance factual accuracy and reduce hallucinations is to give the model access to external tools or reference materials via prompting. Techniques like ReAct (Reason+Act prompting) and Retrieval-Augmented Generation (RAG) incorporate tool use into the prompt format. In these approaches, the model is prompted to alternate between reasoning steps and actions (such as calling a search engine, calculator, or knowledge base)[34]. The prompt might include a few-shot demonstration of using a tool, or the conversation might be structured so that the model’s output can trigger an API call. The model then receives the tool results and continues reasoning. By the end, it produces a final answer grounded in the retrieved facts. • Reliability Impact: Tool-augmented prompting has a dramatic impact on factual reliability. By retrieving up-to-date or specific information, the model is far less likely to hallucinate facts – it can quote or incorporate actual evidence. ServiceNow researchers noted that RAG systems mitigate hallucination and increase output reliability in practice[35]. For structured knowledge tasks, a well-designed RAG prompt can ensure nearly 100% factuality because the model’s role is primarily to summarize or transform retrieved content rather than to invent content. In addition, tools like calculators help eliminate arithmetic errors (a known weak point of pure LLM reasoning). • The ReAct framework (Yao et al., 2022) demonstrated that a GPT-3 model with a prompt instructing it how to think step-by-step and use actions (like “Search[query]”) could answer open-domain questions more accurately than the model alone[34]. It would generate a thought like “I should search for X”, then the environment (or a wrapper code) executes the search and returns results, which are fed back to the model, and the model continues reasoning with those results[34]. This closed-loop prompting allows the model to fetch facts as needed, drastically cutting down hallucination. • Interleaved Retrieval CoT: A specific example from 2023 is IRCoT (Interleaved Retrieval with CoT) for multi-hop QA[36]. Trivedi et al. (2023) prompted the model to retrieve after each reasoning step – effectively interleaving chain-of-thought reasoning with document search. This guided the model to find the exact information for each sub-question, leading to highly accurate multi-hop answers[36]. By planning what to search (using CoT) and then actually searching, IRCoT ensured that each step of reasoning was grounded in reality. • Self-Correction with Tools: Another innovation is using tools to verify the model’s own draft answers. The CRITIC method (Gou et al., 2024) had the model first answer a question with no tools, then critique its answer and selectively invoke tools like web search or a code interpreter to check and fix errors[37]. For example, if the question was historical and the model wasn’t fully sure, in the critique phase it could call the search tool to confirm a date, then correct its answer if it was wrong. This approach combines self-reflection with tool use, compounding reliability: the model’s answer gets a fact-checking pass via external sources[37]. • Example: In practice, many applications now use RAG prompting. A user question is used to retrieve relevant documents (e.g. from Wikipedia or a company knowledge base), and those documents (or summaries) are inserted into the prompt as context, with an instruction “Use the above information to answer the question.” The LLM then produces a grounded answer, often with markedly higher factual accuracy than if it had to rely on memory. Empirical papers have shown near elimination of certain types of hallucinations when models have access to a reliable knowledge retrieval component[35]. • Trade-offs: The main challenge with tool-augmented prompting is complexity – one needs an external system to handle the tool interface, and prompt design becomes more involved (you might have to design a schema for how the model should request a tool, etc.). Also, the model might need to be steered to not overly rely on tools or to interpret tool results correctly. Nonetheless, from a reliability standpoint, tool augmentation is one of the most effective measures for factual correctness. It directly addresses the LLM’s inherent limitation of closed-book knowledge and its tendency to fabricate unknown facts. Contrastive and Negative Example Prompting Another structured technique is to provide counter-examples or constraints in the prompt to shape the model’s output. For instance, contrastive prompting gives examples of incorrect reasoning or undesirable outputs alongside correct ones, to illustrate what not to do[38]. • Contrastive CoT: Chia et al. (2023) introduced Contrastive Chain-of-Thought, where a CoT prompt includes both a correct solution path and an incorrect solution path for a sample problem[38]. The model thus sees a demonstration of a mistake and its resolution. This technique showed significant improvement in arithmetic reasoning and factual QA[38]. The model learned to avoid reasoning patterns that led to the incorrect outcome in the example. In essence, negative examples can serve as an additional guide: “don’t do it this way.” This reduces certain classes of errors (like a common arithmetic slip) because the model recognizes and steers away from the demonstrated bad logic. • Instruction-based Constraints: We can also explicitly prompt constraints like “If you are unsure of a fact, say you are unsure” or “Ensure that your answer is unbiased and does not rely on stereotypes.”[24]. These are structured as part of the user instruction. While a model may not always perfectly comply, studies like the principled prompts work found that adding such clauses can indeed make answers more cautious and aligned[24]. For example, adding “Don’t guess.” in a prompt might decrease hallucinations by causing the model to respond with uncertainty or ask for clarification when it doesn’t know. This improves reliability by avoiding confidently wrong answers. Multi-Prompt Ensembles (Ask-Me-Anything Prompting) To tackle the issue of prompt sensitivity (where slight rephrasing can change answers), researchers have used multiple prompts in parallel and then aggregated the results. One framework is “Ask-Me-Anything” (AMA) prompting, which uses multiple imperfect prompts and combines their answers[39]. For instance, one could prompt the model with question phrasing A, B, and C (or even have different prompt styles like one CoT, one direct) and then have the model or a separate process decide on the best answer (e.g. by majority vote or by a follow-up prompt that evaluates each answer). • This approach improves robustness: if one prompt phrasing leads the model astray but others do not, the ensemble can cancel out the anomaly. AMA prompting improves answer accuracy in QA formats by merging strengths of different prompts[39]. It’s essentially using redundancy to get a more reliable result, similar to self-consistency but across prompt variants instead of random chains. Empirical results show that such prompt ensembling can stabilize performance and often exceeds any single prompt’s accuracy. • For example, if a medical question is asked in three ways (one straightforward, one reworded, one in a “teach me” style), and two out of three answers agree on a diagnosis but one is different, the differing one might be an outlier due to wording. Taking the majority or feeding all into a final “which answer is best?” prompt can yield the correct conclusion. This method explicitly addresses reliability across prompt perturbations, which is important because LLMs can be prompt-sensitive[40][41]. • Research on Robustness: Indeed, research has noted that “the performance of LLMs is acutely sensitive to the phrasing of prompts”, which raises concerns about reliability[42]. Multi-prompt evaluation is recommended as a standard practice to assess and ensure model consistency[43]. By using multiple prompts, one can gauge the stability of the answer and reduce the chance that a particular wording tricks the model into an error. In summary, a wide range of structured prompt strategies have been developed to push LLM reliability to higher levels. These include breaking tasks into parts, giving the model roles or principles to follow, letting the model iteratively refine answers, incorporating external information, and even using the model in an ensemble-like fashion. Many of these can be combined: e.g., one could do chain-of-thought with a tool, then self-refine. The overarching trend is prompting has evolved from one-shot queries to complex, interactive workflows that significantly enhance factual accuracy, reasoning consistency, and safety of LLM outputs. Comparative Analysis of Prompting Techniques Different prompt engineering techniques address different reliability criteria, and they come with trade-offs. The table below summarizes several key techniques and their known reliability impacts, based on findings from 2023–2025 literature: Summary of prompt engineering techniques and their impact on LLM reliability. “Accuracy/Quality” refers to improvements in factual correctness or task performance. “Notes/Results” highlight example empirical findings or considerations.[7][12][35] Some comparative observations from studies: • Factual Accuracy: Techniques that incorporate external knowledge (RAG, tools) or encourage verification (self-refinement, contrastive checking) have the largest impact on factual accuracy, virtually eliminating certain hallucinations[35]. Chain-of-thought also improves factual tasks if reasoning or multi-hop is required, but by itself it may not fix factual gaps (it needs correct facts to start with). Instruction-tuned models start off more factual than base models[18], and adding RAG on top can yield very high factual fidelity. • Reasoning Ability: Chain-of-thought prompting (with or without few-shot) and decomposition strategies are clear winners for logical and mathematical consistency[6][20]. Few-shot examples also help reasoning insofar as they demonstrate the solution method. Self-consistency further boosts reasoning correctness by resolving ambiguity in the model’s thoughts[12]. In contrast, zero-shot without CoT often fails on complex reasoning that these methods can handle. • Consistency & Robustness: Self-consistency and multi-prompt ensembles explicitly tackle consistency, making answers more robust to the stochastic nature of generation and phrasing variations[12][39]. Role prompting can maintain a consistent tone or viewpoint across an answer (and even across a dialogue), reducing contradictions. Conversely, a model without these techniques might give inconsistent answers if asked the same thing in slightly different ways (a known issue). Instruction tuning also imbues a general consistency in following instructions, reducing variability due to phrasing changes[1]. • Safety: No prompt technique alone is bulletproof for safety, but certain strategies help. Instruction tuning and RLHF have the biggest effect by default (as they train the model to avoid unsafe content). On the prompt side, explicitly stating safety rules or using a “Constitutional AI” self-check can significantly reduce harmful outputs[44]. For example, Anthropic’s model, when prompted to critique and revise its answer to uphold harmlessness principles, reduced the success of adversarial attacks by ~40% in one benchmark[45]. However, researchers note a trade-off: making the model very harmless can reduce helpfulness (it might refuse too often)[44]. The optimal balance often requires iteration. There have also been studies on adversarially robust prompting (to resist jailbreaking), including using special “protected” prompt prefixes or detectors for malicious inputs[46]. These are ongoing arms races in safety. • Efficiency vs Performance: Few-shot prompting and long CoT reasoning can be token-intensive. If latency or context length is a concern, one might choose zero-shot or a minimal CoT. Some research into prompt optimization tries to find the shortest, most effective prompts. There’s also interest in automating prompt engineering (using one model to generate better prompts for another), which we did not cover in depth but is a meta-technique to achieve reliability without human-crafted prompts. • Emergent vs Add-on Techniques: It’s worth noting that very large models, especially GPT-4 and beyond, exhibit strong zero-shot capabilities (“emergent” reasoning abilities)[5]. This means some techniques that were crucial for smaller models (like heavy few-shot prompting) might yield diminishing returns on giants. The focus shifts to things like self-refinement and tool use even with these models, because no matter how good the base model is, checking its work or providing it tools further improves reliability. In other words: larger models reduce the need for certain prompt tricks but also enable more advanced prompt pipelines (since they can handle following complex prompts accurately). Conclusion: Prompt engineering has proven to be a powerful lever for improving LLM reliability. By thoughtfully designing how we “ask” the model to behave – whether through examples, eliciting its reasoning, providing it external info, or guiding it to critique itself – we can address many of the failure modes of these models. Academic research from 2023–2025 provides strong empirical evidence for these techniques (with performance gains ranging from modest to striking), and industry practice has incorporated many into state-of-the-art systems (e.g., ChatGPT uses instruction tuning + CoT prompting internally for certain queries, Bing Chat uses retrieval-augmented prompts for factual questions, etc.). As models continue to evolve, prompt engineering will likely remain a key piece of the puzzle in ensuring accurate, consistent, safe, and robust AI behavior[18]. The combination of advanced models with advanced prompting techniques is moving us closer to highly reliable AI assistants.
[1] [3] (PDF) Few-Shot vs. Zero-Shot Learning: Efficiency Trade-offs in NLP Tasks https://www.researchgate.net/publication/390805777_Few-Shot_vs_Zero-Shot_Learning_Efficiency_Trade-offs_in_NLP_Tasks [2] Applying large language models and chain-of-thought for automatic ... https://www.sciencedirect.com/science/article/pii/S2666920X24000146 [4] [5] Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot https://arxiv.org/html/2506.14641v1 [6] [7] [8] [9] [10] [11] [13] [15] [16] [19] [20] [21] [22] [31] [34] [36] [37] [38] [46] [2406.06608] The Prompt Report: A Systematic Survey of Prompting Techniques https://ar5iv.labs.arxiv.org/html/2406.06608 [12] [14] [17] Self-Consistency Improves Chain of Thought Reasoning in Language Models | OpenReview https://openreview.net/forum?id=1PL1NIMMrw [18] Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models https://arxiv.org/html/2508.03860 [23] [24] [25] [26] [27] [39] [2312.16171] Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4 https://ar5iv.labs.arxiv.org/html/2312.16171v1 [28] [29] [30] [32] [33] Self-Refine: Iterative Refinement with Self-Feedback | OpenReview https://openreview.net/forum?id=S37hOerQLB [35] Reducing Hallucination in Structured Outputs via RAG | Prompt Engineering Guide https://www.promptingguide.ai/research/rag_hallucinations [40] [41] [42] On the Worst Prompt Performance of Large Language Models https://neurips.cc/virtual/2024/poster/95497 [43] State of What Art? A Call for Multi-Prompt LLM Evaluation https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00681/123885/State-of-What-Art-A-Call-for-Multi-Prompt-LLM [44] [45] Constitution or Collapse? Exploring Constitutional AI with Llama 3-8B https://arxiv.org/html/2504.04918v1
Techniques by Impact
Reasoning tasks:
- Tree-of-Thought: +tens of percentage points
- Maieutic: +20% on commonsense QA
- Complex CoT: +15% with self-consistency
- Contrastive CoT: +10% over vanilla
Factual accuracy:
- Chain-of-Verification: reduces hallucinations
- RAG: grounds in real facts
- Verify-and-Edit: improves multi-hop
Robustness:
- Self-Consistency: resolves ambiguity
- Ensemble methods: reduce variance
- Active-Prompt: beats standard methods
Efficiency:
- Zero-Shot CoT: no examples needed
- Implicit RAG: internal retrieval
- Self-Refine: no fine-tuning required
----
Foundational Prompting Techniques
• Zero-Shot Prompting: A prompting paradigm where the model is given an instruction or query without any example demonstrations. The LLM must rely entirely on its pre-trained knowledge to perform the task[1]. Mentioned in surveys such as the Prompt Report and Vatsal & Dubey (2024).
• Few-Shot Prompting (In-Context Learning): Providing a handful of exemplars (input-output pairs) in the prompt to illustrate the task, enabling the model to infer the pattern[2]. This technique, popularized by GPT-3, lets the model “learn” from the prompt context without parameter updates. Discussed in all prompt engineering surveys (e.g., Vatsal & Dubey 2024[3]).
• Basic/Standard Prompting: Using a direct query without special formatting or strategy – essentially the “vanilla” prompt. This serves as a baseline where no prompt engineering tricks are applied[4]. Identified as a baseline method in Vatsal & Dubey (2024).
Persona, Style, and Tone Prompting
• Role (Persona) Prompting: Assigning the LLM a specific role or persona in the prompt (e.g. “You are a travel guide...”). This can steer the model’s outputs to adopt the perspective or expertise of that role[5]. Appears in the Prompt Report and others as a way to influence style and content.
• Style/Tone Prompting: Specifying the desired style, genre, or tone of the output (e.g. “Answer in a casual tone” or “Write in the style of a Shakespearean sonnet”). This guides the model to produce output with certain stylistic characteristics[6]. Noted in the Prompt Report as a variant of persona prompting to shape outputs.
• Emotion Prompting: Including emotionally charged or related phrases in the prompt to induce a certain affective response. For example, telling the model “This is critical for my career...” can coax more emotionally-aware or emphatic answers[7]. Mentioned in the Prompt Report as a niche technique to influence model responses.
In-Context Example Selection Strategies
• Exemplar Ordering & Selection: The choice and order of few-shot examples can significantly impact performance. Research shows that reordering exemplars can cause accuracy swings from below 50% to above 90% on some tasks[8]. Best practice is to use relevant examples: e.g. nearest neighbors to the query (as in k-Nearest Neighbor prompts[9]) or ensuring diversity via methods like Vote-K (which selects candidates that are similar yet diverse)[10]. Discussed in the Prompt Report and Vatsal & Dubey (2024) as critical design factors.
• Self-Generated Examples (SG-ICL): Automatically generating synthetic exemplars using an LLM itself when no labeled data is available[11]. The model produces example questions and answers which are then used in-context. This can outperform zero-shot prompting, though synthetic examples may be of lower quality than real ones. Covered in the Prompt Report.
• Active-Prompt (Active Example Selection): An active-learning-inspired approach that identifies the most informative examples to annotate for few-shot prompts[12][13]. The LLM is first prompted to solve many training queries (with its reasoning); then the instances with highest uncertainty or disagreement are selected for human annotation and added as exemplars. This yields task-specific example sets and has been shown to beat standard few-shot, self-consistency, and auto-CoT methods on reasoning tasks[14]. Introduced in 2023 (Diao et al.) and summarized by Vatsal & Dubey.
• Synthetic Prompting: Augmenting a hand-crafted few-shot prompt with additional synthetically generated examples. In one approach, the LLM first generates a new query from a known reasoning chain (backward step), then answers it with a reasoning chain (forward step), adding the most complex or informative synthetic Q&A pairs to the prompt[15][16]. This can improve performance by enriching the prompt with varied examples (up to ~15% gains on reasoning benchmarks). Described by Shao et al. (2023) and included in Vatsal & Dubey’s survey.
• Prompt Mining (Template Optimization): Discovering or learning an optimal prompt phrasing from data. For example, analyzing large text corpora to find effective “middle words” or formats (beyond the standard “Q: ... A: ...”) that elicit better responses[17]. By using more frequent or context-appropriate prompt formats, one can boost performance without changing model parameters. Referenced in the Prompt Report (Jiang et al. 2020 method).
• Advanced Example Techniques: Researchers have proposed sophisticated methods to optimize example selection. For instance, LENS (iterative filtering of exemplars), UDR (Unsupervised Dense Retrieval of exemplars), and other strategies that use reinforcement learning or embeddings to pick exemplars[18]. These aim to automate finding the best demonstrations for a given test query. (These specific methods are noted in the Prompt Report as emerging techniques.)
Step-by-Step Reasoning – Chain-of-Thought and Variants
• Chain-of-Thought (CoT) Prompting: A seminal prompting strategy where the model is prompted (often via an example) to produce a step-by-step reasoning process before giving the final answer[19]. By walking through intermediate reasoning in natural language, the model dramatically improves on complex tasks like math word problems and logical reasoning[20]. Featured in every survey (originating from Wei et al. 2022[19]).
• Zero-Shot CoT: A variant of CoT with no exemplars, instead appending a thought-inducing phrase to the query. For example: “Let’s think step by step.” encourages the model to generate a reasoning chain on the fly[21]. Despite requiring no examples, zero-shot CoT often boosts accuracy by prompting the model’s internal chain-of-thought (finding an optimal phrase or “trigger” is key[21]). Covered in the Prompt Report (Kojima et al. 2022) as an attractive, task-agnostic approach.
• Step-Back Prompting: A two-step reasoning prompt where the model is first asked a high-level or conceptual question about the problem before solving it. This “step back” to consider relevant concepts primes better reasoning. The model’s second step then tackles the original question with improved focus[22]. This method showed significant gains on benchmarks by first eliciting key facts/concepts, then solving. Proposed by Zheng et al. (2023) and noted in the Prompt Report.
• Analogical Prompting: Prompting the model to generate and solve an analogous example problem before answering the real question[23]. By reasoning through a similar situation or analogy (often automatically generated), the model can transfer the inferred solution to the original problem. This has improved tasks like math and code generation by leveraging parallel examples[23]. Yasunaga et al. (2023) introduced this, cited in the Prompt Report and Vatsal & Dubey.
• Thread-of-Thought (ThoT): A method for handling very long or complex contexts by maintaining an “unbroken thread” of reasoning through the content[24]. The prompt guides the LLM to process a lengthy document in parts: first summarize or analyze sections, then iteratively refine and answer using those summaries. ThoT helps filter irrelevant information and retain pertinent facts across long texts[25][26]. Proposed by Zhou et al. (2023) and included in Vatsal & Dubey’s survey.
• Tabular CoT (Tab-CoT): A specialized variant where the chain-of-thought is formatted as a table or grid[27]. For example, each reasoning step could be a row with columns for “step description” and “result”. By structuring reasoning in a table, the LLM’s outputs become more organized and it can reduce reasoning errors[28]. (Jin & Lu 2023, noted in Prompt Report – particularly useful for problems like complex calculations that benefit from structured working.)
• Complex CoT: Emphasizing harder examples in few-shot CoT prompts. Fu et al. (2022) showed that using more complex training exemplars (with many reasoning steps) can lead the model to generalize better on difficult questions[29][30]. Complex CoT also involves, at inference, sampling multiple reasoning chains and selecting answers from the longest or “most complex” chains (on the premise that more elaborate reasoning tends to be correct)[31]. Listed in Vatsal & Dubey (2024) as an enhancement over standard CoT.
• Logical Chain-of-Thought (LogiCoT/LoT): Incorporating formal logic consistency checks into the reasoning chain. The model not only reasons stepwise but also verifies each step against logical principles (e.g. using reductio ad absurdum to check for contradictions)[32]. By allowing the LLM to correct its reasoning when a logical step is invalid, LogiCoT improves zero-shot reasoning accuracy on math, commonsense, and logical inference tasks[33]. Proposed by Zhao et al. (2023) – highlighted in Sahoo et al. and Vatsal & Dubey.
• Contrastive CoT (and Self-Consistency): A technique where the prompt includes both correct and incorrect reasoning examples to teach the model what not to do. By contrasting positive and negative chains-of-thought, the LLM learns to avoid mistakes. Chia et al. (2023) showed that providing a few incorrect reasoning demonstrations alongside correct ones yields ~10% accuracy gains over vanilla CoT[34]. A related idea is Contrastive Self-Consistency, where the model generates multiple reasoning paths and learns from both successful and failed attempts, outperforming standard self-consistency by 15% on math problems[35]. Covered in Vatsal & Dubey (2024).
Problem Decomposition and Planning
• Least-to-Most Prompting: A divide-and-conquer approach that explicitly breaks a complex problem into smaller sub-problems and solves them in sequence[36][37]. The model is first prompted only to plan or list sub-questions (without answering them), then prompted to answer each sub-question one by one, building up to the final answer[36]. This method has shown large improvements on tasks requiring multi-step reasoning or symbolic manipulation by reducing them to simpler chunks[37]. Introduced by Zhou et al. (2022) and cited across surveys.
• Decomposed Prompting (Decomp / DecoMP): A framework where an initial query is recursively decomposed into simpler queries, possibly handled by separate specialized prompts or models[38][39]. For example, a complex question is split into two easier questions; those might even be split further or sent to external tools (web search, calculator), and their answers are then combined. Khot et al. (2022) showed this outperforms straight CoT on multi-hop reasoning, by delegating sub-tasks to prompts best suited for them[39]. Discussed in Vatsal & Dubey’s survey.
• Plan-and-Solve (PS): Another two-stage prompting strategy: first the model is asked to devise a plan (outline steps) to solve the problem, then it executes the plan step-by-step[40][41]. An enhanced version “PS+” adds even more explicit instruction to make the plan detailed[40]. This addresses CoT shortcomings like skipped steps or calculation errors by ensuring the model has a thorough plan. Experiments showed >5% accuracy boosts over standard CoT on math and commonsense reasoning tasks with GPT-3.5[42]. From Wang et al. (2023), noted in Vatsal & Dubey.
• Self-Ask (Question Decomposition): A prompting tactic where the model decides if a complex query requires intermediate questions, then generates and answers those sub-questions before giving a final answer[43]. Essentially, the LLM asks itself: “What do I need to know first?” – answers that, and uses it to resolve the original query. This iterative Q&A approach helps handle multi-hop questions by ensuring the model gathers needed facts stepwise. Proposed by Press et al. (2022) and described in the Prompt Report.
• Recursion-of-Thought: A recursive variant of CoT where if the model encounters a particularly hard sub-problem during its chain-of-thought, it spawns a new prompt (a sub-query) to solve that, then inserts the result back into the original reasoning[44]. In other words, the chain-of-thought can call itself for sub-tasks, enabling hierarchical problem solving beyond a single-pass linear chain. This has shown improved performance on complex arithmetic and algorithmic tasks[44]. Referenced in the Prompt Report (Lee & Kim, 2023).
• Tree-of-Thought (ToT): Extending prompting to a search tree of thoughts instead of one linear chain[45][46]. At each step, the model explores multiple possible next steps (branches), creating a tree of reasoning paths. A heuristic or value function (possibly using the LLM itself) evaluates partial solutions to decide which branches to expand or prune. By systematically exploring different reasoning paths, ToT finds better solutions for problems requiring planning or search (improving success rates by tens of percentage points over standard CoT in some puzzles)[47][48]. Introduced by Yao et al. (2023) and highlighted as an advanced technique in surveys.
• Maieutic Prompting: A complex recursive reasoning approach that builds a tree of hypotheses and uses a form of abductive reasoning. The model generates multiple possible propositions or answers, then asks itself questions to eliminate those that lead to contradictions, gradually narrowing down to a consistent answer[49][50]. This “Socratic” process encourages the model to reflect and cross-verify facts. Jung et al. (2022) demonstrated up to 20% accuracy gains on commonsense QA over basic CoT by using maieutic prompting to systematically rule out inconsistent answers[51]. Discussed in Vatsal & Dubey (2024) as an advanced reasoning strategy.
Ensemble and Consensus Methods
• Self-Consistency: Instead of relying on a single chain-of-thought, this method samples multiple reasoning outputs (by prompting the LLM several times with randomness) and then chooses the most common answer among them[52]. The idea is that different valid reasoning paths should converge on the same answer, so the majority vote is likely correct[53]. Self-Consistency notably improves performance on math, commonsense, and logic tasks by resolving ambiguity and filtering out outlier responses[53]. Proposed by Wang et al. (2022), featured in all surveys as a key technique.
• Ensemble Refinement (ER): An iterative two-stage extension of CoT and self-consistency[54]. In stage 1, multiple CoT reasoning chains and answers are generated (via a few-shot prompt at various decoding temperatures). In stage 2, those chains are fed back into the model with the original question, prompting it to produce a refined answer (taking the earlier attempts into account). This second stage can be repeated, and finally a majority vote selects the answer[55]. ER further boosts accuracy over standard self-consistency and CoT, as shown by Singhal et al. (2023) on QA benchmarks[56]. Summarized in Vatsal & Dubey (2024).
• Demonstration Ensembling (DENSE): Using multiple distinct few-shot prompts in parallel and then aggregating their outputs[57]. For example, one can create several prompt variants, each with a different subset of exemplars or prompt wording, have the model answer the query with each, and then vote or combine the results. This reduces variance from any single prompt choice[57]. Proposed by Khalifa et al. (2023), noted in the Prompt Report.
• Mixture-of-Reasoning-Experts (MoRE): Crafting a set of specialized prompts, each expert prompt geared toward a certain reasoning type or subtask, and then letting them all answer the question[58]. For instance, one prompt is optimized for factual recall, another for math reasoning, another for commonsense, etc. The final answer is chosen from the expert whose response is deemed best (by an agreement score or selection function)[58]. This modular approach can outperform a one-prompt-fits-all strategy. Introduced by Si et al. (2023), cited in the Prompt Report.
• Max Mutual Information Prompting: Generating multiple prompt templates with different styles or example sets, and selecting the prompt that maximizes mutual information with the correct output[59]. In practice, this means choosing the prompt under which the model is most confident or consistent in its answer. Sorensen et al. (2022) used this to pick an optimal prompt out of many candidates, effectively tuning the prompt per query[59]. Mentioned in the Prompt Report.
• “Diverse” Prompt Ensembles (DiVeRSe): Generating diverse reasoning paths from multiple prompts and scoring each path to pick the best answer[60]. In DiVeRSe, one creates different prompts for the same problem, uses self-consistency on each to get sets of reasoning chains, then evaluates the chains (e.g. longer chains may indicate better reasoning) and chooses a final answer from the highest-scoring chain[60]. This method further reduces errors by exploring prompt and reasoning space widely. Proposed by Li et al. (2023), noted in the Prompt Report.
• Consistency-Based Self-Adaptive Prompting (COSP): A technique that learns from the model’s own consistency. First, run zero-shot CoT with self-consistency on a sample of training questions; then take the high-confidence (majority-agreed) answers and use those Q&A pairs as exemplars in a new prompt[61]. Essentially, the model’s most reliable outputs bootstrap a few-shot prompt. This adaptive prompt is then used on the test question (with another round of self-consistency)[61]. COSP was shown to improve performance by constructing exemplars the model itself agrees on. From Wan et al. (2023), included in the Prompt Report.
• Universal Self-Adaptive Prompting (USP): An extension of COSP aiming to generalize across tasks[62]. USP uses unlabeled data to generate candidate exemplars and employs a more complex scoring function (beyond simple majority) to select them. Importantly, USP doesn’t rely on self-consistency in the final stage, making it more broadly applicable[63]. Also from Wan et al. (2023), noted in Prompt Report.
• Prompt Paraphrasing (Ensemble Augmentation): Simply rephrasing the prompt instruction in different ways to create an ensemble of prompts asking the same thing[64]. For instance, one might prompt the model with several semantically equivalent questions (“Explain why X happens” vs “What causes X?”), then combine the answers. This acts as a data augmentation for prompting and can be used in voting or as additional context to the model[64]. Technique noted by Jiang et al. (2020) and cited as a basic ensembling method.
Self-Critique and Verification Techniques
• Self-Calibration: After getting an answer from the model, this method has the model reflect on that answer’s correctness[65]. For example, append to the prompt: “Here was your answer – do you think it’s correct?” The model’s self-evaluation can be used to gauge confidence or decide whether to trust the answer[66]. Kadavath et al. (2022) showed that having the model judge its own output helps identify mistakes (useful for determining if we need a second attempt or human intervention)[65].
• Self-Refine: An iterative refinement framework where the model is asked to critique and improve its own output[67]. The process: the LLM produces an initial answer; then it’s prompted to provide feedback on that answer (point out errors or shortcomings); finally, it gets another prompt to revise the answer according to that feedback[67]. This loop can repeat multiple times. Self-Refine has led to better quality in reasoning, coding, and generation tasks by leveraging the model’s capability to self-correct[67]. Proposed by Madaan et al. (2023), covered in the Prompt Report.
• Reversing Chain-of-Thought (RCoT): A verification strategy where the model tries to reconstruct the question from its own answer[68]. The LLM generates what it thinks the original question was (given its answer), and then compares that to the actual question. Discrepancies indicate potential errors in understanding. Detected mismatches are fed back as signals for the model to revise its answer[68]. This reverse-checking approach helped catch reasoning mistakes in experiments (Xue et al. 2023). Included in the Prompt Report.
• Self-Verification: Here the model generates multiple solution candidates (often via CoT), then attempts to verify each solution by testing whether the solution fits all aspects of the question[69]. One implementation: mask out parts of the question and ask the model to predict them from the candidate answer – if it fails, the answer might be wrong[69]. The model basically cross-examines its answers for consistency with the question. Weng et al. (2022) showed this method improves accuracy on several reasoning benchmarks by filtering out inconsistent answers[69].
• Chain-of-Verification (CoVe): A multi-step verification chain to correct factual errors or hallucinations[70][71]. CoVe has the model: (1) produce an initial answer; (2) generate a list of verification questions that would help check that answer; (3) answer each of those sub-questions; and (4) use those answers to revise the original answer if needed[70][72]. This approach effectively forces the model to fact-check itself. Dhuliawala et al. (2023) found CoVe can reduce hallucinations and outperform basic prompting and even CoT on knowledge-intensive QA tasks[71][73]. Included in multiple surveys (Vatsal & Dubey, Prompt Report).
• Verify-and-Edit (VE): Post-processing the model’s reasoning chain for factual accuracy. In VE (Zhao et al. 2023), the model first uses self-consistency to identify if any of its CoT steps are uncertain or likely incorrect[74]. If so, it retrieves supporting facts (e.g. from an external knowledge base or context) and edits the reasoning chain to align with those facts[75]. The final answer is produced from the corrected reasoning. This method improved multi-hop and truthfulness benchmarks by fixing reasoning errors on the fly[76]. Summarized in Vatsal & Dubey’s survey.
• Cumulative Reasoning: An approach where the model generates a series of potential steps or claims and evaluates them one by one, accepting those that seem valid and rejecting those that don’t, until arriving at a final answer[77]. Essentially, the LLM builds an answer cumulatively, checking each incremental step for consistency or correctness before proceeding[78]. If an inconsistency arises, it can loop back or halt. This method, noted by Zhang et al. (2023), helps navigate logical inference problems by not committing to a full chain that might have a bad step – instead, the chain is validated as it grows[79].
• Metacognitive Prompting: A structured five-stage prompting process inspired by human meta-cognition (thinking about one’s own thinking)[80][81]. The model is guided through: 1) understanding the problem, 2) giving an initial answer or judgment, 3) critiquing that initial answer, 4) revising or finalizing the answer with an explanation, and 5) evaluating its confidence in the answer[82]. By explicitly prompting the LLM to self-reflect and assess its confidence, this technique (Wang & Zhao 2023) consistently outperformed standard CoT and Plan-and-Solve on tasks like NLI, QA, and extraction[83][84]. Included in Vatsal & Dubey’s list as “Metacognitive Prompting.”
Retrieval and Knowledge Augmentation
• Retrieval-Augmented Generation (RAG): A technique that supplements the LLM with external knowledge by retrieving relevant text from a database or the web and inserting it into the prompt[85]. The prompt typically includes both the user query and the retrieved context (e.g., wiki articles, documents) so the model can ground its answer in up-to-date or detailed information. RAG is powerful for knowledge-intensive tasks and helps reduce hallucination by giving the model real facts to draw from[85]. Referenced in surveys as a common approach (first popularized by Lewis et al. 2020).
• Implicit RAG: Instead of an external search engine, the model itself is asked to perform retrieval on a given context. For example, given a long passage as context, the prompt might say: “Extract the 3 most relevant sections from this text and then answer the question.”[86]. The LLM thus internally chooses which chunks of the provided text to focus on (mimicking retrieval) and ignores distractors[86]. This was shown to achieve state-of-the-art results on certain QA tasks by efficiently zeroing in on the answer-bearing parts of context[87]. Proposed by Vatsal et al. (2024) and documented in their survey.
• Knowledge-Augmented Prompting: (General category) Many surveyed works incorporate additional knowledge or context into prompts beyond the task description. Examples include giving definitions of terms (for specialized domains) or adding related facts. One instance is Basic with Term Definitions, where medical term definitions were appended to prompts for a clinical task[88]. Interestingly, Vatsal et al. report that adding static definitions didn’t always help, as the narrow added text could conflict with the model’s broader knowledge[89]. On the other hand, dynamically retrieving relevant knowledge (via RAG or tools) tends to be more effective.
Tool-Use and Multi-Step Agentic Prompting
• ReAct (Reason+Act): A prompting framework that interleaves reasoning with actions[90]. The LLM is prompted to produce not just thoughts, but also actions (like calls to external tools or queries). A ReAct prompt might have the model output a thought (“I should look up X”) followed by an action (“[SEARCH] X”), and then the observation from that action, and so on[91]. By explicitly prompting the model to act (e.g., perform a lookup, calculation, or API call) and then continue reasoning, ReAct enables decision-making and tool use. Yao et al. (2022) showed that ReAct can solve complex decision-making tasks and even outperformed pure reasoning (CoT) on certain benchmarks[92]. Covered in Vatsal & Dubey’s survey as a prominent approach for interactive prompts.
• Tool-Augmented Prompting: More generally, prompts can be designed for agent behaviors, where the LLM uses tools or APIs during its response. For example, Toolformer (Schick et al. 2023) augments the model’s prompt with API call demonstrations (like invoking a calculator or a calendar), enabling the LLM to insert a tool-use step when needed[85][93]. Similarly, frameworks like Gorilla and others provide the model with a list of tools and a format to call them. In surveys, this is treated as an extension of prompting – the prompt includes instructions or examples of using external information sources, effectively turning the LLM into an agent that can fetch information and then respond. (This category is closely related to RAG and ReAct; the Prompt Report’s taxonomy includes tool-using agents as an extension of prompting[85].)
• Program-Aided Prompts (Code as Reasoning): Several techniques integrate code execution into the prompt. For instance, Program-of-Thoughts (PoT) prompts the model to generate a Python program as the intermediate reasoning step, which is then executed to get the answer[94]. This offloads calculation or logical inference to a reliable executor. Another, PAL (Program-Aided Language Models), has the LLM produce a mix of natural language and code statements and run them with a Python interpreter for the final result[95][96]. These approaches dramatically improved math and data processing tasks, since the model doesn’t have to do arithmetic or table manipulation itself – it writes code to do it. PoT (Chen et al. 2022) and PAL (Gao et al. 2023) are both covered in the surveys[94][95].
• Chain-of-Code (CoC): An extension of CoT for code-oriented reasoning. The model is prompted not only to write code as part of its solution, but also to simulate the code’s output mentally[97]. For example, CoC might involve the LLM writing a pseudo-code and also describing what each line would output or do. By “executing” code within the prompt (even if an external interpreter can’t actually run it), the model can handle programming tasks that involve reasoning about code behavior[97]. CoC has shown superior performance to plain CoT on tasks ranging from recommending systems to logical reasoning, by forcing a more rigorous, step-by-step code-like logic in the prompt[98]. Proposed by Li et al. (2023), noted in Vatsal & Dubey.
• Binder: A training-free neuro-symbolic prompting technique that maps natural language queries to a program (in a target language like Python or SQL) via the LLM, and then executes that program externally[99][100]. The prompt provides a few examples of how a query can be “bound” to a piece of code. The LLM acts as both the translator (from query to code) and as a parser/executor (it can simulate running the code, or an actual interpreter is used). Binder was shown to handle complex table queries and truthful reasoning without any fine-tuning, outperforming prior methods that required training[101][102]. From Cheng et al. (2022), included in Vatsal & Dubey’s survey.
Other Specialized Prompting Techniques
• Chain-of-Table: A prompting method tailored for tabular data reasoning[103][104]. It guides the LLM through operations on a table in multiple steps: (1) plan a sequence of table operations (e.g. “filter rows by X, then sort by Y”), (2) have the model generate the outcome of each operation, and (3) produce the final answer from the transformed table[103][105]. By prompting the model to treat table queries in a stepwise manner (much like how a person would manipulate a spreadsheet), Chain-of-Table achieved new state-of-the-art results on table QA and verification tasks[106]. Proposed by Wang et al. (2024), noted in Vatsal & Dubey.
• Three-Hop Reasoning (THOR): A domain-specific prompting chain for sentiment analysis that mimics human analytic steps[107]. The LLM is prompted in three stages: identify the specific aspect being discussed (e.g. product feature mentioned), then describe the opinion or detail about that aspect, and finally infer the sentiment polarity from those details[108]. This structured approach helped the model outperform even fine-tuned models on multi-aspect sentiment tasks[109]. From Fei et al. (2023), included in Vatsal & Dubey’s survey.
• Chain-of-Event (CoE): A technique for summarization tasks where the model is prompted to extract and organize key events chronologically[110]. The prompt breaks summarization into steps: (1) list important specific events in the text; (2) generalize and compress those events; (3) filter to the most critical events; (4) produce a summary by integrating those events in order[111]. By focusing on events, CoE yields more concise summaries and was shown to beat standard CoT on summarizing long texts (with higher ROUGE scores)[112]. Proposed by Bao et al. (2024), noted in Vatsal & Dubey.
• Domain-Specific Instruction Tuning Prompts: In specialized domains like medicine, prompts have been augmented with domain knowledge. One example is adding annotation guidelines or definitions to the prompt. Hu et al. (2024) combined a basic instruction with snippets from annotation guidelines and common errors for a medical named entity recognition task[113][114]. Another example is appending terminology definitions (e.g., medical jargon explanations) to the user query[88]. These hybrid prompts aim to inject domain context so the model follows the same rules a human annotator would. Surveys note that results have been mixed – in some cases these additions didn’t help much or even confused the model[89] – but they represent attempts to tailor prompts to specific application needs.