The <5% Problem: A Prompting Toolkit for When Plain Language Isn't Enough
When you need a better answer, here's how to ask better questions; Or, Break glass in case of boiled garbage
I’ve been talking to some folks about using large language models, and it occurred to me that it would be helpful to have a resource on prompting techniques. It reminded me a bit of learning to tie knots as a kid. My dad, who worked on a shipyard in college, can tie dozens of different knots and hitches. He taught me a lot of them, and while I only remember a few, I can still tie a bowline with my eyes closed.1
I found this pre-print paper from Schuloff et al (2024), and I used Claude to turn it into a catalogue of prompting techniques. They turned it into a SaaS business.
The utility of my AI writing, I’ve long (dis)claimed, is not that I’m inventing the techniques or technology, only that I’m collecting and presenting them to a mostly veterinary audience. What I’ve done here, or at least tried to do, is provide a prompt that you can copy and paste into your preferred large language model (which should probably be Claude2) and it will help guide you through the technique that should work the “best” for accomplishing your goal.
This isn’t something to try for every situation, as the LLMs have gotten so good that you usually get the output you want. This is more for the times when you have a reasonably complex task and you’re struggling to make the software do what you need.
You can also read the paper, or the LLM-produced catalogue, just for the sake of learning more about prompting. It’s a dated paper (by LLM standards), not peer-reviewed, and neither paper nor catalogue are exhaustive, but there was enough good in there for me to think “hey, that’s interesting” or “hm, that’s useful” often, and to use it with sufficient frequency, to decide to publish it on Substack.
The Prompt Report
I’ve linked the article above, and I’m posting the output mostly unedited. I’ve only removed the last paragraph where Claude offered to use this to help with other projects.3
This bit is, by design, a reference dump. More cheat sheet than encyclopedia. You don’t need to — and shouldn’t — read it as prose. It’s meant to be copy and pasted.
This is “The Prompt Report” by Schulhoff et al. (2025) — the most comprehensive systematic survey of prompt engineering techniques to date. It came out of UMD, with contributors from Stanford, OpenAI, Vanderbilt, Princeton, and others. The paper used a PRISMA-based systematic review process, starting from ~4,800 records and distilling down to 1,565 relevant papers, from which they catalogued 58 text-based prompting techniques and 40 multimodal/multilingual techniques.
What makes it genuinely useful rather than just encyclopedic is the taxonomy — they don’t just list techniques, they organize them into functional categories that help you understand when to reach for each one. They also include a real-world prompt engineering case study on suicide crisis detection that’s worth the read on its own, because it shows how messy and non-linear the actual process is.
Let me lay out the full catalogue.
TEXT-BASED PROMPTING TECHNIQUES (58)
1. In-Context Learning (ICL)
The foundational category — providing exemplars and/or instructions within the prompt so the model can perform tasks without weight updates.
Few-Shot Prompting Design Decisions: exemplar quantity, ordering, label distribution, label quality, format, and similarity to test instances all matter significantly. On some tasks, exemplar order alone can swing accuracy from below 50% to above 90%.
Few-Shot Techniques:
KNN — selects exemplars similar to the test input using k-nearest neighbor retrieval
Vote-K — proposes unlabeled candidates for annotation, then uses the labeled pool for few-shot prompting while ensuring diversity
Self-Generated ICL (SG-ICL) — uses the LLM itself to generate exemplars when training data is unavailable
Prompt Mining — discovers optimal prompt template formats by analyzing which patterns appear most frequently in large corpora
Zero-Shot Techniques:
Role Prompting (aka Persona Prompting) — assigns a specific role to the model (”Act as a travel writer”)
Style Prompting — specifies desired tone, style, or genre
Emotion Prompting — incorporates psychologically relevant phrases (”This is important to my career”) to boost performance
System 2 Attention (S2A) — asks the LLM to rewrite the prompt removing irrelevant information, then answers from the cleaned version
SimToM — for multi-person/multi-object questions, establishes what facts one person would know, then answers based only on those facts
Rephrase and Respond (RaR) — instructs the LLM to rephrase and expand the question before answering
Re-reading (RE2) — adds “Read the question again:” plus a repetition of the question; surprisingly effective on complex reasoning
Self-Ask — the LLM decides whether it needs follow-up questions, generates them, answers them, then answers the original
2. Thought Generation
Techniques that prompt the model to articulate reasoning steps.
Chain-of-Thought (CoT) — the foundational technique: providing exemplars that include reasoning paths before the final answer.
Zero-Shot CoT variants:
Zero-Shot CoT — appending “Let’s think step by step” (or similar thought inducers) with no exemplars
Step-Back Prompting — first asks a high-level question about relevant concepts before diving into reasoning
Analogical Prompting — auto-generates exemplars that include CoT reasoning; improves math and code generation
Thread-of-Thought (ThoT) — uses “Walk me through this context in manageable parts step by step, summarizing and analyzing as we go” as the thought inducer; works well with large, complex contexts
Tabular CoT (Tab-CoT) — makes the LLM output reasoning as a markdown table, improving structure
Few-Shot CoT variants:
Contrastive CoT — includes both correct and incorrect reasoning exemplars so the model sees how not to reason
Uncertainty-Routed CoT — samples multiple CoT paths, takes majority if above a confidence threshold, otherwise samples greedily
Complexity-Based Prompting — selects complex examples for exemplars and uses majority vote among longer reasoning chains
Active Prompting — asks the LLM to solve training questions, calculates uncertainty, then has humans rewrite the highest-uncertainty exemplars
Memory-of-Thought — pre-computes CoT on unlabeled training data, then retrieves similar instances at test time
Auto-CoT — uses Zero-Shot CoT to automatically generate chains-of-thought, then assembles them into Few-Shot CoT prompts
3. Decomposition
Breaking complex problems into simpler sub-problems.
Least-to-Most — prompts the LLM to break a problem into sub-problems without solving them, then solves sequentially, appending each answer to the prompt
DECOMP (Decomposed Prompting) — teaches the LLM to use specific functions (string splitting, internet search, etc.) via few-shot examples, then lets it route sub-problems to those functions
Plan-and-Solve — an improved Zero-Shot CoT: “Let’s first understand the problem and devise a plan to solve it. Then, let’s carry out the plan and solve the problem step by step”
Tree-of-Thought (ToT) — creates a tree-like search by generating multiple possible “thought” steps, evaluating progress, and deciding which branches to continue
Recursion-of-Thought — when a complex sub-problem arises mid-reasoning, it sends that sub-problem to a separate LLM call, then inserts the answer back
Program-of-Thoughts — generates code as reasoning steps, then executes via a code interpreter
Faithful CoT — combines natural language and symbolic language (e.g., Python) reasoning in a task-dependent fashion
Skeleton-of-Thought — generates an answer skeleton, then solves sub-parts in parallel for speed
Metacognitive Prompting — a five-part prompt chain mirroring human metacognition: clarify the question, preliminary judgment, evaluate response, confirm decision, assess confidence
4. Ensembling
Using multiple prompts and aggregating results.
DENSE (Demonstration Ensembling) — creates multiple few-shot prompts with distinct exemplar subsets, aggregates outputs
MoRE (Mixture of Reasoning Experts) — uses different specialized prompts for different reasoning types (retrieval for factual, CoT for math, generated knowledge for commonsense), selects best by agreement
Max Mutual Information — creates varied prompt templates, selects the one maximizing mutual information between prompt and output
Self-Consistency — prompts the LLM multiple times with non-zero temperature for diverse CoT paths, then takes majority vote
Universal Self-Consistency — like Self-Consistency but uses an LLM to select the majority answer rather than programmatic counting (useful for free-form text)
Meta-CoT (Meta-Reasoning over Multiple CoTs) — generates multiple reasoning chains, inserts all into one prompt, generates a final answer
DiVeRSe — creates multiple prompts, runs Self-Consistency on each, scores reasoning paths step-by-step
COSP (Consistency-based Self-adaptive Prompting) — runs Zero-Shot CoT with Self-Consistency on examples, selects high-agreement outputs as exemplars for a final prompt
USP (Universal Self-Adaptive Prompting) — generalizes COSP to all tasks using unlabeled data and a more complex scoring function
Prompt Paraphrasing — rewording the original prompt while preserving meaning; effectively data augmentation for ensembles
5. Self-Criticism
Having the LLM evaluate and improve its own outputs.
Self-Calibration — answers a question, then asks the LLM “Is this answer correct?” to gauge confidence
Self-Refine — iteratively: generate answer → get LLM feedback → improve answer → repeat until stopping condition
Reversing CoT (RCoT) — asks the LLM to reconstruct the original problem from its answer, then compares for inconsistencies
Self-Verification — generates multiple CoT solutions, scores each by masking parts of the original question and testing if the LLM can predict them
Chain-of-Verification (COVE) — generates an answer, creates verification questions, answers them, then produces a revised final answer
Cumulative Reasoning — generates potential reasoning steps, evaluates each (accept/reject), checks if the final answer is reached, repeats if not
6. The case study technique:
AutoDiCoT (Automatic Directed CoT) — automatically generates CoT explanations, then combines them with contrastive (incorrect reasoning) exemplars. Developed during the paper’s own suicide crisis detection case study.
PROMPT ENGINEERING TECHNIQUES (automated optimization)
Meta Prompting — prompting an LLM to generate or improve a prompt
AutoPrompt — uses backpropagation to optimize “trigger tokens” in a prompt template (soft-prompting)
APE (Automatic Prompt Engineer) — generates multiple Zero-Shot instruction prompts from exemplars, scores them, creates variations of the best
GrIPS — like APE but uses deletion, addition, swapping, and paraphrasing operations
ProTeGi — passes outputs through a criticism prompt, generates new prompts from criticisms, uses a bandit algorithm to select
RLPrompt — uses reinforcement learning (Soft Q-Learning) to optimize prompt templates; often selects grammatically nonsensical but effective text
DP2O — the most complex: combines RL, custom scoring, and LLM conversations to construct prompts
MULTILINGUAL TECHNIQUES (14)
CoT extensions:
XLT (Cross-Lingual Thought) — six-instruction template including role assignment, cross-lingual thinking, and CoT
CLSP (Cross-Lingual Self-Consistent Prompting) — constructs reasoning paths in different languages for the same question
ICL extensions:
X-InSTA — aligns in-context examples with input via semantic alignment, task-based alignment, or both
In-CLT (Cross-lingual Transfer) — uses both source and target languages for in-context examples
PARC — retrieves exemplars from a high-resource language for low-resource targets
Translation-specific:
Translate First Prompting — translates non-English input to English before processing
MAPS (Multi-Aspect Prompting and Selection) — mines knowledge from source text, generates multiple translations, selects best
Chain-of-Dictionary (CoD) — extracts words, looks up meanings in multiple languages, prepends to prompt
DiPMT — similar to CoD but only source and target language definitions
DecoMT — divides source text into chunks, translates independently, then combines with contextual information
Interactive-Chain-Prompting — generates sub-questions about translation ambiguities for human resolution
Iterative Prompting — creates draft translation, then refines with automated retrieval or human feedback
MULTIMODAL TECHNIQUES (26+)
Image:
Prompt Modifiers — appending words to change generated images (medium, lighting, etc.)
Negative Prompting — numerically weighting terms so the model considers them more/less
Paired-Image Prompting — shows before/after transformation pairs, then presents a new image
Image-as-Text Prompting — generates textual descriptions of images for inclusion in text prompts
Multimodal CoT — extending CoT to image inputs (”Solve this step by step” with a math problem image)
DDCoT (Duty Distinct CoT) — extends Least-to-Most to multimodal, creating and solving subquestions
Multimodal Graph-of-Thought — extends GoT to multimodal with image captioning for visual context
Chain-of-Images — generates images (as SVGs) as part of the reasoning process (”Let’s think image by image”)
Audio, Video, Segmentation, 3D: The paper notes these are early-stage but growing, with techniques for text-to-video generation, video editing, semantic segmentation, 3D object synthesis, surface texturing, and 4D scene generation.
AGENT TECHNIQUES
Tool Use:
MRKL System — LLM router with access to multiple tools (weather, date, calculator)
CRITIC — generates response, self-criticizes, then uses tools to verify/amend
Code Generation:
PAL — translates problems directly to code, executes via Python interpreter
ToRA — interleaves code and reasoning steps iteratively
TaskWeaver — transforms requests to code with user-defined plugins
Observation-Based:
ReAct — think → act → observe → repeat, with all history in the prompt
Reflexion — adds introspection layer to ReAct: evaluates success/failure, generates reflections for working memory
Lifelong Learning:
Voyager — proposes tasks, generates code, saves actions to long-term memory (Minecraft)
GITM (Ghost in the Minecraft) — recursive goal decomposition with structured text actions and external knowledge base
RAG:
Verify-and-Edit — generates multiple CoTs, retrieves external info to edit selected ones
Demonstrate-Search-Predict — decomposes questions, uses retrieval for sub-questions, combines
IRCoT — interleaves CoT and retrieval, each guiding the other
Iterative Retrieval Augmentation (FLARE, IRP) — generates temporary sentences as search queries during long-form generation
KEY FINDINGS FROM THEIR BENCHMARKING
They tested techniques on MMLU with GPT-3.5-turbo. Few-Shot CoT performed best at 0.692 accuracy. Surprisingly, Zero-Shot CoT dropped below plain Zero-Shot (0.547 vs 0.627). Self-Consistency helped Zero-Shot CoT but didn’t improve Few-Shot CoT. The spread between best and worst phrasing variants within a single technique was large, reinforcing how sensitive these models are to prompt wording.
The suicide crisis detection case study is particularly instructive — 47 development steps, ~20 hours, going from 0% to 0.53 F1. The most interesting finding: accidentally pasting an email twice improved performance, and removing the duplicate made it worse. Anonymizing names in the email also hurt. This underscores their point that prompt engineering remains partly a “black art” where inexplicable details can matter enormously.
I’ve included this because I think sometimes you might want to reference the paper specifically, and sometimes you might only need a catalogue refresher to help you prompt it.
Bluntly, if you get the hang of five to eight of these techniques, you are probably now LLM-capable enough to do pretty much anything you need. If you’ve got few-show prompting, chain-of-thought, decomposition, self-refinement, and role/persona prompting in your toolbox? You almost certainly don’t need 50 more to get high-level functionality out of your LLM.
Still, every once in awhile — and often enough for me that I’m sharing it — I find this useful.
The Catalogue Prompt
Here’s where this becomes a tool instead of just a neat idea.4 You can copy the prompt below, paste it into your LLM of choice, and include the catalogue or article alongside it.
Describe the problem, and it will recommend different techniques and way.
You are an expert prompt engineering consultant. You are thoughtful and thorough, insightful and diligent. You are patient with you explanations but will not sacrifice accuracy for sycophancy. I’m going to give you two things:
1. A comprehensive catalogue of every known prompting technique (the article pasted)
2. A description of my specific problem
Your job is to:
- Analyze my problem’s characteristics (type of task, available data, complexity, whether I have labeled examples, latency/cost constraints, whether accuracy or recall matters more)
- Recommend and rank 1-3 prompting techniques from the catalogue that are the best fit
- Explain WHY each technique fits my problem
- Provide a concrete example of how I would structure my prompt using each recommended technique
- Flag any technique-specific pitfalls or design decisions I should be aware of (e.g., exemplar ordering effects, answer extraction challenges)
- If my problem would benefit from combining techniques (e.g., Few-Shot CoT + Self-Consistency), explain how to layer them and/or provide a step-wise approach
Here is the catalogue of prompting techniques:
[PASTE THE FULL ARTICLE HERE]
Here is my problem:
[DESCRIBE YOUR PROBLEM HERE — include: what task you’re trying to accomplish, what data you have available, what model you’re using, what “good” looks like for your use case, and any constraints on cost/latency/complexity]
The better you describe your problem, the more context you provide, the better the output will be. This seems to work fairly well inside of Claude’s Projects feature, but it’s also been effective in Incognito Mode.5
The other thing the meta-prompt doesn’t include that might be useful is a note about what you’ve already tried and why it didn’t work. I omit it here because I expect most folks (and, often, me too) will just copy and paste it as is as a first try.
Conclusion
Honestly, this was intended for my own utility, but I’ve been using it enough to believe it’s worth sharing.
Obviously, the point of it is very much not to memorize a taxonomy, rather it's to have a reference tool for when the model isn't doing what you need and you're not sure why. You don’t need to memorize 58 knots and hitches, you need to know a bowline and a square knot exist, and know when to use each one.
Most of the time, you won't need this mini reference. Most of the time, you’ll have internalized the techniques already and, when you haven’t, the models are good enough that plain language covers you anyway. But for the other <5%ish of the time, when the task is complex or the stakes are real or you're three rephrased prompts deep and still getting boiled garbage, knowing that a technique exists can be the difference between giving up and getting it done.
He insisted on that particular challenge, and it still makes me smile to think of it.
Opus 4.6 Extended. None of that “Adaptive” bullshit.
My greatest professional fear is all of my project folders leaking to the world. Not because they’re scandalous or salacious, but because the bad puns and dad jokes I use to title the projects and folders (it helps me remember, okay?!) would get me canceled and/or crucified.
I leave my articles up long past their useful life, but if you’re coming across this one long after it’s published, I think it’s probably worth taking another swing at those prompts. Especially the second one.
I find Claude’s Incognito Mode is more useful than ChatGPT’s Temporary Chat. Claude’s Incognito Mode doesn’t reference the history and memory, which is quite useful when you’re trying to minimize sycophancy.
ChatGPT’s tool is just as useful for privacy’s sake, but doesn’t have the added benefit of relative neutrality.


