NAVIGATOR — human capability for the COMPASS Platform

Introduction

Most people's mental model of how to use an LLM is "type a question, get an answer." That works for trivia. It does not work for serious work.

The difference between people who get useful output from LLMs and people who don't isn't intelligence or technical background — it's whether they treat the model as a stochastic tool that needs to be configured for each task, or as a magic answer box.

This guide is about the configuration. It covers prompt structure, how to feed context, when to use examples, how to choose between different AI patterns (chat vs RAG vs agents vs fine-tuning), and the small handful of techniques that account for most of the quality difference between weak and strong prompts.

What you'll be able to do by the end

Write prompts that are specific enough to consistently get usable output

Choose the right AI pattern (chat, RAG, structured output, agent, fine-tuning) for a given problem

Diagnose why a prompt isn't working and know which lever to pull

Avoid the most common dead ends: prompt folklore that doesn't actually move quality

How to use this guide

Read sections 1–3 in order — they build on each other. Sections 4–7 are independent and you can jump to whichever is most relevant. The exercises at the end of each section are worth doing; they're designed around problems engineers actually have, not toy puzzles. The reference appendix at the end is the page to bookmark.

1. The Mental Model

Before techniques, get the model right. An LLM is a function that predicts what text most likely comes next, given everything in front of it (the context window). It is not a database, not a reasoner with private state, and not a person you're having a conversation with — though it imitates all three convincingly.

Three implications fall out of this, and they explain almost every counterintuitive behavior you'll see in practice.

Implication 1: Everything has to be in the context window

If you want the model to know something — a fact, a coding style, a customer's record, last week's decision — it has to be in the prompt right now. The model doesn't "remember" the conversation you had yesterday. It doesn't have access to your codebase. It doesn't know what your team agreed in Slack. When you start a new chat, you start from zero context.

This is why people who get good results spend a lot of time on context. Pasting in the relevant schema, the docs, the error message, the existing function, the style guide — these aren't optional decoration, they're the entire substrate the model is working from.

Implication 2: The model is a probability distribution, not a single answer

Run the same prompt twice and you can get different outputs. Run it ten times and you'll see a range. The same model can be brilliant on attempt three and confidently wrong on attempt four. This isn't a bug — it's how the technology works. There's no "the" answer; there's a distribution of possible next-text continuations, and a sampler picks one.

Practically: don't read too much into a single output. If the answer matters, run the prompt twice or three times and see whether the core claim holds. Don't anchor on the first response.

Implication 3: Plausibility is not accuracy

The model optimizes for text that looks like what should come next. It does not optimize for text that's true. Most of the time, looking-right and being-right correlate — that's why LLMs work at all. But the two come apart at the edges, and the failure mode is silent: the wrong answer looks exactly as confident as the right answer. We'll deal with this in depth in Guide 2; for now, just internalize that the model has no internal sensation of certainty.

The mental model in one line An LLM is a context-in, plausible-text-out function. Your job is to build the context that makes the most plausible continuation also the most useful one.

2. Prompt Structure

A weak prompt is one sentence. A strong prompt has structure. Not because the model demands a specific format — modern models are flexible — but because writing a structured prompt forces you to be specific about what you actually want.

Five components show up in almost every strong prompt:

Role / context — who the model is acting as and what situation it's in.
Task — what you want it to do, specifically.
Inputs — the actual material it's working with (code, data, document, etc.).
Constraints — anything it should or shouldn't do (length, style, dialect, things to avoid).
Output format — what the response should look like (JSON schema, table, list, prose, etc.).

You don't need labeled sections — natural language is fine. But cover all five, and your output quality will jump.

Worked example: weak vs strong

Suppose you want to refactor a Python function.

❌ Weak prompt

refactor this function

def process(d):
    r = []
    for x in d:
        if x['s'] == 'active':
            r.append(x['id'])
    return r

✅ Strong prompt

Act as a senior Python reviewer.

Refactor this function for readability: use descriptive names, prefer list
comprehensions where idiomatic, add type hints, and add a brief docstring.
Do not change the function signature.

def process(d):
    r = []
    for x in d:
        if x['s'] == 'active':
            r.append(x['id'])
    return r

Return the refactored function only, no explanation.

The strong version takes 20 seconds longer to write and gives consistently better output. It also tells you, the writer, exactly what you wanted before you ever ran the prompt — which is often the actual hard part of the job.

On role-setting (and when it stops mattering)

"Act as a senior X" used to make a noticeable difference. With current models it's a smaller effect, but still useful — not because the model becomes more skilled, but because role-setting often nudges the output toward a specific style (concise, technical, opinionated) that you actually want. Use it when it tightens the output, skip it when it doesn't.

On output format (do not skip this)

Of the five components, output format is the one people skip most and lose the most quality from. If you don't specify the format, the model picks one — usually verbose prose with apologetic hedging. If you do specify it ("a bulleted list of file:line findings," "JSON matching this schema," "the refactored function only, no explanation"), the output gets dramatically more usable.

The single highest-leverage change you can make Add an explicit output format to every prompt. It costs you one sentence and saves you the cleanup.

Iteration is part of the process

Almost nobody writes the perfect prompt the first time. Strong prompt engineering is iterative: run it, see what's missing or wrong, edit the prompt (not just nudge with a follow-up), run again. After three iterations you usually have a prompt that's good enough to save somewhere reusable.

An important wrinkle: it's tempting to fix a bad output by sending a follow-up message ("that's not quite right, please also include X"). This works in chat, but the resulting context gets long, muddled, and hard to reuse. If you're building something repeatable — a script, a workflow, a template — edit the original prompt instead. The follow-up pattern is for exploration; the editing pattern is for production.

Things that don't move quality (skip these)

A surprising amount of prompt folklore doesn't survive controlled testing. You can save effort by ignoring:

Politeness phrases. "Please," "thank you," "I would really appreciate it." No measurable effect on quality. Be polite if you want to (it's your habit, not the model's).
Threats and emotional appeals. "This is very important to my career." "I'll be very upset if this is wrong." Discredited and slightly weird. Don't do this.
Tipping promises. Briefly popular online. Doesn't hold up. Doesn't pay either.
All-caps emphasis. Mostly noise. Specific constraints work; shouting doesn't.

What does work: the five-component structure above, examples (next section), and clear constraints.

Exercise 1 — Rewrite weak prompts

Below are three weak prompts. Rewrite each to include all five structure components. There's no one right answer — the test is whether your version forces specificity.

"Help me debug this error." (Assume you have a stack trace and the relevant code.)
"Write tests for my function."
"Summarize this document." (Assume a 30-page technical report.)

Hint: ask yourself "what would a junior engineer need to know to do this task well?" and put that information in the prompt.

3. Context and Examples

Section 2 was about structure. This one is about substance — what you actually put inside that structure. There are three big levers: relevant context, worked examples (few-shot prompting), and step-by-step decomposition.

3.1 Provide relevant context

Context means anything the model needs to know to do the task that isn't general knowledge. The model knows Python; it doesn't know your codebase's conventions. The model knows SQL; it doesn't know your table schema. The model knows how to write a customer support reply; it doesn't know your tone of voice.

Common context to consider including:

Task type	Useful context to paste in
Code refactor	The function, related helpers, the existing test, the codebase's style guide if you have one
SQL query	Table schemas with column types, the SQL dialect (Postgres / BigQuery / Snowflake), sample rows if shape isn't obvious
Bug investigation	The error/stack trace, the failing input, the relevant code, what you've already tried
Documentation	The code being documented, an example of existing docs in the target style
API design	Use cases / user stories, related existing endpoints, any constraints (auth, versioning, rate limits)
Data analysis	Column descriptions, sample of the data, what question you're trying to answer, what answers would mean what

The cost of including too much context is mild (slower response, slightly worse focus). The cost of including too little is severe (wrong assumptions, fabricated function names, generic advice). When in doubt, include more.

3.2 Few-shot examples

Showing the model two or three input-output examples is one of the highest-leverage techniques available, especially for tasks with a specific format or style. The mechanism is straightforward: the model is a pattern continuer, so showing it the pattern works dramatically better than describing the pattern in words.

When few-shot helps most

Tasks with a specific output format (extracting data into JSON, generating commit messages in a house style, formatting logs)
Tasks with subtle judgment calls (classifying support tickets by severity, deciding when a code smell is or isn't actually a problem)
Tasks where the rules are easier to demonstrate than describe ("good" vs "bad" code comments, in-house naming conventions)

Worked example: classifying log lines

You want the model to classify log lines as INFO / WARN / ERROR / CRITICAL.

❌ Description only

Classify each log line as INFO, WARN, ERROR, or CRITICAL based on severity.

Log line: Connection retry 3/5 to downstream API succeeded.
Classification:

✅ With examples

Classify each log line as INFO, WARN, ERROR, or CRITICAL.

Examples:
Log: User 4521 logged in.
Class: INFO

Log: Cache miss rate above 50%.
Class: WARN

Log: Payment service unreachable; falling back to retry queue.
Class: ERROR

Log: Database master unreachable; all writes failing.
Class: CRITICAL

Log: Connection retry 3/5 to downstream API succeeded.
Class:

The example-laden version doesn't just produce a more reliable classification — it also implicitly teaches the model where your team's threshold sits between WARN and ERROR. That threshold is almost impossible to describe in prose but trivial to show in three examples.

How many examples?

Two to five is the sweet spot. One example doesn't establish a pattern; the model treats it as a one-off. Ten or more starts to crowd the context and rarely helps more than five. Pick examples that cover the boundaries of your task — edge cases, ambiguous cases, the easy default case.

3.3 Decompose hard tasks

If you ask the model to do five things in one prompt — "read this code, find the bugs, propose fixes, write tests for the fixes, and update the documentation" — quality drops on every one of them. Models, like humans, do better with one well-defined task at a time.

Two ways to decompose:

Sequential decomposition. Break the task into steps and run each in its own prompt, feeding the output of one into the next. This is more work but produces dramatically better results for anything non-trivial. Example: first prompt produces a code review; you accept or reject findings; second prompt writes tests for the accepted findings; third prompt drafts a PR description.

Chain-of-thought (in-prompt decomposition). Ask the model to think step-by-step within a single response. Phrases like "work through this step by step before answering," "first identify the key constraints, then propose options, then pick one," or "think out loud before giving your final answer" reliably improve output on reasoning tasks. Newer reasoning-focused models do this internally and you don't need to ask, but it doesn't hurt.

Decomposition rule of thumb If the task takes a human five steps to do well, it takes the model five prompts (or one prompt explicitly asking for the five steps). One prompt does not equal one task.

Exercise 2 — Build a few-shot prompt

Pick a small classification or extraction task from your own work — categorizing support tickets, tagging PR titles, extracting fields from logs, deciding whether a code change needs review, etc. Write a prompt with three to five worked examples covering: the obvious case, an ambiguous case, and an edge case. Run it on five new inputs. Note which ones it gets right and which it gets wrong — the failures usually point to a missing example.

4. Structured Outputs

If your downstream code needs to parse the output, asking nicely in the prompt is the wrong approach. "Please return valid JSON" works most of the time, which means it breaks regularly — trailing commas, narrative preamble, markdown code fences, hallucinated fields, missing commas. If you've ever written a regex to clean up an LLM's JSON, you've felt this pain.

The fix is to use a structured output feature. All major API providers (Anthropic, OpenAI, Google, etc.) support some form of schema-constrained output. The exact name varies — "structured outputs," "JSON mode," "response format," "tool use" — but the idea is the same: you provide a JSON schema, and the model is constrained to produce output that validates against it.

This is dramatically more reliable than prompt-only requests. The model still has to be smart enough to fill in the right values, but it can't accidentally emit invalid JSON.

The three-layer pattern for production

In practice, you want defense in depth:

Use structured outputs / JSON mode / a schema-enforcement feature if the API supports it. This handles ~95% of cases.
Validate the output against your schema in code (using a library like pydantic, zod, ajv, etc.). This catches the rest.
Have a repair or retry path for validation failures — log them, attempt to fix, or re-prompt with the error message.

Just doing step 1 catches most issues. Just doing step 2 catches the rest. Doing all three is what lets you run LLM-powered features in production without writing JSON parsers as a hobby.

When you can't use structured outputs

Sometimes you're using a model or interface that doesn't support schema enforcement. In that case: (a) be very explicit about the format in the prompt, (b) give an example of valid output, (c) use a delimiter so you can extract just the JSON part from any surrounding text, (d) parse defensively with try/except and a repair path.

# Defensive parsing pattern when schema enforcement isn't available
import json
import re

def extract_json(text: str):
    # Strip markdown code fences if present
    text = re.sub(r'^```(?:json)?\s*|\s*```$', '', text.strip(), flags=re.M)
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        # Try to find the first JSON object/array in the text
        match = re.search(r'(\{.*\}|\[.*\])', text, re.S)
        if match:
            return json.loads(match.group(1))
        raise

A note on temperature

Temperature controls how "creative" or "random" the output is. For structured tasks (extraction, classification, code generation, anything where there's a right answer), set temperature low — 0 or close to it. For creative tasks (brainstorming, copywriting), higher temperatures (0.7–1.0) help.

Lowering temperature does not fix hallucinations — the model just hallucinates the same wrong thing more consistently. It does help with format consistency.

5. Choosing the Right Pattern

Once you're past chat-and-prompt, there are several distinct patterns for building AI-powered functionality. Picking the wrong pattern is one of the most expensive mistakes, because they're all doable for any task, but each fits a specific shape of problem. Here's how to choose.

Pattern	What it is	When to use it
Chat / prompt	Direct prompts to a model, possibly with conversation history.	One-off tasks. Exploration. Anything where context fits comfortably in a prompt and freshness doesn't matter.
Long context	Loading a large document, transcript, or codebase into a single prompt (100K+ tokens).	Analysis of one specific large artifact in one session. Not for persistent state across sessions.
RAG (retrieval-augmented generation)	Index a corpus; at query time, retrieve relevant chunks and include them in the prompt.	Grounding answers in a large, frequently-updated knowledge base (internal docs, wikis, codebases). Anywhere you need citations.
Structured outputs	Constraining the model's output to a JSON schema.	Whenever the output is consumed by code, not a human. Use defensively even for human-consumed output where format matters.
Tool use / agents	Letting the model call functions (search, calculator, code execution, API calls) and iterate.	Tasks that need fresh data, external actions, or multi-step problem solving with feedback. Use with care: more powerful but harder to control.
Fine-tuning	Training a model on examples to bake in a behavior, style, or format.	Consistent stylistic or behavioral patterns that prompts struggle to maintain. Or as a cost optimization for high-volume use cases. Not for fresh facts.

The two patterns most often picked wrong

"I want it to know our internal docs — should I fine-tune?"

Almost always no. Fine-tuning teaches behaviors and styles, not facts. Facts that come from a body of documents go in via retrieval (RAG). The reasons: (a) the docs change, and re-fine-tuning every week is wasteful, (b) RAG gives you citations back to source documents, which fine-tuning can't, (c) RAG lets the model say "I couldn't find this in the docs," which fine-tuned models tend not to do — they hallucinate confidently instead.

"I have a long context window, can I just paste everything every time?"

Works for one-off analysis. Doesn't work as a substitute for retrieval. Long context windows are per-session loading, not persistent memory: every new session starts empty, so you'd be paying to re-process the whole corpus on every query. RAG retrieves only the chunks relevant to the current query, which is cheaper and usually more accurate (the model doesn't get distracted by 200,000 tokens of irrelevant material).

RAG vs. fine-tuning, in one line RAG is for facts you need to look up. Fine-tuning is for behaviors you need to bake in. If you're tempted to do the opposite, stop and re-check.

Agents and tool use — the part most often over-engineered

Agents (models that can call tools and iterate) are powerful and currently fashionable. They're also the easiest pattern to over-engineer. Start with the simplest thing that works:

Can a single well-structured prompt solve it? Use that.
Can a single prompt with retrieved context solve it? Use RAG.
Does the task genuinely require external actions (looking up live data, executing code, navigating an API) or multi-step iteration? Then use an agent.

If you find yourself building a three-step agent for something that was really one prompt plus a function call, you're working harder than necessary and getting less reliable output.

Exercise 3 — Pattern selection

For each of the following scenarios, decide which pattern (or combination) fits best:

Your team wants an internal assistant that answers questions from a 50,000-page engineering wiki, updated weekly. Answers should cite specific wiki pages.
You want an LLM to consistently write commit messages in your team's specific format (subject + body, imperative mood, includes ticket number).
You want to summarize one specific 300-page contract a colleague sent you this morning.
You want a script that reads incoming customer support tickets, classifies them by issue type and urgency, and routes to the right team.
You want an LLM-powered assistant that can answer questions about a customer account by pulling live data from your CRM, billing system, and support history.

Answers in the appendix. Don't peek until you've worked through your own reasoning — the reasoning is more useful than the answer.

6. Common Failure Modes and How to Fix Them

When a prompt isn't producing what you want, the fix is rarely "try harder." It's almost always one of a small set of root causes. Here's how to diagnose.

Failure: output is cut off mid-task

What's happening: you hit a token limit, either on the response or the total context. The model stopped because it ran out of budget, not because it finished.

Fix: chunk the task. If you're refactoring a 400-line file, split it into smaller logical pieces, refactor each, and reconcile. Asking the model to "continue from where you stopped" sometimes works but introduces reconciliation bugs at the seams. Re-running the same prompt almost never helps.

Failure: model hallucinates function names, library APIs, or facts

What's happening: the task is in the model's knowledge gap. It generates something plausible-shaped rather than admitting uncertainty (a quirk of how these models are trained — they're rewarded for fluent output, not for refusing).

Fix: bring the knowledge into the context window. Paste the relevant docs, or use a tool/agent pattern that lets the model look things up. Asking "are you sure?" is unreliable and often makes the model flip to a different wrong answer. We cover verification in depth in Guide 2.

Failure: output is generic, vague, or hedged

What's happening: your prompt was generic. The model gives you what you ask for.

Fix: tighten the constraints. Add specific requirements (length, audience, format, what to exclude). Switch from open-ended ("tell me about X") to specific ("give me three concrete examples of X with citations"). Add a role/audience: "explain to a senior engineer who already understands the fundamentals."

Failure: structured output keeps breaking your parser

What's happening: prompt-only requests for JSON are unreliable, even with explicit instructions.

Fix: use structured outputs / JSON mode / schema enforcement (see Section 4). If unavailable, parse defensively with retry-and-repair.

Failure: model loses focus on long, multi-part tasks

What's happening: you asked for too much in one prompt. Quality degrades across each sub-task.

Fix: decompose. One prompt per task. See Section 3.3.

Failure: output is technically correct but unusable

What's happening: the model answered the question you asked, not the question you meant.

Fix: read your prompt back and ask whether a smart but new colleague could do the task from it. What did you assume that they wouldn't know? What did you mean by "good," "useful," "appropriate"? Pin those down.

Failure: model claims an authority that isn't real

You see things like "this is the standard approach used by Google" or "industry best practice recommends X." The model doesn't have meaningful access to which approach Google actually uses. These claims are almost always unverified pattern-matching on what sounds authoritative.

Fix: don't update on appeals to authority from an LLM. If the claim matters, verify it directly. If you're using the model for ideation, ignore the appeals; if you're using it for facts, verify or use RAG.

7. Workflow Habits

Knowing techniques isn't the same as having habits. A few habits that distinguish people who get high value from LLMs day to day:

Save your prompts. Anything you write twice should be a template. A simple prompts.md file in your repo, a snippet manager, or a team wiki page — doesn't matter where, just somewhere. Most prompts go from acceptable to good after three or four iterations; throw that work away every time and you're paying the iteration cost on every task.

Use the right interface for the job. Chat for exploration. API for anything you'll run more than a few times. IDE integrations (Copilot, Cursor) for code, where the context comes from the file you're in. Don't use chat to do what a script should do.

Run prompts twice on important things. If the answer matters, the variance matters. Two runs catch most one-off bad outputs. Three catches almost all of them.

Keep a "wins and losses" log. When a prompt produces something genuinely useful, note what worked. When it fails badly, note why. Over a month you build personal intuition that no general guide can give you.

Don't paste a wall of text and hope. If you're including a long document, tell the model what to do with it ("answer only based on this document; if the answer isn't here, say so"). Long unstructured context produces long unstructured output.

When the task feels stuck, change the shape, not the prompt. If you've iterated three times on a prompt and it's still not working, the prompt isn't the problem — the pattern is. Maybe it should be RAG. Maybe it should be decomposed. Maybe it shouldn't be an LLM task at all. Step back.

Reference Appendix

A. The prompt template

[ROLE]      Act as a <senior X / experienced Y>.
[CONTEXT]   <Relevant background, schema, constraints, what's been tried>
[TASK]      <The specific thing to do>
[INPUTS]    <The actual material — code, data, document>
[CONSTRAINTS] <Length / style / dialect / what NOT to do>
[FORMAT]    <Exactly how the output should look>
[EXAMPLES]  <2–5 input → output examples if format or judgment is tricky>

You don't need the labels in your actual prompt. They're just a checklist.

B. Quick reference: which pattern when

If you want…	Use
To answer one question, ad hoc	Chat
To analyze one specific large document	Long context
To answer questions grounded in a large/changing corpus	RAG
Output your code will parse	Structured outputs
The model to call APIs, fetch data, or run tools	Agent / tool use
Consistent style or behavior baked in	Fine-tuning
To do many similar tasks at scale	API + saved prompts

C. Levers that work, in order of leverage

Add explicit output format — biggest cheap win
Add relevant context — usually unlocks the biggest quality jump
Add 2–5 few-shot examples — especially for format and judgment tasks
Decompose into smaller tasks — one prompt per task
Switch pattern (chat → RAG → agent) if structure isn't the issue
Lower temperature for deterministic tasks
Specify what NOT to do — sometimes more useful than what to do

D. Levers that don't work (don't bother)

Politeness, threats, emotional appeals, tipping promises
ALL-CAPS emphasis
"Are you sure?" as a verification mechanism (see Guide 2)
Re-running the same prompt and expecting different results
Asking the model to estimate its own confidence

E. Common failure modes — quick diagnosis

Symptom	Likely cause	Fix
Output cut off	Token limit	Chunk the task
Hallucinated facts/APIs	Knowledge gap	Add context or use RAG
Generic, hedged output	Vague prompt	Tighten constraints
Broken JSON	No schema enforcement	Use structured outputs
Quality drops on multi-part tasks	Too much in one prompt	Decompose
Correct but useless	Asked the wrong question	Re-read your prompt as a stranger

F. Answers to Exercise 3 (pattern selection)

RAG. Large, frequently-updated corpus + need for citations = textbook RAG case. Long context wouldn't scale; fine-tuning wouldn't handle weekly updates or citations.
Fine-tuning, or few-shot prompting. This is a behavior (stylistic consistency), not a fact lookup. For low volume, a saved few-shot prompt is enough. For high volume, fine-tune.
Long context. One specific large document, one session. Don't build a RAG pipeline for a single contract.
Structured outputs + few-shot prompting. The classification is the task; structured outputs ensure the routing code can parse it; few-shot examples teach your team's threshold judgments.
Agent / tool use. Needs fresh data from multiple live systems. RAG over static exports would be stale; chat alone can't pull live data. The model needs to call APIs.

G. Further reading and habits

Save anything that worked: prompts.md, snippet manager, team wiki
Read your team's AI usage policy (covered in Guide 3)
Practice on real work, not toy puzzles — your own tasks are the best teacher
Re-run your prompts after model upgrades; what was needed before may not be needed now

Continue to Guide 2: Evaluating AI Outputs — how to spot hallucinations, build verification habits, and avoid the most common trust failures.

Next: Evaluating AI Outputs