Practitioner track
Module 05 of 6

Reviewing AI-Generated Work

Apply a rubric-based review to AI-generated artifacts — and refuse the ones that fail it.

75 MINFOCUS · REVIEW

Outcome: Apply rubric-based review; refuse output that fails it.

Introduction

AI makes producing artifacts cheap. That moves the bottleneck — and the risk — to review. A team that generates specs and plans ten times faster but reviews them at the old standard ships ten times the defects. The practitioner's job is to be the gate that AI fluency can't talk its way past.

This module gives you a six-dimension rubric for reviewing COMPASS artifacts, the failure modes to hunt for, and the mechanics of returning work that doesn't pass. It assumes the verification mindset from AI Literacy Guide 2 — that plausible and correct come apart, silently.

What you'll be able to do by the end

  • Score an AI-generated artifact across six review dimensions
  • Spot the failure modes that fluent output hides
  • Return an artifact with actionable feedback instead of silently fixing or rubber-stamping it

Prerequisite: AI Literacy Series — Guide 2 (Evaluating AI Outputs). The "verify against the source, not the model" habit is the foundation this builds on.


1. Review is a gate, not a courtesy

In COMPASS, an artifact doesn't reach ACCEPTED because it exists — it earns the status by passing review. Accepting AI-generated work because "it compiled" or "it reads well" is the exact trap Guide 2 warned about: fluency is what the model is best at, and it tells you nothing about correctness.

The bar is simple and non-negotiable: AI-generated artifacts meet the same standard as human-generated ones. Same rubric, same right to be rejected.


2. The six-dimension review rubric

Score each artifact on six dimensions. Any dimension can be a blocker — a single critical failure is enough to keep the artifact in REVIEW.

#DimensionThe questionBlocker when…
1GroundingIs every specific claim traceable to a real source — the upstream artifact, the docs, the codebase?Specifics are fabricated or unverifiable
2CompletenessDoes it satisfy its acceptance criteria and cover the obvious edge cases?A stated criterion is unaddressed
3TraceabilityDoes it link correctly to what it derives from / implements?Links are missing or point at the wrong parent
4Fidelity fitIs the depth right for this pass (Discovery / Definition / Execution)?Too shallow to build from, or false precision
5Failure-mode hygieneAny hallucinated APIs, invented citations, forced quantities, plausible-but-wrong specifics?A single fabricated API or stat survives
6ActionabilityCan the next phase consume this without asking a clarifying question?The consumer would be blocked

The rubric isn't a scorecard you tally for a grade — it's a checklist that makes you look in the places fluent output hides problems. Dimensions 1, 5, and 6 are where AI-generated work fails most often and most quietly.


3. Failure modes to hunt for

Fluent output is designed to pass a casual read. These are the failures worth slowing down for:

  • Fabricated specifics in a grounded artifact. The spec is mostly faithful, but one number, function name, or API was filled in from pattern rather than the source. Check specifics against the original — don't trust the summary.
  • Plausible-but-wrong design. A reasonable-sounding architecture that doesn't actually satisfy a constraint from Clarify. Read it against the constraints, not in isolation.
  • Forced quantities. You asked for "five risks" and got five — including two that don't exist. Real coverage, not the count you requested.
  • Scope drift. A decomposition that quietly invents stories outside the spec, or a plan that solves a bigger problem than the one assigned.
  • Broken traceability. A plan that implements the wrong story, or a spec that doesn't link to the request it came from. The lineage lies, and everything downstream inherits the lie.
  • Confident appeals to authority. "This is the standard pattern used by [BigCo]." Treat as zero evidence; evaluate the design on its merits.

The reviewer's reflex When a claim is specific, verifiable, and you can't immediately verify it — stop and verify it, no matter how confident the artifact sounds. That single habit catches the majority of what slips through.


4. Returning an artifact

When something fails the rubric, you have three moves — and "silently fix it yourself" is not one of them (the producer, human or agent, never learns; the next artifact has the same flaw).

  • Comment on the version. Thread specific, actionable feedback on the exact artifact version: which dimension failed, where, and what "good" would be. "Acceptance criterion 3 isn't testable — give it a measurable threshold" beats "needs work."
  • Hold the status. Leave it in REVIEW. Don't advance it to ACCEPTED. The status machine is the team's signal of what's trustworthy; protect it.
  • Retreat the Journey, with a reason. If the gap is structural — the spec is wrong, not just rough — retreat to the earlier phase rather than patching forward. Log the reason; it's both feedback and audit.

And log a decision when the rejection reflects a real call ("rejected the agent's caching design — violates the latency budget from Clarify; here's the alternative"). That's the rationale the next reviewer, and the next agent, will need.

Actionable beats nice

Decline isn't rudeness; it's the kindest thing you can do for the work. A vague "looks off" forces a guessing game. A precise "dimension 5: cache.warm_async() doesn't exist in the library — verify against the docs and replace" can be acted on immediately, by a human or an agent. Write feedback the producer can execute without finding you.


Exercise — review against the rubric

Take an AI-generated feature_specification (generate one if you don't have it handy):

  1. Score it on all six dimensions. Force yourself to find at least one real issue per artifact — there almost always is one.
  2. Verify every specific claim against its source: does each referenced API, number, or constraint actually exist where it should?
  3. Write version-level comments for each issue, phrased so the producer could act without asking you anything.
  4. Decide: ACCEPT, hold in REVIEW with comments, or retreat with a reason — and justify the call in one sentence.

If you found nothing, you reviewed too fast. Re-read dimension 1 and dimension 5 — the fabrications are usually hiding exactly where the prose is smoothest. Module 06 turns this into a team skill: handoffs that survive a reviewer who wasn't in the room.