Prompt Engineering
& Structured Output
Domain 4 is the precision domain. It tests whether you know the difference between vague confidence-based instructions and specific categorical criteria, whether you understand that tool_use eliminates syntax errors but not semantic ones, and whether you know exactly when a retry will succeed versus when the information simply isn't there.
This domain maps to Scenario 5 (CI/CD) and Scenario 6 (Structured Data Extraction) — but its techniques appear in every scenario. Explicit criteria, few-shot examples, and schema-enforced output are universal production skills.
Design prompts with explicit criteria to improve precision and reduce false positives
The Core Concept
False positives are not a model quality problem — they are a prompt precision problem. When you tell a model to "check that comments are accurate," it will flag anything that might be inaccurate, including stylistic issues, subjective interpretations, and local conventions. When you tell it to "flag comments only when the claimed behaviour contradicts actual code behaviour," it has an unambiguous criterion to apply.
Vague vs Explicit Criteria
## Code Review Severity Criteria
### CRITICAL — Report immediately
- Null pointer dereference with no guard
- SQL injection via string concatenation
- Hardcoded credentials or API keys
Example: `db.query("SELECT * FROM users WHERE id=" + userId)`
→ Report as CRITICAL
### HIGH — Report with suggested fix
- Logic errors where output contradicts specification
- Race conditions in shared state
Example: `if (user.role = "admin")` (assignment not comparison)
→ Report as HIGH
### SKIP — Do not report
- Minor style differences from team convention
- Missing JSDoc on private methods
- Variable naming preferences
- Comments that could be more descriptive (but aren't wrong)
Example: `// get user` above getUserById() → SKIP
The False Positive Trust Problem
High false positive rates cause cascading trust failure. When developers learn that a particular review category generates noise, they start ignoring all findings from that category — including the real bugs it correctly identifies. A 30% false positive rate in comment accuracy doesn't just waste time on 30% of reviews — it erodes confidence in the 70% of findings that were correct.
- Write criteria that define which issues to report (bugs, security vulnerabilities, logic errors) and which to skip (style preferences, local patterns, subjective quality)
- Temporarily disable high false-positive categories while improving their prompts — restores developer trust in the remaining categories
- Define explicit severity levels (CRITICAL / HIGH / MEDIUM / SKIP) with concrete code examples for each — eliminates ambiguous classifications
- Never rely on "be conservative" or confidence thresholds as substitutes for categorical scope definitions
Exam Traps for Task 4.1
| The Trap | Why It Fails | Correct Pattern |
|---|---|---|
| Add "only report high-confidence findings" to reduce false positives | Confidence-based filtering doesn't change what the model considers in scope — it still makes the same borderline decisions, just reports fewer of them. Root cause (vague scope) is unchanged. | Define categorical scope: explicitly list what IS in scope vs what is NOT. Remove ambiguous categories entirely. |
| Keep a high false-positive category enabled to avoid missing real issues | A category with 30% false positive rate destroys trust in all its findings, including the 70% that are real. Net effect: developers ignore more real issues than the category catches. | Temporarily disable the category; fix the criteria; re-enable once false positive rate is acceptable |
| Use vague severity labels ("low/medium/high") without definitions | Inconsistent classification across sessions — the same issue gets different severity labels on different runs. Breaks automated triage workflows. | Define each severity level with concrete code examples for each classification |
🔨 Implementation Task
Build an Explicit-Criteria Code Review Prompt
- Start with a vague prompt: "Review this code for bugs and style issues." Run it on 10 files. Classify each finding as true or false positive.
- Rewrite with explicit categorical criteria: define which 3 issue types to report and which 3 to skip. Run the same 10 files.
- Compare false positive rates — document the reduction.
- Add severity definitions (CRITICAL/HIGH/SKIP) with one concrete code example each.
- Identify the highest false-positive category in your results; write a prompt that disables it and removes the noise from developer view.
Exam Simulation — Task 4.1
Apply few-shot prompting to improve output consistency and quality
The Core Concept
Few-shot examples demonstrate the reasoning behind a decision, not just the decision itself. A good example shows why one choice was made over a plausible alternative — enabling Claude to apply that same reasoning to inputs it has never seen. Examples that only show inputs and outputs (without explaining the why) improve consistency but not generalisation.
Anatomy of an Effective Few-Shot Example
## Example 1 — Genuine Bug (REPORT) Code: const total = items.reduce((sum, item) => sum + item.price, 0); // Returns total including tax Finding: REPORT as HIGH Location: checkout.js line 42 Issue: Comment claims tax is included, but price field does not include tax per the schema in types.ts Suggested fix: Either update the comment to "Returns subtotal (excluding tax)" or add tax calculation Why reported: Claimed behaviour contradicts code behaviour. This could cause double-counting in tax reporting. ## Example 2 — Style Issue (SKIP) Code: // Get the user from db const user = await db.getUser(userId); Finding: SKIP Why skipped: Comment accurately describes what the code does. "From db" vs "from database" is a wording preference, not a behavioural contradiction. ## Example 3 — Ambiguous (Demonstrate the Decision) Code: // Validate input if (!user.email) throw new Error('Email required'); Finding: SKIP Why skipped: Comment describes the general category (validation) but not every validation rule. Missing specificity is not a contradiction. No false claim.
Show Reasoning, Not Just Answers
Include "Why reported" or "Why skipped" in every example. This enables Claude to apply the same reasoning to patterns it hasn't seen, rather than pattern-matching to the specific examples.
Include Ambiguous Cases
The most valuable examples show how to handle the grey area — the cases that instructions alone can't resolve. Show a borderline case and explain why it falls on one side of the line.
Demonstrate Exact Output Format
Use the same output structure in every example: Location, Issue, Severity, Suggested fix. Models reproduce the structure they're shown — inconsistent examples produce inconsistent output.
Varied Document Structures
For extraction tasks, include examples from documents with different formats: inline citations vs bibliographies, narrative descriptions vs structured tables. One format example fails to generalise.
Key Applications
- Create 2–4 targeted examples for ambiguous scenarios — show reasoning for why one action was chosen over plausible alternatives
- Demonstrate exact desired output format (location, issue, severity, suggested fix) in every example to achieve structural consistency
- Include examples showing acceptable patterns that are NOT issues — this reduces false positives while preserving generalisation to genuine bugs
- For extraction tasks: provide examples from multiple document formats (inline citations, bibliographies, narrative, tables) — single-format examples fail on novel structures
- For extraction of optional fields: include an example showing correct
nulloutput when information is absent — prevents hallucinated values
Exam Traps for Task 4.2
| The Trap | Why It Fails | Correct Pattern |
|---|---|---|
| Provide 10+ examples covering every possible case | Overly comprehensive examples inflate token usage without improving quality — 2–4 targeted examples at the ambiguous boundary are more effective than exhaustive coverage | 2–4 examples focused on the hardest cases — especially ambiguous ones where instructions alone are insufficient |
| Show only correct answers without reasoning | Without "why," Claude can only pattern-match to the specific examples — it cannot generalise the decision rule to novel patterns | Include explicit reasoning in each example: "Why reported / Why skipped" so Claude can apply the same logic to new inputs |
| Use one document format in extraction examples | Model learns to extract from that format but fails on documents with different structures (e.g., inline vs bibliography citations) | Include examples from each major document variant the system will encounter in production |
🔨 Implementation Task
Build a Few-Shot Prompt and Measure Consistency
- Design a code review prompt with 3 few-shot examples: one genuine bug, one false-positive-prone case that should be skipped, one genuinely ambiguous case showing the decision
- For each example, include the reasoning ("Why reported / Why skipped") — not just the verdict
- Run the prompt on 20 code samples. Measure consistency: how often does the same code get the same classification across 3 runs?
- Compare against the same 20 samples with instructions-only (no examples). Document the consistency improvement.
- For a structured extraction task: add examples from 3 different document formats. Measure how often required fields are correctly extracted vs left null vs hallucinated.
Exam Simulation — Task 4.2
methodology field null in 35% of documents, even though the papers do contain methodology information — it's just described inline in the results section rather than in a dedicated "Methods" section. Detailed instructions about finding methodology information haven't helped. What is the most effective fix?Enforce structured output using tool use and JSON schemas
tool_choice has three modes with very different guarantees.The Core Concept
Asking Claude to "return JSON" in a prompt is unreliable — it can return markdown-wrapped JSON, prose before/after the JSON, or malformed JSON under pressure. Defining an extraction function via tool_use with a JSON schema guarantees syntax-valid output because the API enforces the schema structurally before returning the response.
tool_choice: Three Modes
"auto"
Model decides whether to call a tool or return text. May return conversational text instead of calling a tool. Use when tool calling is optional — e.g., the model might answer a simple question directly without needing a tool.
"any"
Model must call one of the available tools — cannot return plain text. Use when structured output is required and multiple extraction schemas exist (e.g., invoice vs contract vs receipt — the model selects which tool fits). Guarantees a tool is called.
{"type":"tool","name":"X"}
Forced tool selection — the model must call exactly this tool. Use when a specific extraction must run first — e.g., extract_metadata before enrichment tools. Guarantees not just that a tool is called, but which one.
The "auto" Trap
Setting tool_choice: "auto" when you need guaranteed structured output is a common exam distractor. "auto" means Claude may return text. For guaranteed output, use "any" or forced selection.
tools = [{
"name": "extract_invoice",
"description": "Extract structured data from an invoice document",
"input_schema": {
"type": "object",
"properties": {
"vendor_name": {"type": "string"},
"invoice_date": {"type": ["string", "null"]}, # nullable!
"total_amount": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"amount": {"type": "number"}
}
}
},
"currency": {
"type": "string",
"enum": ["USD", "EUR", "GBP", "other"] # "other" for extensibility
}
},
"required": ["vendor_name", "total_amount", "line_items"]
# invoice_date is NOT required — may be absent
}
}]
response = client.messages.create(
model="claude-sonnet-4-20250514",
tools=tools,
tool_choice={"type": "any"}, # ← guarantees structured output
messages=[{"role": "user", "content": document_text}]
)
Schema Design Principles
- Make fields nullable (optional) when source documents may not contain the information — prevents the model from fabricating values to satisfy a required field
- Use
enumfields with an "other" + detail string pattern for extensible categorisation — captures known categories precisely while preserving unknown ones - Add an "unclear" enum value for fields where the source data is ambiguous — explicit rather than guessing
- Include format normalisation rules in prompts alongside strict schemas — e.g., "dates should be ISO 8601 format" — schemas enforce structure but not format consistency
# ✗ Closed enum — fails on unknown contract types "contract_type": {"type": "string", "enum": ["NDA", "SLA", "MSA"]} # ✓ Open enum with fallback — handles unknown types "contract_type": { "type": "string", "enum": ["NDA", "SLA", "MSA", "other"] }, "contract_type_detail": { "type": ["string", "null"], "description": "Populated when contract_type is 'other'" }
What Schemas Cannot Do
Tool use schemas enforce syntactic correctness. They cannot detect semantic errors — when the values are syntactically valid but logically wrong:
Exam Traps for Task 4.3
| The Trap | Why It Fails | Correct Pattern |
|---|---|---|
Use tool_choice: "auto" when structured output is required |
"auto" allows Claude to return plain text instead of calling the tool — doesn't guarantee structured output | Use "any" (any tool must be called) or forced selection when extraction is required every time |
| Mark all fields as required in the schema to ensure complete extraction | Required fields on optional information force the model to hallucinate values rather than leave them null — produces worse output than nullable fields | Only mark truly mandatory fields as required; use nullable types for fields that may be absent in source documents |
| Assume schema validation eliminates all extraction errors | Schemas eliminate syntax errors only — semantic errors (wrong values, mismatched sums, misplaced data) pass schema validation | Add semantic validation layer: check computed totals, validate date ranges, verify cross-field consistency after schema validation |
🔨 Implementation Task
Build a Schema-Enforced Extraction Pipeline
- Define an invoice extraction tool with a JSON schema including: required fields (vendor, total), nullable fields (date, PO number), and an enum with "other" fallback for invoice type
- Test with
tool_choice: "auto"— observe when Claude returns text instead of calling the tool. Then switch to"any"and confirm 100% tool calls - Deliberately create a document where line items don't sum to the total. Verify the schema accepts the output (syntax valid). Add a post-validation check that flags the discrepancy (semantic validation)
- Test with a document missing the invoice date — verify the nullable field correctly returns null rather than a fabricated date
- Test forced tool selection: set
tool_choice: {"type":"tool","name":"extract_metadata"}and confirm only that tool runs on the first turn
Exam Simulation — Task 4.3
tool_use with a strict JSON schema, the team celebrates that JSON syntax errors have been eliminated. However, they are still finding invoices where the line_items amounts don't add up to the total_amount. What does this indicate?tool_choice configuration ensures the model always calls one of the extraction tools rather than returning conversational text?"any" is the right choice when: (1) structured output is guaranteed (cannot return text), and (2) multiple tools exist and the model should choose the appropriate one. A is wrong: "auto" allows the model to return plain text — doesn't guarantee extraction. C is wrong: forcing a specific tool runs invoice extraction on contracts and POs, producing incorrect schema mappings. D is wrong: system prompt instructions are probabilistic — "any" provides a structural guarantee.termination_date is before the start_date. What additional measure would most reduce the rejection rate?Implement validation, retry, and feedback loops for extraction quality
The Core Concept
Retry-with-error-feedback is a powerful pattern: rather than sending the same prompt again, you append the specific validation error and ask the model to correct its output. This gives the model the information it needs to fix the error — but only if the error is fixable given the source document.
Retry-with-Error-Feedback
def extract_with_retry(document, max_retries=3): messages = [{"role": "user", "content": document}] for attempt in range(max_retries): response = extract(messages) result = parse_tool_result(response) # Validate the extraction result errors = validate_extraction(result) if not errors: return result # ✓ Success # ✓ Append assistant response AND specific errors messages.append({"role": "assistant", "content": response.content}) messages.append({ "role": "user", "content": f"""Your extraction has validation errors. Please correct them: Errors found: {chr(10).join(f"- {e}" for e in errors)} Original document for reference: {document} Please re-extract with these corrections applied.""" }) return result # Return best attempt after max retries
When Retry Won't Help
The most important distinction in 4.4: retries fix errors caused by format or structural output problems. They cannot fix errors caused by missing information in the source document.
Feedback Loop Design
detected_pattern Field
Add a detected_pattern field to structured findings so that when developers dismiss a finding, you can track which code constructs triggered false positives. Enables systematic prompt improvement based on real dismissal patterns.
Self-Validation Fields
Add calculated_total alongside stated_total — the model computes the sum of line items and compares. Add conflict_detected: boolean when source data has inconsistencies. Makes semantic errors visible without external logic.
Confidence Alongside Findings
Have the model self-report a confidence score per finding. Use this to route low-confidence extractions to human review. Enables calibrated routing without making every extraction manual.
Sample Before Batch
Run prompt refinement on a representative sample set (10–20 documents) before batch-processing large volumes. Maximises first-pass success rates and dramatically reduces costly iterative resubmission on thousands of documents.
Exam Traps for Task 4.4
| The Trap | Why It Fails | Correct Pattern |
|---|---|---|
| Retry when a required field is null because information isn't in the document | The model has nothing new to work with — it will either repeat null or fabricate a value to satisfy the requirement | Distinguish format/structural errors (retry will help) from absent information (route to human review or accept null) |
| Retry without providing the specific validation error | Sending the same prompt again produces the same output — the model has no information about what was wrong | Include the original document, the failed extraction, and the specific validation errors in the retry request |
| Run batch processing on 10,000 documents without sample testing | If the prompt has extraction gaps, they affect all 10,000 documents — expensive to reprocess the entire batch | Test on a representative 10–20 document sample first; fix extraction issues before running at scale |
🔨 Implementation Task
Build a Validation-Retry Pipeline with Semantic Checking
- Implement retry-with-error-feedback: include original document, failed extraction, and specific errors in the retry message
- Create test cases: (a) date format mismatch — confirm retry fixes it; (b) required field absent from document — confirm retry produces fabrication or repeated null, not a correct answer
- Add semantic validation: check that line item amounts sum to stated total; add
calculated_totalandconflict_detectedfields to the schema - Add
detected_patternfield to code review findings; track which patterns developers dismiss; identify top 3 false-positive patterns after 50 reviews - Test sample-before-batch: run on 10 documents, identify extraction gaps, fix prompt, then run on 100 — measure first-pass success rate improvement
Exam Simulation — Task 4.4
publication_date fields. Investigation reveals that 8% of documents genuinely don't contain a publication date, while 7% have a date but in a non-standard format (e.g., "Published Spring 2023"). The team has a retry loop that retries all null extractions. What outcome should they expect for each group?Design efficient batch processing strategies
The Core Concept
The Message Batches API is designed for workloads that can tolerate async processing — overnight reports, weekly audits, bulk document processing. Its 50% cost savings make it compelling. Its non-deterministic completion time (up to 24 hours) makes it incompatible with any blocking workflow.
Batch API Trade-offs and SLA Calculation
50% Cost Savings
Half the per-token cost of the synchronous API. For high-volume extraction on non-urgent workloads, this compounds significantly across millions of tokens.
Up to 24-Hour Window
No guaranteed completion time. Batches complete within 24 hours — but may complete in minutes. Never build a blocking workflow dependency on "usually faster" completion.
custom_id Correlation
Every batch request takes a custom_id. Results are returned with the same ID, enabling correlation regardless of completion order. Essential for partial failure recovery.
No Multi-Turn Tool Calling
Single-turn only. If your workflow needs to call a tool, receive the result, then make a follow-up decision — use the synchronous API. Batch API cannot execute tools mid-request.
Batch Failure Handling
def process_batch_results(batch_results, original_documents): successful = [] failed = [] for result in batch_results: doc_id = result.custom_id # ← correlate using custom_id if result.result.type == "succeeded": successful.append(result) else: original_doc = original_documents[doc_id] failure_reason = result.result.error.type if failure_reason == "token_limit_exceeded": # ✓ Chunk the document and resubmit chunks = chunk_document(original_doc) failed.extend([(doc_id, chunk) for chunk in chunks]) else: # ✓ Resubmit with same content failed.append((doc_id, original_doc)) # ✓ Only resubmit failures, not the entire batch if failed: resubmit_batch(failed) return successful
- Match API approach to latency: synchronous for blocking pre-merge checks, batch for overnight/weekly non-urgent workloads
- Resubmit only failed documents identified by
custom_id— not the entire batch - For documents that exceeded context limits: chunk them before resubmitting rather than retrying as-is
- Test on a representative sample before batch-processing large volumes — maximises first-pass success and reduces costly resubmission cycles
Exam Traps for Task 4.5
| The Trap | Why It Fails | Correct Pattern |
|---|---|---|
| Use batch API for a blocking pre-merge check that developers wait for | Batch API has up to 24-hour completion time — developers cannot wait for a pre-merge check to complete in an unbounded window | Synchronous API for any blocking workflow; batch only for non-blocking async workloads |
| Use batch API for an agentic extraction workflow that calls tools mid-extraction | Batch API does not support multi-turn tool calling within a single request — tool calling requires the synchronous API | Use synchronous API for tool-calling workflows; batch for single-turn extraction where results can be returned in one pass |
| Resubmit the entire batch when a few documents fail | Reprocessing successful documents wastes 50x the cost of targeted resubmission — and correct results may be overwritten | Use custom_id to identify and resubmit only failed documents |
🔨 Implementation Task
Build a Batch Extraction Pipeline with Failure Recovery
- Submit a batch of 50 documents with unique
custom_idvalues for each. Verify result correlation works correctly. - Deliberately include 3 oversized documents that exceed context limits. Implement chunking-based resubmission only for those failures.
- Calculate the submission frequency for a 30-hour SLA given the batch API's 24-hour maximum processing window.
- Run a sample of 10 documents with your extraction prompt before batch-processing all 50. Identify and fix any extraction gaps. Measure first-pass success improvement.
- Design a decision matrix: given a new workflow, what 3 questions determine whether it should use batch vs synchronous API?
Exam Simulation — Task 4.5
Design multi-instance and multi-pass review architectures
The Core Concept
When Claude generates code and then reviews it in the same session, it retains the reasoning context from generation — making it less likely to question its own architectural decisions. An independent review instance (a fresh session with no generation context) approaches the code with fresh eyes and is measurably more effective at catching subtle issues.
The Self-Review Limitation
Review Architectures
Independent Instance Review
After generation, spawn a second Claude instance with no generation history. Provide only the code. The independent instance reviews without reasoning bias from the generation phase.
Per-File + Integration Pass
For multi-file reviews: pass 1 analyzes each file individually for local issues (consistent depth, no attention dilution). Pass 2 is a separate cross-file integration review examining data flow, interface contracts, and cross-module patterns.
Confidence-Scored Routing
Run a verification pass where the model self-reports confidence alongside each finding. Route high-confidence findings to automated action, low-confidence findings to human review. Calibrates automation vs human oversight.
When to Use Each
Independent instance: after any generation. Per-file + integration: PRs with 5+ files. Confidence routing: high-stakes automated actions. Combination: large PRs with auto-fix capability.
- Use a second independent Claude instance without the generator's reasoning context — not self-review instructions in the same session
- Split large multi-file reviews into per-file local analysis passes plus a separate cross-file integration pass — avoids attention dilution and contradictory findings
- Run verification passes with model-reported confidence scores to enable calibrated routing between automated action and human review
Exam Traps for Task 4.6
| The Trap | Why It Fails | Correct Pattern |
|---|---|---|
| Add "critically review your own output" instruction in the same session | The reasoning context from generation is still present — the model retains its design decisions and is less likely to question them, regardless of the instruction | Use a second independent instance with no generation history — provides architectural bias removal, not just instructional reminder |
| Switch to a larger context model to review 14 files in one pass | Context window size doesn't solve attention dilution — models still produce inconsistent depth across many files in a single pass | Per-file passes for consistent depth, plus a separate integration pass for cross-file issues |
| Run three consensus-based passes to filter findings — only report issues appearing in 2/3 runs | Consensus suppresses intermittently-detected real bugs — issues that are only sometimes caught are still real issues, just hard to detect | Independent instance review; per-file passes; confidence-scored routing — not consensus filtering |
🔨 Implementation Task
Build a Two-Instance Review Pipeline
- Generate a sorting algorithm in one Claude session. In the same session, ask Claude to review it for bugs. Document what it finds (or doesn't).
- In a fresh Claude session, provide only the generated code with no generation history. Ask for a review. Compare findings with the self-review. Document the difference.
- Create a 6-file PR. Run a single-pass review. Document inconsistency in depth and any contradictory findings.
- Re-run with per-file passes (6 separate reviews) + 1 integration pass. Compare quality and consistency.
- Add confidence scoring to the review output: have Claude rate each finding 1–5. Define routing rules: ≥4 → automated comment, ≤2 → human review queue.