Domain 4 20% of exam

Prompt Engineering & Structured Output — Complete Lesson

Domain progress
Domain 4 20% of exam · 6 Task Statements

Prompt Engineering
& Structured Output

Domain 4 is the precision domain. It tests whether you know the difference between vague confidence-based instructions and specific categorical criteria, whether you understand that tool_use eliminates syntax errors but not semantic ones, and whether you know exactly when a retry will succeed versus when the information simply isn't there.

This domain maps to Scenario 5 (CI/CD) and Scenario 6 (Structured Data Extraction) — but its techniques appear in every scenario. Explicit criteria, few-shot examples, and schema-enforced output are universal production skills.

Task Statement 4.1
Task Statement 4.1

Design prompts with explicit criteria to improve precision and reduce false positives

Vague instructions like "be conservative" or "only report high-confidence findings" sound useful but consistently fail in production. Explicit categorical criteria — specifying exactly which issues to report and which to skip — is the only approach that reliably reduces false positive rates.

The Core Concept

False positives are not a model quality problem — they are a prompt precision problem. When you tell a model to "check that comments are accurate," it will flag anything that might be inaccurate, including stylistic issues, subjective interpretations, and local conventions. When you tell it to "flag comments only when the claimed behaviour contradicts actual code behaviour," it has an unambiguous criterion to apply.

The Exam Principle: General instructions like "be conservative" or "only report high-confidence findings" do not reduce false positives. They are confidence-based filters applied over an unchanged decision process — the model still makes the same borderline calls, just with lower stated confidence. Only specific categorical criteria change which issues are considered in scope.

Vague vs Explicit Criteria

✗ Vague — Produces False Positives
System prompt: "Check that comments are accurate and flag any inaccurate ones." What gets flagged: - Comment says "returns null on error" but actually throws → VALID - Comment uses different wording than the code → FALSE POSITIVE - Comment describes old behaviour but code was updated → could go either way - Comment missing edge case description → FALSE POSITIVE
✓ Explicit — Precise Scope
System prompt: "Flag comments ONLY when the claimed behaviour contradicts the actual code behaviour. Skip: - Stylistic or wording differences - Missing details or edge cases - Local naming conventions - Subjective quality opinions" Result: Only genuine contradictions are flagged. False positive rate ↓↓
Explicit severity criteria with code examples — consistent classification PRODUCTION PATTERN
## Code Review Severity Criteria

### CRITICAL — Report immediately
- Null pointer dereference with no guard
- SQL injection via string concatenation
- Hardcoded credentials or API keys
Example: `db.query("SELECT * FROM users WHERE id=" + userId)`
→ Report as CRITICAL

### HIGH — Report with suggested fix
- Logic errors where output contradicts specification
- Race conditions in shared state
Example: `if (user.role = "admin")` (assignment not comparison)
→ Report as HIGH

### SKIP — Do not report
- Minor style differences from team convention
- Missing JSDoc on private methods
- Variable naming preferences
- Comments that could be more descriptive (but aren't wrong)
Example: `// get user` above getUserById() → SKIP

The False Positive Trust Problem

High false positive rates cause cascading trust failure. When developers learn that a particular review category generates noise, they start ignoring all findings from that category — including the real bugs it correctly identifies. A 30% false positive rate in comment accuracy doesn't just waste time on 30% of reviews — it erodes confidence in the 70% of findings that were correct.

🚨
The Correct Response: When a category has unacceptably high false positive rates, temporarily disable that category and fix its criteria before re-enabling. Leaving it enabled — even with a note that it's unreliable — continues to erode developer trust in the entire review system.
  • Write criteria that define which issues to report (bugs, security vulnerabilities, logic errors) and which to skip (style preferences, local patterns, subjective quality)
  • Temporarily disable high false-positive categories while improving their prompts — restores developer trust in the remaining categories
  • Define explicit severity levels (CRITICAL / HIGH / MEDIUM / SKIP) with concrete code examples for each — eliminates ambiguous classifications
  • Never rely on "be conservative" or confidence thresholds as substitutes for categorical scope definitions

Exam Traps for Task 4.1

The TrapWhy It FailsCorrect Pattern
Add "only report high-confidence findings" to reduce false positives Confidence-based filtering doesn't change what the model considers in scope — it still makes the same borderline decisions, just reports fewer of them. Root cause (vague scope) is unchanged. Define categorical scope: explicitly list what IS in scope vs what is NOT. Remove ambiguous categories entirely.
Keep a high false-positive category enabled to avoid missing real issues A category with 30% false positive rate destroys trust in all its findings, including the 70% that are real. Net effect: developers ignore more real issues than the category catches. Temporarily disable the category; fix the criteria; re-enable once false positive rate is acceptable
Use vague severity labels ("low/medium/high") without definitions Inconsistent classification across sessions — the same issue gets different severity labels on different runs. Breaks automated triage workflows. Define each severity level with concrete code examples for each classification

🔨 Implementation Task

T1

Build an Explicit-Criteria Code Review Prompt

  • Start with a vague prompt: "Review this code for bugs and style issues." Run it on 10 files. Classify each finding as true or false positive.
  • Rewrite with explicit categorical criteria: define which 3 issue types to report and which 3 to skip. Run the same 10 files.
  • Compare false positive rates — document the reduction.
  • Add severity definitions (CRITICAL/HIGH/SKIP) with one concrete code example each.
  • Identify the highest false-positive category in your results; write a prompt that disables it and removes the noise from developer view.

Exam Simulation — Task 4.1

Question 1 — Task 4.1 CI/CD Code Review Pipeline
A CI code review pipeline is flagging 40% of valid comment blocks as inaccurate, generating constant noise. The current prompt says "check that comments are accurate and flag inaccurate ones." A developer suggests adding "only flag when you're highly confident the comment is wrong." What outcome should you expect?
  • AFalse positive rate drops significantly — confidence thresholds are an effective filter for borderline cases
  • BFalse positive rate improves marginally at best — the root cause is vague scope, not confidence calibration. The prompt must explicitly define which comment inaccuracies are in scope vs out of scope
  • CTotal findings drop significantly — fewer reviews means less developer friction even if some real issues are missed
  • DFalse positive rate improves because the model will re-evaluate all its borderline decisions with a stricter threshold
Correct: B
B is correct. "Only flag when highly confident" is a confidence filter — it doesn't change the model's scope of what counts as an inaccuracy. The model still considers style differences, missing details, and subjective quality as potential inaccuracies; it just suppresses some of them. The fix is categorical scope: "flag ONLY when claimed behaviour contradicts actual code behaviour." A is wrong: Confidence filters don't fix scope ambiguity. C is a net negative: reducing total findings by suppression means real bugs are also suppressed. D is wrong: confidence thresholds don't make the model reconsider what it considers in scope.
Question 2 — Task 4.1 CI/CD Code Review Pipeline
Your CI review has 5 categories: security vulnerabilities, logic errors, performance issues, comment accuracy, and naming conventions. Developers report they've stopped reading any findings because there are too many false positives in comment accuracy and naming. What is the most effective immediate fix?
  • AAdd a disclaimer to all findings: "This category has elevated false positive rates — verify before acting"
  • BReduce the review to only 2 categories — security vulnerabilities and logic errors — and never add more categories
  • CTemporarily disable comment accuracy and naming categories from the pipeline while improving their criteria, so developers can immediately trust the remaining 3 categories
  • DAdd "only flag when confidence is above 90%" to the comment accuracy and naming prompts
Correct: C
C is correct. When false positive rates destroy trust, the right move is to temporarily disable those categories so developers can confidently act on the reliable categories. This restores trust immediately without requiring a full prompt rewrite — then improve the disabled categories offline before re-enabling. A adds disclaimers that further erode trust and don't help developers distinguish real findings. B is too aggressive — permanently removing categories is different from temporarily disabling them for improvement. D is the confidence-filter anti-pattern — doesn't fix the scope problem.
Question 3 — Task 4.1 CI/CD Code Review Pipeline
Your Claude-powered code review pipeline runs on every pull request. After deploying, you observe that 40% of flagged issues are false positives — Claude flags well-written code as problematic. A colleague suggests adding more examples of good code to the training prompt. Which prompt engineering change most directly reduces false positives?
  • AAdd five examples of good code that should NOT be flagged, with no explanation.
  • BAdd explicit exclusion criteria to the system prompt, such as 'Do not flag functions that have full test coverage, are under 20 lines, and have no branching complexity.'
  • CIncrease the temperature setting to allow more nuanced judgements.
  • DAdd a second Claude pass that reviews the first pass's output for false positives.
Correct: B
B is correct. Explicit exclusion criteria tell Claude exactly which conditions disqualify a finding. Without them, Claude applies its own judgment about what 'problematic' means, producing high false-positive rates on well-written code. A (few-shot without explanation) helps less than criteria because it doesn't generalise to unseen patterns. C (temperature) controls randomness, not precision. D (second pass) can help but is a cost-multiplier and doesn't address the root cause: the first pass lacks a clear definition of what NOT to flag.
Task Statement 4.2
Task Statement 4.2

Apply few-shot prompting to improve output consistency and quality

When detailed instructions alone produce inconsistent results, few-shot examples are the most effective remedy. They demonstrate judgment in ambiguous cases, show exact output format, and enable Claude to generalise to novel patterns — not just match pre-specified cases.

The Core Concept

Few-shot examples demonstrate the reasoning behind a decision, not just the decision itself. A good example shows why one choice was made over a plausible alternative — enabling Claude to apply that same reasoning to inputs it has never seen. Examples that only show inputs and outputs (without explaining the why) improve consistency but not generalisation.

The Exam Principle: Few-shot examples are the most effective technique when detailed instructions alone produce inconsistent results. They are especially powerful for: (1) ambiguous-case handling, (2) exact output format enforcement, (3) reducing hallucination in extraction tasks with varied document structures.

Anatomy of an Effective Few-Shot Example

Code review few-shot — showing reasoning for ambiguous cases EFFECTIVE PATTERN
## Example 1 — Genuine Bug (REPORT)
Code:
  const total = items.reduce((sum, item) => sum + item.price, 0);
  // Returns total including tax

Finding: REPORT as HIGH
Location: checkout.js line 42
Issue: Comment claims tax is included, but price field
  does not include tax per the schema in types.ts
Suggested fix: Either update the comment to "Returns
  subtotal (excluding tax)" or add tax calculation
Why reported: Claimed behaviour contradicts code behaviour.
  This could cause double-counting in tax reporting.

## Example 2 — Style Issue (SKIP)
Code:
  // Get the user from db
  const user = await db.getUser(userId);

Finding: SKIP
Why skipped: Comment accurately describes what the code
  does. "From db" vs "from database" is a wording
  preference, not a behavioural contradiction.

## Example 3 — Ambiguous (Demonstrate the Decision)
Code:
  // Validate input
  if (!user.email) throw new Error('Email required');

Finding: SKIP
Why skipped: Comment describes the general category
  (validation) but not every validation rule. Missing
  specificity is not a contradiction. No false claim.
🎯

Show Reasoning, Not Just Answers

Include "Why reported" or "Why skipped" in every example. This enables Claude to apply the same reasoning to patterns it hasn't seen, rather than pattern-matching to the specific examples.

🔀

Include Ambiguous Cases

The most valuable examples show how to handle the grey area — the cases that instructions alone can't resolve. Show a borderline case and explain why it falls on one side of the line.

📐

Demonstrate Exact Output Format

Use the same output structure in every example: Location, Issue, Severity, Suggested fix. Models reproduce the structure they're shown — inconsistent examples produce inconsistent output.

📄

Varied Document Structures

For extraction tasks, include examples from documents with different formats: inline citations vs bibliographies, narrative descriptions vs structured tables. One format example fails to generalise.

Key Applications

  • Create 2–4 targeted examples for ambiguous scenarios — show reasoning for why one action was chosen over plausible alternatives
  • Demonstrate exact desired output format (location, issue, severity, suggested fix) in every example to achieve structural consistency
  • Include examples showing acceptable patterns that are NOT issues — this reduces false positives while preserving generalisation to genuine bugs
  • For extraction tasks: provide examples from multiple document formats (inline citations, bibliographies, narrative, tables) — single-format examples fail on novel structures
  • For extraction of optional fields: include an example showing correct null output when information is absent — prevents hallucinated values

Exam Traps for Task 4.2

The TrapWhy It FailsCorrect Pattern
Provide 10+ examples covering every possible case Overly comprehensive examples inflate token usage without improving quality — 2–4 targeted examples at the ambiguous boundary are more effective than exhaustive coverage 2–4 examples focused on the hardest cases — especially ambiguous ones where instructions alone are insufficient
Show only correct answers without reasoning Without "why," Claude can only pattern-match to the specific examples — it cannot generalise the decision rule to novel patterns Include explicit reasoning in each example: "Why reported / Why skipped" so Claude can apply the same logic to new inputs
Use one document format in extraction examples Model learns to extract from that format but fails on documents with different structures (e.g., inline vs bibliography citations) Include examples from each major document variant the system will encounter in production

🔨 Implementation Task

T2

Build a Few-Shot Prompt and Measure Consistency

  • Design a code review prompt with 3 few-shot examples: one genuine bug, one false-positive-prone case that should be skipped, one genuinely ambiguous case showing the decision
  • For each example, include the reasoning ("Why reported / Why skipped") — not just the verdict
  • Run the prompt on 20 code samples. Measure consistency: how often does the same code get the same classification across 3 runs?
  • Compare against the same 20 samples with instructions-only (no examples). Document the consistency improvement.
  • For a structured extraction task: add examples from 3 different document formats. Measure how often required fields are correctly extracted vs left null vs hallucinated.

Exam Simulation — Task 4.2

Question 1 — Task 4.2 Structured Data Extraction
An extraction pipeline processing research papers is leaving the methodology field null in 35% of documents, even though the papers do contain methodology information — it's just described inline in the results section rather than in a dedicated "Methods" section. Detailed instructions about finding methodology information haven't helped. What is the most effective fix?
  • AMake the methodology field required in the schema so the model is forced to populate it
  • BAdd few-shot examples demonstrating correct extraction from papers that describe methodology inline in the results section, without a dedicated Methods heading
  • CAdd a retry loop that resubmits documents where methodology is null, with instructions to look harder
  • DSwitch to a larger model with better reading comprehension for research paper formats
Correct: B
B is correct. The model has been trained to look for methodology in a dedicated section — it doesn't generalise to inline descriptions without examples showing that structure. Few-shot examples demonstrating exactly how to find methodology embedded in results sections teach the model the pattern it needs. A is counterproductive: making the field required without fixing the extraction prompt causes hallucination — the model invents methodology information to satisfy the schema. C won't help: retrying doesn't fix the extraction approach — the model will look in the same place again. D misdiagnoses the problem as a model capability issue when it's a prompt format issue.
Question 2 — Task 4.2 Structured Data Extraction
Your Claude extraction pipeline converts scanned invoices to JSON. Two failure modes appear in production: (1) vendor names with commas are split incorrectly, and (2) dates in DD/MM/YYYY format are transposed to MM/DD/YYYY. You plan to add few-shot examples to fix both. Which approach is most efficient?
  • AAdd one comprehensive example that demonstrates both correct vendor parsing and correct date parsing in a single invoice.
  • BAdd two separate examples — one targeting only comma-containing vendor names, one targeting only DD/MM/YYYY dates.
  • CAdd one example for each failure mode, placing the most frequent failure first.
  • DAdd three examples with diverse invoice layouts to improve generalisation.
Correct: C
C is correct. Targeted few-shot examples that isolate each failure mode are more efficient than broad examples, because the model learns the specific pattern being corrected rather than trying to generalise from mixed signals. Placing the most frequent failure first ensures it gets more implicit weighting. A (single combined example) makes it harder to isolate which correction fixed which problem if you need to iterate. B is essentially C but misses the ordering principle. D (diverse layouts) improves generalisation but doesn't directly target the two known failures.
Question 3 — Task 4.2 Structured Data Extraction
You're building a prompt to extract financial data from earnings call transcripts. Early tests show Claude inconsistently formats revenue figures — sometimes '$1.2B', sometimes '1,200,000,000', sometimes '1.2 billion'. You have 50 real transcripts available. What is the most effective use of few-shot examples for this problem?
  • AAdd all 50 transcripts as examples with annotated correct outputs.
  • BAdd 2–3 targeted examples showing the exact output format required, each demonstrating a different revenue magnitude.
  • CAdd 10 examples covering all revenue format variations seen in the 50 transcripts.
  • DDescribe the format rule in the system prompt instead of using examples.
Correct: B
B is correct. 2–3 well-chosen examples that directly demonstrate the desired format are sufficient and efficient — they show the model exactly what's expected without creating a bloated prompt. A (50 examples) dramatically increases token cost with diminishing returns. C (10 examples) is better than A but still more than needed to convey a single format convention. D (description only) is weaker than examples for format enforcement — showing is more reliable than telling for consistent structured output.
Task Statement 4.3
Task Statement 4.3

Enforce structured output using tool use and JSON schemas

Tool use with JSON schemas is the most reliable mechanism for structured output — it eliminates JSON syntax errors. But it does not eliminate semantic errors. And tool_choice has three modes with very different guarantees.

The Core Concept

Asking Claude to "return JSON" in a prompt is unreliable — it can return markdown-wrapped JSON, prose before/after the JSON, or malformed JSON under pressure. Defining an extraction function via tool_use with a JSON schema guarantees syntax-valid output because the API enforces the schema structurally before returning the response.

The Critical Limit: Tool use eliminates syntax errors (malformed JSON, missing brackets, wrong types). It does not eliminate semantic errors — line items that don't sum to the stated total, values placed in the wrong fields, or logically inconsistent data. Schema validation is syntactic; business logic validation is a separate layer.

tool_choice: Three Modes

🔄

"auto"

Model decides whether to call a tool or return text. May return conversational text instead of calling a tool. Use when tool calling is optional — e.g., the model might answer a simple question directly without needing a tool.

"any"

Model must call one of the available tools — cannot return plain text. Use when structured output is required and multiple extraction schemas exist (e.g., invoice vs contract vs receipt — the model selects which tool fits). Guarantees a tool is called.

🎯

{"type":"tool","name":"X"}

Forced tool selection — the model must call exactly this tool. Use when a specific extraction must run first — e.g., extract_metadata before enrichment tools. Guarantees not just that a tool is called, but which one.

⚠️

The "auto" Trap

Setting tool_choice: "auto" when you need guaranteed structured output is a common exam distractor. "auto" means Claude may return text. For guaranteed output, use "any" or forced selection.

python — tool_use structured extraction with tool_choice PRODUCTION PATTERN
tools = [{
    "name": "extract_invoice",
    "description": "Extract structured data from an invoice document",
    "input_schema": {
        "type": "object",
        "properties": {
            "vendor_name":    {"type": "string"},
            "invoice_date":   {"type": ["string", "null"]},  # nullable!
            "total_amount":  {"type": "number"},
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "amount": {"type": "number"}
                    }
                }
            },
            "currency": {
                "type": "string",
                "enum": ["USD", "EUR", "GBP", "other"]  # "other" for extensibility
            }
        },
        "required": ["vendor_name", "total_amount", "line_items"]
        # invoice_date is NOT required — may be absent
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    tools=tools,
    tool_choice={"type": "any"},  # ← guarantees structured output
    messages=[{"role": "user", "content": document_text}]
)

Schema Design Principles

  • Make fields nullable (optional) when source documents may not contain the information — prevents the model from fabricating values to satisfy a required field
  • Use enum fields with an "other" + detail string pattern for extensible categorisation — captures known categories precisely while preserving unknown ones
  • Add an "unclear" enum value for fields where the source data is ambiguous — explicit rather than guessing
  • Include format normalisation rules in prompts alongside strict schemas — e.g., "dates should be ISO 8601 format" — schemas enforce structure but not format consistency
Schema — "other" + detail pattern for extensible enums SCHEMA DESIGN
# ✗ Closed enum — fails on unknown contract types
"contract_type": {"type": "string", "enum": ["NDA", "SLA", "MSA"]}

# ✓ Open enum with fallback — handles unknown types
"contract_type": {
    "type": "string",
    "enum": ["NDA", "SLA", "MSA", "other"]
},
"contract_type_detail": {
    "type": ["string", "null"],
    "description": "Populated when contract_type is 'other'"
}

What Schemas Cannot Do

Tool use schemas enforce syntactic correctness. They cannot detect semantic errors — when the values are syntactically valid but logically wrong:

✓ Schema Catches (Syntax Errors)
Wrong type: "amount": "50.00" → should be number, not string Missing required field: no vendor_name → required field absent Invalid enum: "currency": "CAD" → not in allowed enum values Malformed JSON: missing closing brace → syntax error caught
✗ Schema Misses (Semantic Errors)
Line items: [50, 75] but total: 200 → values don't sum — schema can't check Invoice date: "2025-13-45" → invalid date — passes as string Amount in wrong currency unit → value is a number — schema satisfied Vendor name in address field → both strings — schema can't detect swap

Exam Traps for Task 4.3

The TrapWhy It FailsCorrect Pattern
Use tool_choice: "auto" when structured output is required "auto" allows Claude to return plain text instead of calling the tool — doesn't guarantee structured output Use "any" (any tool must be called) or forced selection when extraction is required every time
Mark all fields as required in the schema to ensure complete extraction Required fields on optional information force the model to hallucinate values rather than leave them null — produces worse output than nullable fields Only mark truly mandatory fields as required; use nullable types for fields that may be absent in source documents
Assume schema validation eliminates all extraction errors Schemas eliminate syntax errors only — semantic errors (wrong values, mismatched sums, misplaced data) pass schema validation Add semantic validation layer: check computed totals, validate date ranges, verify cross-field consistency after schema validation

🔨 Implementation Task

T3

Build a Schema-Enforced Extraction Pipeline

  • Define an invoice extraction tool with a JSON schema including: required fields (vendor, total), nullable fields (date, PO number), and an enum with "other" fallback for invoice type
  • Test with tool_choice: "auto" — observe when Claude returns text instead of calling the tool. Then switch to "any" and confirm 100% tool calls
  • Deliberately create a document where line items don't sum to the total. Verify the schema accepts the output (syntax valid). Add a post-validation check that flags the discrepancy (semantic validation)
  • Test with a document missing the invoice date — verify the nullable field correctly returns null rather than a fabricated date
  • Test forced tool selection: set tool_choice: {"type":"tool","name":"extract_metadata"} and confirm only that tool runs on the first turn

Exam Simulation — Task 4.3

Question 1 — Task 4.3 Structured Data Extraction
After switching invoice extraction to use tool_use with a strict JSON schema, the team celebrates that JSON syntax errors have been eliminated. However, they are still finding invoices where the line_items amounts don't add up to the total_amount. What does this indicate?
  • AThe JSON schema is not strict enough — add a minimum and maximum constraint to the total_amount field
  • BThe tool descriptions are unclear — rewrite them to explain that line items should sum to the total
  • CTool use eliminates syntax errors but not semantic errors — a separate validation layer is needed to check business logic like sum consistency
  • DThe model needs a larger context window to process both the line items and total in a single pass
Correct: C
C is correct. This is the central exam principle for 4.3: tool use provides syntactic guarantees (valid JSON, correct types, required fields present) but cannot enforce business logic like "sum of line items must equal total." JSON schemas have no mechanism to express cross-field arithmetic constraints. A post-extraction validation layer must check semantic consistency. A is wrong: min/max constraints on a single field can't enforce relationships between fields. B is wrong: better tool descriptions improve selection, not arithmetic consistency. D is wrong: this is a validation architecture problem, not a context problem.
Question 2 — Task 4.3 Structured Data Extraction
An extraction pipeline processes three document types: invoices, contracts, and purchase orders — each with a different extraction schema. The document type is not always known in advance. What tool_choice configuration ensures the model always calls one of the extraction tools rather than returning conversational text?
  • Atool_choice: "auto" — the model will select the appropriate tool based on document content
  • Btool_choice: "any" — the model must call one of the available tools and cannot return plain text, while still choosing which tool fits the document
  • Ctool_choice: {"type":"tool","name":"extract_invoice"} — forces the invoice tool to run for all documents
  • DNo tool_choice setting — include "always use a tool" in the system prompt
Correct: B
B is correct. "any" is the right choice when: (1) structured output is guaranteed (cannot return text), and (2) multiple tools exist and the model should choose the appropriate one. A is wrong: "auto" allows the model to return plain text — doesn't guarantee extraction. C is wrong: forcing a specific tool runs invoice extraction on contracts and POs, producing incorrect schema mappings. D is wrong: system prompt instructions are probabilistic — "any" provides a structural guarantee.
Question 3 — Task 4.3 Structured Data Extraction
Your pipeline extracts contract terms using a strict JSON schema enforced via tool use. Schema validation passes 100% of the time, but downstream legal review still rejects 18% of extractions because key relationships between fields are wrong — e.g., the termination_date is before the start_date. What additional measure would most reduce the rejection rate?
  • ASwitch from tool use enforcement to markdown JSON output and add format instructions.
  • BAdd a post-extraction validation layer that checks semantic constraints across fields, such as date ordering and required field co-occurrence.
  • CIncrease the schema strictness by adding more required fields.
  • DAdd few-shot examples showing correct field relationships.
Correct: B
B is correct. JSON schema validation ensures structural correctness but cannot enforce semantic constraints between fields (e.g., end > start). A semantic validation layer applies cross-field business rules that the schema cannot express. A (markdown JSON) is strictly worse — it removes schema enforcement without adding semantic validation. C (more required fields) increases structural coverage but still doesn't catch relational errors between existing fields. D (few-shot) helps Claude learn patterns but doesn't provide a deterministic safety net — a validation layer is more reliable for cross-field constraints.
Task Statement 4.4
Task Statement 4.4

Implement validation, retry, and feedback loops for extraction quality

Retries work when the error is fixable — format mismatches, structural output errors. They don't work when the required information simply isn't in the document. Knowing the difference prevents wasted API calls and incorrect data.

The Core Concept

Retry-with-error-feedback is a powerful pattern: rather than sending the same prompt again, you append the specific validation error and ask the model to correct its output. This gives the model the information it needs to fix the error — but only if the error is fixable given the source document.

Retry-with-Error-Feedback

python — retry loop with specific error feedback EFFECTIVE RETRY PATTERN
def extract_with_retry(document, max_retries=3):
    messages = [{"role": "user", "content": document}]

    for attempt in range(max_retries):
        response = extract(messages)
        result = parse_tool_result(response)

        # Validate the extraction result
        errors = validate_extraction(result)

        if not errors:
            return result  # ✓ Success

        # ✓ Append assistant response AND specific errors
        messages.append({"role": "assistant", "content": response.content})
        messages.append({
            "role": "user",
            "content": f"""Your extraction has validation errors. Please correct them:

Errors found:
{chr(10).join(f"- {e}" for e in errors)}

Original document for reference:
{document}

Please re-extract with these corrections applied."""
        })

    return result  # Return best attempt after max retries

When Retry Won't Help

The most important distinction in 4.4: retries fix errors caused by format or structural output problems. They cannot fix errors caused by missing information in the source document.

✓ Retry Will Succeed
Format mismatch: Date extracted as "Jan 15, 2024" Expected: ISO 8601 "2024-01-15" → Retry with format instruction will fix Structural output error: Amount extracted as string "50.00" Expected: number 50.00 → Schema feedback will fix Field misplacement: Vendor name in address field → Specific error feedback will fix
✗ Retry Won't Help
Missing information: Invoice date not present anywhere in the document → Retrying just produces fabrication External reference: "As per the attached rate card" Rate card not provided → Model cannot extract what's not in the context Ambiguous source data: Two conflicting totals in document → Retry produces inconsistent picks, not a correct answer

Feedback Loop Design

🔍

detected_pattern Field

Add a detected_pattern field to structured findings so that when developers dismiss a finding, you can track which code constructs triggered false positives. Enables systematic prompt improvement based on real dismissal patterns.

🧮

Self-Validation Fields

Add calculated_total alongside stated_total — the model computes the sum of line items and compares. Add conflict_detected: boolean when source data has inconsistencies. Makes semantic errors visible without external logic.

📊

Confidence Alongside Findings

Have the model self-report a confidence score per finding. Use this to route low-confidence extractions to human review. Enables calibrated routing without making every extraction manual.

Sample Before Batch

Run prompt refinement on a representative sample set (10–20 documents) before batch-processing large volumes. Maximises first-pass success rates and dramatically reduces costly iterative resubmission on thousands of documents.

Exam Traps for Task 4.4

The TrapWhy It FailsCorrect Pattern
Retry when a required field is null because information isn't in the document The model has nothing new to work with — it will either repeat null or fabricate a value to satisfy the requirement Distinguish format/structural errors (retry will help) from absent information (route to human review or accept null)
Retry without providing the specific validation error Sending the same prompt again produces the same output — the model has no information about what was wrong Include the original document, the failed extraction, and the specific validation errors in the retry request
Run batch processing on 10,000 documents without sample testing If the prompt has extraction gaps, they affect all 10,000 documents — expensive to reprocess the entire batch Test on a representative 10–20 document sample first; fix extraction issues before running at scale

🔨 Implementation Task

T4

Build a Validation-Retry Pipeline with Semantic Checking

  • Implement retry-with-error-feedback: include original document, failed extraction, and specific errors in the retry message
  • Create test cases: (a) date format mismatch — confirm retry fixes it; (b) required field absent from document — confirm retry produces fabrication or repeated null, not a correct answer
  • Add semantic validation: check that line item amounts sum to stated total; add calculated_total and conflict_detected fields to the schema
  • Add detected_pattern field to code review findings; track which patterns developers dismiss; identify top 3 false-positive patterns after 50 reviews
  • Test sample-before-batch: run on 10 documents, identify extraction gaps, fix prompt, then run on 100 — measure first-pass success rate improvement

Exam Simulation — Task 4.4

Question 1 — Task 4.4 Structured Data Extraction
An extraction pipeline has a 15% rate of null publication_date fields. Investigation reveals that 8% of documents genuinely don't contain a publication date, while 7% have a date but in a non-standard format (e.g., "Published Spring 2023"). The team has a retry loop that retries all null extractions. What outcome should they expect for each group?
  • ARetry fixes both groups — the model will try harder to find dates it missed the first time
  • BRetry fixes neither group — retries without new information produce the same output
  • CRetry with format guidance fixes the 7% with non-standard formats (format mismatch is correctable) but the 8% with truly absent dates will either remain null or be fabricated
  • DRetry fixes the 8% with absent dates because the model will infer approximate dates from document context
Correct: C
C is correct. Format mismatches are correctable — a retry telling the model "extract dates in ISO 8601 format, including informal formats like 'Spring 2023' → '2023-01'" will fix the 7%. But the 8% where no date exists cannot be fixed by retrying — the model either keeps returning null (correct) or hallucinates a date (incorrect). The correct handling for absent dates is to accept null and route to human review if needed. A is wrong: retrying without new information doesn't help absent-date documents. B is too pessimistic: retries with specific format feedback do fix format mismatches. D is wrong and dangerous — inferring dates from context produces fabrication, not extraction.
Question 2 — Task 4.4 Structured Data Extraction
Your contract extraction pipeline retries any extraction that fails JSON schema validation. After 3 retries, the error rate is still 12%. You analyse failures and find three distinct root causes: (A) Claude misidentifies section headings as clause content, (B) multi-paragraph clauses are truncated at the first line break, and (C) cross-references like 'see Section 4.3' are extracted literally instead of being resolved. Which retry strategy would most effectively reduce errors across all three root causes?
  • ARetry with temperature=0 for all failures to maximise determinism.
  • BClassify the failure type before retrying, then apply a targeted correction prompt specific to each root cause.
  • CRetry with a longer context window to provide more surrounding text.
  • DAdd all three root causes as system prompt warnings on the first attempt.
Correct: B
B is correct. When failures have multiple distinct root causes, a classification-based retry is far more effective than a single generic retry. Each root cause requires a different correction: for (A), a prompt distinguishing headings from content; for (B), instructions for multi-paragraph continuation; for (C), a resolution step for cross-references. A (temperature=0) helps with consistency but doesn't address structural misunderstandings. C (longer context) helps with (B) but not (A) or (C). D (upfront warnings) can reduce errors on the first pass but doesn't fully eliminate them — retries still need to be targeted.
Question 3 — Task 4.4 Structured Data Extraction
You implement a feedback loop where a validator Claude instance checks extraction quality. The validator is given only the extracted JSON and the schema to assess whether the extraction is correct. After deployment, you find the validator approves extractions that human reviewers reject. What is the most likely design flaw?
  • AThe validator prompt is too strict and rejecting valid extractions.
  • BThe validator model is less capable than the extractor model.
  • CThe validator only has the extracted JSON, not the source document, so it cannot verify whether extracted values actually appear in the source.
  • DThe schema used by the validator is outdated compared to the extractor's schema.
Correct: C
C is correct. A validator checking extraction quality without access to the source document is fundamentally limited — it can only check structural validity, not factual accuracy (whether the extracted values actually exist in the document). Human reviewers reject extractions because values are wrong or hallucinated, not because the JSON structure is invalid. A (too strict) is the opposite problem — the validator is too permissive. B (less capable model) might cause inconsistent behavior but doesn't explain systematic approval of factually wrong extractions. D (schema mismatch) would cause structural validation failures, not factual accuracy failures.
Task Statement 4.5
Task Statement 4.5

Design efficient batch processing strategies

50% cost savings — but no guaranteed latency SLA, no multi-turn tool calling, and up to 24 hours. Matching the API to the workflow latency requirement is the core decision. The exam tests SLA calculation and failure recovery design.

The Core Concept

The Message Batches API is designed for workloads that can tolerate async processing — overnight reports, weekly audits, bulk document processing. Its 50% cost savings make it compelling. Its non-deterministic completion time (up to 24 hours) makes it incompatible with any blocking workflow.

Critical Limitation: The batch API does not support multi-turn tool calling within a single request. If your extraction workflow requires calling tools mid-request and returning results, you cannot use the batch API for it — use the synchronous API.

Batch API Trade-offs and SLA Calculation

💰

50% Cost Savings

Half the per-token cost of the synchronous API. For high-volume extraction on non-urgent workloads, this compounds significantly across millions of tokens.

⏱️

Up to 24-Hour Window

No guaranteed completion time. Batches complete within 24 hours — but may complete in minutes. Never build a blocking workflow dependency on "usually faster" completion.

🔑

custom_id Correlation

Every batch request takes a custom_id. Results are returned with the same ID, enabling correlation regardless of completion order. Essential for partial failure recovery.

🚫

No Multi-Turn Tool Calling

Single-turn only. If your workflow needs to call a tool, receive the result, then make a follow-up decision — use the synchronous API. Batch API cannot execute tools mid-request.

📐
SLA Calculation Example: If your SLA requires results within 30 hours of document receipt, and the batch API takes up to 24 hours: you must submit batches within 6 hours of document receipt (30 − 24 = 6-hour submission window). If documents arrive continuously, submit every ≤6 hours to guarantee the 30-hour SLA.

Batch Failure Handling

python — batch failure recovery using custom_id FAILURE RECOVERY PATTERN
def process_batch_results(batch_results, original_documents):
    successful = []
    failed = []

    for result in batch_results:
        doc_id = result.custom_id  # ← correlate using custom_id

        if result.result.type == "succeeded":
            successful.append(result)
        else:
            original_doc = original_documents[doc_id]
            failure_reason = result.result.error.type

            if failure_reason == "token_limit_exceeded":
                # ✓ Chunk the document and resubmit
                chunks = chunk_document(original_doc)
                failed.extend([(doc_id, chunk) for chunk in chunks])
            else:
                # ✓ Resubmit with same content
                failed.append((doc_id, original_doc))

    # ✓ Only resubmit failures, not the entire batch
    if failed:
        resubmit_batch(failed)

    return successful
  • Match API approach to latency: synchronous for blocking pre-merge checks, batch for overnight/weekly non-urgent workloads
  • Resubmit only failed documents identified by custom_id — not the entire batch
  • For documents that exceeded context limits: chunk them before resubmitting rather than retrying as-is
  • Test on a representative sample before batch-processing large volumes — maximises first-pass success and reduces costly resubmission cycles

Exam Traps for Task 4.5

The TrapWhy It FailsCorrect Pattern
Use batch API for a blocking pre-merge check that developers wait for Batch API has up to 24-hour completion time — developers cannot wait for a pre-merge check to complete in an unbounded window Synchronous API for any blocking workflow; batch only for non-blocking async workloads
Use batch API for an agentic extraction workflow that calls tools mid-extraction Batch API does not support multi-turn tool calling within a single request — tool calling requires the synchronous API Use synchronous API for tool-calling workflows; batch for single-turn extraction where results can be returned in one pass
Resubmit the entire batch when a few documents fail Reprocessing successful documents wastes 50x the cost of targeted resubmission — and correct results may be overwritten Use custom_id to identify and resubmit only failed documents

🔨 Implementation Task

T5

Build a Batch Extraction Pipeline with Failure Recovery

  • Submit a batch of 50 documents with unique custom_id values for each. Verify result correlation works correctly.
  • Deliberately include 3 oversized documents that exceed context limits. Implement chunking-based resubmission only for those failures.
  • Calculate the submission frequency for a 30-hour SLA given the batch API's 24-hour maximum processing window.
  • Run a sample of 10 documents with your extraction prompt before batch-processing all 50. Identify and fix any extraction gaps. Measure first-pass success improvement.
  • Design a decision matrix: given a new workflow, what 3 questions determine whether it should use batch vs synchronous API?

Exam Simulation — Task 4.5

Question 1 — Task 4.5 Structured Data Extraction
A document processing system has an SLA requiring all documents to be processed within 36 hours of receipt. Documents arrive continuously throughout the day. The team wants to use the batch API (up to 24-hour processing window). What batch submission frequency is required to guarantee the SLA?
  • ASubmit batches every 24 hours — batches complete within 24 hours so results will meet the 36-hour SLA
  • BSubmit batches every 36 hours — the full SLA window is the maximum submission interval
  • CSubmit batches every 12 hours — documents must be submitted within 12 hours of receipt so that even a 24-hour batch processing window still completes within the 36-hour SLA (36 − 24 = 12)
  • DBatch API is unsuitable — the 24-hour window makes any SLA under 24 hours impossible to guarantee
Correct: C
C is correct. SLA calculation: 36-hour SLA minus 24-hour maximum batch processing time leaves a 12-hour submission window. Documents must be submitted to the batch API within 12 hours of receipt. Submitting every 12 hours guarantees the latest-arriving document in each batch is submitted within 12 hours and processed within 24 hours — total ≤36 hours. A is wrong: submitting every 24 hours means a document arriving just after a submission could wait up to 24 hours before being submitted, then another 24 hours processing = 48 hours total. B is wrong: same logical error. D is wrong: a 36-hour SLA is achievable with a 24-hour batch window — with the right submission frequency.
Question 2 — Task 4.5 Customer Support Resolution Agent
Your customer support pipeline processes 10,000 support tickets per day. Your SLA requires ticket responses to be submitted within 36 hours of receipt, and processing latency is 24 hours end-to-end. Your batch job currently runs once per day at midnight. A ticket arrives at 11:58 PM. How much time remains before the SLA is breached from the moment the ticket arrives?
  • A36 hours
  • B24 hours
  • C12 hours
  • D0 hours
Correct: C
C is correct. Trace the timeline: ticket arrives at 11:58 PM (T=0) → SLA deadline is T+36h = 11:58 AM day+2. The next batch runs at midnight (T+2 min) and processing takes 24 hours, so the batch completes at ~12:02 AM day+2. From that completion point to the SLA deadline (11:58 AM day+2) is approximately 12 hours of buffer remaining. A (36 hours) treats the full SLA window as available buffer, ignoring that 24 hours are consumed by processing. B (24 hours) misidentifies the processing latency itself as the buffer. D (0 hours) would only be correct if the batch ran after the SLA window had already closed — it runs just 2 minutes after arrival.
Question 3 — Task 4.5 Customer Support Resolution Agent
Your batch processing pipeline sends all 10,000 daily tickets to Claude in a single API request batch. Occasionally the entire batch fails due to a single malformed ticket causing a processing error, requiring a full re-run. Which batch design change most directly reduces the blast radius of single-ticket failures?
  • AAdd retry logic that re-submits the entire batch if any ticket fails.
  • BProcess tickets in smaller batches (e.g., 100 tickets per batch) so a single failure only affects that batch.
  • CAdd input validation before submission to catch malformed tickets.
  • DUse a dead-letter queue to route failed tickets to a separate processing pipeline.
Correct: B
B is correct. Smaller batches directly limit blast radius — if one ticket fails in a 100-ticket batch, only 99 others are affected rather than 9,999. A (retry entire batch) makes things worse — it repeats the problem at scale. C (input validation) is complementary but novel failures can still slip through. D (dead-letter queue) handles failures after they occur but doesn't prevent the batch from failing in the first place.
Task Statement 4.6
Task Statement 4.6

Design multi-instance and multi-pass review architectures

A model reviewing its own output is less effective than an independent reviewer — not because it's incapable, but because it retains reasoning context from generation. Independence eliminates the bias. Multi-pass eliminates attention dilution.

The Core Concept

When Claude generates code and then reviews it in the same session, it retains the reasoning context from generation — making it less likely to question its own architectural decisions. An independent review instance (a fresh session with no generation context) approaches the code with fresh eyes and is measurably more effective at catching subtle issues.

The Exam Principle: Self-review in the same session is less effective than independent review — not because "Claude can't review its own code" in principle, but because the retained reasoning context makes it biased toward its own decisions. The fix is architectural: use a separate instance that has only the code, not the generation history.

The Self-Review Limitation

✗ Self-Review (Same Session)
Session history: 1. [User] "Generate a sorting algorithm" 2. [Claude] "Here's merge sort: ..." 3. [User] "Review this code for bugs" 4. [Claude] "The code looks correct..." Why it fails: Claude still has step 2's reasoning in context. It "knows" why it made each decision. Unlikely to question its own design choices or notice assumptions it encoded.
✓ Independent Review (New Session)
Review session (fresh): 1. [User] "Review this code for bugs" [code provided directly] 2. [Claude] "I notice the merge operation doesn't handle..." Why it works: No generation context. No reasoning bias. Claude evaluates the code on its own merits. Same capability, different perspective.

Review Architectures

🔄

Independent Instance Review

After generation, spawn a second Claude instance with no generation history. Provide only the code. The independent instance reviews without reasoning bias from the generation phase.

📄

Per-File + Integration Pass

For multi-file reviews: pass 1 analyzes each file individually for local issues (consistent depth, no attention dilution). Pass 2 is a separate cross-file integration review examining data flow, interface contracts, and cross-module patterns.

🎯

Confidence-Scored Routing

Run a verification pass where the model self-reports confidence alongside each finding. Route high-confidence findings to automated action, low-confidence findings to human review. Calibrates automation vs human oversight.

When to Use Each

Independent instance: after any generation. Per-file + integration: PRs with 5+ files. Confidence routing: high-stakes automated actions. Combination: large PRs with auto-fix capability.

  • Use a second independent Claude instance without the generator's reasoning context — not self-review instructions in the same session
  • Split large multi-file reviews into per-file local analysis passes plus a separate cross-file integration pass — avoids attention dilution and contradictory findings
  • Run verification passes with model-reported confidence scores to enable calibrated routing between automated action and human review

Exam Traps for Task 4.6

The TrapWhy It FailsCorrect Pattern
Add "critically review your own output" instruction in the same session The reasoning context from generation is still present — the model retains its design decisions and is less likely to question them, regardless of the instruction Use a second independent instance with no generation history — provides architectural bias removal, not just instructional reminder
Switch to a larger context model to review 14 files in one pass Context window size doesn't solve attention dilution — models still produce inconsistent depth across many files in a single pass Per-file passes for consistent depth, plus a separate integration pass for cross-file issues
Run three consensus-based passes to filter findings — only report issues appearing in 2/3 runs Consensus suppresses intermittently-detected real bugs — issues that are only sometimes caught are still real issues, just hard to detect Independent instance review; per-file passes; confidence-scored routing — not consensus filtering

🔨 Implementation Task

T6

Build a Two-Instance Review Pipeline

  • Generate a sorting algorithm in one Claude session. In the same session, ask Claude to review it for bugs. Document what it finds (or doesn't).
  • In a fresh Claude session, provide only the generated code with no generation history. Ask for a review. Compare findings with the self-review. Document the difference.
  • Create a 6-file PR. Run a single-pass review. Document inconsistency in depth and any contradictory findings.
  • Re-run with per-file passes (6 separate reviews) + 1 integration pass. Compare quality and consistency.
  • Add confidence scoring to the review output: have Claude rate each finding 1–5. Define routing rules: ≥4 → automated comment, ≤2 → human review queue.

Exam Simulation — Task 4.6

Question 1 — Task 4.6 CI/CD Code Review Pipeline
A CI pipeline uses the same Claude session to generate a function and then immediately review it. Developers report the review misses subtle issues that human reviewers catch. A developer suggests adding the instruction "Be critical of your own code — find all bugs including subtle ones." What outcome should you expect?
  • AReview quality improves significantly — explicit self-critique instructions activate a different reasoning mode
  • BReview quality improves marginally at best — the root cause is retained reasoning context from generation, which the instruction cannot eliminate. An independent review instance without generation history is needed
  • CReview quality gets worse — self-critique instructions cause the model to flag too many false positives
  • DReview quality improves because the instruction triggers extended thinking which bypasses the generation context
Correct: B
B is correct. The self-review limitation is architectural — the model retains the reasoning and assumptions from generation in its context window. Adding a self-critique instruction doesn't remove that reasoning context; the model still "knows" why it made each decision and is biased toward its own choices. The correct fix is structural: use an independent review instance with a fresh context containing only the code, not the generation history. A is wrong: no instruction mode bypasses retained context. C is wrong: there's no evidence self-critique increases false positives specifically. D is wrong: extended thinking operates within the same session context — it doesn't remove the generation history.
Question 2 — Task 4.6 CI/CD Code Review Pipeline
A PR review of 14 files is producing inconsistent results — detailed feedback on some files, superficial comments on others, and contradictory findings (flagging a pattern in one file, approving it in another). What is the correct architectural fix?
  • ASplit into focused per-file passes for local issue analysis, then run a separate cross-file integration pass examining data flow and interface consistency
  • BSwitch to a model with a 200k token context window to accommodate all 14 files with adequate attention
  • CRun three independent passes on the full PR and only report findings that appear in at least two of three runs
  • DAdd "maintain consistent analysis depth across all files" to the review system prompt
Correct: A
A is correct. This is the official exam question Q12 pattern. Per-file passes ensure consistent depth by limiting each pass to one file — attention is fully available. The integration pass catches cross-file issues that per-file analysis can't see. B is the canonical trap: larger context windows don't solve attention dilution — the model still processes middle content with lower reliability regardless of window size. C is wrong: consensus filtering suppresses real bugs that are only intermittently detected — not a quality improvement. D won't work: instruction-based fixes for attention dilution are probabilistic and fail on large inputs.
Question 3 — Task 4.6 Multi-Agent Research System
Your multi-pass research system uses Pass 1 to gather evidence from 20 sources and Pass 2 to synthesise findings into a structured report. In testing, Pass 2 produces reports where minor contradictions between low-credibility sources dominate the conclusions, while strong evidence from high-credibility sources is underweighted. What is the most effective prompt engineering change for Pass 2?
  • AInstruct Pass 2 to cite all sources equally to avoid bias.
  • BAdd explicit prioritisation criteria to the Pass 2 prompt, such as 'Weight findings by source credibility tier, recency, and consistency with majority evidence.'
  • CIncrease the number of sources in Pass 1 to 40 to dilute low-credibility outliers.
  • DRun Pass 2 twice and take the consensus between the two runs.
Correct: B
B is correct. Without explicit prioritisation criteria, Claude synthesises sources by giving roughly equal weight to all of them, causing minor contradictions from low-credibility sources to distort conclusions. Adding explicit weighting rules (credibility tier, recency, majority consistency) gives Pass 2 a decision framework for resolving conflicts. A (equal citation) is the exact problem — it enforces the flawed default behaviour. C (more sources in Pass 1) dilutes the signal further — adding more low-credibility sources doesn't help if Pass 2 lacks rules for handling credibility differences. D (two Pass 2 runs) averages the same flawed synthesis twice, producing consistent but still wrong results.