Domain 5 15% of exam

Context Management & Reliability — Complete Lesson

Domain progress
Domain 5 15% of exam · 6 Task Statements

Context Management
& Reliability

Domain 5 is the reliability domain — everything that prevents production systems from silently degrading. It covers how context windows fill and fail, how errors should propagate to enable intelligent recovery, how to preserve source attribution through multi-agent pipelines, and when to keep humans in the loop.

This domain appears across all six exam scenarios. The "lost in the middle" effect, structured error propagation, escalation calibration, and provenance preservation are universal concerns that appear in customer support, research pipelines, CI/CD, and extraction systems alike.

Task Statement 5.1
Task Statement 5.1

Manage conversation context to preserve critical information across long interactions

Context windows are finite and degrade silently. Numbers disappear in summaries, tool results bloat with irrelevant fields, and findings buried in the middle of long inputs get dropped. The exam tests whether you know how to prevent each failure mode architecturally.

The Core Concept

Long conversations fail in predictable ways. Context management is the set of architectural choices that prevent those failures: extracting critical facts into a persistent layer, trimming tool outputs before they accumulate, positioning key findings where models reliably process them.

The Exam Principle: Three distinct failure modes, three distinct fixes. (1) Summarisation loses numbers → extract facts into a persistent "case facts" block. (2) Tool results bloat → trim to relevant fields before storing. (3) Lost-in-the-middle → place key findings at the beginning of aggregated inputs.

Progressive Summarisation Risks

When conversation history is summarised to save context, quantitative information collapses. "Customer reported order #ORD-4821 was delivered to the wrong address; the replacement cost is $127.49 and the expected delivery was 14 March 2025" becomes "Customer had a delivery issue and wants compensation." The order number, the amount, and the date — the facts needed to take action — are gone.

✗ Progressive Summary (Loses Facts)
After 5 turns of conversation: Summary: "Customer had delivery issues with two orders and is requesting refunds totalling a significant amount. They have been frustrated by multiple interactions." Lost: order IDs, specific amounts, exact dates, original complaint text, stated expectations
✓ Persistent Case Facts Block
Case Facts (always included): Order 1: ORD-4821 | $127.49 Status: wrong_address_delivered Expected: 2025-03-14 Order 2: ORD-5103 | $43.00 Status: damaged_on_arrival Customer expectation: full refund on both + apology voucher Summary: general conversation tone
  • Extract transactional facts (amounts, dates, order numbers, statuses) into a persistent "case facts" block that is always included in each prompt, outside the summarised history
  • For multi-issue sessions, extract each issue's data into a separate structured context layer — prevents individual issues from collapsing into each other during summarisation

The "Lost in the Middle" Effect

Models reliably process information at the beginning and end of long inputs. Information in the middle sections of large aggregated inputs is processed less reliably — findings may be omitted from synthesised outputs even when they were present in the input.

📌

Key Findings First

Place the most critical findings summary at the beginning of aggregated inputs — not buried in the middle. If the synthesis agent must read 10 subagent reports, start with a condensed findings overview before the full reports.

🏷️

Explicit Section Headers

Organise long aggregated inputs with explicit section headers (## Web Search Results, ## Document Analysis, ## Prior Findings). Headers serve as attention anchors — findings under labelled sections are less likely to be dropped.

📊

Structured Subagent Output

Require subagents to include metadata (dates, source locations, methodological context) in structured outputs. Downstream synthesis agents need this context to interpret findings correctly.

✂️

Upstream Compression

Modify upstream agents to return structured data (key facts, citations, relevance scores) instead of verbose content and reasoning chains when downstream agents have limited context budgets.

Tool Result Bloat

Every tool result appended to conversation history consumes tokens. A single lookup_order call that returns 40+ fields (shipping carrier, warehouse ID, pick timestamp, scan events, carrier contract details…) when only 5 fields are relevant (status, amount, dates, address) will bloat the context in proportion to the number of order lookups made during a session.

python — trim tool results before appending to context CONTEXT EFFICIENCY
def trim_order_result(full_result):
    # ✓ Keep only fields relevant to customer support
    return {
        "order_id":      full_result["order_id"],
        "status":        full_result["status"],
        "total_amount":  full_result["total_amount"],
        "expected_date": full_result["expected_date"],
        "delivery_address": full_result["delivery_address"]
        # ✗ Dropped: warehouse_id, carrier_contract, scan_events,
        #   pick_timestamp, route_code, manifest_id, ...
        # These 35 fields are never needed by the support agent
    }

# Use trimmed result when appending tool result to history
tool_results.append({
    "type": "tool_result",
    "tool_use_id": block.id,
    "content": trim_order_result(raw_result)  # ← trimmed
})

Exam Traps for Task 5.1

The TrapWhy It FailsCorrect Pattern
Summarise conversation history to save tokens Summaries collapse quantitative facts (amounts, dates, order IDs) into vague prose — the specific values needed for action are lost Keep transactional facts in a persistent case facts block outside the summarised history; only summarise conversational context
Place key findings in the middle of a large aggregated input The "lost in the middle" effect means models process middle-section content less reliably — critical findings may be dropped Place key findings summary at the beginning of aggregated inputs; use explicit section headers throughout
Append full tool result objects to conversation history Tool results with 40+ fields consume tokens disproportionate to their relevance — bloats context, reduces available space for actual conversation Trim tool outputs to relevant fields before appending to context; only keep what the agent actually needs

🔨 Implementation Task

T1

Build a Context-Managed Customer Support Session

  • Implement a persistent "case facts" block: extract all transactional facts (amounts, dates, order IDs) after each tool call and maintain them in a structured block outside conversation history
  • Simulate 10 turns of conversation with 5 order lookups. Compare context token usage with full tool results vs trimmed results. Document the difference.
  • Deliberately trigger "lost in the middle": put key findings in position 5 of a 10-item aggregated input. Verify they get dropped. Then move them to position 1 and verify they're preserved.
  • Implement section headers in aggregated subagent outputs: ## Web Results, ## Document Analysis, ## Prior Case Facts. Measure synthesis quality improvement.
  • Modify the upstream order lookup to return only 5 fields relevant to support resolution. Document how many tokens are saved over a 20-turn session.

Exam Simulation — Task 5.1

Question 1 — Task 5.1 Customer Support Resolution Agent
A customer support agent handles multi-issue sessions. After 8 turns, the agent incorrectly states a refund amount that the customer had mentioned explicitly in turn 2. Investigation reveals the history was summarised at turn 5, collapsing "Customer wants a $127.49 refund for order #ORD-4821" into "Customer wants refund for delivery issue." What is the correct architectural fix?
  • AIncrease the summarisation interval to every 15 turns instead of every 5, giving more context before facts are lost
  • BExtract transactional facts (order IDs, amounts, dates, statuses) into a persistent "case facts" block that is always included in each prompt, separate from the summarised history
  • CSwitch to a larger context window model so summarisation is never needed
  • DInclude the full unsummarised conversation in every prompt to prevent fact loss
Correct: B
B is correct. The case facts block is the specific exam pattern for preserving quantitative transactional facts across summarisation. It keeps exact values (amounts, order IDs, dates) in a structured layer that survives compression, while the conversational history is still summarised to save tokens. A delays the problem but doesn't solve it — facts will still be lost at turn 15 instead of turn 5. C misses the point — larger context windows don't prevent summarisation-induced fact loss if you're still summarising. D works but defeats the purpose of context management — context windows still fill up.
Question 2 — Task 5.1 Multi-Agent Research System
A synthesis agent receives aggregated reports from 8 subagents. Analysis shows it consistently drops findings from reports 3–6 while reliably incorporating reports 1–2 and 7–8. What is the most likely cause and correct fix?
  • AReports 3–6 have lower quality findings — improve the subagents assigned to those topics
  • BThe "lost in the middle" effect — the synthesis agent processes reports at the beginning and end reliably but not the middle. Fix by placing a key findings summary at the beginning of the aggregated input and using explicit section headers
  • CThe synthesis agent's context window is too small — route it to a higher-tier model
  • DReports 3–6 are arriving in a different format — standardise all reports to the same structure
Correct: B
B is correct. The pattern — reliable at the start and end, dropping from the middle — is the diagnostic signature of the "lost in the middle" effect. The fix: place a condensed key findings overview at position 1 (before any detailed reports) and use explicit section headers. A incorrectly blames report quality when the positional pattern is the diagnostic. C adds cost but doesn't fix the positional processing pattern — larger windows still exhibit "lost in the middle." D addresses format consistency but not positional processing reliability.
Question 3 — Task 5.1 Customer Support Resolution Agent
Your customer support agent handles multi-session customer interactions. Each conversation is stored in a rolling 10-turn window. A customer reports that the agent asked for their account number three times across different sessions, despite the customer providing it each time. What is the root cause of this failure?
  • AThe rolling window was too short — increase it to 20 turns.
  • BThe agent failed to write the account number to its internal state correctly.
  • CAccount number is critical persistent context that must be stored in a persistent 'active commitments' block outside the rolling window, not in the conversation turns themselves.
  • DThe agent should summarise each session before closing and prepend the summary to the next session.
Correct: C
C is correct. A rolling window discards older turns — if the account number was provided 11+ turns ago, it's gone. Critical persistent context (account number, customer preferences, unresolved issues) must be extracted from the conversation and stored separately in a persistent block that is always prepended to every session. A (longer window) delays the problem but doesn't solve it — eventually the number is still lost. B (write failure) may be a contributing bug, but the structural problem is that conversational turns are the wrong storage mechanism for critical context. D (session summary) is a partial mitigation but summaries can omit specific values unless the prompt explicitly requires their preservation.
Task Statement 5.2
Task Statement 5.2

Design effective escalation and ambiguity resolution patterns

Escalation decisions are binary in the wrong way — agents either escalate too readily (55% resolution rate) or too aggressively (attempting policy exceptions autonomously). The exam tests whether you know which signals reliably indicate escalation need and which are unreliable proxies.

The Core Concept

Calibrated escalation requires explicit decision criteria, not self-reported confidence or sentiment detection. The exam distinguishes between three categories: cases where escalation is mandatory (explicit customer request, policy gaps), cases where the agent should offer to resolve first (straightforward issues within capability), and situations requiring clarification (ambiguous customer identity).

Official Q3 Pattern: When an agent has 55% first-contact resolution with incorrect escalation calibration, the correct first fix is always explicit escalation criteria with few-shot examples — not confidence scoring, not sentiment analysis, not ML classifiers. These other approaches add infrastructure without addressing the root cause: unclear decision boundaries.

Correct Escalation Triggers

🙋

Explicit Customer Request — Immediate

When a customer explicitly asks for a human agent, escalate immediately without attempting investigation first. Attempting to resolve despite an explicit request for escalation violates customer trust.

📋

Policy Gap — Escalate

When policy is ambiguous or silent on the specific request (e.g., customer requests competitor price matching when policy only covers own-site adjustments), escalate rather than attempting to extrapolate policy.

🔄

No Meaningful Progress — Escalate

When the agent cannot make meaningful progress after reasonable investigation — not just "complex" cases. Complexity alone is insufficient; inability to progress is the correct trigger.

🤝

Frustration + Capability — Acknowledge, Offer to Resolve

When a customer is frustrated but the issue is within the agent's capability: acknowledge the frustration, offer to resolve it. Escalate only if the customer reiterates their preference for a human after the offer.

💡
Ambiguous Identity: When a tool lookup returns multiple customer matches, ask for additional identifiers (email, phone, account number) rather than selecting based on heuristics like "most recent order" or "closest name match." Heuristic selection risks operating on the wrong account.

Unreliable Escalation Proxies

✗ Self-Reported Confidence Score
Agent self-reports: "Confidence: 3/10" → escalate if below 5 Why it fails: The agent is already making wrong decisions. Its confidence scores are equally poorly calibrated. An agent confidently doing the wrong thing still scores 8/10.
✗ Sentiment-Based Escalation
Escalate when negative sentiment exceeds a threshold Why it fails: Sentiment ≠ complexity. A customer calmly requesting a policy exception needs escalation. A furious customer with a standard return does not. Sentiment solves a different problem from case complexity.
  • Add explicit escalation criteria to the system prompt with few-shot examples demonstrating when to escalate vs resolve autonomously — directly addresses unclear decision boundaries
  • Honor explicit customer requests for a human agent immediately — do not attempt investigation or resolution before transferring
  • Escalate on policy gaps (policy is silent on the specific case) rather than attempting to extrapolate or improvise
  • Ask for additional identifiers on multiple customer matches — never use heuristic selection on ambiguous identity

Exam Traps for Task 5.2

The TrapWhy It FailsCorrect Pattern
Use self-reported confidence scores to trigger escalation LLM self-reported confidence is poorly calibrated — the same model making wrong decisions also produces unreliable confidence scores for those decisions Explicit categorical escalation criteria: policy gap, explicit customer request, inability to make progress
Escalate when sentiment analysis detects high frustration Frustration level doesn't correlate with case complexity — a frustrated customer with a standard return doesn't need escalation; a calm customer requesting a policy exception does Escalate on case type (policy gap, explicit request), not emotional state
Attempt to resolve before transferring when customer explicitly requests a human Attempting investigation after an explicit escalation request frustrates the customer further and reduces trust — the explicit request overrides all other considerations Honor explicit requests immediately; offer resolution only when customer hasn't explicitly requested escalation
Use heuristic selection when multiple customer matches are returned Selecting "most recent order" or "closest name match" risks operating on the wrong customer's account — financial and privacy consequences Request additional identifiers (email, phone, account number) to narrow to a single match before proceeding
Verify the triggering event but not the claimed causal consequence Confirming that event X occurred does not confirm that event X caused outcome Y. A shipping delay happened — but did it cause the spoiled groceries, or were they left out for unrelated reasons? Verify both: (1) that the triggering event occurred, and (2) that it causally produced the claimed outcome before taking remedial action

🔨 Implementation Task

T2

Implement Calibrated Escalation with Few-Shot Criteria

  • Write explicit escalation criteria covering: explicit human request, policy gap, inability to progress, and the frustration-but-resolvable case — each with a concrete few-shot example
  • Test: send a frustrated customer with a standard return. Verify the agent offers to resolve rather than escalating. Then have the customer reiterate their preference for a human — verify escalation fires.
  • Test: send a request for competitor price matching when policy only covers own-site adjustments. Verify escalation fires due to policy gap, not complexity.
  • Test: simulate multiple customer matches from a lookup. Verify the agent requests additional identifiers rather than selecting the "best" match.
  • Compare the agent's resolution rate before and after adding explicit criteria + few-shot examples. Document the improvement.

Exam Simulation — Task 5.2

Question 1 — Task 5.2 Customer Support Resolution Agent
Your agent achieves 55% first-contact resolution, well below the 80% target. Logs show it escalates straightforward cases (standard damage replacements with photo evidence) while attempting to autonomously handle complex situations requiring policy exceptions. What is the most effective first step?
  • AAdd explicit escalation criteria to the system prompt with few-shot examples demonstrating when to escalate versus resolve autonomously
  • BHave the agent self-report a confidence score (1–10) and automatically escalate when confidence falls below 5
  • CDeploy a separate ML classifier trained on historical tickets to predict which requests need escalation
  • DImplement sentiment analysis and automatically escalate when negative sentiment exceeds a threshold
Correct: A
A is correct. This is the official exam question Q3. The root cause is unclear decision boundaries — the agent doesn't know which case types require escalation vs autonomous resolution. Explicit criteria with few-shot examples directly addresses this. B fails because self-reported confidence is poorly calibrated — the agent already incorrectly confident on the cases it misclassifies. C is over-engineered — it requires labeled training data and ML infrastructure when prompt optimisation hasn't been tried. D solves a different problem — sentiment doesn't correlate with case complexity.
Question 2 — Task 5.2 Customer Support Resolution Agent
A customer writes: "I'm extremely frustrated — I've contacted you 3 times. I just want to speak to a real person." The agent begins investigating the order before transferring. What is wrong with this behaviour and what should the agent do?
  • AThe agent should complete its investigation first so the human agent receives a complete case summary
  • BThe customer explicitly requested a human — the agent must escalate immediately without attempting investigation. Attempting to resolve after an explicit escalation request further damages trust.
  • CThe agent should acknowledge the frustration and offer to resolve the issue before escalating, since the issue may be straightforward
  • DThe agent should escalate after one investigation attempt — give it one chance to resolve before transferring
Correct: B
B is correct. An explicit request for a human agent is an unconditional escalation trigger. The agent must transfer immediately. Attempting to investigate first — even with good intentions — directly contradicts the customer's stated preference and increases frustration. A is wrong: preparing a case summary is appropriate only when escalation is decided internally, not when the customer has explicitly demanded it. C describes the correct behaviour for an implicit frustration signal, not an explicit escalation request. D allows one investigation attempt which still violates the explicit request.
Question 3 — Task 5.2 Customer Support Resolution Agent
Your customer support agent has authority to issue credits up to $50 without human approval. A customer claims a delayed shipment cost them $45 in spoiled groceries. The agent verifies the shipping delay, confirms it occurred, and issues a $45 credit. Post-incident review finds the agent skipped a required step. What step did the agent skip?
  • AThe agent should have escalated all claims above $25 to a human.
  • BThe agent issued credit without verifying the customer's claim that the spoiled groceries were caused by the delay — it only verified the delay itself.
  • CThe agent should have applied a standard processing fee before issuing credit.
  • DThe agent exceeded its authority because $45 is too close to the $50 limit and requires human review.
Correct: B
B is correct. The agent correctly identified that a delay occurred, but incorrectly assumed that confirming the delay was sufficient to verify the customer's claim. The missing step was verifying causation — whether the spoiled groceries were actually caused by the delay. A (escalate at $25) is not a stated policy — the threshold is $50. C (processing fee) is invented. D ($45 near limit) is not a rule — the agent's authority is up to $50, and $45 is within that limit.
Task Statement 5.3
Task Statement 5.3

Implement error propagation strategies across multi-agent systems

Two anti-patterns dominate: silently suppressing errors (returning empty results as success) and terminating entire workflows on single failures. Both prevent intelligent recovery. Structured error context enables the coordinator to make informed decisions about how to proceed.

The Core Concept

When a subagent fails, the coordinator needs enough information to decide whether to retry, use an alternative source, proceed with partial results, or escalate. A generic "search unavailable" status tells the coordinator nothing useful. A structured error with failure type, what was attempted, partial results, and potential alternatives gives the coordinator everything it needs.

The Official Q8 Pattern: When a web search subagent times out, the correct error propagation approach is to return structured error context — failure type, attempted query, any partial results, and potential alternative approaches — so the coordinator can decide whether to retry with a modified query, try an alternative, or proceed with partial findings.

Structured Error Context

python — structured error propagation from subagent to coordinator CORRECT PATTERN
# ✗ Anti-pattern: generic status hides context
return {"status": "search_unavailable"}
# Coordinator can't distinguish timeout from rate limit
# from invalid query from empty result — all look the same

# ✓ Structured error context enables intelligent recovery
return {
    "isError": True,
    "errorType": "timeout",              # what kind of failure
    "attemptedQuery": query,          # what was tried
    "partialResults": partial_data,   # what was found before failure
    "alternativeApproaches": [         # what coordinator can try next
        "Retry with shorter query",
        "Try document_search instead",
        "Use cached results from prior session"
    ],
    "isRetryable": True                # should coordinator retry?
}
  • Subagents should implement local recovery for transient failures — retry with exponential backoff internally. Only propagate errors they cannot resolve locally.
  • When propagating: include what was attempted, any partial results found, and potential alternatives — give the coordinator what it needs to decide.
  • Structure synthesis output with coverage annotations — indicate which findings are well-supported vs which topic areas have gaps due to unavailable sources.

Empty Results vs Access Failures

This distinction appears directly in the exam. A search that finds no matching documents is a successful operation with an empty result — the coordinator should not retry it. A search that timed out before completing is an access failure — the coordinator may choose to retry.

✓ Valid Empty Result (Success)
Query: "EU AI Act penalties 2023" Result: 0 documents found Response: { "isError": false, "results": [], "message": "No documents match the query in this corpus" } Coordinator: topic may not be covered — try different approach, don't blindly retry same query
✗ Access Failure (Error)
Query: "EU AI Act penalties 2023" Network timeout at 30s Response: { "isError": true, "errorType": "timeout", "attemptedQuery": "...", "isRetryable": true } Coordinator: retry may succeed, possibly with a modified query or different source

Exam Traps for Task 5.3

The TrapWhy It FailsCorrect Pattern
Return empty results as success to hide the error from the coordinator Silently suppressing errors prevents any recovery decision — the coordinator treats empty results as "no findings on this topic" rather than "source was unavailable" Report access failures as errors with structured context; reserve empty results for genuinely successful queries that found no matches
Terminate the entire research workflow when one subagent fails Single source failure doesn't invalidate all research — coordinator can proceed with partial results from other subagents, noting the coverage gap Propagate structured error context to coordinator; coordinator decides to retry, use alternatives, or proceed with partial results and coverage annotation
Return generic "search unavailable" on timeout Generic status hides whether the failure was a timeout (retryable), an invalid query (not retryable), or a rate limit (wait and retry) — coordinator can't decide correctly Include errorType, isRetryable, attemptedQuery, partialResults, and alternativeApproaches in structured error responses

🔨 Implementation Task

T3

Build Structured Error Propagation for a Research Pipeline

  • Implement structured error responses in a web search subagent: include errorType, attemptedQuery, partialResults, alternativeApproaches, isRetryable
  • Simulate a timeout: verify the coordinator receives structured context and can decide to retry with a modified query
  • Simulate a valid empty result: verify the coordinator correctly distinguishes it from an access failure and doesn't retry the same query
  • Implement local retry with exponential backoff in the subagent for transient failures — verify errors only propagate to coordinator when local recovery fails
  • Implement coverage annotations in synthesis output: flag which topic areas have gaps due to subagent failures — include the structured gap information in the final report

Exam Simulation — Task 5.3

Question 1 — Task 5.3 Multi-Agent Research System
A web search subagent times out while researching a complex topic. You need to design how this failure flows back to the coordinator agent. Which approach best enables intelligent recovery?
  • AReturn structured error context to the coordinator including the failure type, the attempted query, any partial results, and potential alternative approaches
  • BImplement automatic retry with exponential backoff within the subagent, returning a generic "search unavailable" status only after all retries are exhausted
  • CCatch the timeout within the subagent and return an empty result set marked as successful
  • DPropagate the timeout exception directly to a top-level handler that terminates the entire research workflow
Correct: A
A is correct. This is the official exam question Q8. Structured error context gives the coordinator the information it needs to make intelligent recovery decisions. B has the right idea about local retry (handle transient failures locally before propagating) but the generic "search unavailable" status after all retries still hides valuable context — the coordinator can't decide whether to try a modified query or alternative source. C is the worst outcome — silently suppressing the error by returning empty results as success prevents any recovery and risks incomplete research being treated as complete. D terminates unnecessarily — a single subagent timeout doesn't invalidate the entire research effort.
Question 2 — Task 5.3 Multi-Agent Research System
Your multi-agent research pipeline has 6 sub-agents running in parallel. Sub-agent 3 fails partway through its analysis after completing 60% of its assigned sources. The coordinator must produce a partial synthesis using the 5 completed sub-agents plus the 60% partial output from sub-agent 3. What should the coordinator include in the synthesis context to handle this correctly?
  • AExclude sub-agent 3's output entirely and note the gap in the final report.
  • BInclude sub-agent 3's completed 60% output but add a structured flag indicating partial completion, which sources were covered, and which were not.
  • CInclude a structured partial-completion context block for sub-agent 3 that lists completed sources, pending sources, and any preliminary findings — enabling the synthesis to incorporate available findings while clearly attributing what is missing.
  • DRe-run sub-agent 3 from the beginning and wait for full completion before synthesising.
Correct: C
C is correct. A structured partial-completion context block gives the synthesis agent everything it needs: what was completed (so findings can be incorporated), what is missing (so the synthesis doesn't over-claim), and the preliminary findings themselves. This is more informative than B's simple flag, which acknowledges the gap but doesn't help the coordinator integrate the partial findings. A (exclude entirely) loses the 60% of valid work already completed. D (re-run) defeats the purpose of partial recovery and adds latency.
Question 3 — Task 5.3 Multi-Agent Research System
Your research pipeline processes each topic with three sequential agents: Retrieval → Analysis → Synthesis. In production, the Analysis agent occasionally fails on complex topics, requiring a restart of the entire three-stage pipeline from Retrieval. The Retrieval stage takes 45 minutes. What architectural change most effectively reduces recovery time when Analysis fails?
  • ARun all three stages in parallel and discard whichever fails.
  • BCache Retrieval outputs and add retry logic to the Analysis agent.
  • CImplement checkpointing: persist Retrieval outputs to durable storage after completion, so that Analysis failure triggers a resume from the persisted checkpoint rather than a full restart.
  • DMerge Retrieval and Analysis into a single agent to eliminate the handoff failure point.
Correct: C
C is correct. Checkpointing — persisting the Retrieval output to durable storage — means that when Analysis fails, the pipeline resumes from the checkpoint rather than restarting from Retrieval, eliminating the 45-minute re-run. B (cache + retry) is similar but 'cache' implies in-memory storage that may not survive a crash, while 'durable storage' in C provides true fault tolerance. A (parallel + discard) doesn't make sense for sequential dependencies. D (merge agents) creates a larger single point of failure and makes the combined agent harder to retry selectively.
Task Statement 5.4
Task Statement 5.4

Manage context effectively in large codebase exploration

Extended exploration sessions degrade silently — the model starts giving inconsistent answers and referencing "typical patterns" instead of specific classes it discovered earlier. Scratchpad files, subagent delegation, and structured state persistence are the three tools that prevent this.

The Core Concept

Context degradation in long exploration sessions is insidious — the model doesn't announce that it's losing track of earlier findings. It simply starts answering from training-data patterns rather than from the codebase it was exploring. Scratchpad files externalise key findings so they survive context boundaries.

Scratchpad Files and Subagent Delegation

📝

Scratchpad Files

Agents maintain files recording key findings. When context degrades, reference the scratchpad rather than relying on in-context memory. Files persist across context boundaries — memory does not.

🔬

Subagent Delegation

Spawn subagents for focused investigation questions ("find all test files," "trace refund flow dependencies"). Verbose exploration output stays in the subagent context. The main agent receives only structured summaries.

🔄

Phase Summaries

Before spawning subagents for the next exploration phase, summarise key findings from the current phase. Inject these summaries into the subagent's initial context — prevents each phase from starting blind.

/compact Command

Use /compact during extended sessions when context fills with verbose discovery output. Reduces token usage while preserving the key findings that have been referenced explicitly.

Scratchpad file pattern — persist findings across context boundaries CONTEXT PERSISTENCE
# .claude/scratchpad/exploration-state.md

## Session: Refund Flow Investigation
## Last Updated: Phase 2 complete

### Key Findings
- Entry point: src/api/orders/refunds.ts:42 (processRefund)
- Validation gate: RefundValidator.validate() must pass first
- Amount threshold: $500 hardcoded in RefundPolicy.ts:18
- DB write: OrderRepository.updateStatus() is the final step
- Test coverage: 0 tests for >$500 threshold path

### Unresolved Questions
- What happens when RefundValidator fails? (not traced yet)
- Is OrderRepository.updateStatus() atomic? (check next phase)

### Files Analysed
src/api/orders/refunds.ts, src/validators/RefundValidator.ts,
src/policies/RefundPolicy.ts (partial)

Crash Recovery with Structured State Exports

Long-running multi-agent exploration can be interrupted. Structured state persistence enables recovery without full re-exploration:

  • Each agent exports its state to a known location at checkpoints — current task, findings so far, unresolved questions
  • A coordinator loads a manifest on resume — which agents completed, which were in progress, what each found
  • Inject the manifest into resumed agent prompts — agents continue from where they left off rather than restarting
  • Prefer structured state exports over session resumption when agents were mid-exploration — prior tool results may be stale if files changed

Exam Traps for Task 5.4

The TrapWhy It FailsCorrect Pattern
Rely on in-context memory for key findings across a 3-hour exploration session Context degrades silently — model starts referencing "typical patterns" instead of specific classes found in phase 1. No warning is given. Maintain scratchpad files recording key findings; reference them explicitly for subsequent questions rather than relying on in-context memory
Run verbose exploration in the main agent context Verbose discovery output fills the main agent's context window, leaving insufficient space for coordination and synthesis Delegate focused exploration questions to subagents; main agent receives structured summaries and maintains high-level coordination context
Resume a crashed multi-agent session with --resume without checking what changed If files changed since the last session, prior tool results are stale — the resumed agent reasons from outdated data Load structured state manifest on resume; identify which findings are still valid; re-explore only what changed

🔨 Implementation Task

T4

Build a Scratchpad-Backed Codebase Explorer

  • Create a scratchpad file structure: key findings, unresolved questions, files analysed, session phase
  • After each major discovery, write findings to the scratchpad before continuing. After 10 discoveries, ask a question about finding #3 — verify the agent references the scratchpad correctly
  • Deliberately fill the context with verbose output, then use /compact. Verify key scratchpad-referenced findings are preserved in the compacted context
  • Implement a 3-subagent exploration: each writes its findings to a separate scratchpad file. The coordinator reads all three at synthesis time.
  • Simulate a crash mid-exploration. Implement manifest-based recovery: coordinator reads the manifest, identifies which subagents completed, resumes only the interrupted ones.

Exam Simulation — Task 5.4

Question 1 — Task 5.4 Developer Productivity with Claude
A developer uses Claude Code to explore a large legacy codebase over 4 hours. After 3 hours, they notice Claude is giving inconsistent answers — referring to "a typical service layer pattern" instead of the specific OrderService class it analysed in hour 1. What is happening and what is the correct architectural fix?
  • AClaude is making errors — restart the session and re-analyse the problematic files
  • BContext degradation — early findings are being pushed out of reliable processing range. Fix by maintaining scratchpad files recording key findings, and having agents reference them explicitly for subsequent questions
  • CThe model needs a larger context window — switch to a model that can hold 4 hours of exploration in context
  • DUse /clear to reset the context and re-ask all questions from scratch with better prompts
Correct: B
B is correct. The diagnostic — referencing generic "typical patterns" instead of specific classes that were explicitly analysed — is the context degradation signature described in the exam guide. Scratchpad files externalise findings so they survive context window pressure; the agent references the file rather than relying on in-context memory. A misdiagnoses as model error when it's a context management architecture problem. C delays the problem — even a 200k token window fills in a long enough session; scratchpad files don't have this limit. D clears all context including useful findings from earlier phases — the scratchpad approach preserves what was found while resetting verbose discovery output.
Question 2 — Task 5.4 Developer Productivity Pipeline
A developer asks your Claude-powered assistant to explain a bug fix in a large codebase. The assistant has already used 80% of its context window loading related files. The developer follows up with 'Can you also check if this pattern appears elsewhere in the codebase?' Claude's response becomes less coherent and misses two obvious file matches. What is the most likely cause?
  • AClaude's temperature was set too high, causing hallucinations.
  • BContext degradation — at 80%+ context utilisation, Claude's ability to accurately reference earlier content in the window degrades, causing it to miss patterns it loaded earlier.
  • CThe search request was too vague for Claude to interpret correctly.
  • DClaude cannot perform file pattern matching without a dedicated search tool.
Correct: B
B is correct. Claude's ability to recall and cross-reference content earlier in its context window degrades as the window fills. At 80%+, the model may fail to surface patterns it accurately processed earlier in the conversation. A (temperature) affects randomness, not recall quality. C (vague request) might be a factor in an ambiguous case, but the developer asked a clear pattern-matching question — the issue is context saturation. D is wrong — Claude can perform pattern matching given the file content in context, but context saturation is the limiting factor here.
Question 3 — Task 5.4 Developer Productivity Pipeline
You need to audit a monorepo with 2,000 TypeScript files for a specific deprecated API usage pattern. Using a single Claude session to review all files would exhaust the context window. What is the most reliable architecture for exhaustive coverage?
  • AUse Claude with a very large context window model to fit all files at once.
  • BSample 200 representative files and extrapolate the findings to the full codebase.
  • CUse Grep to generate an exhaustive list of candidate files matching the pattern, then spawn separate Claude sessions per file (or small batches) so each session has full context for accurate analysis.
  • DAsk Claude to scan the directory structure first and self-select which files to examine.
Correct: C
C is correct. The two-step approach — Grep for candidate identification, then per-file Claude sessions — provides both exhaustive coverage and full context quality for each analysis. Grep is deterministic and fast; Claude adds semantic judgment (is this usage actually deprecated in this call context?). A (large context model) helps but still has limits at 2,000 files and is expensive. B (sample) is fundamentally non-exhaustive — it misses files outside the sample. D (self-select) introduces coverage gaps because Claude's file selection is based on naming conventions, not a deterministic scan of actual file contents.
Task Statement 5.5
Task Statement 5.5

Design human review workflows and confidence calibration

97% overall accuracy sounds production-ready — until you discover it's 99.8% on standard invoices and 71% on handwritten receipts. Aggregate metrics hide segment failures. Calibrated confidence routing prevents both over-automation and under-utilisation of human review capacity.

The Core Concept

Human review capacity is finite. Routing all low-confidence extractions to humans wastes capacity on cases where the model is actually reliable. Routing high-confidence extractions without sampling assumes uniform performance — which is rarely true across document types and fields.

The Exam Principle: Validate accuracy by document type and field segment before reducing human review for high-confidence extractions. Aggregate accuracy (97%) can mask 71% accuracy on a specific segment. Always stratify before automating.

Aggregate Metric Risk

A system processing 90% standard invoices (99% accuracy) and 10% handwritten receipts (71% accuracy) has an aggregate accuracy of 96.2%. That number looks good. But if you use it to justify removing human review, the handwritten receipt segment — which may contain high-value transactions — proceeds with 71% accuracy undetected.

✗ Aggregate Accuracy Report
Overall: 97% accuracy ✓ → Reduce human review threshold → Route high-confidence directly → Save reviewer capacity Hidden reality: Standard invoices: 99.8% ✓ Handwritten receipts: 71.0% ✗ Scanned forms: 88.5% ✓ Foreign-language: 62.0% ✗ Two segment failures invisible in the aggregate number
✓ Stratified Analysis
By document type: Standard invoices: 99.8% → automate Handwritten receipts: 71.0% → human review Scanned forms: 88.5% → borderline Foreign-language: 62.0% → human review By field within type: vendor_name: 98% → automate line_items: 73% → human review Segment-aware routing

Confidence Calibration and Routing

🎯

Field-Level Confidence Scores

Have the model output confidence scores per field, not per document. A document may have 98% confidence on vendor_name but 60% on line_items — route only the uncertain fields for review, not the whole document.

📊

Calibrate with Labeled Sets

Raw model confidence scores are not inherently calibrated. Calibrate thresholds using labeled validation sets — find the confidence score above which actual accuracy exceeds your target. Don't assume a 90% confidence score means 90% accuracy.

🔍

Stratified Random Sampling

For extractions the model is confident about, implement ongoing stratified random sampling — periodically review a random sample of "high confidence" outputs to detect novel error patterns before they accumulate.

🚥

Routing Priority

When reviewer capacity is limited, prioritise: low-confidence fields first, then document types with known poor performance, then ambiguous/contradictory source documents. High-confidence standard-format documents last.

  • Validate accuracy by document type and field before reducing human review — aggregate metrics mask segment-level failures
  • Implement stratified random sampling of high-confidence extractions — detect novel error patterns before they scale
  • Route ambiguous or contradictory source documents to human review regardless of model confidence score

Exam Traps for Task 5.5

The TrapWhy It FailsCorrect Pattern
Use aggregate accuracy (97%) to justify reducing human review Aggregate masks segment-level failures — a 71% accuracy segment invisible in the aggregate will proceed without review Stratify by document type and field before making any automation decisions; automate only segments that meet accuracy targets individually
Route all documents with confidence below X% to human review Document-level confidence ignores field-level variation — a document can have 98% confidence on vendor_name and 60% on line_items Use field-level confidence scores; route uncertain fields to review independently from high-confidence fields in the same document
Stop sampling once a 97% accuracy rate is established Novel document formats or data distributions can degrade accuracy silently — ongoing stratified sampling detects new error patterns before they accumulate at scale Maintain ongoing stratified random sampling of high-confidence extractions for continuous error rate measurement

🔨 Implementation Task

T5

Build a Segmented Accuracy Analysis and Routing System

  • Run extraction on a mixed document set (standard invoices, scanned forms, handwritten receipts). Compute accuracy both in aggregate and by document type — observe the difference.
  • Implement field-level confidence scores in the extraction schema. Route fields with confidence below threshold to human review independently of other fields in the same document.
  • Calibrate confidence thresholds using a labeled validation set of 50 documents — find the confidence score above which accuracy exceeds 95%.
  • Implement stratified random sampling: route 5% of high-confidence extractions to random review to detect novel error patterns.
  • Build a routing priority queue: low-confidence fields → poor-performance document types → contradictory sources → high-confidence standard formats. Verify routing order under reviewer capacity constraints.

Exam Simulation — Task 5.5

Question 1 — Task 5.5 Structured Data Extraction
An extraction pipeline achieves 97% overall accuracy across 10,000 documents. The team proposes routing all high-confidence model outputs directly to downstream systems without human review, relying on the 97% accuracy as justification. What is the primary risk of this approach?
  • A97% accuracy is not high enough for production — target should be 99.5% before removing human review
  • BAggregate accuracy masks poor performance on specific document types or fields — a subset may have 65% accuracy invisible in the 97% overall figure. Accuracy must be validated by segment before reducing review.
  • CModel confidence scores are not calibrated — high confidence doesn't mean high accuracy without calibration against a labeled set
  • DThe system has no mechanism to detect when new document formats appear that reduce accuracy
Correct: B
B is correct. The exam guide explicitly states: "aggregate accuracy metrics (e.g., 97% overall) may mask poor performance on specific document types or fields." This is the primary risk of aggregate-based automation decisions. C is also a real risk and is mentioned in the exam guide, but it's the secondary concern here — the question is about the primary risk of the proposed approach, which is aggregate-metric blindness. D is valid (handled by ongoing stratified sampling) but again secondary to the immediate risk of masking. A is wrong — there's no universal accuracy threshold for removing human review; it depends on business stakes and segment performance.
Question 2 — Task 5.5 Customer Support Resolution Agent
Your support agent has been running for 3 months. Based on outcome data, 85% of ticket types are handled correctly without human review. Leadership proposes removing human review entirely for the 85% to reduce costs. Your analysis shows the 15% failure rate is NOT randomly distributed — it clusters around tickets from a specific new customer segment added 6 weeks ago. What is the most appropriate response to leadership's proposal?
  • AAccept the proposal — 85% accuracy is sufficient for cost reduction.
  • BReject blanket removal of human review; instead route by ticket segment, maintaining review for the new segment until the failure rate is resolved for that segment.
  • CAccept the proposal but add a post-hoc weekly audit of a 5% random sample.
  • DReject the proposal entirely until accuracy reaches 99% across all segments.
Correct: B
B is correct. The 85% accuracy figure masks a distribution mismatch — the 15% failure rate is concentrated in a new customer segment. Removing human review uniformly would eliminate oversight exactly where it is most needed. Segment-specific routing maintains cost reduction for well-performing segments while preserving review for the problematic segment. A (blanket removal) ignores the failure clustering. C (5% post-hoc audit) is not sufficient oversight for a segment with a 15% failure rate. D (99% threshold) ignores that even high overall accuracy can mask concentrated failures.
Question 3 — Task 5.5 Customer Support Resolution Agent
Your support automation pipeline was initially scoped for billing and shipping inquiries. After 4 months, leadership expands scope to include product defect claims, which involve potential safety implications and regulatory reporting requirements. Leadership requests reducing human review to maintain throughput. What is the correct response?
  • AReduce human review for billing and shipping to compensate for the new overhead from defect claims.
  • BAccept reduced human review if the existing accuracy threshold is maintained across all ticket types.
  • CReject oversight reduction: scope expansion to a higher-risk domain (safety implications, regulatory requirements) represents a distribution shift that requires re-calibrating human review thresholds, not reducing them.
  • DPause all automation until the pipeline is retrained on defect claim data.
Correct: C
C is correct. When system scope expands to a domain with higher stakes (safety, regulatory), the appropriate response is to increase or re-calibrate oversight, not reduce it. The original accuracy thresholds were calibrated for lower-stakes billing and shipping decisions — they do not transfer to defect claims without re-evaluation. A (reduce billing/shipping review) reduces oversight in an already-validated domain to offset new domain costs — the wrong tradeoff direction. B (maintain threshold) assumes the existing threshold is appropriate for the new domain, which hasn't been validated. D (pause all automation) is overly disruptive — the existing billing/shipping scope continues to perform well.
Task Statement 5.6
Task Statement 5.6

Preserve information provenance and handle uncertainty in multi-source synthesis

Source attribution is silently lost during summarisation. Conflicting statistics from credible sources require annotation, not arbitrary selection. Temporal differences in data need dates to distinguish genuine contradictions from time-based evolution. These are the provenance preservation skills the exam tests.

The Core Concept

When a synthesis agent combines findings from multiple subagents, source attribution can disappear at every summarisation step. A finding that started as "According to WHO 2024 report (p.47): 78% of cases show early onset" becomes "Most cases show early onset" — no source, no statistic, no date. Structured claim-source mappings prevent this.

Claim-Source Mappings

Subagents must output structured mappings that the synthesis agent is required to preserve and merge — not summarise away:

Structured claim-source mapping — preserved through synthesis PROVENANCE PATTERN
# ✗ What gets lost with prose summarisation
"Web research shows most companies adopted AI by 2023."
# No source. No statistic. No date. Unverifiable.

# ✓ Structured claim-source mapping
{
  "claim": "67% of Fortune 500 companies had deployed AI in production",
  "source_url": "https://mckinsey.com/ai-survey-2023",
  "source_name": "McKinsey Global AI Survey",
  "publication_date": "2023-10-15",
  "relevant_excerpt": "67% of respondents reported...",
  "confidence": "high"
}

# ✓ Synthesis agent merges mappings, preserves attribution
# Report: "McKinsey (Oct 2023) found 67% of Fortune 500..."
# [citation preserved through all synthesis steps]
  • Require subagents to output structured claim-source mappings (source URL, document name, relevant excerpt, publication date) that downstream agents must preserve through synthesis
  • Structure reports with explicit sections distinguishing well-established findings from contested ones, preserving original source characterisations
  • Render different content types appropriately: financial data as tables, news as prose, technical findings as structured lists — rather than converting everything to a uniform format

Handling Conflicting Sources and Temporal Data

⚖️

Conflicting Statistics — Annotate, Don't Choose

When two credible sources report different statistics on the same topic, annotate the conflict with both sources rather than arbitrarily selecting one value. The synthesis report should surface the conflict and attribute each figure — the reader decides which to use.

📅

Temporal Differences — Require Dates

Require publication or data collection dates in all structured outputs. Two statistics that look conflicting may simply reflect different time periods — 35% in 2020 and 67% in 2023 are not contradictory. Without dates, temporal evolution looks like a disagreement.

🔀

Upstream Conflict Passing

Complete document analysis with conflicting values included and explicitly annotated. Let the coordinator decide how to reconcile before passing to synthesis — don't resolve conflicts at the document analysis stage by silently selecting one value.

📊

Methodological Context

Include methodological context in structured outputs — sample size, methodology, geographic scope. Two statistics can disagree because they measure different populations, not because one is wrong. Methodology context enables correct interpretation.

💡
Conflict Annotation Pattern: "Adoption rate: McKinsey (Oct 2023, n=1,500 enterprises, global): 67%. Gartner (Mar 2023, n=800 US companies): 45%. Note: different survey populations and methodologies — figures are not directly comparable." This is better than either silently picking 67% or averaging to 56%.

Exam Traps for Task 5.6

The TrapWhy It FailsCorrect Pattern
Summarise subagent findings into prose before passing to synthesis Prose summaries strip source URLs, statistics, and publication dates — attribution is permanently lost once summarised Pass structured claim-source mappings; synthesis agent merges structured data and generates prose with attribution preserved
When two credible sources conflict, use the higher-quality source and discard the other Both sources may be correct in different contexts (time periods, populations, methodologies) — silently discarding one hides genuine uncertainty from the reader Annotate the conflict with both sources, their dates, and methodological context. Let the coordinator or reader decide how to interpret.
Omit publication dates from structured outputs to save tokens Without dates, temporal evolution (35% in 2020, 67% in 2023) looks like a contradiction — the synthesis agent may flag a non-existent disagreement Require publication and data collection dates in all subagent structured outputs — essential for temporal interpretation

🔨 Implementation Task

T6

Build a Provenance-Preserving Multi-Source Synthesis Pipeline

  • Design a subagent output schema that includes: claim, source_url, source_name, publication_date, relevant_excerpt, methodology_notes, confidence
  • Instruct the synthesis agent to preserve all claim-source mappings when combining findings — verify citations survive through final output
  • Deliberately provide two conflicting statistics from credible sources. Verify the synthesis annotates both with attribution rather than selecting one.
  • Provide findings with different publication years (2020, 2022, 2024). Verify the synthesis correctly interprets temporal differences as evolution, not contradiction.
  • Build a final report with distinct sections: well-established findings, contested findings (with source conflict annotations), and gaps in coverage.

Exam Simulation — Task 5.6

Question 1 — Task 5.6 Multi-Agent Research System
A synthesis agent combines findings from a web search agent and a document analysis agent. Each subagent returns prose summaries. The final report contains statistics with no sources, claims without attribution, and no way to trace findings back to original documents. What structural change would fix this?
  • AInstruct the synthesis agent: "Always include source citations in the final report"
  • BRequire subagents to output structured claim-source mappings (claim, source URL, excerpt, publication date) that the synthesis agent must preserve and merge rather than summarising into prose
  • CInstruct subagents to include citations in their prose summaries so the synthesis agent can extract them
  • DRun a post-synthesis citation extraction pass that identifies claims and attempts to trace them back to original sources
Correct: B
B is correct. The root cause is architectural — prose summaries strip attribution at the subagent output stage, before the synthesis agent even sees the data. The fix must be upstream: require structured mappings from subagents so attribution is never lost. A instructs the synthesis agent to cite sources it no longer has access to — the attribution was lost before it arrived. C improves the prose summaries but attribution embedded in prose is fragile — parsing prose to extract citations is error-prone. D attempts to recover attribution that was already lost — expensive and unreliable.
Question 2 — Task 5.6 Multi-Agent Research System
A research pipeline finds two conflicting statistics: McKinsey reports 67% AI adoption among Fortune 500 companies; Gartner reports 45%. Both are credible, widely-cited analyst reports from the same year. How should the synthesis agent handle this conflict?
  • AUse the McKinsey figure (67%) as it comes from a more established research firm and discard the Gartner figure
  • BAverage the two figures to report 56% as a balanced estimate
  • CAnnotate the conflict with both figures, their source attribution, and methodological context (sample size, geography, definition of "AI adoption") — let the report reader interpret based on full context
  • DFlag the conflict for human review and omit both statistics from the report pending resolution
Correct: C
C is correct. The exam guide explicitly states: "annotating conflicts with source attribution rather than arbitrarily selecting one value." Both figures may be correct given different survey methodologies, populations, or definitions of "adoption." The synthesis agent's job is to surface the conflict with full context, not resolve it. A arbitrarily selects — brand reputation doesn't resolve methodological differences. B creates a figure that neither source reported — mathematically invalid and methodologically meaningless. D omits potentially valid statistics — both figures provide useful information even in conflict; the annotation pattern makes them usable.
Question 3 — Task 5.6 Multi-Agent Research System
Your synthesis agent produces a report stating: 'Revenue growth is accelerating, with compound annual growth rates of 23–31% confirmed across three independent sources.' Post-review reveals that Source A reported 23% CAGR for North American operations, Source B reported 31% CAGR for Asia-Pacific, and Source C reported 27% CAGR for digital channels only. The synthesis is factually wrong. What is the root cause of the failure?
  • AThe synthesis agent hallucinated the figures — none of the sources reported those exact numbers.
  • BThe synthesis agent used an incorrect averaging formula to combine regional figures.
  • CThe synthesis agent conflated figures from different geographic and channel scopes into a single global claim, and lost the source provenance that would have revealed the incompatibility.
  • DThe three sources are contradictory and should have been excluded from the synthesis.
Correct: C
C is correct. The figures (23%, 31%, 27%) are all accurate within their respective sources — there is no hallucination. The failure is semantic conflation: the synthesis agent treated three figures with different scope definitions (regional, regional, channel-specific) as interchangeable data points about the same metric. Preserving provenance — keeping each figure tagged with its source, scope, and definition — would have made the incompatibility visible before synthesis. A (hallucination) is wrong — all figures appear verbatim in the sources. B (averaging formula) is wrong — no average was computed; the range was directly reported. D (contradictory) is wrong — the sources measure different things and are all accurate within their scope.