ChanlChanl
Learning AI

Context Engineering Is What Your Agent Actually Needs

Prompt engineering hits a wall with production AI agents. Context engineering fixes it. Build a full context pipeline with memory, RAG, history compression, and tool resolution.

DGDean GroverCo-founderFollow
March 20, 2026
21 min read
Illustration of an engineer assembling context layers for an AI agent, with memory, tools, and knowledge sources flowing into a central pipeline

You've spent hours perfecting a system prompt. The agent still hallucinates, forgets what the customer said three turns ago, and ignores half its tools. Sound familiar?

The problem isn't your prompt. It's everything else in the context window.

Anthropic's Applied AI team published their guide to context engineering in late 2025, and by early 2026 the term had swept through the developer community. Shopify CEO Tobi Lutke put it bluntly: context engineering "describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM." Andrej Karpathy called it "the delicate art and science of filling the context window with just the right information for the next step."

This isn't a rebrand. It's a recognition that the hard engineering problem behind production AI agents was never about writing better sentences. It's about building systems that dynamically assemble the right information -- system instructions, conversation history, retrieved knowledge, persistent memory, tool definitions -- into a limited context window, on every single inference call, across multi-step workflows that can run for dozens of turns.

This tutorial teaches you to build that system. We'll go from the conceptual shift through each layer of a production context pipeline, with working code in both TypeScript and Python. By the end, you'll have a context engine that handles memory injection, knowledge retrieval, history compression, and dynamic tool resolution.

What's in this article

Prerequisites

You'll need Node.js 20+ and Python 3.11+ installed. We'll use the Vercel AI SDK for TypeScript and the openai package for Python.

TypeScript setup:

bash
mkdir context-engine && cd context-engine
npm init -y
npm install ai @ai-sdk/openai zod
npm install -D typescript @types/node tsx
npx tsc --init --target ES2022 --module NodeNext --moduleResolution NodeNext --outDir dist

Python setup:

bash
mkdir context_engine && cd context_engine
python -m venv .venv && source .venv/bin/activate
pip install openai numpy tiktoken

We install tiktoken alongside openai because accurate token counting matters -- the naive "divide by 4" heuristic can be off by 20-30% on code-heavy or multilingual content, and that margin is the difference between fitting in your budget and silently truncating context.

Environment:

bash
export OPENAI_API_KEY="sk-..."

If you've worked through our RAG tutorial or MCP server guide, you already have the foundation. Context engineering is what ties those pieces together into a coherent runtime.

What Is Context Engineering

Context engineering is the discipline of designing and managing everything a model sees at inference time -- not just the prompt, but the entire assembled context window including system instructions, tool schemas, retrieved documents, conversation history, and injected memories. It's the difference between writing a good question and building the information system that makes good answers possible.

The scale of the shift is hard to overstate. According to LangChain's 2026 State of Agent Engineering report, 57% of organizations now have agents in production -- but among large enterprises, "quality" remains the number-one barrier to scaling, cited by 32% of respondents. The report specifically calls out "context engineering and managing context at scale" as an ongoing difficulty. In other words: most teams can get an agent demo working. The context pipeline is what separates the demo from production.

Here's how Anthropic's team defines it: "Context engineering is the art and science of curating what will go into the limited context window from that constantly evolving universe of possible information."

That word "curating" is doing a lot of work. Remember the agent that hallucinates and forgets what the customer said three turns ago? Here's exactly why. When your agent handles a customer support call, the universe of possibly-relevant information includes:

  • The system prompt defining agent behavior
  • The customer's full conversation history (maybe 50+ turns)
  • Their profile and previous interactions from your CRM
  • Knowledge base articles matching their question
  • Definitions and schemas for 20+ available tools
  • Memories from previous conversations ("prefers email over phone", "has a premium plan")
  • Results from tools already called in this conversation

All of that needs to fit in a context window. Claude Opus 4.6 gives you 1 million tokens -- roughly 750,000 words. That sounds like plenty until you realize that a 30-minute customer support conversation with tool calls, knowledge retrieval, and memory injection can easily consume 50,000+ tokens. And more isn't better: as deepset's research notes, "models can actually get worse at recalling specific facts as the context gets very large." They call this phenomenon context rot.

The transformer architecture processes tokens with attention that scales quadratically -- n tokens create n-squared pairwise relationships. Stuff in everything, and the model's attention budget gets diluted across irrelevant information.

Prompt Engineering vs. Context Engineering

If you've read our prompt engineering techniques guide, you know that individual prompt techniques -- chain-of-thought, few-shot examples, role prompting -- are powerful. Context engineering doesn't replace those techniques. It's the layer above them.

DimensionPrompt EngineeringContext Engineering
ScopeThe instruction text you writeEverything the model sees at inference
NatureMostly static templatesDynamic, assembled per-request at runtime
FocusHow to phrase the questionWhat information to provide and when
IntegrationText-basedTools, APIs, memory, retrieved docs, history
LifecycleWrite once, iterateEvolves with every conversation turn
DebuggingRead the promptInspect the full assembled context
Cost driverPrompt length (small, fixed)Retrieved context volume (variable, 5-50x prompt size)
Performance ceiling~15-20% improvement from prompt rewrites~40-60% improvement from context pipeline changes

As deepset puts it: "Performance gains increasingly come not from better models, but from smarter context." You could swap from GPT-4o to Claude Opus and see a 10% improvement. Or you could fix your context pipeline and see a 50% improvement on the same model.

Here's the counterintuitive part that most teams learn the hard way: more context often makes agents worse, not better. Google's "Needle in a Haystack" evaluations and subsequent research show that recall accuracy drops as context fills up -- even on models that technically support the full window. A model at 20% context utilization will reliably retrieve a specific fact. At 80% utilization, that same fact can get lost -- especially information buried in the middle third of the window (the "lost in the middle" effect documented by Liu et al., 2023). A well-curated 30K-token context will outperform a carelessly assembled 120K-token context every time. Context engineering is as much about what you exclude as what you include.

The Five Layers

A production context pipeline has five layers, each assembled dynamically at runtime: system instructions at the foundation, then retrieved knowledge, persistent memory, conversation history, and tool definitions. The order matters -- models pay more attention to the beginning and end of context windows.

Here's the architecture:

User Message Context Engine 1. System Instructions 2. Retrieved Knowledge · RAG 3. Persistent Memory 4. Conversation History 5. Tool Definitions Assembled Context Window LLM Inference Response + Tool Calls Tool Execution
Five layers of a production context pipeline, assembled per-request

Each layer has its own engineering challenges. Let's build them.

Layer 1: System Instructions

System instructions should be the smallest set of high-signal tokens that fully define your agent's behavior. Anthropic's team advises aiming for "minimal information that fully outlines expected behavior" -- specific enough to guide decisions, flexible enough to handle edge cases.

The mistake most teams make is treating the system prompt like a feature specification. They stuff in every edge case, every policy rule, every exception. A 5,000-token system prompt might sound thorough, but you've just consumed 5% of a 100K context window before the conversation even starts.

TypeScript -- Structured System Prompt Builder:

typescript
import { z } from "zod";
 
// Zod validates config at runtime -- catches missing fields before they silently degrade output
const SystemPromptConfig = z.object({
  role: z.string(),
  constraints: z.array(z.string()),
  personality: z.string().optional(),
  escalationRules: z.array(z.string()).optional(),
  outputFormat: z.string().optional(),
});
 
type SystemPromptConfig = z.infer<typeof SystemPromptConfig>;
 
function buildSystemPrompt(
  config: SystemPromptConfig,
  injectedContext: { memories?: string; knowledge?: string }
): string {
  const sections: string[] = [];
 
  // Role first: models weight the beginning of context most heavily
  sections.push(`<role>\n${config.role}\n</role>`);
 
  // Constraints next: behavioral boundaries the model must not cross
  if (config.constraints.length > 0) {
    sections.push(
      `<constraints>\n${config.constraints.map((c) => `- ${c}`).join("\n")}\n</constraints>`
    );
  }
 
  // Memories injected per-request -- this section doesn't exist at design time
  if (injectedContext.memories) {
    sections.push(
      `<customer_context>\nRelevant information from previous interactions:\n${injectedContext.memories}\n</customer_context>`
    );
  }
 
  // RAG results injected per-request -- grounds the model in facts, prevents hallucination
  if (injectedContext.knowledge) {
    sections.push(
      `<knowledge>\nRelevant documentation:\n${injectedContext.knowledge}\n</knowledge>`
    );
  }
 
  // Escalation rules: safety net for situations the agent shouldn't handle alone
  if (config.escalationRules?.length) {
    sections.push(
      `<escalation>\n${config.escalationRules.map((r) => `- ${r}`).join("\n")}\n</escalation>`
    );
  }
 
  // XML sections help the model attend to specific blocks vs. parsing a wall of text
  return sections.join("\n\n");
}
 
// Usage
const systemPrompt = buildSystemPrompt(
  {
    role: "You are a customer support agent for Acme Corp. You help customers with billing questions, account issues, and product information.",
    constraints: [
      "Never reveal internal pricing formulas or discount logic",
      "Always verify customer identity before accessing account details",
      "If you cannot resolve an issue in 3 attempts, offer to escalate to a human agent",
    ],
    escalationRules: [
      "Escalate immediately if customer mentions legal action",
      "Escalate if customer sentiment is consistently negative for 3+ turns",
    ],
  },
  {
    memories: "Customer prefers email communication. Has been a subscriber since 2024. Previously had a billing dispute resolved in their favor.",
    knowledge: "Current promotion: 20% off annual plans through March 31. Refund policy: full refund within 30 days, prorated after.",
  }
);

Python -- Structured System Prompt Builder:

python
from dataclasses import dataclass, field
 
@dataclass
class SystemPromptConfig:
    role: str
    constraints: list[str] = field(default_factory=list)
    personality: str | None = None
    escalation_rules: list[str] = field(default_factory=list)
    output_format: str | None = None
 
def build_system_prompt(
    config: SystemPromptConfig,
    memories: str | None = None,
    knowledge: str | None = None,
) -> str:
    sections = []
 
    # Role first: models weight the beginning of context most heavily
    sections.append(f"<role>\n{config.role}\n</role>")
 
    # Hard behavioral boundaries the model must not cross
    if config.constraints:
        items = "\n".join(f"- {c}" for c in config.constraints)
        sections.append(f"<constraints>\n{items}\n</constraints>")
 
    # Per-request injection -- doesn't exist at design time, populated from memory store
    if memories:
        sections.append(
            f"<customer_context>\nRelevant information from previous interactions:\n{memories}\n</customer_context>"
        )
 
    # Per-request injection -- RAG results ground the model in facts
    if knowledge:
        sections.append(
            f"<knowledge>\nRelevant documentation:\n{knowledge}\n</knowledge>"
        )
 
    # Safety net for situations the agent shouldn't handle alone
    if config.escalation_rules:
        items = "\n".join(f"- {r}" for r in config.escalation_rules)
        sections.append(f"<escalation>\n{items}\n</escalation>")
 
    return "\n\n".join(sections)
 
# Usage
config = SystemPromptConfig(
    role="You are a customer support agent for Acme Corp.",
    constraints=[
        "Never reveal internal pricing formulas",
        "Verify customer identity before accessing account details",
    ],
    escalation_rules=[
        "Escalate immediately if customer mentions legal action",
    ],
)
 
system_prompt = build_system_prompt(
    config,
    memories="Customer prefers email. Subscriber since 2024.",
    knowledge="Current promotion: 20% off annual plans through March 31.",
)

The key insight is that this system prompt is assembled at runtime. The memories and knowledge sections don't exist at design time -- they're populated per-request based on who the customer is and what they're asking about. That's context engineering. That's why your carefully crafted static prompt wasn't enough -- it had no mechanism to inject what it needed to know right now.

Layer 2: Knowledge Retrieval

Before each LLM call, retrieve the 3-5 most relevant knowledge base documents using the current message as a query. This grounds the model in facts and prevents hallucination. (New to RAG? Start with our RAG from Scratch tutorial and come back here.)

The challenge isn't retrieval -- it's deciding how much to retrieve. A naive approach dumps 10 chunks into the context. That's 2,000-5,000 tokens that might be 80% irrelevant.

TypeScript -- Token-Budgeted RAG Retrieval:

typescript
interface RetrievedChunk {
  content: string;
  score: number;
  source: string;
  tokenCount: number;
}
 
interface TokenBudget {
  knowledge: number;  // Hard ceiling -- RAG cannot exceed this
  memory: number;     // Hard ceiling -- memory injection
  history: number;    // Hard ceiling -- conversation history
  tools: number;      // Hard ceiling -- tool schemas
  system: number;     // Hard ceiling -- system prompt
}
 
// These numbers enforce the 60/20/20 split described below
const DEFAULT_BUDGET: TokenBudget = {
  system: 2000,
  knowledge: 3000,
  memory: 1000,
  history: 8000,
  tools: 4000,
};
 
async function retrieveWithBudget(
  query: string,
  budget: number,
  retriever: { search(query: string, topK: number): Promise<RetrievedChunk[]> }
): Promise<string> {
  // Over-fetch, then trim -- cheaper than multiple retrieval calls
  const chunks = await retriever.search(query, 10);
 
  // 0.7 threshold: below this, chunks inject noise that dilutes attention
  const relevant = chunks.filter((c) => c.score >= 0.7);
 
  // Greedy packing: highest-score chunks first until budget exhausted
  const selected: RetrievedChunk[] = [];
  let tokenCount = 0;
 
  for (const chunk of relevant) {
    if (tokenCount + chunk.tokenCount > budget) break;
    selected.push(chunk);
    tokenCount += chunk.tokenCount;
  }
 
  if (selected.length === 0) return "";
 
  // Source attribution helps the model cite its reasoning
  return selected
    .map((c, i) => `[Source ${i + 1}: ${c.source}]\n${c.content}`)
    .join("\n\n---\n\n");
}

Python -- Token-Budgeted RAG Retrieval:

python
from dataclasses import dataclass
 
@dataclass
class RetrievedChunk:
    content: str
    score: float
    source: str
    token_count: int
 
# Hard ceilings per layer -- prevents any single layer from starving the others
DEFAULT_BUDGET = {
    "system": 2000,
    "knowledge": 3000,
    "memory": 1000,
    "history": 8000,
    "tools": 4000,
}
 
async def retrieve_with_budget(
    query: str,
    budget: int,
    retriever,  # has async search(query, top_k) -> list[RetrievedChunk]
) -> str:
    # Over-fetch, then trim -- cheaper than multiple retrieval calls
    chunks = await retriever.search(query, top_k=10)
 
    # 0.7 threshold: below this, chunks inject noise that dilutes attention
    relevant = [c for c in chunks if c.score >= 0.7]
 
    # Greedy packing: highest-score chunks first until budget exhausted
    selected = []
    token_count = 0
 
    for chunk in relevant:
        if token_count + chunk.token_count > budget:
            break
        selected.append(chunk)
        token_count += chunk.token_count
 
    if not selected:
        return ""
 
    sections = []
    for i, c in enumerate(selected, 1):
        sections.append(f"[Source {i}: {c.source}]\n{c.content}")
 
    return "\n\n---\n\n".join(sections)

The token budget forces a critical design decision: how do you allocate your context window? In production, we've found a roughly 60/20/20 split works well for customer-facing agents -- 60% for conversation history (the customer expects you to remember what they said), 20% for knowledge and memory, 20% for system prompt and tools.

Layer 3: Memory Injection

Memory injection transforms a stateless chatbot into something that knows you. Before each LLM call, the system performs a semantic search against the user's stored memories and injects relevant hits into the system prompt.

Without memory, every conversation starts from zero. The customer re-explains their situation, preferences, history. With memory injection, the agent already knows this customer prefers email, has been a subscriber for two years, and had a billing dispute last month. Here's the flow:

I need to update my billing address Semantic search: "billing address update" ["Prefers email confirmations", "Changed address in Jan 2026", "Premium plan"] Inject memories into system prompt Assembled context with memories I can update that for you. Should I send the confirmation to your email like last time? User Context Engine Memory Store LLM
Memory injection: semantic search retrieves relevant memories before each LLM call

TypeScript -- Memory Injection Service:

typescript
interface Memory {
  id: string;
  content: string;
  category: "preference" | "fact" | "interaction" | "feedback";
  createdAt: Date;
  embedding: number[];
}
 
interface MemorySearchResult {
  memory: Memory;
  score: number;
}
 
class MemoryInjector {
  private minScore = 0.3;   // Lower than RAG (0.7) because memories are broader signals
  private maxMemories = 10;  // Cap prevents flooding context with stale facts
  private tokenBudget = 1000;
 
  async injectMemories(
    customerId: string,
    currentMessage: string,
    memoryStore: {
      search(
        customerId: string,
        query: string,
        options: { minScore: number; limit: number }
      ): Promise<MemorySearchResult[]>;
    }
  ): Promise<string> {
    // The current message IS the search query -- "billing address" retrieves billing-related memories
    const results = await memoryStore.search(customerId, currentMessage, {
      minScore: this.minScore,
      limit: this.maxMemories,
    });
 
    if (results.length === 0) return "";  // No memories = no tokens wasted
 
    // Grouping by category creates structure the model can attend to selectively
    const grouped = new Map<string, string[]>();
    for (const { memory } of results) {
      const items = grouped.get(memory.category) || [];
      items.push(memory.content);
      grouped.set(memory.category, items);
    }
 
    const sections: string[] = [];
 
    if (grouped.has("preference")) {
      sections.push(
        `Customer preferences:\n${grouped.get("preference")!.map((m) => `- ${m}`).join("\n")}`
      );
    }
 
    if (grouped.has("fact")) {
      sections.push(
        `Known facts:\n${grouped.get("fact")!.map((m) => `- ${m}`).join("\n")}`
      );
    }
 
    if (grouped.has("interaction")) {
      sections.push(
        `Previous interactions:\n${grouped.get("interaction")!.map((m) => `- ${m}`).join("\n")}`
      );
    }
 
    return sections.join("\n\n");
  }
}

Python -- Memory Injection Service:

python
from dataclasses import dataclass
from collections import defaultdict
 
@dataclass
class Memory:
    id: str
    content: str
    category: str  # "preference" | "fact" | "interaction" | "feedback"
    created_at: str
    embedding: list[float]
 
@dataclass
class MemorySearchResult:
    memory: Memory
    score: float
 
class MemoryInjector:
    def __init__(self, min_score: float = 0.3, max_memories: int = 10):
        self.min_score = min_score
        self.max_memories = max_memories
 
    async def inject_memories(
        self,
        customer_id: str,
        current_message: str,
        memory_store,  # has async search(customer_id, query, min_score, limit)
    ) -> str:
        results = await memory_store.search(
            customer_id,
            current_message,
            min_score=self.min_score,
            limit=self.max_memories,
        )
 
        if not results:
            return ""
 
        # Group by category
        grouped: dict[str, list[str]] = defaultdict(list)
        for result in results:
            grouped[result.memory.category].append(result.memory.content)
 
        sections = []
 
        if "preference" in grouped:
            items = "\n".join(f"- {m}" for m in grouped["preference"])
            sections.append(f"Customer preferences:\n{items}")
 
        if "fact" in grouped:
            items = "\n".join(f"- {m}" for m in grouped["fact"])
            sections.append(f"Known facts:\n{items}")
 
        if "interaction" in grouped:
            items = "\n".join(f"- {m}" for m in grouped["interaction"])
            sections.append(f"Previous interactions:\n{items}")
 
        return "\n\n".join(sections)

The minScore: 0.3 threshold is important. Too high and you miss relevant memories. Too low and you inject noise. In production, 0.3 is a good starting point for cosine similarity with modern embedding models -- it catches memories that are topically related without flooding the context with tangential facts.

Memory injection feeds directly into the system prompt builder from Layer 1. The <customer_context> section is populated by whatever the memory injector returns. If there are no relevant memories, that section is simply omitted -- no wasted tokens.

This is the layer that directly fixes the problem we opened with. Your agent "forgot" what the customer said three turns ago? That's a history problem. But your agent doesn't know the customer at all -- their preferences, their plan, their last dispute? That's a memory problem. And no amount of prompt engineering can solve it, because the information simply isn't there.

Layer 4: History Compression

When conversations exceed a token threshold, older messages get summarized while recent ones stay verbatim. Anthropic calls this compaction: "summarizing conversation nearing context window limits, reinitializing new windows with summaries." It reduces token usage by 70-90%.

A 50-turn conversation can easily hit 20,000 tokens. Without compression, you're burning budget on "Hi, how can I help you?" With compression, those 50 turns become a 500-token summary plus the last 5-10 turns verbatim.

TypeScript -- History Compressor:

typescript
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
 
interface Message {
  role: "user" | "assistant" | "system";
  content: string;
}
 
interface CompressedHistory {
  summary: string | null;
  recentMessages: Message[];
  originalCount: number;
  compressedTokens: number;
}
 
class HistoryCompressor {
  private maxTokens: number;
  private recentTurnCount: number;
  // gpt-4o-mini for summaries: 16x cheaper than gpt-4o, good enough for compression
  private summaryModel = openai("gpt-4o-mini");
 
  constructor(maxTokens = 8000, recentTurnCount = 10) {
    this.maxTokens = maxTokens;
    this.recentTurnCount = recentTurnCount;
  }
 
  async compress(messages: Message[]): Promise<CompressedHistory> {
    const estimatedTokens = this.estimateTokens(messages);
 
    // No compression needed
    if (estimatedTokens <= this.maxTokens) {
      return {
        summary: null,
        recentMessages: messages,
        originalCount: messages.length,
        compressedTokens: estimatedTokens,
      };
    }
 
    // Split point: recent turns stay verbatim (customer expects you to remember what was just said)
    const recentMessages = messages.slice(-this.recentTurnCount * 2); // *2 for user+assistant pairs
    const olderMessages = messages.slice(0, -this.recentTurnCount * 2);
 
    if (olderMessages.length === 0) {
      return {
        summary: null,
        recentMessages: messages,
        originalCount: messages.length,
        compressedTokens: estimatedTokens,
      };
    }
 
    // Summarize older messages
    const conversationText = olderMessages
      .map((m) => `${m.role}: ${m.content}`)
      .join("\n");
 
    const { text: summary } = await generateText({
      model: this.summaryModel,
      system: `You are a conversation summarizer. Extract key facts, decisions,
and unresolved issues from the conversation. Be concise but preserve all
actionable information. Format as bullet points.`,
      prompt: `Summarize this conversation:\n\n${conversationText}`,
    });
 
    return {
      summary,
      recentMessages,
      originalCount: messages.length,
      compressedTokens: this.estimateTokens(recentMessages) + (summary.length / 4),
    };
  }
 
  private estimateTokens(messages: Message[]): number {
    // Rough estimate: 1 token ~= 4 chars for English text.
    // For production, use tiktoken-node or the AI SDK's token counting.
    // This heuristic overestimates code/JSON by ~20% and underestimates CJK by ~40%.
    return messages.reduce((sum, m) => sum + Math.ceil(m.content.length / 4), 0);
  }
}

Python -- History Compressor:

python
from openai import AsyncOpenAI
from dataclasses import dataclass
 
@dataclass
class CompressedHistory:
    summary: str | None
    recent_messages: list[dict]
    original_count: int
    compressed_tokens: int
 
class HistoryCompressor:
    def __init__(
        self,
        max_tokens: int = 8000,
        recent_turn_count: int = 10,
    ):
        self.max_tokens = max_tokens
        self.recent_turn_count = recent_turn_count
        self.client = AsyncOpenAI()
 
    async def compress(self, messages: list[dict]) -> CompressedHistory:
        estimated = self._estimate_tokens(messages)
 
        if estimated <= self.max_tokens:
            return CompressedHistory(
                summary=None,
                recent_messages=messages,
                original_count=len(messages),
                compressed_tokens=estimated,
            )
 
        # Split into older (summarize) and recent (keep verbatim)
        split_idx = -self.recent_turn_count * 2
        recent = messages[split_idx:]
        older = messages[:split_idx]
 
        if not older:
            return CompressedHistory(
                summary=None,
                recent_messages=messages,
                original_count=len(messages),
                compressed_tokens=estimated,
            )
 
        conversation_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in older
        )
 
        response = await self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a conversation summarizer. Extract key facts, "
                        "decisions, and unresolved issues. Be concise but preserve "
                        "all actionable information. Format as bullet points."
                    ),
                },
                {"role": "user", "content": f"Summarize:\n\n{conversation_text}"},
            ],
        )
 
        summary = response.choices[0].message.content
 
        return CompressedHistory(
            summary=summary,
            recent_messages=recent,
            original_count=len(messages),
            compressed_tokens=(
                self._estimate_tokens(recent) + len(summary or "") // 4
            ),
        )
 
    def _estimate_tokens(self, messages: list[dict]) -> int:
        # Fast approximation. For production accuracy, use:
        # import tiktoken; enc = tiktoken.encoding_for_model("gpt-4o")
        # return sum(len(enc.encode(m.get("content", ""))) for m in messages)
        return sum(len(m.get("content", "")) // 4 for m in messages)

Two production details worth calling out. First, we use gpt-4o-mini for summarization -- 16x cheaper than gpt-4o. At 10,000 conversations per day, compression costs roughly $20/day versus $320/day for naively sending full history to a frontier model. The compression pays for itself by reducing input tokens on every subsequent turn. Second, we keep recent turns verbatim. Summarizing the last message ("Customer asked about billing") when the customer is actively waiting for a billing answer would be jarring.

Layer 5: Tool Resolution

Tool definitions should be resolved per-request, not hardcoded. An agent with 30 tools wastes thousands of tokens on irrelevant schemas every call. This is where MCP (Model Context Protocol) transforms context engineering: instead of static tool lists, MCP servers advertise capabilities at runtime, and the context engine includes only what's relevant.

Anthropic's team warns: "If engineers cannot definitively choose which tool applies, agents cannot either."

TypeScript -- Dynamic Tool Resolution:

typescript
interface ToolDefinition {
  name: string;
  description: string;
  inputSchema: Record<string, unknown>;
  tokenCost: number; // Estimated tokens for this tool's schema
}
 
interface ToolSet {
  id: string;
  name: string;
  tools: ToolDefinition[];
}
 
class ToolResolver {
  private tokenBudget: number;
 
  constructor(tokenBudget = 4000) {
    this.tokenBudget = tokenBudget;
  }
 
  async resolveTools(
    agentConfig: {
      toolsetIds: string[];
      toolIds: string[];
    },
    toolsetStore: { get(id: string): Promise<ToolSet> },
    conversationContext?: string // Current conversation topic for relevance filtering
  ): Promise<ToolDefinition[]> {
    const allTools: ToolDefinition[] = [];
 
    // Toolsets group related tools (e.g., "Billing", "CRM") -- load only what's relevant
    for (const tsId of agentConfig.toolsetIds) {
      const toolset = await toolsetStore.get(tsId);
      allTools.push(...toolset.tools);
    }
 
    // Fallback: if no toolsets configured, resolve individual tool IDs
    if (agentConfig.toolsetIds.length === 0 && agentConfig.toolIds.length > 0) {
      // Agent-scoped resolution -- implementation depends on your registry
    }
 
    // Hard budget enforcement: surplus tools get dropped, not squeezed in
    let tokenCount = 0;
    const selected: ToolDefinition[] = [];
 
    for (const tool of allTools) {
      if (tokenCount + tool.tokenCost > this.tokenBudget) {
        console.warn(
          `Tool budget exceeded. Included ${selected.length}/${allTools.length} tools.`
        );
        break;
      }
      selected.push(tool);
      tokenCount += tool.tokenCost;
    }
 
    return selected;
  }
}

Python -- Dynamic Tool Resolution:

python
from dataclasses import dataclass
 
@dataclass
class ToolDefinition:
    name: str
    description: str
    input_schema: dict
    token_cost: int  # Estimated tokens for schema
 
@dataclass
class ToolSet:
    id: str
    name: str
    tools: list[ToolDefinition]
 
class ToolResolver:
    def __init__(self, token_budget: int = 4000):
        self.token_budget = token_budget
 
    async def resolve_tools(
        self,
        toolset_ids: list[str],
        tool_ids: list[str],
        toolset_store,  # has async get(id) -> ToolSet
    ) -> list[ToolDefinition]:
        all_tools: list[ToolDefinition] = []
 
        # Resolve from toolsets (MCP-style grouped tools)
        for ts_id in toolset_ids:
            toolset = await toolset_store.get(ts_id)
            all_tools.extend(toolset.tools)
 
        # Agent-scoped fallback
        if not toolset_ids and tool_ids:
            pass  # Resolve individual tools from registry
 
        # Budget enforcement
        token_count = 0
        selected: list[ToolDefinition] = []
 
        for tool in all_tools:
            if token_count + tool.token_cost > self.token_budget:
                print(
                    f"Tool budget exceeded. "
                    f"Included {len(selected)}/{len(all_tools)} tools."
                )
                break
            selected.append(tool)
            token_count += tool.token_cost
 
        return selected

Rather than one giant bag of tools, agents organize tools into logical groups (a "Customer Management" toolset, a "Billing" toolset). The context engine creates separate MCP connections per toolset and merges the results. This is why the agent in our opening scenario "ignored half its tools" -- they were all dumped in with no relevance filtering, and the model's attention was spread too thin to use them effectively.

Assembling the Engine

Now we wire all five layers into one engine. This is the core of the system -- the thing that runs on every single inference call.

par [Parallel retrieval] Send message Estimate token budget per layer Search knowledge base Search memories Resolve tool definitions Relevant documents Customer memories Tool schemas Compress history if needed Build system prompt with injections Assemble final context Inference call Response + tool calls Execute tools, add results to history Final response User Context Engine Knowledge Store Memory Store Tool Registry LLM
Sequence of a single inference call through the context engine

TypeScript -- Full Context Engine:

typescript
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
 
interface ContextEngineConfig {
  model: Parameters<typeof openai>[0];
  maxContextTokens: number;
  budget: TokenBudget;
}
 
class ContextEngine {
  private systemPromptBuilder: typeof buildSystemPrompt;
  private memoryInjector: MemoryInjector;
  private historyCompressor: HistoryCompressor;
  private toolResolver: ToolResolver;
  private config: ContextEngineConfig;
 
  constructor(config: ContextEngineConfig) {
    this.config = config;
    this.memoryInjector = new MemoryInjector();
    this.historyCompressor = new HistoryCompressor(config.budget.history);
    this.toolResolver = new ToolResolver(config.budget.tools);
    this.systemPromptBuilder = buildSystemPrompt;
  }
 
  async processMessage(
    message: string,
    context: {
      customerId: string;
      agentConfig: SystemPromptConfig;
      toolsetIds: string[];
      conversationHistory: Message[];
      // Injected dependencies
      memoryStore: any;
      knowledgeRetriever: any;
      toolsetStore: any;
    }
  ) {
    // 1. Parallel retrieval saves 200-300ms vs. sequential -- critical for voice agents
    const [memories, knowledge, tools] = await Promise.all([
      this.memoryInjector.injectMemories(
        context.customerId,
        message,
        context.memoryStore
      ),
      retrieveWithBudget(
        message,
        this.config.budget.knowledge,
        context.knowledgeRetriever
      ),
      this.toolResolver.resolveTools(
        { toolsetIds: context.toolsetIds, toolIds: [] },
        context.toolsetStore
      ),
    ]);
 
    // 2. Compress history if needed
    const history = await this.historyCompressor.compress(
      context.conversationHistory
    );
 
    // 3. Build system prompt with injected context
    const systemPrompt = this.systemPromptBuilder(context.agentConfig, {
      memories,
      knowledge,
    });
 
    // 4. Assemble messages array
    const messages: Message[] = [];
 
    // Add compressed history summary if it exists
    if (history.summary) {
      messages.push({
        role: "system",
        content: `Previous conversation summary:\n${history.summary}`,
      });
    }
 
    // Add recent messages
    messages.push(...history.recentMessages);
 
    // Add current message
    messages.push({ role: "user", content: message });
 
    // 5. maxSteps: 10 allows multi-turn tool use (call tool, read result, call another)
    const result = await generateText({
      model: openai(this.config.model),
      system: systemPrompt,
      messages,
      tools: this.convertTools(tools),
      maxSteps: 10,
    });
 
    return {
      response: result.text,
      toolCalls: result.steps.flatMap((s) => s.toolCalls),
      tokenUsage: result.usage,
    };
  }
 
  private convertTools(tools: ToolDefinition[]) {
    // Convert to AI SDK tool format
    const converted: Record<string, any> = {};
    for (const tool of tools) {
      converted[tool.name] = {
        description: tool.description,
        parameters: tool.inputSchema,
      };
    }
    return converted;
  }
}
 
// Usage
const engine = new ContextEngine({
  model: "gpt-4o",
  maxContextTokens: 128000,
  budget: DEFAULT_BUDGET,
});

Python -- Full Context Engine:

python
import asyncio
from openai import AsyncOpenAI
from dataclasses import dataclass
 
@dataclass
class ContextEngineConfig:
    model: str = "gpt-4o"
    max_context_tokens: int = 128000
    budget: dict = None
 
    def __post_init__(self):
        if self.budget is None:
            self.budget = DEFAULT_BUDGET
 
class ContextEngine:
    def __init__(self, config: ContextEngineConfig):
        self.config = config
        self.client = AsyncOpenAI()
        self.memory_injector = MemoryInjector()
        self.history_compressor = HistoryCompressor(
            max_tokens=config.budget["history"]
        )
        self.tool_resolver = ToolResolver(
            token_budget=config.budget["tools"]
        )
 
    async def process_message(
        self,
        message: str,
        customer_id: str,
        agent_config: SystemPromptConfig,
        toolset_ids: list[str],
        conversation_history: list[dict],
        memory_store,
        knowledge_retriever,
        toolset_store,
    ) -> dict:
        # 1. Parallel retrieval saves 200-300ms vs. sequential -- critical for voice agents
        memories, knowledge, tools = await asyncio.gather(
            self.memory_injector.inject_memories(
                customer_id, message, memory_store
            ),
            retrieve_with_budget(
                message, self.config.budget["knowledge"], knowledge_retriever
            ),
            self.tool_resolver.resolve_tools(
                toolset_ids, [], toolset_store
            ),
        )
 
        # 2. Compress history
        history = await self.history_compressor.compress(
            conversation_history
        )
 
        # 3. Build system prompt
        system_prompt = build_system_prompt(
            agent_config, memories=memories, knowledge=knowledge
        )
 
        # 4. Assemble messages
        messages = []
 
        if history.summary:
            messages.append({
                "role": "system",
                "content": f"Previous conversation summary:\n{history.summary}",
            })
 
        messages.extend(history.recent_messages)
        messages.append({"role": "user", "content": message})
 
        # 5. Call LLM
        tool_defs = [
            {
                "type": "function",
                "function": {
                    "name": t.name,
                    "description": t.description,
                    "parameters": t.input_schema,
                },
            }
            for t in tools
        ]
 
        response = await self.client.chat.completions.create(
            model=self.config.model,
            messages=[
                {"role": "system", "content": system_prompt},
                *messages,
            ],
            tools=tool_defs if tool_defs else None,
        )
 
        return {
            "response": response.choices[0].message.content,
            "usage": {
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
            },
        }

The Promise.all / asyncio.gather pattern matters more than most teams expect. Memory search, knowledge retrieval, and tool resolution each take 50-150ms. Sequentially: 150-450ms before the LLM even starts. In parallel: 80-150ms total. On a voice agent, that 200-300ms savings is the difference between a natural response and an awkward pause.

Just-In-Time vs. Upfront Retrieval

The most important architecture decision in context engineering is when to load information: upfront (before the conversation starts) or just-in-time (when the agent needs it).

Anthropic describes the just-in-time pattern as one that "mirrors human cognition by using external organization systems." Instead of memorizing an entire codebase, you remember where to look. The agent maintains lightweight identifiers -- file paths, search queries, tool names -- and dynamically loads full content at runtime. In practice, production systems use a hybrid:

StrategyLoad WhenTokens UsedBest For
UpfrontBefore first messageFixed per sessionAgent config, system prompt, customer profile
Just-in-timeWhen needed during conversationVariable per turnKnowledge base, tool results, detailed records
CachedFirst access, then reuseAmortizedFrequently accessed documents, pricing tables
CompressedAfter threshold exceededDecreases over timeConversation history, earlier tool results

Claude Code is a good example of this hybrid pattern. It loads CLAUDE.md files upfront (your project context), but uses glob and grep tools for just-in-time retrieval of specific code files. The model decides what to look up based on what the conversation needs.

For customer-facing agents, we typically load upfront: the agent configuration, system prompt template, and customer profile. Everything else -- knowledge base results, tool outputs, detailed account records -- is loaded just-in-time as the conversation evolves.

Sub-Agents for Complex Tasks

When a single context window isn't enough, sub-agents handle focused sub-tasks and return condensed summaries -- typically "1,000-2,000 tokens" each, per Anthropic. A research task might need 50 documents. A debugging task might need dozens of files. No single context window can hold all of that.

The pattern: an orchestrator dispatches sub-tasks. Each sub-agent gets a clean, focused context window. When finished, it returns a condensed summary. The orchestrator synthesizes the final result.

Orchestrator Agent Sub-Agent: Billing Research Sub-Agent: Account History Sub-Agent: Product Lookup Summary: 1-2K tokens Summary: 1-2K tokens Summary: 1-2K tokens Orchestrator synthesizes final response
Sub-agent architecture: orchestrator dispatches focused tasks, collects condensed summaries

The key insight: each sub-agent has an optimized context window. The billing research agent doesn't need product documentation. The product lookup agent doesn't need account history. By splitting context, each agent performs better than a single agent would with everything crammed in.

The economics are counterintuitive. Three sub-agent calls at 10K tokens each cost the same as one call at 30K tokens -- but the sub-agents produce better results because each operates in the high-accuracy zone of context utilization. You pay the same and get better output.

Token Budgeting

Without explicit budgets, a long conversation history silently consumes your entire context window, leaving no room for the knowledge the agent actually needs. Here's a practical framework for a 128K token window:

LayerBudgetTokensNotes
System prompt2%2,560Role, constraints, escalation rules
Injected memories1%1,28010-15 memories max
Retrieved knowledge3%3,8403-5 RAG chunks
Tool definitions4%5,12015-20 tools with schemas
Conversation history15%19,200Compressed if needed
Current turn5%6,400User message + immediate context
Reserved for output10%12,800Response generation
Safety margin60%76,800Breathing room for attention quality

That 60% safety margin looks wasteful. It's not. Anthropic notes that LLMs have "an attention budget that they draw on when parsing large volumes of context." Using 40% of a 128K window gives better results than using 90%. And the cost angle seals it: at GPT-4o's pricing ($2.50/1M input tokens), sending 115K tokens per call instead of 50K costs an extra $480/month at 100K daily requests -- for worse quality. You're paying more for degraded performance.

The goal, as Anthropic puts it, is to "find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome."

Monitoring

You can't improve what you can't measure. LangChain's 2026 survey found that 89% of organizations with production agents have implemented some form of observability, and 62% have detailed tracing down to individual tool calls. The teams that skip this step are the ones stuck debugging context pipelines blind.

Track token usage per layer, retrieval relevance scores, compression ratios, and which tools actually get called. You need observability into what's actually happening inside that context window:

What to track:

MetricWhy It MattersTarget
Tokens per layerIdentify budget violationsWithin 10% of budget
RAG relevance scoresAre you retrieving useful knowledge?Mean score > 0.75
Memory hit rateAre injected memories being used?> 60% utilization
Compression ratioHow much history are you losing?5:1 to 10:1
Tool usage rateAre defined tools actually called?> 30% per session
Context window utilizationHow full is the window?20-50% for best quality

If your tool usage rate is below 10%, you're wasting tokens on tool definitions the model never uses. If your RAG relevance scores are below 0.5, your retrieval is injecting noise. If your context window utilization is consistently above 70%, you're likely seeing context rot -- the model starts forgetting earlier parts of the conversation.

Common Pitfalls

The most common mistake is treating the context window like a database -- stuffing everything in and hoping the model finds what it needs.

Kitchen sink system prompt. You write a 6,000-token system prompt covering every edge case. The model ignores half of it. Fix: start under 1,000 tokens. Add rules only when you see the model fail without them.

Retrieving everything, filtering nothing. Your RAG pipeline returns 10 chunks per query, half marginally relevant. Fix: set a minimum relevance threshold (0.7 for knowledge, 0.3 for memories). Three highly relevant chunks beat ten mediocre ones.

Ignoring history growth. The conversation is 100 turns deep, 40,000 tokens of "Let me check on that for you." Fix: compress when history exceeds 8,000 tokens. Keep the last 10 turns verbatim, summarize the rest.

Static tool lists. All 30 tool schemas included on every request -- 4,000+ tokens -- even for a simple FAQ. Fix: resolve tools per-request. Use toolsets to load only relevant groups.

No budget enforcement. Each layer decides independently how much to include. On a bad day, the window overflows. Fix: hard token limits per layer. The context engine is the budget authority.

What's Next

Remember the agent we opened with -- the one that hallucinates, forgets the customer, and ignores its tools? Every one of those failures traces back to a context problem. The hallucination happens because no knowledge was retrieved. The forgetting happens because history wasn't compressed and memories weren't injected. The ignored tools happen because 30 schemas were dumped in with no budget enforcement.

The fix was never a better prompt. It was a better pipeline.

Start simple: a structured system prompt, a single RAG source, basic history management. Add memory injection when you see the agent forgetting customer context. Add compression when conversations get long enough to hurt quality. Add sub-agents when tasks outgrow a single context window.

As models get larger context windows -- Claude Opus 4.6 now offers 1 million tokens -- the temptation will be to skip all of this. Don't. A well-engineered 50K-token context will outperform a carelessly assembled 500K-token context every time. The discipline of curating what goes in, and what stays out, is what separates agents that work in demos from agents that work in production.

Build agents with context engineering built in

Chanl handles memory injection, tool resolution, and history compression at runtime so you can focus on what your agent does, not how it remembers.

Start building
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions