AI Five Layer Stack

The Vocabulary Problem Nobody Wants to Admit

Every enterprise AI conversation eventually bottoms out at the same word: AI.

"We need AI for this." "We're building an AI solution." "Our competitors are already doing AI." The word is load-bearing and hollow at the same time. It carries the weight of executive expectation while describing nothing actionable about architecture, infrastructure, or capability.

Here's what's actually underneath it: almost every real-world enterprise "AI" system is a composition of five distinct building blocks, each solving a different problem, each failing in different ways, and each requiring different tradeoffs. Call them the five layers of the AI capability stack.

  • RAG — what the model knows at runtime
  • MCP — how the model interacts with the outside world
  • Tool Calls — lightweight in-process actions
  • Skills — procedural expertise the model can execute
  • Agents — orchestration under uncertainty

The LLM itself sits beneath all of them as the reasoning substrate. It doesn't belong to any one layer — it powers all of them. And its ceiling is a hard constraint that every layer above inherits.

Most enterprise teams are building with three of these layers while calling it all "AI." The mismatch between what they picked and what the problem actually needed is usually where the project quietly falls apart.

This article draws the map.


The Stack Mental Model

Before going layer by layer, establish the frame. These five layers are not interchangeable. They are not a menu where you pick one. They are composable building blocks that solve different capability gaps:

Capability Layers
Agent — Orchestration Layer
decides what, when, in what order
Skills — Procedural Layer
how to execute known workflows
MCP + Tool Calls — Action Layer
how the model interacts with the world
RAG — Knowledge Layer
what the model knows at runtime
Orchestrates ↕ Composes
LLM — Reasoning Substrate · Foundation
the ceiling every layer above inherits

All layers sit on the model · its ceiling is their ceiling

Layer What it solves What it does NOT do
RAG Knowledge currency and grounding Take actions or execute workflows
MCP / Tool Calls External system connectivity Decide what to connect to or why
Skills Repeatable procedural expertise Replace reasoning or judgment
Agents Orchestration across all other layers Replace any individual layer
LLM (base) Reasoning, language, inference Real-time data access or action

A system that only has RAG can answer questions grounded in documents. It cannot update a CRM, trigger a workflow, or remember how to do a complex multi-step task. A system that only has MCP can reach external systems — but it has no knowledge layer, no procedural memory, no orchestrator deciding when to reach what.

The confusion in most enterprise AI projects is not technical. It is vocabulary: teams conflate these layers and end up architecting a retrieval system for a problem that needs an action layer, or building an agent when a single well-structured RAG query would have been enough.


Layer 1 — RAG: The Knowledge Layer

The LLM you deploy has a training cutoff. It knows what it was trained on and nothing more. For most enterprise use cases — internal documentation, compliance policy, product specs, customer contracts — that means the model is reasoning about a world it has never seen.

RAG (Retrieval-Augmented Generation) solves this without retraining. The pattern is architecturally simple: when a query arrives, a retriever searches a knowledge base (usually a vector store), fetches the most relevant chunks, injects them into the model's context, and the model generates a grounded answer. The knowledge base gets updated without touching the model.

This is the right layer for: answering questions over proprietary documents, grounding model output in current data, reducing hallucination on domain-specific facts.

This is the wrong layer for: executing workflows, multi-hop reasoning across thousands of interconnected facts, or corpus sizes that dwarf what a context window can hold.

The failure mode of standard RAG in enterprise settings is predictable. You chunk a 10,000-page policy corpus into paragraphs, embed them, and your retriever fetches the top-5 passages most similar to the query. For a question like "what's the deductible on Plan B?" that works. For a question like "how did the 2023 regulatory change in the EU affect our reinsurance exposure across all products?" — the top-5 chunks will be wrong every time. The answer isn't in any single chunk; it's in the relationship between dozens of them.

Standard RAG solves factual lookup. It doesn't solve reasoning across interconnected facts.

📐 RAG Variant: GraphRAG
Microsoft's GraphRAG (open-sourced 2024) takes a different approach: instead of embedding chunks, it builds an entity-relationship graph over your entire corpus and generates community summaries. When a query arrives, it traverses the graph rather than fetching similar vectors. The result is the ability to answer "global" questions — theme synthesis, cross-document reasoning, multi-hop inference. The cost is real: graph indexing runs 100–1000× more expensive than vector RAG, and it requires domain-specific tuning to extract entities worth keeping. LazyGraphRAG (Microsoft Research, 2025) reduces indexing cost to ~0.1% of full GraphRAG by deferring graph construction. Use GraphRAG when your queries require relationship-aware reasoning across a large, interconnected corpus. Skip it when you need fast factual lookup.
📐 RAG Variant: CAG (Cache-Augmented Generation)
If your corpus is small enough to fit in the model's context window, you may not need retrieval at all. Cache-Augmented Generation preloads your entire knowledge base into context (or into a pre-computed KV cache) and lets the model attend globally — no chunking, no embedding, no retrieval errors. The 2025 CAG paper (Chan et al.) showed accuracy gains of 15–20% and latency improvements over standard RAG for sub-1M token corpora. The constraint is physics: attention is O(n²), so as corpora grow past the context window ceiling, CAG collapses. It also requires direct model memory access, making it incompatible with most hosted API deployments today. The decision surface: if your corpus fits and rarely changes, CAG wins. If it's large, live, or frequently updated, RAG is still the right tool.
📐 RAG Variant: Agentic RAG
Standard RAG is a fixed pipeline: one query, one retrieval pass, one generation. Agentic RAG replaces the fixed pipeline with an agent that plans multiple retrieval steps, chooses which retriever to call, reflects on intermediate answers, and adapts its strategy. Think of it as giving the retrieval layer an orchestration brain. The tradeoff is latency and complexity — you're now managing an agentic loop inside your knowledge layer. Worth it for complex, exploratory queries in regulated environments where accuracy matters more than response time.

Layer 2 — MCP and Tool Calls: The Action Layer

RAG makes the model better informed. The action layer makes it capable of changing things.

Before late 2024, when an LLM needed to reach an external system — a CRM, a SQL database, a third-party API — developers hand-coded the integration. Function definitions were written in whatever schema the framework expected, passed to the model, and parsed back. This worked within a single project. It didn't compose across projects, vendors, or runtimes. Every new integration was a fresh integration. The LLM ecosystem looked like the server room before containers: powerful machines with proprietary cables everywhere.

MCP (Model Context Protocol) is the standardization event that changed this. Announced by Anthropic in late 2024 and rapidly adopted across the ecosystem, MCP defines an open, vendor-neutral protocol for connecting LLMs to external systems. By April 2026, the MCP registry had over 9,400 entries and every major LLM vendor shipped first-class support.

MCP Is to LLMs What Containers Were to Infrastructure

The analogy is precise, not rhetorical.

Before containers, deploying an application meant negotiating with the server: What OS? What runtime version? What port? Each deployment was bespoke. Docker abstracted all of that into a standard unit that could run anywhere. Before MCP, integrating an LLM with an external system meant negotiating with the integration: What schema? What auth model? What error format? Each integration was bespoke.

MCP abstracts all of that into a standard protocol. An MCP server exposes resources, tools, and prompts through a defined interface. An MCP client (your LLM framework) speaks the protocol without caring about what's on the other end. Write the MCP server once — your model, your agent framework, your IDE, your orchestration layer can all use it without rewiring.

The container analogy extends further:

Containers MCP
Package apps + dependencies into a standard unit Package LLM tools + context into a standard protocol
Run anywhere (dev, staging, prod) Connect any LLM to any external system
Orchestrated by Kubernetes Orchestrated by agents and frameworks
Enabled the microservices ecosystem Enabling the tool/integration ecosystem
Docker Hub as central registry MCP Registry (9,400+ servers, April 2026)

Just as containers enabled an explosion of infrastructure reusability, MCP is enabling an explosion of LLM integrations that are no longer locked to a single framework or vendor.

Common Use Cases for MCP

MCP servers exist across every enterprise integration surface. The most common production patterns:

Internal data access

  • Connect to SQL databases and data warehouses — agents query live operational data without requiring a data pipeline to pre-process it
  • Salesforce, HubSpot, and CRM platforms — agents can read deal data, update records, log interactions
  • Confluence, Notion, SharePoint — agents pull from internal wikis without a separate RAG pipeline for structured, navigable content

Developer tooling

  • GitHub MCP server — agents read code, open PRs, create issues, and query commit history
  • Linear, Jira — triage tickets, assign work, update statuses
  • CI/CD systems — query build status, trigger deployments, read test results

Communication systems

  • Slack, Teams — read channel messages, post updates, surface information across conversations
  • Email (Gmail, Outlook) — draft, send, search, and organize with full inbox access

Compute and execution

  • File systems — read, write, move files without leaving the agent loop
  • Terminals and shell environments — run commands, execute scripts, read outputs
  • Browser automation — navigate web UIs, fill forms, extract structured data from pages

Enterprise systems

  • ERP and finance systems — pull invoices, trigger purchase orders, validate line items
  • HR platforms — read org structure, check leave balances, manage approvals
  • Compliance and audit systems — log decisions, query regulatory data, verify controls

How to Access and Use MCPs

There are three integration modes, each suited to different contexts:

1. Hosted MCP servers (Remote) The simplest path. Vendors publish an MCP-compliant server accessible over HTTPS. Your client sends requests; the server handles auth, execution, and response formatting. Most major SaaS integrations (GitHub, Slack, Notion, Google Drive) now offer hosted MCP endpoints. Auth is typically OAuth 2.1 — the spec mandates PKCE for remote servers, and 81% of remote MCP servers now implement this. Suitable for: production integrations with well-supported SaaS tools.

2. Local MCP servers (stdio transport) The server runs as a local process on the same machine as the client, communicating over stdio. This is how Claude Code, Cursor, and Cline expose tool integrations to the IDE. Suitable for: development tooling, file system access, local shell execution, any integration that shouldn't leave the machine.

3. Self-hosted MCP servers Your team builds and deploys an MCP server that wraps your internal systems — your proprietary APIs, internal databases, legacy systems with no public SDK. The server handles auth, business logic, and response formatting. The LLM sees a clean tool interface without knowing anything about what's behind it. Suitable for: enterprise-grade integrations with internal systems, regulated environments where data cannot touch third-party infrastructure.

Framework integration: All major agent frameworks — LangGraph, CrewAI, OpenAI Agents SDK, Autogen — support MCP natively. Connecting an MCP server is a configuration declaration, not code.

# LangGraph + MCP — connecting a server is a one-liner config
from langgraph.mcp import MCPClient

client = MCPClient(
    server_url="https://mcp.github.com/sse",
    auth=OAuthConfig(...)
)

Tool Calls: Still Relevant, Different Scope

Tool Calls predate MCP and still belong in your toolkit — just for a different use case. A Tool Call is a function defined inline within your agent or chain, invoked by the model within the same runtime. No separate server, no protocol negotiation.

Use Tool Calls when: the tool is scoped to a single project, the function is simple enough to define inline, and you don't need it to be reusable across agents or frameworks. They're excellent for project-specific logic — a custom formatter, a business rule evaluator, a local computation — that doesn't warrant the overhead of an MCP server.

The distinction is operational scope. MCP is infrastructure. Tool Calls are code.


Layer 3 — Skills: The Procedural Layer

Here's the layer most enterprise teams haven't built yet.

RAG gives the model declarative knowledge — facts, documents, recorded information. It answers "what." But enterprise work is dominated by procedural tasks: how you process an insurance claim, how you onboard a new vendor, how you handle an escalation that crosses three systems. A model with excellent factual recall can still be useless at complex multi-step execution because it has to reason through every step from scratch, every single time.

Consider the human analogy. You wouldn't hand a new employee a policy manual and call them trained. The manual covers the what — the rules, the definitions, the documented process. But expertise is the how: the sequence of steps, the edge case handling, the judgment calls that aren't in the manual. That expertise is procedural. It lives in demonstrated practice, not documentation.

Skills are how you give an LLM agent that procedural layer.

A Skill is a reusable, callable module that encapsulates a sequence of actions — a workflow the agent can execute without reasoning through it from first principles. Academically, Skills are defined as simultaneously executable, reusable, and governable, which distinguishes them from tools (atomic, single-step), plans (one-off reasoning scaffolds), and episodic memories (stored past observations).

In practice, a Skill looks like a structured instruction set — often a Markdown file with YAML frontmatter — that tells the agent precisely when to apply it, what sequence to follow, and how to handle failure conditions. The agent reads the Skill at runtime, not training time. This means Skills are updatable without retraining the model.

Where Skills matter most in enterprise AI:

  • Compliance workflows — the model doesn't reason through a 14-step KYC check every time; it loads the KYC skill and executes
  • Support escalation — when a ticket meets specific conditions, the agent follows a defined diagnosis-escalation-documentation sequence, not an improvised one
  • Data processing pipelines — the agent knows the exact sequence of validation, transformation, and output steps for a given data type
  • Report generation — structured output tasks where the format, sections, and validation rules need to be consistent across every invocation

The gap in most enterprise AI stacks: teams invest in RAG (knowledge) and MCP (connectivity) but leave the agent reasoning from scratch every time it encounters a known workflow. That's the equivalent of expecting a doctor to re-derive medical procedure from anatomy textbooks every appointment.


Layer 4 — Agents: The Orchestration Layer

An agent is not smarter than the other layers. It doesn't make RAG better or MCP more capable or Skills more precise. An agent is the thing that decides which of these to use, when, and in what sequence, given a goal it hasn't been given exact instructions for.

The core loop: think → act → observe → think. The agent receives a goal, reasons about what step to take first, takes that step (retrieve something, call a tool, execute a skill), observes the result, and decides what to do next. This loop continues until the goal is met or the agent determines it cannot proceed.

What this enables is orchestration under uncertainty — the ability to decompose an ambiguous goal into concrete sub-tasks across multiple systems without being given an exact script. "Prepare the monthly compliance report" is not a prompt. It's a goal. An agent that can execute it is doing something qualitatively different from a retrieval system or a fixed pipeline.

In enterprise deployments in 2026, single-agent loops are giving way to multi-agent architectures:

Executive Agent (receives goal, decomposes into subtasks)
  ├── Research Agent (RAG + web search)
  ├── Data Agent (SQL + MCP integrations)
  ├── Drafting Agent (generation + formatting)
  └── Review Agent (validation + compliance check)

The orchestration challenge scales with the number of agents and the complexity of their inter-dependencies. This is where failures compound: a hallucinated intermediate result in agent 2 propagates to agents 3 and 4 without a human noticing. Production multi-agent systems need observability, intermediate result validation, and explicit human-in-the-loop checkpoints for high-stakes decisions.

One framing that clarifies the agent's role: an agent is not a replacement for the other layers. It is a coordinator that decides when each layer is needed. An agent without RAG is blind. An agent without MCP is isolated. An agent without Skills is improvising every known workflow from scratch. An agent with all three is the composable unit most enterprise AI systems are actually trying to build.

Agent think → act → observe
coordinates all layers
RAGKnowledge
MCP / ToolsAction
SkillsProcedural
HumanEscalation

The Foundation: LLM as the Reasoning Substrate

All five layers sit on top of a model. And the model has a ceiling.

A weak model with excellent RAG still produces poorly reasoned answers. A weak model with an MCP integration still makes poor decisions about when to call which tool. A weak model with a well-crafted Skill still fumbles the execution. The capability of every layer above is bounded by the reasoning quality of the model beneath.

This is where a second vertical of innovation operates — not in the architectural layers, but in the model itself:

🔧 Raising the Foundation — Model-Level Investments

Fine-tuning — adapt the model's weights to your domain. The model internalizes domain knowledge, style, and task patterns. Best for stable, proprietary knowledge that doesn't change often. Expensive to iterate, but the capability gain is permanent and doesn't require runtime retrieval.

RLHF / Reinforcement Learning — align the model's outputs to human preferences through feedback loops. Used to tune helpfulness, safety, and task-specific behavior. This is how base models become products.

Re-ranking — after RAG retrieves candidate chunks, a cross-encoder re-scores them for relevance before generation. Cheap accuracy improvement that doesn't touch the model — it just puts better context into the prompt.

LLM-as-Judge — use a separate model to evaluate the quality of the primary model's output. Enables automated quality loops without human reviewers at every step. Core to building reliable agent evals.

Each of these will get its own deep dive. The point here is positioning: these are investments into the foundation. Raise the ceiling, and every layer above gets better.

How to Read Your Own System

Given a business problem, which layers does it actually need?

Run this diagnostic before you architect:

1. Does the system need to know things the LLM wasn't trained on? If yes → RAG layer required. Now ask: are the facts independent or deeply interrelated? Independent → vector RAG. Interrelated across a large corpus → GraphRAG. Small stable corpus → consider CAG.

2. Does the system need to take actions in external systems? If yes → Action layer required. Is this integration scoped to one project and simple enough to define inline? → Tool Call. Is it reusable across projects, or connecting to a major external system? → MCP server.

3. Does the system need to execute known, repeatable workflows reliably? If yes → Skills layer required. Define the workflow once as a Skill. Stop making the model reinvent the process every invocation.

4. Does the system need to make decisions about what to do and in what order, without being given an exact script? If yes → Agents required. Now ask: how complex is the goal decomposition? Single agent or multi-agent? Where are the human escalation points?

5. How good is the underlying model for this domain? If the model is weak on your domain → vertical model investment (fine-tuning, RLHF, re-ranking) before the architectural layers. Optimizing a retrieval pipeline on top of an inadequate model is building the second floor of a house with a broken foundation.

Most enterprise teams that are struggling with AI reliability are failing at either question 1 (they assumed the model knew their domain — it doesn't) or question 3 (they're making the agent reason through known workflows every time instead of encoding procedural expertise as Skills).

The over-engineered failure mode: deploying a multi-agent system with MCP and GraphRAG for a problem that needed a single RAG query and a well-written prompt.

The under-built failure mode: deploying a RAG system for a problem that needed actions, orchestration, and procedural memory — and wondering why it can only answer questions but not actually get anything done.


Where This Goes Next

The five-layer stack is stabilizing in 2026, but each layer is still evolving:

  • RAG is bifurcating — vector RAG for large, dynamic corpora; CAG for small, stable ones; GraphRAG for relationship-dense knowledge. The middle ground (mid-sized, semi-structured corpora with moderate update frequency) is becoming contested.
  • MCP is becoming infrastructure. The 2026 ecosystem looks like Docker in 2016 — the standard is set, the registry is growing, the question is now governance and enterprise-grade security, not whether to adopt.
  • Skills are the least mature layer. Formalization is recent, tooling is immature, and most enterprise teams don't have a Skills strategy at all. This is the highest-leverage gap in most stacks.
  • Agents are moving from single-agent loops to multi-agent systems with memory, observability, and human-in-the-loop checkpoints. The orchestration problem is not solved.

The LLM beneath all of it continues to get better, and each improvement lifts every layer above it. The teams that will win the next two years are not the ones with the most sophisticated RAG pipeline. They are the ones that understand which layer each problem actually needs — and build accordingly.