Building a Multi-Agent Document Intelligence System with Kiro CLI

March 8, 2026 · 10 min read · ai, multi-agent, kiro, architecture, tutorial

I needed a way to ask questions about a pile of documents — PDFs, Word files, images — and get a cited answer back. Not a summary of one file, but a synthesis across dozens of them. The kind of thing where you’d normally spend an afternoon reading, cross-referencing, and taking notes.

The problem is that LLMs choke when you try to stuff 50 documents into a single context window. So I built a multi-agent system where specialized workers each handle one file, and an orchestrator stitches the results together. Here’s how it works and why the architecture decisions matter.

The Core Problem: Context Overflow

Every LLM has a context window — a fixed budget of tokens it can process at once. Load a 200-page PDF and you’ve already consumed a significant chunk. Try loading ten of them and you’re either truncating content or hitting hard limits.

The naive approach is to concatenate everything and hope the model figures it out. That fails in predictable ways: the model loses track of which information came from which source, earlier documents get “forgotten” as the window fills, and you can’t scale beyond a handful of files.

The solution is to never put more than one document in a single agent’s context.

The Architecture

The system uses a hierarchical pipeline where each agent has a specific job and a restricted set of tools:

 1User Query
 2   │
 3   ▼
 4Orchestrator (glob + subagents only)
 5   │
 6   ├── File Worker (1 per file, MCP document-loader)
 7   │        └── Returns JSON summary
 8   │
 9   ├── Image Worker (1 per image, MCP document-loader)
10   │        └── Returns JSON analysis
11   │
12   └── Writer (fs_write, only when asked)
13         └── Requires user approval

The orchestrator can discover files and spawn workers, but it physically can’t read file contents — it doesn’t have fs_read in its tool list. Workers can read files but can’t spawn other agents. The writer can modify files but requires explicit user approval for every operation.

This isn’t just a prompt instruction like “please don’t read files directly.” The tools are restricted at the configuration level. The orchestrator literally doesn’t have the capability to read a file, the same way a process without filesystem permissions can’t open a file regardless of what code it runs.

Why Tool Restriction Matters More Than Prompt Instructions

Early in the process, I tried a different approach: a single SOP (Standard Operating Procedure) that told the default agent to delegate file reading to subagents. The SOP said “You MUST NOT use fs_read to read file contents.” The agent ignored it and read the files directly every time.

This makes sense. The model sees fs_read in its available tools, sees that the task involves reading files, and takes the shortest path. Prompt instructions are suggestions. Tool restrictions are constraints. When you need guaranteed behavior, restrict the tools.

This is the same principle behind the security concept of least privilege — don’t give a process permissions it doesn’t need, because you can’t rely on the process choosing not to use them.

The Agent Configuration

Each agent is a JSON file that defines its name, description, system prompt, and — critically — which tools it can access. Here’s the orchestrator:

 1{
 2  "name": "doc-intel",
 3  "tools": ["use_subagent", "glob"],
 4  "allowedTools": ["use_subagent", "glob"],
 5  "toolsSettings": {
 6    "subagent": {
 7      "availableAgents": [
 8        "doc-intel-file-worker",
 9        "doc-intel-image-worker",
10        "doc-intel-writer"
11      ],
12      "trustedAgents": ["doc-intel-file-worker", "doc-intel-image-worker"]
13    }
14  }
15}

Notice that doc-intel-writer is in availableAgents but not in trustedAgents. That means the orchestrator can invoke it, but every invocation requires the user to approve it. Read operations are trusted and run automatically. Write operations need a human in the loop.

The file worker is even more locked down:

 1{
 2  "name": "doc-intel-file-worker",
 3  "mcpServers": {
 4    "document-loader": {
 5      "command": "uvx",
 6      "args": ["awslabs.document-loader-mcp-server"]
 7    }
 8  },
 9  "tools": ["@document-loader"],
10  "allowedTools": ["@document-loader"],
11  "includeMcpJson": false
12}

It can only use the MCP document-loader. It can’t read arbitrary files, can’t run shell commands, can’t access the internet. It loads one document, summarizes it into a JSON envelope, and returns. The includeMcpJson: false flag means it doesn’t inherit the global MCP server configuration — it only gets the document-loader.

Context Isolation by Design

Each subagent runs with its own isolated context window. When the orchestrator spawns four file workers in parallel, each one starts fresh with only:

The file path it needs to process
The user’s query for relevance scoring
Its system prompt

No cross-contamination between files. No accumulated context from previous workers. This is what makes the system scale — processing 100 files uses the same amount of context per worker as processing 5.

The workers return compressed JSON summaries capped at 500 tokens each. The orchestrator never sees the raw document text. It works with structured envelopes:

 1{
 2  "file_path": "/docs/report.pdf",
 3  "file_type": "pdf",
 4  "summary": "Q4 revenue was $12.3M, up 15% YoY...",
 5  "entities": [{ "name": "Q4 2025", "type": "period" }],
 6  "key_information": ["Revenue increased 15% year-over-year"],
 7  "relevance_score": 0.95,
 8  "images_detected": [
 9    { "image_id": "chart_p3", "location": "page 3", "estimated_relevance": 0.8 }
10  ]
11}

This is the same pattern used in distributed systems: services communicate through well-defined contracts, not by sharing internal state.

Batching Around Platform Limits

Kiro CLI supports up to 4 parallel subagents. With 50 files to process, the orchestrator batches them into groups of 4, waits for each batch to complete, then spawns the next. It’s not as fast as spawning all 50 at once, but it works within the platform’s constraints and still provides significant parallelism.

The orchestrator also handles failures gracefully: if a batch fails, it retries once, then skips those files and notes them as errors in the final answer. The pipeline doesn’t stop because one PDF is corrupted.

Image Analysis: Selective, Not Exhaustive

File workers detect embedded images and estimate their relevance to the query. The orchestrator reviews these estimates and only spawns image analysis workers for images that are likely to contain useful information — diagrams, charts, data tables. Decorative images, logos, and headers get skipped.

This matters because image analysis is slow and expensive. A Word document with 30 embedded images doesn’t need all 30 analyzed to answer a question about budget figures. The system routes intelligently instead of processing everything.

Lessons from the First Real Test

The first test was against real work documents — a meeting preparation file, two venue specification documents full of images, and a project timeline. Several things happened:

A subagent returned XML instead of JSON. The worker prompt said “return only JSON” but one model decided to wrap it in XML tags. The fix was adding explicit “NEVER return XML, NEVER use XML tags” to the prompt. Defensive prompting works.
A batch timed out after 5+ minutes. Large DOCX files with many high-resolution images take time to process through the MCP document-loader. The orchestrator’s retry logic handled this, but it highlighted the need for reasonable timeout expectations.
The orchestrator tried to read files directly. When I first tried running this as a SOP from the default agent (which has all tools), the model ignored the delegation instructions and just used fs_read. This is what led to the tool restriction approach — the dedicated doc-intel agent that physically cannot read files.

Each of these failures improved the system. That’s the value of testing with real documents instead of synthetic examples.

What About Other Tools?

This architecture isn’t unique to Kiro CLI. The principles — context isolation, tool restriction, structured communication, hierarchical delegation — apply anywhere you’re building multi-agent systems. The implementation details differ:

GitHub Copilot has evolved into a multi-agent development platform with specialized agents for different workflows, but these are primarily fixed-purpose agents — not user-configurable for arbitrary workflows like document analysis. Custom agents are now supported in VS Code, allowing developers to define specialized AI assistants with specific prompts and tool access.
Roo Code takes a cloud-based approach with 5 role-based agents (Explainer, Planner, Coder, PR Reviewer, PR Fixer). Each runs in an isolated cloud environment, giving strong context isolation but requiring cloud infrastructure. Locally, Roo Code uses “Custom Modes” — specialized AI personas with scoped tool permissions — rather than a subagent delegation model.
Cline supports parallel subagents that are read-only by design — they can explore the codebase but can’t edit files or execute commands. Each subagent gets its own context window. The restriction model is similar in spirit to what we built, though the configuration is less declarative.
OpenCode offers configurable agents with tool restrictions through its JSON-based configuration system, providing flexibility similar to Kiro’s approach. Agents can have permissions configured to control tool access.

The key differentiator in Kiro’s approach is the declarative JSON configuration with explicit availableAgents, trustedAgents, and per-agent tool lists. You define the security model in configuration, not in prompts.

What the Architecture Review Found

After building the system, I ran a formal architecture review — reading every file, comparing against documented multi-agent patterns, and rating the implementation. The results were instructive.

The strongest aspect was context safety (8/10). Single-file workers with isolated context, structured JSON envelopes, and MCP isolation via includeMcpJson: false form a solid foundation. The tool-level enforcement of separation of concerns — where the orchestrator physically cannot read files — was validated as the right approach.

The weakest aspect was operational reliability (5/10). The system works, but it has no observability. There’s no structured logging of which files were processed, which failed, how long each batch took, or how many tokens were consumed. When something goes wrong, you’re debugging blind.

Several concrete risks emerged:

The 500-token summary budget is enforced only by prompt instruction. There’s no validation that workers actually comply. With 100+ files, oversized summaries could silently accumulate and overflow the orchestrator’s context. The fix is straightforward — add a length check in the orchestrator’s aggregation step.

The writer agent has fs_read access it doesn’t need. It receives content to write via the subagent query context, so arbitrary file reading is unnecessary scope. Removing it tightens the permission model.

The image relevance threshold (0.6) exists in an SOP document that the orchestrator agent doesn’t load. The orchestrator’s prompt says “skip decorative images” without a numeric threshold, leaving the filtering to model interpretation. Adding an explicit number makes it deterministic.

There’s also no caching. If you ask about the same file in two different queries, it gets re- processed both times. A summary cache keyed by file path and modification time would eliminate redundant MCP calls.

These are the kinds of issues you only find by auditing the implementation against the design — not by testing happy paths.

The Improvement Roadmap

Based on the review, the next iteration focuses on:

Summary length verification in the orchestrator
Explicit numeric thresholds for image routing
Removing unnecessary fs_read from the writer
Batch progress summaries for observability
Aligning documentation with the actual 4-agent implementation
Eventually: summary caching and large document chunking

The Takeaway

If you’re building multi-agent systems for document analysis — or any task where context overflow is a risk — the architecture matters more than the prompts:

Restrict tools, don’t just instruct. If an agent shouldn’t read files, remove fs_read from its tool list. Don’t rely on “please don’t read files” in the prompt.
One document per worker. Context isolation isn’t just about preventing overflow — it ensures each document gets the model’s full attention.
Structured communication. JSON envelopes with fixed schemas prevent raw text from leaking between agents and keep the orchestrator’s context predictable.
Trust boundaries. Read operations can be trusted. Write operations should require human approval. Not every subagent needs the same permission level.
Test with real documents. Synthetic tests won’t surface the XML response issue, the timeout on image-heavy DOCX files, or the model’s preference for taking shortcuts when tools are available.
Audit your own implementation. Run a formal review comparing your code against documented patterns. The gaps between design intent and actual behavior are where the real improvements hide.

The system processes real work documents now — meeting preparations, venue specifications, project timelines — and produces cited answers that would’ve taken an afternoon of manual reading. The architecture is 4 JSON files and a handful of SOPs. The complexity is in the design decisions, not the implementation.

References

Kiro CLI Documentation - Agent configuration and subagent system
Model Context Protocol - Open standard for AI tool integration
AWS Labs Document Loader MCP Server - MCP server for document loading
AWS Open Source: Introducing Strands Agent SOPs - Agent SOP methodology
Microsoft ISE: Patterns for Building a Scalable Multi-Agent System - Orchestration patterns at scale
Roo Code Cloud Agents - Cloud-based agent team architecture
Cline Subagents - Read-only parallel research agents