Designing Agentic AI Systems with the ORCHIDEAS Framework

Published 06/05/2026

Written by Ken Huang, CEO & Chief AI Officer, DistributedApps.ai.

A secure-by-construction approach to nine-pillar agentic AI design, integrated with the Cloud Security Alliance MAESTRO threat modeling framework

Introduction: Security as a Structural Property

Most security failures in software systems come from treating security as something added on top of an otherwise-complete design. A team builds the application, then adds authentication; ships the feature, then writes the audit log; designs the architecture, then performs a penetration test. The defects this approach produces are not usually exotic — they are predictable consequences of asking security to retrofit a structure that was not built for it.

Agentic AI systems make this approach untenable. The principal making decisions is probabilistic, partially opaque, and capable of being steered by adversarial inputs that look indistinguishable from legitimate ones. The boundary between data and instruction is porous. The system's behavior emerges from the interaction of components in ways that cannot be fully predicted from any single component's specification. A bolt-on security model — runtime checks layered onto an otherwise-trusting architecture — produces a system whose security properties depend on the perfection of those checks, and the checks are never perfect.

The alternative is secure by construction: design the system so that the security properties you want are invariants of the architecture itself. Violating those invariants requires explicitly bypassing the structure, not merely neglecting to add a check. The agent cannot act without an attested intent token, because the action interface only accepts calls bound to one. The model cannot be invoked directly, because all calls flow through a gateway by construction. Tool authority cannot expand across handoffs, because capability tokens are attenuated by construction. Context content from untrusted sources cannot enter privileged context slots, because the segregation is structural. The threat model still applies — but it applies to the architecture's invariants rather than to a runtime-checking surface that depends on every check firing correctly.

This document presents ORCHIDEAS, a nine-pillar design framework for agentic AI systems built on the secure-by-construction premise, integrated with the Cloud Security Alliance's MAESTRO threat modeling framework. ORCHIDEAS organizes the design space; MAESTRO organizes the threat space. Together they let a team build systems whose security properties are structural rather than aspirational.

The framework name is a mnemonic rather than a construction sequence. The design sequence is Autonomy, Identity & Intent, Data & Memory Governance, Context, Runtime, Human Oversight & Override, Observability, Eval/Environment/Ecosystem, and Scalability. The same nine pillars are remembered as ORCHIDEAS, whose letter order emphasizes the complete set rather than the build order. ORCHIDEAS is Spanish for "orchids," a fitting metaphor for systems whose health depends on every environmental condition being right and whose ecosystem is intricate enough to require deliberate cultivation.

Figure 1. Secure-by-construction control plane showing how autonomy policy, identity, data governance, and context controls converge at the gateway and tool broker.

Secure-by-construction control plane

The Engineering Disciplines That Ground ORCHIDEAS

Before walking through the pillars, it helps to name the engineering disciplines they embody. Each pillar is not an arbitrary collection of concerns but an application of well-established principles to the specific surfaces that agentic AI presents. Six disciplines run through the entire framework.

The first is the Saltzer and Schroeder security principles, articulated in 1975 and still the bedrock of security engineering. Economy of mechanism (keep designs simple enough to reason about), fail-safe defaults (default-deny rather than default-allow), complete mediation (every access checked, no shortcuts), open design (security does not depend on secrecy of the mechanism), separation of privilege (require more than one condition for sensitive operations), least privilege (each component gets only the authority it needs), least common mechanism (minimize shared mechanism between principals), and psychological acceptability (controls users will actually use). These principles map directly onto agentic system design: complete mediation becomes the LLM gateway and tool broker pattern; least privilege becomes attenuated capability tokens; fail-safe defaults becomes the default-deny posture at autonomy boundaries.

The second is defense in depth: no single control prevents the failure mode you fear most. The intent of layering controls is not redundancy for its own sake but the assumption that any individual control will sometimes fail. A robust agentic system has prompt injection defenses at the input filter, at the context assembly, at the model conditioning, at the tool broker, at the action confirmation, and at the observability layer. The attack must defeat all of them to succeed, while the defender must succeed at only one to catch it.

The third is zero trust, applied to internal architecture as much as to external boundaries. The agent does not trust its own context content unconditionally. The orchestrator does not trust the agent's stated intent unconditionally. The tool broker does not trust the agent's tool selection unconditionally. The MCP server does not trust the calling agent's identity claim unconditionally. Trust is established through cryptographic attestation and bounded by policy, not assumed from network location or prior interaction.

The fourth is capability-based security. Authority in the system is conveyed through unforgeable tokens that name exactly what may be done. An agent that holds a capability token for "read customer record 12345 within intent task-xyz for the next 60 seconds" cannot, by construction, read customer record 67890, or read 12345 after the TTL, or read 12345 under a different intent. The capability defines the authority; possession is sufficient; nothing more is granted. This contrasts with ambient authority models where principals possess broad permissions and rely on runtime checks to narrow them — a model that fails predictably under prompt injection.

The fifth is compositionality: secure components must compose into secure systems, with the security properties of the whole derivable from the properties of the parts. This is the discipline that gives "secure by construction" its meaning. Without compositionality, every interaction between components is its own threat model; with compositionality, the architectural choices ensure that combinations of secure components do not produce insecure emergent behavior. Compositionality requires clear trust boundaries, well-typed interfaces, and explicit reasoning about what each component requires from and provides to its neighbors.

A sixth principle, reliability engineering, underlies the operational pillars (Observability, Scalability) and provides the framing for graceful degradation: when a component fails, the system should fail safely rather than catastrophically, and the failure should be visible. Agentic systems that "succeed silently with degraded quality" — for instance, falling back to a less-capable model without alerting on the quality drop — accumulate latent problems that surface as incidents later.

The ORCHIDEAS pillars are organized so that each one applies one or more of these disciplines to a specific design surface. Autonomy applies fail-safe defaults and separation of privilege to action boundaries. Identity & Intent applies capability-based security and complete mediation to authorization. Data & Memory Governance applies least privilege and compositionality to data lifecycle. Context applies zero trust to the runtime window. Runtime applies complete mediation and defense in depth to in-flight decisions. Human Oversight & Override applies psychological acceptability and separation of privilege to the human-AI interface. Observability applies open design and reliability engineering to the surveillance layer. Eval applies compositionality (in CI/CD form) to validation. Scalability applies economy of mechanism and reliability engineering to operational design.

Figure 2. Engineering disciplines that ground ORCHIDEAS and keep the pillars tied to established security and reliability practices.

Six engineering disciplines under the nine pillars

The Nine Pillars in Design Sequence

The pillars are presented below in the order they should be addressed when designing a new agentic AI system. The acronym ORCHIDEAS is a mnemonic; the design order is the implementation order. Skipping pillars early produces systems where later pillars cannot be cleanly applied — for instance, an agent built without explicit autonomy boundaries cannot have IBAC retrofitted later without rewriting the orchestration layer.

The mapping between ORCHIDEAS and the design sequence is given here for reference. ORCHIDEAS is the mnemonic order; the design sequence is the implementation order. The mnemonic order is O, R, C, H, I, D, E, A, S. The design sequence is A, I, D, C, R, H, O, E, S. Both refer to the same nine pillars; the difference prevents the acronym from implying a build order that would be unsafe.

Figure 3. Design-order versus mnemonic-order view of the same nine ORCHIDEAS pillars.

ORCHIDEAS: one pillar set, two useful orders

A — Autonomy

The first design decision

The autonomy question is the first architectural decision because every subsequent pillar depends on the answer. A platform building toward Level 1 agents (narrow assistive autonomy) has very different identity, context, runtime, and data requirements than one building toward Level 3+ agents (autonomous within domain). Deciding autonomy levels late forces all earlier pillars to be redesigned. Deciding them first allows each subsequent pillar to be sized appropriately.

The autonomy question is not binary but multi-dimensional. Each capability the agent has — read this resource, write that resource, call this external API, spend that much money, communicate with this counterparty, modify its own configuration — has its own autonomy setting that ranges from fully autonomous through autonomous-with-logging, autonomous-with-notification, autonomous-within-budget, requires-synchronous-approval, requires-asynchronous-review, all the way to never-permitted.

Several axes structure the decisions. Reversibility is the most important: actions that can be undone tolerate substantially more autonomy than actions that cannot. Blast radius matters: actions affecting one user warrant less scrutiny than actions affecting an entire tenant or all users. Cost matters: actions with significant monetary impact need explicit ceilings. External visibility matters: actions visible to customers, partners, or the public need confirmation paths that internal actions may not require. Compliance sensitivity matters: regulated actions (financial transactions, healthcare decisions, employment decisions) typically require explicit human gates regardless of the agent's track record. And trust history matters: trust-progressive autonomy allows agents to earn broader scope through demonstrated reliable behavior, with rollback when behavior degrades.

A useful framework for stratifying autonomy is a five-plus-one level model, analogous to the SAE levels for autonomous vehicles but scoped to agentic action. Level 0 agents recommend and a human always acts. Level 1 agents perform narrow, low-risk tasks with human review of outcomes. Level 2 agents act autonomously within a defined scope and escalate at scope boundaries. Level 3 agents operate autonomously within a domain and surface exceptions. Level 4 agents operate autonomously across a bounded environment while humans supervise the fleet rather than individual actions. Level 5, full autonomy across all dimensions, remains aspirational and is not deployment-ready for consequential domains. Most production deployments should target Level 1 or Level 2 for consequential actions and reserve Level 3+ for narrow domains with strong reversibility, containment, and observability.

Engineering principles embodied

Autonomy is the pillar that operationalizes fail-safe defaults and separation of privilege. The default at every boundary is "this is not permitted unless explicitly authorized"; expanding autonomy requires deliberate sign-off rather than emerging from negative space. Separation of privilege manifests as multi-party approval for catastrophic-risk actions — two humans, or two independent agents plus a human — ensuring no single principal (human or AI) can authorize the most consequential operations alone.

The structural property to design for: an agent that wants to perform an action outside its autonomy bounds should find no path to do so within the architecture. The boundary is not a runtime check that could be bypassed; it is the shape of the interface itself.

MAESTRO threats relevant to Autonomy

Autonomy maps primarily to MAESTRO L3 (Agent Frameworks) where orchestration enforces boundaries, L7 (Agent Ecosystem) where multi-agent autonomy interactions occur, and L6 (Security & Compliance, vertical) where autonomy policy is governed.

Autonomy creep (L3, L6): gradual expansion without re-attestation. A team grants narrow autonomy at launch, encounters friction from approval gates, loosens boundaries to improve UX, and over months ends up with an agent operating well beyond what was originally authorized. The threat is organizational as much as technical; the mitigation is periodic re-attestation of autonomy levels with explicit sign-off and automated tracking of boundary changes.

Autonomy ambiguity (L3): unclear boundaries lead to either over-restriction (agent refuses legitimate actions) or under-restriction (agent takes unauthorized actions). The mitigation is explicit, machine-readable autonomy policy with clear precedence rules.

Approval fatigue (L6, L7): too many approval requests train human reviewers to rubber-stamp. The mitigation is risk-based approval routing that batches low-risk reviews and surfaces high-risk decisions with clear context.

Autonomy shopping (L3, L7): agents or users find paths achieving a goal without crossing approval gates — an agent forbidden from sending external email invokes a tool that triggers a webhook that sends external email. The mitigation is autonomy policy expressed at the effect level (outcomes), not just the action level (specific API calls), and cross-tool composition analysis.

Cascading autonomy failures (cross-layer): broad autonomy in one domain inherits effective autonomy in adjacent domains through tool composition. The mitigation is per-tool autonomy scoping, not per-agent.

Design patterns

The reference pattern for high-stakes deployments uses tiered authorization. Low-risk actions flow through a fast path with logging only. Medium-risk actions flow through an enriched-context path with anomaly checks. High-risk actions require synchronous approval. Catastrophic-risk actions require multi-party approval. Budget-based autonomy provides a backstop: regardless of risk level, an agent that exceeds a configured budget of tokens, dollars, calls, or affected records halts and requires human review.

The structural guarantee: every action the agent takes flows through the autonomy decision point. There is no path that bypasses it. The decision point reads from machine-readable policy (OPA/Rego, Cedar) that is itself versioned, signed, and reviewed.

Figure 4. Autonomy boundary routing from proposed action to logging, notification, approval, multi-party approval, or denial.

Autonomy boundaries turn risk into routing

I — Identity & Intent

Once autonomy boundaries are defined, the next question is who is doing the acting and on whose behalf. Identity establishes the principal; intent establishes the purpose. Both are required before any data access, tool call, or action can be authorized. Building data governance, context handling, or runtime enforcement before identity is solved produces systems whose authorization model is implicit and inconsistent across components — exactly the seam where attacks succeed.

The uncanny valley of agentic identity

Traditional identity systems split into two regimes. Human identity is built around interactive authentication, contextual signals, and slow-changing entitlements expressed through roles and group memberships. Workload identity — service accounts, API keys, mTLS certificates, SPIFFE IDs, OIDC tokens — is built around cryptographic attestation, short-lived credentials, and call graphs that can be statically reasoned about.

Agentic systems fall into the uncanny valley between these regimes. An agent is a workload by execution model: it runs in a container, holds credentials, makes API calls. But it is human-like in decision-making: it interprets ambiguous instructions, exercises judgment, composes novel sequences of actions, and acts on behalf of a human whose authority it inherits in some delegated form. Applying human identity controls (MFA, behavioral analytics tuned for humans) misfits the workload nature; applying pure workload identity controls (static service accounts with broad scopes) ignores that agents make value-laden decisions and can be steered by adversaries. Organizations that fail to recognize this valley grant agents service-account-style blanket permissions, then discover that prompt injection turns those permissions into an attacker capability.

The maturation path layers three components. The base layer is cryptographic workload identity (SPIFFE/SPIRE, cloud-native workload identity federation, TPM-rooted attested certificates) that proves what code is running. The middle layer is delegated authority (OAuth 2.0 token exchange RFC 8693, capability tokens, step-up authentication callbacks) that proves on whose behalf the agent acts. The top layer, still maturing, is intent attestation: a verifiable claim about what the agent is currently trying to do, bound to the specific task and revocable when the task completes.

Intent-Based Access Control

Conventional access control answers the question "is this principal allowed to perform this action on this resource?" For agents, that question is insufficient. An agent with legitimate access to a customer database may have a legitimate need to read one customer's record and no legitimate need to exfiltrate the entire table; both operations may satisfy traditional RBAC or ABAC policies. Intent-Based Access Control (IBAC) adds a third dimension: the action must be consistent with the agent's currently authorized intent.

When a user initiates a task, the orchestrator mints an intent token capturing the natural-language goal, a structured representation extracted by a classification model, the scope of resources the goal could legitimately touch, expected action types, a budget (in API calls, tokens, or time), and a TTL. Downstream policy decision points evaluate not just identity and resource but the active intent token, rejecting actions that fall outside the declared scope.

Intent misalignment becomes a first-class security threat with three primary vectors. External manipulation occurs when a user crafts a request that maps to a benign intent but is designed to be elaborated into a harmful action sequence. Adversarial prompt injection occurs when untrusted content mutates the agent's working intent mid-task. Model hallucination occurs when the model, given a vague task, invents a plausible but unauthorized sub-goal. IBAC defends against all three by requiring every consequential action to trace back to the attested intent.

Engineering principles embodied

Identity & Intent applies capability-based security and complete mediation. Authority is conveyed through unforgeable, narrowly scoped capability tokens — never through ambient permissions tied to a session or role. Every action passes through a mediation point that checks the capability against the policy; there are no privileged code paths that skip the check.

The structural property: an agent in possession of authority for one task cannot use that authority for a different task, because the capability binds to the intent ID and the authorization check verifies the binding by construction.

MAESTRO threats relevant to Identity & Intent

Maps primarily to L6 (Security & Compliance), L3 (Agent Frameworks), with secondary impact on L4 (Deployment) and L7 (Ecosystem).

Agent impersonation (L7, L6): mutual authentication, capability tokens that cannot be replayed, workload attestation binding credentials to verified code.

Credential theft and replay (L4, L6): short TTLs (minutes, not hours), narrow scopes, binding credentials to workload attestation so they cannot be used outside the verified execution environment.

Confused deputy across handoffs (L3, L7): attenuated delegation — Agent B receives a capability token strictly narrower than Agent A's, scoped to the specific subtask.

Intent drift through prompt injection (L1, L3): re-deriving authorization from the originally attested intent rather than from the agent's current reasoning.

Hallucinated intent (L1): vague tasks should fail attestation rather than be elaborated into plausible-sounding goals.

Privilege escalation through tool composition (L3, L7): composition-aware policy that reasons about effect chains, not just individual tool permissions.

Design patterns and anti-patterns

The reference architecture binds four elements at every consequential action: workload identity, delegated user identity, active intent token, and a per-action capability token derived from the three. The capability token is what passes to downstream services. It is short-lived, narrowly scoped, and embeds the intent ID.

Two anti-patterns recur. The omnipotent agent service account — a single broadly scoped account used by all agent instances — collapses the identity dimensions, makes auditing meaningless, and converts every prompt injection into worst-case escalation. Passing user credentials through to the agent — forwarding the user's bearer token so the agent can act "as the user" — breaks delegation hygiene and creates significant credential exposure. Token exchange to a narrower delegated credential is always preferable.

Figure 5. Identity and intent chain that binds workload identity, delegated authority, intent, capability, and broker mediation.

Identity and intent bind authority to purpose

D — Data & Memory Governance

Identity authorizes principals; data governance defines what the principals are authorized over. Before designing context, runtime, or observability, the platform team must answer: what data exists, where does it live, how is it classified, who can access it, how does classification propagate, and how long is it retained. Building agent context handling before the data classification is settled produces systems where sensitive data leaks into unprivileged context paths and where deletion requests have no clean propagation path.

The data lifecycle of an agent

Agentic systems depend on a wide data lifecycle. The training and fine-tuning data behind any custom models. The RAG corpora the agent retrieves from. The agent memory it persists across sessions. The data it generates as derivative artifacts (summaries, embeddings, extracted facts). The classification labels that should propagate from source systems through every downstream operation. None of this lives in the runtime context window, and none of it is owned by other pillars.

Data & Memory Governance owns: training data lineage (consent, license, bias evaluation); RAG corpus governance (source vetting, freshness, poisoning monitoring, access alignment); agent memory persistence (vector databases, episodic memory, fact stores, preference profiles); PII handling (detection, redaction, tokenization, audit, retention); data residency and sovereignty (where data can reside and be processed); data classification propagation (labels flow through retrieval, context, output, logs); and retention and right-to-erasure (deletion semantics that propagate through derived stores).

Engineering principles embodied

Data & Memory Governance applies least privilege and compositionality to the data lifecycle. Least privilege at the data layer means: the agent retrieves only the data needed for the task, classification flows with the data, and downstream operations inherit restrictions. Compositionality means that the security properties of derived data (embeddings, summaries) are derivable from the source — there is no place where classification can be "lost" because the data went through an operation that did not understand it.

The structural property: any data the agent touches carries its classification with it through every subsequent operation. There is no path where a "restricted" document gets summarized into an "unclassified" output by accident, because the classification is propagated by construction.

MAESTRO threats relevant to Data & Memory Governance

Maps primarily to L2 (Data Operations), with L1 (Foundation Models) for training data and L6 (Security & Compliance) for governance.

Data poisoning of training and fine-tuning data (L2 → L1): the canonical cross-layer cascade. Poisoned data ingested during a retraining cycle embeds corruption into model weights, manifesting as compromised behavior at L7. Mitigations: signed datasets with provenance, automated anomaly checks on incoming data, isolation between training data sources and production retrieval corpora, red-team testing for backdoor triggers post-training.

RAG corpus poisoning (L2, L3): an insider or compromised contributor adds malicious content. When retrieved into context, the content contains prompt injection payloads. This is one of the most underestimated production threats because it weaponizes the very content the system was designed to use. Mitigations: source vetting, content classification on ingestion, change tracking with audit, periodic adversarial retrieval testing, isolation between user-contributed and curated content.

Embedding inversion attacks (L2): given access to a vector database, an attacker reconstructs the original text from embeddings — sometimes substantially. Mitigations: treat vector databases as containing the original text for access control purposes; encrypt embeddings at rest where warranted; consider differentially-private embedding techniques for highly sensitive data.

Memory contamination across sessions or tenants (L2): one user's data leaks into another user's context through memory persistence. Mitigations: strict per-tenant memory scoping; separate physical or logical vector indexes for confidential data; explicit access control on memory retrieval.

PII leakage through derived data (L2, L5): embeddings, summaries, and logs derived from PII may not themselves be classified as PII in source systems, but reconstruction or correlation attacks can extract personal information. Mitigation: classification inheritance — any data derived from classified inputs inherits at least the classification of its inputs.

Right-to-erasure failures (L2, L6): when a user requests deletion, agents may have copies in memory stores, embeddings, summaries, fine-tuning data, and logs. Mitigations: per-user data inventory across all stores; deletion workflows that propagate to derived data; documentation of any data that cannot be deleted with explicit user notice; architectural choices that minimize the proliferation of personal data copies.

Data residency violations (L4, L6): an agent retrieves EU-resident data and processes it through a model API in a non-compliant region. Mitigations: residency labels on all data; routing logic that respects residency at the inference layer.

Design patterns

The reference architecture centers on a data classification service that every data-producing and data-consuming component consults. Source data is classified at ingestion; derived data inherits classifications from inputs; access control at retrieval, context assembly, and output enforces classifications. Memory stores are partitioned by tenant and by classification level. RAG pipelines treat retrieved content as untrusted by default. Embedding stores are governed at the same level as the raw text they were derived from.

A separate but related pattern is the data inventory: a continuously updated map of what personal and sensitive data exists where, how it flows through the agent system, who has access, and how long it persists. This is the artifact compliance teams need to demonstrate control and the operational tool incident response teams need when investigating a breach.

A common anti-pattern is treating the vector database as "just" a search index with relaxed access controls because "it's only embeddings." Embedding inversion plus the accumulation of context summaries means vector stores end up containing reconstruction-grade representations of sensitive data. Treat them as primary data stores for governance purposes.

C — Context

With autonomy, identity, and data governance in place, the next question is what the agent perceives at decision time. Context is where data, instructions, and untrusted content meet inside the model's working window. Designing context before identity and data is unsolved produces structurally compromised agents — the context handling has no way to know what trust level to assign each segment because the classifications were never established.

Context as a security dimension

The context window is not a working buffer; it is the entire universe the agent perceives at the moment of decision. Anything in that window shapes behavior, and anything that shapes behavior is in scope for security analysis. Treating context as a purely architectural concern misses that context window integrity has direct security consequences comparable to a misconfigured firewall.

Applying the CIA triad to the context window itself yields a useful framing. Confidentiality asks what is in the context, who can see it, and what happens if it is logged or leaked. Integrity asks whether the context can be modified by untrusted sources between read and act — indirect prompt injection is a pure integrity attack. Availability asks whether an adversary can exhaust, truncate, or fill the context with noise to push legitimate instructions out.

A typical agent context contains layered content with very different trust properties. At the top, a system prompt and policy block, authored by the platform team and effectively immutable from the agent's perspective. Below that, user-provided input, untrusted by default. Below that, retrieved content from RAG pipelines, often the lowest-trust segment because it can be authored by anyone whose documents end up in the corpus. Interleaved throughout, tool call results, carrying the trust of the tool. Finally, agent-generated content — prior reasoning, scratchpad notes — which inherits the integrity properties of everything that came before.

Engineering principles embodied

Context applies zero trust to the runtime window. The architecture conditions the model on segment provenance: instructions in low-trust segments are data, not directives. The structural property: there is no path by which content from an untrusted source becomes interpreted as instruction without crossing an explicit trust-elevation boundary that requires authorization.

The error most systems make is treating context as a flat token stream. A more defensible architecture tags each context segment with provenance and trust level, and the model is conditioned to respect the tags. No language model perfectly honors such conditioning, but the combination of trust tagging, structural separation, and downstream policy enforcement raises the bar significantly.

MAESTRO threats relevant to Context

Maps primarily to L3 (Agent Frameworks) and L1 (Foundation Models), with L2 (Data Operations) secondary.

Indirect prompt injection (L3, L1): the canonical attack. A user asks the agent to summarize a webpage; the page contains text designed to manipulate the agent. Layered mitigations: input transformations on untrusted content; prompt-layer trust tagging; runtime-layer tool call validation against current intent; action-layer confirmation for sensitive operations.

Context corruption through compaction (L3): summarization drops policy instructions; the agent loses constraints. The fix is hierarchical context — a sealed top layer (system prompts, policies) never compacted, a sticky middle layer of session-critical facts that resists summarization, and a rolling tail that can be compacted. Many systems re-inject the policy block before every model call rather than relying on it remaining in context.

Cross-tenant context leakage (L3, L4): shared caches, embedding stores, or KV cache reuse mix context between tenants. Mitigations: tenant-scoped cache keys, separate vector indexes for confidential corpora, explicit context isolation in the inference layer.

Context exhaustion attacks (L3, L1): adversarial inputs consume context budget and push policy out. Mitigation: sealed policy regions that consume from a separate budget.

Context-as-exfiltration-channel (L3, L6): agents include context content in tool calls or external responses, leaking what should remain internal. Output filtering and content classification on outgoing data are necessary.

Design patterns

A robust context security architecture maintains a provenance graph for every context element, so any token can be traced to its source. It enforces segregation through structural delimiters and role-based channels. It performs input transformations on untrusted content. It preserves security invariants across summarization. It logs context snapshots at decision points for post-incident forensics.

Figure 6. Data, memory, and context governance showing how classification, retention, retrieval, and trust segmentation shape the model window.

Safe context depends on governed data and memory

R — Runtime

With autonomy, identity, data, and context in place, the system has the substrate to make decisions. Runtime is what watches those decisions as they happen and intervenes when they go wrong. Building runtime enforcement before the prior pillars are settled produces enforcement that has nothing meaningful to check against — runtime rules need attested intents to verify alignment, classified data to enforce restrictions, and trust-tagged context to evaluate.

Runtime as last line of defense

Pre-deployment evaluation is necessary but insufficient. No eval suite covers every adversarial input, every tool composition, every emergent interaction. Runtime controls are the last line of defense — for many threats, the only meaningful one. A prompt injection variant invented yesterday will not be in your eval set; runtime detection and enforcement is what catches it.

Runtime comprises three capabilities. Verification checks each step against policy and intent. Enforcement takes action when verification fails — block, redact, transform, escalate. Dynamic intervention updates behavior in flight without redeploying, isolates misbehaving agents, rolls back actions.

Architecture for runtime control

The reference architecture places an LLM gateway (sometimes called AI proxy or AI firewall) in front of every model invocation. Direct model API access from application code is disallowed; all calls flow through the gateway, which enforces authentication, applies content policies on input and output, performs PII detection and redaction, rate-limits, attaches cost accounting, and emits telemetry. This is the chokepoint that makes uniform policy enforcement possible across a heterogeneous fleet.

A complementary chokepoint sits at the tool invocation layer. Tools are called through a tool broker that validates each call against the agent's identity, the active intent token, and policy. The broker is where IBAC is enforced.

Sandboxing matters for any tool that executes generated code or processes untrusted data. Code execution belongs in a properly isolated sandbox: containers with strict resource limits, no outbound network access except through the broker, ephemeral filesystems, no access to the agent's credentials.

Engineering principles embodied

Runtime applies complete mediation and defense in depth. Complete mediation: every action passes through a verification point; there are no privileged paths. Defense in depth: multiple verification layers (schema validation, intent alignment, policy check, content filter, anomaly detection) operate independently, so the attacker must defeat all of them while the defender needs only one to catch the attack.

The structural property: an action that does not pass through the gateway and broker cannot reach its target. There is no out-of-band path to the model or to the tools.

MAESTRO threats relevant to Runtime

Maps primarily to L3 (Agent Frameworks) and L4 (Deployment), with L6 (Security & Compliance) for policy enforcement.

Tool misuse and unsafe tool calls (L3, L7): tool-call validation gates — schema validation, allowlisted tools/actions, parameter constraints. Schema validation on every call is the cheapest and most effective check.

Container escape from sandboxed code execution (L4): gVisor or Firecracker-style isolation; no host filesystem access; kernel-level resource limits; no outbound network except through broker.

Pipeline compromise (L4): signed artifacts, reproducible builds, deployment gate reviews, scoped agent RBAC, container signing.

Goal misalignment cascades (L3 → L7): an agent under prompt injection performs a tool call that returns a result further corrupting reasoning. Mitigation: intent re-verification at each call, output filtering on tool responses.

Rate limit and resource exhaustion (L3, L4): per-task and per-agent budgets, circuit breakers, timeout enforcement.

Verification, enforcement, and dynamic intervention

Output schema validation is the cheapest and most effective runtime check. If a tool call is supposed to produce structured output, validate that it does — agents under prompt injection often produce malformed responses, and refusing to proceed on schema violation interrupts many attacks. Action confirmation for high-risk operations requires either explicit human approval or a second model invocation with adversarial framing to challenge the proposed action.

Intent re-verification is the IBAC counterpart at runtime. Before any consequential action, the system re-derives whether the action falls within the declared intent, operating from the originally attested intent rather than from the agent's current reasoning (which may have been corrupted).

When verification fails, enforcement options include blocking, redacting, transforming, escalating, or quarantining. Dynamic intervention enables responses without redeployment — policy bundles hot-load, rate limits tighten in real time, specific tool capabilities are temporarily revoked across the fleet when a vulnerability is disclosed.

Action rollback is the most ambitious capability. Designing agent tools with reversibility in mind — soft-delete defaults, transactional staging, two-phase commit for high-stakes actions — preserves the option to undo when something goes wrong.

H — Human Oversight & Override

With autonomy, identity, data, context, and runtime in place, the agent system has its automated defenses. But automated defenses do not cover the cases where automated decisions should defer to humans — high-stakes irreversible actions, ambiguous edge cases, regulatory requirements for human review, or situations where the agent's confidence is low. Human Oversight & Override is the discipline that designs these intervention points into the architecture rather than leaving them to be added later as "escalations" without structural support.

Human Oversight is distinct from Autonomy. Autonomy defines what's permitted for the agent to do unilaterally; Human Oversight defines how humans stay in the loop within the agent's permitted scope, and how they override outside it. An agent can have broad autonomy within a domain while still being subject to human oversight at specific decision points, sampled audits, deadman switches, and abort signals.

Human Oversight is also distinct from Runtime. Runtime enforcement is automated — block, transform, escalate. Human Oversight is human-mediated — review, approve, override. Runtime escalates to Human Oversight when policy cannot auto-decide; Human Oversight provides the human-facing workflow that makes that escalation actionable.

Components of human oversight

A complete human oversight architecture includes several distinct components.

Pre-action approval gates require human consent before specific actions execute. These fire on high-stakes actions identified by policy: financial transactions above a threshold, communications to external parties, irreversible operations, actions affecting many users. The approval interface presents the agent's proposed action, the reasoning that led to it, the data the agent considered, and the policy reason an approval is required. Approval gates that fire too often produce approval fatigue; ones that fire too rarely catch nothing. Risk-based routing — batching low-risk approvals for asynchronous review, surfacing high-risk in real time — preserves the value of the gate.

Post-action review queues sample completed actions for human review. Not every action can be reviewed before it happens (latency and volume don't permit it), but every action can be subject to retrospective review with some probability. Sampling biases toward high-stakes actions, anomalous patterns, and actions from agents with recent reliability concerns. Reviewers can confirm acceptable behavior, flag drift, or trigger rollback for reversible actions.

Real-time override mechanisms — sometimes called "stop buttons" or "abort signals" — let an authorized human halt an agent's execution mid-task. The structural requirement is that the override signal must reach the agent reliably, take effect promptly, and leave the system in a coherent state. Long-running agent tasks should poll for abort signals at safe interruption points; transactional state should be designed so abort yields rollback rather than partial state.

Deadman switches pause agent fleets when expected signals are absent. If the platform team loses communication with the agent fleet for longer than a configured interval, agents should default to a safe state — pausing high-stakes actions, requiring re-attestation to resume — rather than continuing autonomously without oversight.

Escalation policies route specific situations to specific humans. A medical diagnosis agent escalates differential diagnoses to a physician; a financial advice agent escalates positions above a threshold to a licensed advisor; a customer service agent escalates complaints involving regulatory matters to a compliance officer. The policy is part of the architecture, not a runtime decision.

Override audit logs every human override with the human's identity, the reason given, the prior agent decision, and the override outcome. Overrides are legitimate but they create accountability burden; without audit, "the human overrode it" becomes an excuse that defeats the oversight function.

Engineering principles embodied

Human Oversight applies psychological acceptability and separation of privilege. Psychological acceptability is the principle that controls must be usable in practice, or operators will route around them. Approval interfaces must give reviewers enough context to make real decisions without being so verbose that reviewers skim. Override mechanisms must be discoverable and fast enough to use in a crisis. The structural design is what makes oversight psychologically acceptable — not training or exhortation.

Separation of privilege shows up in multi-party approval (catastrophic actions require two humans), in the distinction between agent operators and audit reviewers (the team running the agents is not the team reviewing them), and in the separation between override authority and execution authority (a person can stop the agent without being able to make it act differently).

The structural property: there is no agent action so urgent or so deeply embedded in automation that no human can intervene. Every action path crosses points where oversight can be exercised.

MAESTRO threats relevant to Human Oversight & Override

Maps primarily to L6 (Security & Compliance) for governance, L3 (Agent Frameworks) for implementation, and L5 (Evaluation & Observability) for audit.

Approval gate bypass (L3, L6): the agent finds a path that achieves the desired effect without crossing the approval gate. Mitigation: effect-level policy (approvals fire on outcomes, not specific tool calls) and composition analysis.

Approval fatigue / rubber-stamping (L5, L6): high-volume low-context approvals train reviewers to approve without reading. Mitigations: risk-based routing, decision quality monitoring (reviewers whose approval rate is suspiciously high get sampled for review themselves), and explicit attestation language that makes approvals harder to mass-approve.

Override misuse (L6, insider threat): a human with override authority uses it inappropriately. Mitigations: multi-party override for high-stakes cases, override audit, periodic review of override patterns.

Lack of accountability (L6): "the agent did it" becomes an excuse. Every consequential action has a traceable human accountability path — either an approving human, an overriding human, or a human who authorized the autonomy boundary the agent operated within.

Audit blind spots in human decisions (L5, L6): human approvals and overrides are not subject to the same observability as agent actions. Mitigations: structured logging of all human decisions in the same audit stream as agent actions.

Delegated approval to under-qualified reviewers (L6): approval workflows route to people who lack the context or expertise to evaluate the decision. Mitigations: explicit reviewer qualification policy, training requirements, and reviewer rotation to surface knowledge gaps.

Time-based attacks on approval windows (L3, L6): approval windows that auto-approve on timeout enable attackers who can stall human review (by triggering many simultaneous approvals to exhaust attention). Mitigations: timeout-defaults-to-deny rather than timeout-defaults-to-approve.

Regulatory grounding

Human Oversight is explicitly required by major AI governance regimes. EU AI Act Article 14 specifies that "high-risk AI systems shall be designed and developed in such a way... that they can be effectively overseen by natural persons during the period in which the AI system is in use." NIST AI Risk Management Framework includes human oversight as a Govern function requirement. ISO 42001 (the AI management system standard) requires defined human oversight roles. Sector-specific regimes — FDA AI/ML guidance, financial supervisory expectations, healthcare regulations — impose additional human oversight requirements.

Building human oversight as a first-class pillar rather than an afterthought converts these requirements from compliance burden into operational capability.

O — Observability

With the agent's actions, authorization, data handling, perception, runtime enforcement, and human intervention points in place, the system has its complete decision-making architecture. Observability is what makes that architecture inspectable, debuggable, and accountable. It comes after the decision-making pillars because what's worth observing depends on what the system is doing; placing observability earlier produces telemetry without context.

Why observability is non-negotiable

Agentic systems are non-deterministic, partially opaque, and operate across many components per task. The same input may produce different outputs on different runs. The reasoning path is not directly inspectable from model parameters. A single user request may fan out into dozens of LLM calls, tool invocations, and agent handoffs. Without comprehensive observability, debugging is impossible, security incident response is blind, cost attribution is guesswork, and compliance attestation has nothing to attest to.

Observability is the substrate on which every other pillar depends in operation. IBAC needs traces to verify intent alignment. Context integrity needs logs of what was in context at each decision. Runtime enforcement needs telemetry to detect anomalies. Evaluation needs production data to identify regressions. Scalability needs metrics to drive autoscaling and cost optimization. Human Oversight needs decision context to make review meaningful.

What to observe

The observable surface includes every LLM call (model, version, parameters, prompt or its hash, completion, tokens, latency, cost, upstream context); every tool invocation (name, arguments, response, latency, errors, policy decision); every agent handoff (source, target, task description, capability token issued, context transferred); every policy decision (rule, inputs, decision, policy version); every human approval, override, or escalation; authentication and authorization events at every layer; and token and cost accounting at user, tenant, and task granularity.

The OpenTelemetry semantic conventions for GenAI — gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, span types for inference, embedding, and tool calls — provide a starting vocabulary. Adopting these conventions early prevents lock-in and enables cross-stack tracing.

Engineering principles embodied

Observability applies open design and reliability engineering. Open design means that the system's behavior is inspectable by authorized parties — no hidden code paths, no actions that escape telemetry, no decisions that lack provenance. Reliability engineering means that telemetry itself must be reliable: logs must survive the failures they describe, traces must be reconstructible, and observability infrastructure must not fail open (which would hide failures).

The structural property: there is no action the agent can take, no decision the system can make, that does not leave a trace. Telemetry is produced by construction at every chokepoint, not as an explicit addition.

MAESTRO threats relevant to Observability

Observability is itself MAESTRO L5 (Evaluation and Observability). Telemetry collection touches all other layers.

Log tampering (L5, L6): an attacker who compromises an agent attempts to delete or modify audit trails. Mitigations: tamper-evident audit logs (write-once storage, signed entries, append-only ledgers), out-of-band shipping to a SIEM with separate access controls, pre-action commit so logs exist before action takes effect.

Observability gaps as attack targets (L5): adversaries probe for blind spots where actions are not logged. Mitigation: comprehensive instrumentation by default, explicit policy review for any opt-out.

PII leakage through logs (L5, L6, L2): full prompts contain PII; completions may contain inferences about individuals. Architectures must support per-tenant data residency, configurable redaction at ingestion with reversible tokenization for authorized investigation, and retention aligned with regulatory regimes.

Evaluation/observability inconsistency (L5): the agent behaves differently when it detects evaluation versus production. Mitigation: indistinguishable shadow evaluation in production.

Cost-as-DoS / cost anomalies (L5, L4): a runaway agent loop or adversary leveraging your agent for their own LLM workload can generate substantial bills in minutes. Cost anomaly detection is its own discipline and should fire faster than human notification cycles.

Tracing across the agent graph

Distributed tracing with a stable trace ID propagated through every hop — including through MCP servers, A2A handoffs, and asynchronous queues — is essential. Volume and shape are challenges. A single task can generate thousands of spans; sampling strategies that work for HTTP services discard exactly the spans needed for forensics. The pragmatic approach is full retention for security-relevant events and intelligent sampling elsewhere, with the ability to rehydrate full traces on demand.

Forensic replay — reconstructing exactly what an agent saw and produced for any past task — is invaluable for security incidents and customer disputes. Tamper-evident audit logs, real-time anomaly detection, and privacy-preserving redaction round out the mature state.

E — Eval, Environment, and Ecosystem

Why this comes eighth

The pillars above describe the agent system in isolation. Eval, Environment, and Ecosystem describes how it is validated, integrated into the surrounding IT environment, and connected to other agents and tools through emerging protocols. This comes after the core architecture because evaluation is meaningful only against a defined behavior, integration is meaningful only against a defined interface, and ecosystem participation is meaningful only against defined trust boundaries.

Evaluation as continuous engineering

The most damaging misconception about agentic AI security is that red-teaming is an event. A two-week red-team engagement before launch finds the issues those red-teamers think of; production attackers have years and motivations red-teamers don't share. Security evaluation must be continuous, automated, and integrated into the same CI/CD pipeline that ships application code.

The foundation is a golden dataset: a curated set of inputs covering the security and safety surface with expected behaviors. It includes known prompt injection variants, jailbreak attempts, edge cases from past bugs, representative legitimate inputs to guard against over-blocking, and inputs that probe specific policy boundaries. Every change to prompts, models, tools, or policies runs against this dataset; regressions block merges. Over time the dataset grows.

A robust eval harness measures task success rates, refusal rates, tool selection quality, cost per task, latency, and consistency. A model change that improves safety scores but tanks task success is a regression. Multi-dimensional evaluation prevents shipping changes that look good on one metric but are net-negative.

Automated red-teaming tools — Garak, PyRIT, and others — generate adversarial inputs at scale and should run on every release candidate. Their output feeds back into the golden dataset.

Production evaluation closes the loop. Shadow-mode evaluation runs a new version against production traffic without affecting users. Canary deployments expose a small fraction of traffic and watch for anomalies. Online metrics feed back into the eval suite. The boundary between testing and production becomes a gradient.

Environment: integration with existing IT

Agentic systems do not deploy onto a green field. They integrate into existing IT with established identity providers, secrets managers, network architectures, data classification regimes, compliance frameworks, and monitoring stacks. Treating the agent platform as a parallel universe creates security gaps at every seam.

The integration pattern that works: agent platform consumes identity from the enterprise IdP via OIDC/SAML with token exchange to derived agent credentials. Secrets are retrieved just-in-time from the enterprise secrets manager using workload identity. Network policy applies to agent workloads as to other services. Logs flow to the enterprise SIEM. Data classification labels propagate into agent context handling.

Compliance frameworks — SOC 2, ISO 27001, HIPAA, NIST AI RMF, EU AI Act — increasingly require demonstrable controls. The control surface required for compliance largely overlaps with security controls required for safety. Designing the platform to emit compliance evidence as a byproduct of normal operation saves enormous effort.

Ecosystem: MCP, A2A, and the connectivity attack surface

Two protocols are becoming the connective tissue of agentic systems: the Model Context Protocol (MCP) and Agent-to-Agent (A2A) protocols. Both expand capability and attack surface in equal measure.

MCP servers expose tool capabilities to agents. Every MCP server is a privileged endpoint. The server must authenticate calling agents — anonymous MCP is a non-starter for anything sensitive. It must enforce its own authorization. Tool descriptions returned by the server become part of the agent's context and can themselves be vectors for prompt injection. Response data is also a vector. The MCP ecosystem currently lacks a mature trust model for third-party tool catalogs; treating any third-party MCP server as high-trust without scrutiny is a serious risk.

Engineering principles embodied

Eval applies compositionality in its CI/CD form: changes are validated against the same invariants the system was originally designed to maintain. Each release composes against the previous release's evidence base. Environment integration applies least common mechanism — share enterprise infrastructure where possible rather than building parallel systems whose security properties drift. Ecosystem applies zero trust at the protocol boundary — every cross-organizational interaction is verified.

MAESTRO threats relevant to Eval/Environment/Ecosystem

Maps to L5 (Evaluation & Observability), L4 (Deployment), and L7 (Agent Ecosystem).

Evaluation bypass (L5, L1): inputs benign during testing, malicious in production. Mitigation: indistinguishable shadow evaluation.

Golden dataset poisoning (L2, L5): modified eval dataset hides regressions. Mitigation: signed datasets, version control, adversarial dataset generation.

Eval-production drift (L5): production differs from eval in ways that invalidate results. Mitigation: shadow-mode evaluation, canary deployments with rollback triggers.

Agent impersonation and collusion (L7): mutual authentication, capability attenuation, monitoring for unusual agent-to-agent patterns.

MCP server compromise / malicious MCP servers (L7, L3): MCP server vetting before connection, signed tool catalogs, treatment of all MCP-sourced content as untrusted.

Marketplace manipulation (L7): curated catalogs, code review for tool implementations, sandboxing, reputation/audit trails.

Supply chain compromise (cross-layer): signed artifacts, vulnerability scanning, internal mirrors of vetted dependencies.

A2A confused deputy (L7, L6): attenuated capability tokens across handoffs, audit logs at every handoff.

S — Scalability

Scalability is the final design pillar because optimizing for scale before the system's correctness and security properties are settled produces a system that scales the wrong things. A platform that autoscales an insecure architecture deploys more copies of the insecurity. A platform optimized for token cost before its trust model is solid replaces expensive frontier models with cheaper but less-vetted ones, often reducing security posture. Solve correctness and security first; scale is a refinement of an already-working system.

Scalability as a security and economics problem

Scalability means three different things that interact: handling more concurrent users, handling larger or more complex tasks, and doing both economically. Conventional autoscaling solves the first; the second and third require AI-specific architectural decisions. All three have security implications: a system that scales poorly is a denial-of-service target, and a system whose costs scale unpredictably is a denial-of-wallet target.

Compute autoscaling

The Kubernetes-native baseline involves horizontal pod autoscaling driven by request queue depth, in-flight task count, or token-per-second throughput rather than CPU alone. CPU is a poor signal because most time is spent waiting on upstream model APIs. KEDA enables event-driven scaling from message queues, the appropriate pattern for asynchronous agent workloads. Cluster autoscalers handle node-level capacity; GPU-backed inference (when running models locally) should scale to zero when idle.

Multi-region deployment serves both availability and latency. State replication (memory, vector stores, audit logs) must align with data residency requirements.

Provider rate limits are increasingly the binding constraint on agentic scale; negotiating committed throughput with providers and architecting around quota becomes a planning function.

Token cost as a first-class concern

LLM API costs scale with token volume, influenced by prompt size, response size, retry behavior, multi-agent fan-out, and reasoning length. A single user task in a multi-agent system can consume tens of thousands of tokens. Cost-aware architecture is not an optimization — for many deployments, it is the difference between a viable and unviable product.

Prompt caching reuses computation for shared prefixes; structuring prompts so the cacheable system block comes first dramatically reduces input token costs. Semantic caching returns cached responses for similar inputs. Prompt compression reduces tokens at some quality cost. Batched inference processes multiple requests in one call at lower per-request cost.

Multi-model architecture

Not every task needs the largest, most capable model. A multi-model architecture routes each task to the smallest, cheapest, fastest model that can do it well. The router is typically a small classifier or rules-based system. Simple classification and short summaries go to small models. Complex reasoning and high-stakes generation go to large frontier models. Specialized tasks go to domain-tuned models when volume justifies operational overhead.

Local models earn their place when latency, privacy, or cost dominates. Local inference removes round-trip latency, removes the need to send data outside the trust boundary, and replaces per-token cost with amortized compute cost. Tradeoffs are operational complexity, capability ceiling, and engineering cost.

Provider diversity provides resilience and pricing leverage. Architecting behind a provider abstraction so switching between Anthropic, OpenAI, Google, AWS Bedrock, or open-weight models is a configuration change protects against single-provider issues. Fallback chains formalize resilience: primary route, secondary on failure, tertiary for degraded service.

Engineering principles embodied

Scalability applies economy of mechanism and reliability engineering. Economy of mechanism: prefer simpler scaling architectures that can be reasoned about over complex ones whose failure modes are unpredictable. Reliability engineering: graceful degradation when components fail (fallback to a smaller model rather than failing the request) and visible failure modes (a degradation that the platform team sees rather than one that silently changes quality).

The structural property: scaling behavior is predictable, cost is bounded, and degradation modes are visible. There is no scaling decision that silently changes the system's security properties.

MAESTRO threats relevant to Scalability

Maps primarily to L4 (Deployment), with secondary impact on L1 and L3.

Denial-of-wallet attacks (L4, L3): adversary inputs trigger expensive operations. Mitigations: per-user budget caps, per-tenant cost ceilings, anomaly detection on cost rate, circuit breakers.

Provider abuse via the platform (L4, L7): platform's provider credentials leveraged by attacker. Mitigations: per-workload credential isolation, the same controls as DoW.

Routing manipulation (L3): adversarial inputs cause router to direct to less-defended or higher-capability model. Mitigation: router decisions logged and auditable; security policy enforced uniformly across all routes.

Cross-tenant interference (L4): noisy neighbors cause latency or availability degradation. Mitigations: resource isolation at namespace and node-pool level.

Supply chain risk in fallback chains (L1, L7): fallback to less-vetted provider carries different security properties. Mitigation: treat fallback paths as production paths for security review.

Latency engineering

Streaming returns tokens as generated. Speculative execution starts likely follow-up calls in parallel. Parallel tool calls collapse chains where the agent doesn't need results in order. Edge inference for routing and classification puts the first hop close to the user.

A mature platform routes through an edge classifier, dispatches to the right model tier on the right provider, with appropriate caching and parallelization, streaming back to the user, while emitting telemetry that lets the team observe and optimize. None of this is optional at scale; all of it is engineering work that compounds.

Figure 7. Assurance loop across evaluation, controlled release, observability, incident learning, and scale decisions.

Evaluation, observability, and scale form the assurance loop

MAESTRO: The Threat Modeling Companion

The MAESTRO framework — Multi-Agent Environment, Security, Threat, Risk, and Outcome — is the threat modeling discipline that ORCHIDEAS pairs with. Where ORCHIDEAS organizes the design space across nine pillars, MAESTRO organizes the threat space across seven architectural layers.

The seven layers are: L1 Foundation Models (LLMs and the core AI capabilities); L2 Data Operations (storage, processing, embeddings, training data, RAG corpora, agent memory); L3 Agent Frameworks (orchestration platforms like LangChain, AutoGen, APIs); L4 Deployment and Infrastructure (servers, containers, networks); L5 Evaluation and Observability (monitoring, debugging, telemetry); L6 Security and Compliance (vertical layer cutting across all others, providing governance, auditability, regulatory alignment); and L7 Agent Ecosystem (the environment where multiple agents interact, including MCP/A2A and external integrations).

MAESTRO's central insight: threats rarely exist in isolation within a single layer. Cross-layer dependencies create attack paths that span multiple layers, enabling sophisticated attacks that traditional single-layer security approaches cannot detect or prevent. An attack might begin with data poisoning at Layer 2, influence decision-making at Layer 3, and ultimately manifest as unauthorized actions at Layer 7. Attackers can also move laterally across systems within the same layer.

The pillar-to-layer mapping (summarized below) shows how each ORCHIDEAS pillar contributes defenses at specific MAESTRO layers. Pillars and layers are orthogonal but interlocking: every pillar touches multiple layers, every layer is touched by multiple pillars, and the cross-cutting Security & Compliance layer (L6) is in scope for every pillar.

Figure 8. ORCHIDEAS controls mapped across MAESTRO layers to show why defenses must cross architectural boundaries.

ORCHIDEAS controls mapped across MAESTRO layers

Cross-layer attack paths

Three example attack paths illustrate why pillar-by-pillar thinking is insufficient.

Path 1 — Data poisoning to ecosystem damage (L2 → L1 → L3 → L7). An attacker contributes a document to a shared knowledge base used for RAG. The document contains both legitimate content (passing review) and a prompt injection payload activating on specific query patterns. A user later asks a question triggering retrieval. The agent's context (L3) contains the injection. The model (L1) interprets it as instruction. The agent performs an unauthorized action visible in the ecosystem (L7). The defense crosses pillars: Data Governance (D) failed at ingestion; Context (C) failed at trust segregation; Runtime (R) failed at intent re-verification; Human Oversight (H) might have caught it at a sampled review; Observability (O) might have caught it at anomaly detection. Any single layer might catch it; the combination is what makes catching it reliable.

Path 2 — Tool ecosystem to credential theft (L7 → L3 → L4 → L6). An organization adopts a third-party MCP server. The server is compromised at the supplier. Tool descriptions contain instructions to exfiltrate environment variables. The agent (L3) follows the embedded instruction, reads environment variables (L4) where credentials are stored, and calls an external URL with the contents (L7), violating compliance posture (L6). Defenses combine ecosystem vetting (E), context handling (C), runtime egress controls (R), credential handling (I, no secrets in env vars), and observability (O).

Path 3 — Pipeline compromise to fleet behavior change (L4 → L3 → L7). An attacker compromises the deployment pipeline, inserting a modified system prompt that subtly lowers the bar for one action category. The change passes initial review as a routine update. Deployed, it increases the rate of unauthorized actions across the fleet. Defenses: signed prompts and configuration (E), pre-deployment evaluation (E), canary deployment with anomaly detection (S, O), dynamic rollback (R).

The recurring shape: any layer can be a beachhead, and attack value is realized at a different layer than the initial exploit. Threat modeling cannot stop at "the model gets jailbroken" or "the container gets escaped." The threat model must trace plausible attack chains from any layer through to consequential effects, and the ORCHIDEAS pillars must each contribute defenses at the layers they touch.

Secure by Construction: Synthesis

The phrase "secure by construction" captures the difference between a system whose security depends on runtime checks firing correctly and a system whose security is structural — where the architecture itself forecloses the attacks rather than relying on checks to catch them.

Several properties characterize a secure-by-construction agentic system, each emerging from the combination of pillars rather than from any single one.

Figure 9. Secure-by-construction invariants showing how structural guarantees emerge from combinations of ORCHIDEAS pillars.

Authority cannot expand. Capability tokens are attenuated at every handoff. An agent that delegates to another agent passes a strictly narrower capability, never a broader one. There is no path within the architecture by which an agent can gain authority it did not start with. This property emerges from Identity & Intent (capability-based security), Autonomy (boundary policy), and Runtime (the broker that mediates tool calls).

Trust does not leak. Content from untrusted sources is tagged at ingestion and the tags follow the content through every transformation. There is no path by which untrusted content becomes interpreted as instruction without crossing an explicit trust-elevation boundary. This property emerges from Data Governance (classification at ingestion), Context (trust tagging in the runtime window), and Runtime (output and input filters).

Actions cannot bypass mediation. Every model call flows through the gateway; every tool call flows through the broker. There is no privileged path that skips the check. This property emerges from Runtime (the chokepoint pattern) and is reinforced by Observability (any action without telemetry is detected as a violation).

Intent binds to action. Every consequential action references the originally attested intent, and authorization is re-derived from that intent rather than from the agent's current reasoning. Prompt injection that mutates the agent's working understanding cannot change what the authorization layer sees. This property emerges from Identity & Intent (IBAC), Runtime (intent re-verification at decision points), and Context (segregation of policy from manipulable content).

Humans can always intervene. No action is so urgent or so deeply automated that override is impossible. Every action path crosses points where oversight can be exercised. This property emerges from Human Oversight (intervention points by design), Autonomy (boundaries that route to humans), and Observability (humans can see what's happening).

Data classification cannot be lost. Any derived data inherits the classification of its inputs. There is no transformation that produces unclassified output from classified input by default. This property emerges from Data Governance (classification propagation), Context (segregation by trust level), and Observability (logs inherit data classifications).

The threat model survives change. Adding a new model provider, a new MCP server, a new tool, or a new agent type passes through the same architectural commitments. The new component must conform to the existing identity, data, context, and runtime contracts. Components that cannot conform are not adopted. This property emerges from Eval/Environment/Ecosystem (the validation gate) and from the compositional discipline running through the whole framework.

Together, these properties define what it means for an agentic AI system to be secure by construction. Each is achievable; none is sufficient alone. The framework is the discipline of achieving all of them simultaneously through structural choices rather than vigilance.

Adoption Roadmap

A pragmatic adoption sequence for a team starting from scratch follows the design order of the pillars.

Phase 1 — Foundations. Establish explicit autonomy levels (A) for the first agent use case. Stand up workload identity (I) for the agent platform and integrate with the enterprise IdP. Build the data classification baseline (D) and inventory the data sources the agent will touch.

Phase 2 — Substrate. Deploy the LLM gateway and tool broker as the runtime chokepoints (R). Build the context assembly pipeline with trust tagging (C). Implement the basic policy enforcement at the gateway and broker. Wire IBAC into the policy layer (I + R together).

Phase 3 — Oversight and visibility. Instrument comprehensive telemetry (O) at every chokepoint. Design the human oversight workflows for the first use case (H): approval gates for high-stakes actions, override mechanisms, deadman switches. Build the audit pipeline.

Phase 4 — Validation and integration. Build the golden dataset and evaluation harness (E). Wire it into CI/CD with regression gates. Integrate with enterprise SIEM, secrets management, and compliance reporting. Conduct first formal MAESTRO threat modeling pass; remediate findings.

Phase 5 — Scale. Build out multi-model routing and cost optimization (S). Add provider diversity. Adopt MCP and A2A protocols with appropriate trust scrutiny. Continuous evaluation and continuous MAESTRO threat modeling become the ongoing engineering discipline.

The phases overlap, and small teams may compress them. The order matters more than the calendar: skipping pillars early forces rework later. A team that builds runtime enforcement before identity is settled produces enforcement that has nothing meaningful to check; one that builds observability before the decision-making pillars are in place produces telemetry without context; one that scales before the security properties are structural deploys insecurity at scale.

Teams that succeed treat ORCHIDEAS as a continuous engineering discipline rather than a checklist. Threats evolve, protocols mature, models change, costs shift. The platform that survives is the one whose architecture admits change in each dimension without requiring rewrites elsewhere — where adding a new model provider is a config change, adopting a new MCP server passes through a standard review, a new class of prompt injection becomes a new golden dataset entry and a new gateway filter, a new data residency requirement maps to existing classification infrastructure, a new compliance requirement maps to existing audit infrastructure, and a new human oversight obligation maps to existing approval workflows. That is the bar for production agentic AI, and the combination of ORCHIDEAS and MAESTRO is the framework that makes the bar reachable.

Zero Trust Identity and Access Management Data Security Compliance Risk Management Artificial Intelligence DevSecOps