Leveling Up Autonomy in Agentic AI

Published 01/28/2026

Written by Jim Reavis, Co-founder and Chief Executive Officer, CSA.

The conversation around artificial intelligence has shifted dramatically over the past two years. We've moved from debating whether AI can write a decent email to grappling with AI systems that can autonomously execute code, manage infrastructure, conduct financial transactions, and orchestrate complex multi-step operations with minimal human involvement. This isn't a future scenario. It's happening now in enterprises around the world.

As I've watched this evolution unfold, a question has been nagging at me: Do we have the right conceptual frameworks to think about AI autonomy in a structured way? More specifically, do we need something analogous to what other industries have developed when confronting autonomous systems: a clear taxonomy of autonomy levels that helps organizations understand what they're deploying, what controls are appropriate, and how to govern these systems responsibly.

I've been sketching out some ideas on this front, drawing inspiration from established frameworks in other domains, and I want to share my thinking with the CSA community. I'm genuinely uncertain whether this is worth developing further or whether existing approaches are sufficient. So consider this an invitation to a conversation rather than a finished proposal.

Lessons from Other Autonomous Domains

When I started thinking about this problem, I naturally looked at how other industries have tackled the challenge of defining and governing autonomy. Two examples stand out as particularly instructive.

The Society of Automotive Engineers' J3016 standard has become the lingua franca for discussing vehicle automation. Its six levels, from Level 0 (no automation) through Level 5 (full automation), provide a shared vocabulary that regulators, manufacturers, insurers, and consumers all understand. When someone says a vehicle has "Level 2 automation," there's a common understanding of what that means: the vehicle can control steering and acceleration/deceleration, but the human driver must remain engaged and monitor the environment at all times. This shared vocabulary has proven invaluable for setting expectations, defining liability, establishing insurance frameworks, and creating appropriate regulations.

Similarly, the unmanned aircraft systems (UAS) community has developed frameworks for categorizing drone autonomy and the associated operational requirements. These frameworks consider factors like the complexity of the mission, the environment in which the system operates, and the potential consequences of failure. Higher autonomy in more complex or populated environments triggers more stringent requirements for certification, operator training, and operational controls.

What strikes me about both examples is that they create a foundation for proportionate governance, rather than just classifying systems. The controls, oversight requirements, and accountability mechanisms scale with the autonomy level. A Level 2 vehicle doesn't need the same validation rigor as a Level 4 vehicle, and a drone operating autonomously in a populated urban area faces different requirements than one under direct human control in a rural setting.

The Agentic AI Gap

Now consider where we are with agentic AI. Organizations are deploying AI systems with wildly different levels of autonomy, but we lack a common vocabulary to describe these differences. An "AI agent" might mean a chatbot that suggests responses for human review, or it might mean a system that autonomously executes code changes in production environments, manages cloud infrastructure, or conducts financial transactions, all without human approval for individual actions.

This vocabulary gap creates real problems. How do you assess risk when you can't clearly categorize what you're assessing? How do you define appropriate controls when there's no shared understanding of what autonomy level you're trying to govern? How do you communicate with boards, regulators, or insurers about your AI deployments when there's no common framework for describing what these systems actually do?

The data on current practices is concerning. Based on conversations with practitioners and various industry surveys, I estimate that the majority of organizations deploying agentic AI have no formal classification system for autonomy levels, make autonomy decisions on an ad hoc basis, lack technical enforcement of autonomy boundaries, and have unclear or nonexistent policies governing AI autonomy. This isn't a criticism of these organizations. They're working with the tools and frameworks available. But it does suggest we have a gap that needs filling.

A Proposed Autonomy Level Taxonomy

With that context, let me share the framework I've been developing. I'm proposing six autonomy levels for agentic AI, deliberately echoing the structure of SAE J3016 while adapting it to the unique characteristics of AI systems.

Level 0: No Autonomy (Human Execution)

At this level, the AI system provides information, analysis, or recommendations, but humans perform all actions. Think of a chatbot that answers questions or an analytics tool that presents insights. The AI cannot execute any actions in the world; it can only inform human decision-making.

The risk profile here is minimal because the AI has no action capability. Controls focus primarily on output quality and preventing information leakage. This is where most traditional AI systems have operated, and it's the appropriate level for many use cases where AI adds value through insight rather than action.

Level 1: Assisted (Human Decision + AI Execution)

At Level 1, the AI can execute actions, but each action requires explicit human approval before execution. The AI proposes what it wants to do, a human reviews the proposal, and only upon approval does the AI proceed.

Consider an AI coding assistant that can write and run code, but presents each proposed action to the developer for review before execution. Or an AI that can send emails on your behalf, but shows you each draft and waits for your explicit approval before sending. The human remains the decision-maker for every action; the AI handles execution.

The control model here is straightforward: an approval gate between intention and action. Every proposed action enters a queue, a human reviews and approves (or rejects or modifies), and only then does execution occur. This creates a clear accountability chain and provides a natural checkpoint for catching errors or inappropriate actions.

Level 2: Supervised (Human Approval + Batch Execution)

Level 2 introduces plan-level approval. Instead of approving each individual action, humans review and approve a plan or batch of actions, and the AI then executes autonomously within that approved scope.

Imagine approving an AI's plan to refactor a codebase, after which the AI executes the dozens of individual changes required without seeking approval for each one. Or approving a workflow for the AI to process a set of customer requests, then letting it work through the queue independently.

This level represents a meaningful increase in autonomy and efficiency. The human still provides authorization, but at a higher level of abstraction. Controls must ensure that the AI stays within the approved plan scope, that humans can monitor execution progress, and that there are mechanisms to pause or cancel if something goes wrong. Checkpoint-based rollback capabilities become important here.

Level 3: Conditional (AI Decision within Boundaries)

This is where things get interesting and where I think many organizations are heading without necessarily realizing it. At Level 3, the AI makes decisions and takes actions autonomously within defined boundaries, escalating to humans only when it encounters situations that exceed those boundaries.

The boundaries might be defined by action type (the AI can modify configuration files but not delete them), by scope (the AI can operate on development systems but not production), by value (the AI can approve expenses up to $1,000 but must escalate larger amounts), or by various other parameters. Within these boundaries, the AI operates independently. When it encounters a situation at or beyond the boundaries, it pauses and requests human guidance.

This level requires a fundamental shift in how we think about control. Instead of approving actions or plans, humans define the boundaries within which the AI may operate. The critical requirements become: machine-readable boundary definitions that the AI can evaluate against, technical enforcement mechanisms that actually prevent boundary violations (not just detect them after the fact), robust escalation workflows for handling boundary cases, and comprehensive logging of the AI's autonomous decisions.

The risk profile at Level 3 depends heavily on how boundaries are defined. Narrow, well-chosen boundaries with strong enforcement can make Level 3 quite safe. Poorly defined or weakly enforced boundaries can create significant exposure.

Level 4: High Autonomy (Minimal Supervision)

At Level 4, the AI operates autonomously across a broad scope, with human involvement shifting from decision approval to monitoring and exception handling. Humans don't approve actions or even define narrow boundaries. They watch for anomalies and intervene when something appears wrong.

This level might be appropriate for AI systems managing routine operational tasks at scale, where human approval for individual actions or plans would be impractical. Think of an AI conducting continuous security monitoring and auto-remediation, or managing dynamic resource scaling across cloud infrastructure.

The control requirements at Level 4 are substantial: continuous monitoring capabilities (potentially 24/7), sophisticated anomaly detection to identify when the AI is behaving unexpectedly, immediate kill-switch capabilities with rapid response times, comprehensive logging and attribution for all actions, and executive-level authorization and risk acceptance before deployment. This level should require explicit sign-off from senior leadership who understand and accept the risks involved.

Level 5: Full Autonomy (Self-Directed)

Level 5 represents full autonomy, including the ability to set goals and potentially modify its own behavior. The AI operates with only strategic oversight from humans.

I want to be direct about this: I don't believe Level 5 is appropriate for enterprise deployment today. I'm including it in the taxonomy for completeness and to clearly mark the end of the spectrum, but the control mechanisms required to safely operate at this level don't exist yet. Any organization considering Level 5 autonomy should treat it as a research question, not a deployment option.

Levels of Agentic AI Autonomy infographic

Connecting Autonomy to Capability

One thing I've learned from working on the Capabilities-Based Risk Assessment (CBRA) framework is that autonomy alone doesn't determine risk. Capability matters enormously. An AI with Level 3 autonomy over low-risk, easily reversible actions presents a very different risk profile than an AI with Level 3 autonomy over financial transactions or code execution in production systems.

This suggests we need something like a capability-control matrix that maps the intersection of autonomy levels and capability types to appropriate controls. For instance, financial transaction capabilities might warrant additional controls at any autonomy level: per-transaction approval at Level 1, transaction batch limits at Level 2, amount and vendor boundaries at Level 3, and so on. Code execution capabilities might require sandboxing, resource limits, and output verification regardless of autonomy level.

Certain capability-autonomy combinations might simply be inadvisable. Should any AI have Level 4 autonomy for financial transactions? For actions affecting physical systems? For operations that could cause irreversible harm? These are questions each organization must answer based on their risk tolerance, but having a framework that surfaces these questions is valuable.

Integration with Existing CSA Frameworks

If this autonomy framework has value, it needs to work in concert with the frameworks and tools CSA has already developed.

The AI Controls Matrix (AICM) provides an extensive catalog of security and governance controls across 18 domains. An autonomy level framework could help organizations determine which AICM controls are most relevant for their deployments. Lower autonomy levels might require a baseline set of controls focused on fundamental security and data protection. Higher autonomy levels would progressively require more controls from additional domains: model security, monitoring and logging, incident response, business continuity, and so on.

The CBRA framework already addresses the autonomy dimension as one of four factors in its risk scoring model (alongside system criticality, access permissions, and impact radius). A formal autonomy level taxonomy could make the autonomy assessment more structured and consistent. Instead of subjectively rating autonomy on a scale, assessors could classify the system into a defined level based on clear criteria.

We're also developing an AI Security Maturity Model that addresses organizational capabilities for AI security. That work focuses on organizational maturity: Do you have the policies, processes, skills, and tools to manage AI securely? The autonomy framework I'm describing here focuses on individual systems: How much autonomy does this specific AI system have, and what controls are appropriate for that level? The two are complementary. Organizational maturity determines whether you can safely deploy and govern systems at various autonomy levels.

Governance Implications

If organizations adopt an autonomy level taxonomy, it has significant implications for governance structures and processes.

Different autonomy levels should require different authorization authority. Deploying a Level 1 system might need only business owner approval. Level 2 might require manager and security team sign-off. Level 3 might need a formal review board. Level 4 should require executive authorization and documented risk acceptance. This creates a natural forcing function: if you want higher autonomy, you need to make the case to increasingly senior stakeholders.

Review cadence should also scale with autonomy. A Level 1 system might need only annual review. A Level 4 system might warrant weekly review to ensure that the monitoring is effective, that no anomalies have been missed, and that the risk acceptance remains appropriate given any changes in the threat landscape or operational context.

There's also the question of dynamic autonomy adjustment. Should autonomy levels be static, or should they adjust based on context? An AI might normally operate at Level 3, but automatically drop to Level 1 if anomalies are detected, if a security incident is underway, or if it's operating in an unusually sensitive context. This adds complexity but could provide an important safety mechanism.

Technical Enforcement

A taxonomy is only useful if it can be operationalized. This means technical architectures that actually enforce the autonomy boundaries, not just policies that describe them.

At Level 1, this might be an approval gate: a technical component that intercepts all proposed actions, presents them for human review, and only forwards approved actions for execution. At Level 3, it might be a boundary enforcement layer that evaluates each proposed action against machine-readable boundary definitions and either permits execution, blocks and logs the attempt, or routes to an escalation queue.

The principle I keep coming back to is that autonomy boundaries must be technically enforced, not just policy-documented. A policy saying "this AI should only modify development systems" is meaningless if the AI technically has access to production and there's no mechanism preventing it from acting there. Technical enforcement is what makes autonomy levels real.

Open Questions

As I've developed this framework, several questions remain unresolved in my mind:

Granularity: Is a six-level taxonomy the right granularity? SAE J3016 uses six levels, but perhaps AI autonomy is different enough to warrant more or fewer distinctions. Could we accomplish the same goals with three levels (human-controlled, bounded autonomy, broad autonomy)?

Dynamic classification: Should a single AI system have one autonomy level, or should it vary by capability or context? A system might warrant Level 3 autonomy for some actions and Level 1 for others. How do we handle this without making the framework unwieldy?

Measurement and verification: How do we verify that a system is actually operating at its claimed autonomy level? What would an audit of autonomy level look like? What evidence would demonstrate compliance?

Industry variation: Should the framework be universal, or do different industries need different taxonomies? A healthcare AI and a financial services AI face different regulatory contexts and risk profiles, but there's value in a common vocabulary across industries.

Regulatory alignment: How does this framework map to emerging AI regulations? The EU AI Act takes a risk-based approach that aligns conceptually with this thinking, but the specific levels and requirements might not map cleanly.

An Invitation to Collaborate

I've shared this framework not because I think it's finished, but because I think the underlying problem is important and I want to hear from the community.

Do you see value in a structured autonomy level taxonomy for agentic AI? Would this help your organization think more clearly about AI governance? Are there aspects of this framework that seem wrong or missing? Are there existing frameworks I should be learning from?

If there's sufficient interest, I'd like to propose this as a topic for development within CSA's research community. We could refine the taxonomy, develop more detailed control mappings, create assessment methodologies, and potentially integrate this with our existing frameworks like AICM and CBRA.

But first, I want to know whether this resonates with practitioners who are actually grappling with agentic AI in their organizations. The best frameworks are developed collaboratively, grounded in real-world needs, and refined through practical application. That process starts with a conversation.

What do you think? I'd genuinely like to hear your perspective, whether you think this is a valuable direction, a solution in search of a problem, or something that needs significant revision before it would be useful. Reach out to [email protected] and let me know.

Standards Enterprise Architecture Risk Management Artificial Intelligence