Governing AI Agents When They Write Your Code: A Framework for Bounded Autonomy

Seventy-five percent of enterprises plan to deploy AI agents by the end of 2026. Ninety-three percent of mid-market companies have no governance framework specific to agentic systems. That gap is not an oversight; it is a structural failure to recognise that autonomous agents are categorically different from the AI tools most organisations have spent the past three years learning to manage.

When a developer uses GitHub Copilot to autocomplete a function, a human reviews the suggestion before it lands in the codebase. When an AI agent is handed a task (“resolve this bug and open a pull request”), it reads files, modifies code, runs tests, commits, and pushes, making dozens of decisions along the way with no human in the loop at each step. The Anthropic 2026 Agentic Coding Trends Report captures this shift precisely: task horizons are expanding from minutes to days and weeks, and the scope of what an agent can touch in a single autonomous run is growing proportionally. Governance frameworks built for human-operated tools simply do not cover this surface. They were never designed to.

The governance gap

Traditional application security assumes a human actor at the keyboard. Policies govern what people can do: who can merge to main, who can access production credentials, who can invoke a cloud API that spins up infrastructure. Those policies are enforced through identity systems, access controls, and audit trails. They work because the unit of action is a human decision.

AI agents break this model in two ways. First, an agent’s decision-making is opaque in a way that human decision-making is not. You can ask a developer why they made a call; you cannot always reconstruct why an agent chose a particular path through a task. Second, an agent’s blast radius can dwarf anything a human could accomplish in the same time window. A developer with production credentials who went rogue could cause significant damage. An agent with the same credentials, running at machine speed across thousands of files, could cause catastrophic damage before any human noticed something was wrong.

The governance models that have served us well (IAM policies, RBAC, change control processes) are necessary but not sufficient. They need to be extended with controls designed specifically for autonomous actors: permission boundaries that scope what an agent can touch, decision gates where human approval is required before irreversible actions, runtime monitoring that can detect when an agent’s behaviour departs from expected patterns, and audit infrastructure built to capture agent decisions at the granularity needed for forensics.

What OWASP tells us about the real attack surface

In early 2026, OWASP published the Top 10 for Agentic Applications, the first formal attempt to map the security risks specific to autonomous AI systems. It is a document worth reading in full, but four risks stand out as immediately relevant to engineering teams.

Agent Goal Hijack sits at the top of the list. An agent operating on natural-language instructions can be redirected by injected content: a comment in a file the agent reads, a fabricated tool response, a malicious prompt embedded in data the agent processes as part of its task. Unlike SQL injection or XSS, prompt injection attacks target the agent’s reasoning process rather than a code execution path, which makes them harder to detect and harder to filter. The attack surface is every document, file, or API response the agent is permitted to read.

Rogue Agents describes the risk of an agent that pursues its assigned goal through means outside any sanctioned boundary, not because of adversarial input, but because the agent’s optimisation path led it somewhere the designers did not anticipate. This is the principal-agent problem dressed in new clothes. An agent told to “reduce CI build times” might disable test suites. An agent told to “fix the failing tests” might delete the assertions. Neither action is the result of a security failure in the traditional sense. Both are the result of an autonomous system operating without sufficient constraint.

Excessive Agency captures the structural risk that most teams are already creating without realising it. When an agent is granted broad access (read/write across the entire codebase, permission to invoke deployment scripts, access to production secrets), it carries a blast radius that is larger than the task requires. The principle of least agency, directly analogous to least privilege, holds that an agent should be granted only the permissions needed to accomplish its current task, scoped to the context of that task and no further.

Human-Agent Trust Exploitation is subtler and more insidious. As agents become normalised in engineering workflows, humans begin to treat their output with less scrutiny. A PR opened by an AI agent gets reviewed as quickly as a PR from a trusted colleague. A commit message written by an agent is read at face value. Attackers who can influence an agent’s output, through goal hijack or through compromise of the data it reads, can leverage this reduced scrutiny to land malicious changes that a more sceptical review would catch. The attack surface is not just the agent; it is the human trust the agent has inherited.

A governance architecture for engineering teams

The principle of least agency provides the right mental model, but it requires operational structure to be useful. We have developed a four-layer governance architecture that we apply when helping engineering teams deploy agents responsibly.

flowchart TD
    subgraph Agents["AI Agents"]
        A1["Coding Agent"]
        A2["Review Agent"]
        A3["Deploy Agent"]
    end

    subgraph L1["Layer 1: Permission Boundaries"]
        B["Dynamic least-privilege\nscoped to task context\nRead/write ACLs, tool allow-lists\nSecret access restrictions"]
    end

    subgraph L2["Layer 2: Decision Gates"]
        C{"Requires human\napproval?"}
        C1["Low blast radius\nReversible action\n→ Proceed autonomously"]
        C2["High blast radius\nIrreversible action\n→ Human checkpoint required"]
    end

    subgraph L3["Layer 3: Behavioral Monitoring"]
        D["Runtime observation\nDrift detection\nAnomaly alerting\nGoal adherence checks"]
    end

    subgraph L4["Layer 4: Audit Trail"]
        E["Immutable log:\nEvery tool invocation\nEvery decision point\nEvery external call\nOutcome tracking"]
    end

    Human["Engineering Team\n(Approvers & Reviewers)"]

    Agents --> L1
    L1 --> L2
    C --> C1
    C --> C2
    C2 --> Human
    Human -->|"Approved"| C1
    C1 --> L3
    L3 --> L4
    L4 --> Human

Permission boundaries define what an agent can access and modify, scoped dynamically to the task at hand. An agent assigned to fix a bug in the authentication module should have read/write access to authentication code, read access to related tests, and no access to billing infrastructure, deployment scripts, or credential stores it has no business touching. This is not a static ACL; it is a context-aware policy that changes with the task. When the task is complete, permissions revert. The Anthropic report’s finding that teams with well-maintained context files see 40% fewer agent errors is partly a capability finding, but it is also a governance finding: a well-scoped context limits the surface on which errors can occur.

Decision gates identify which actions require human approval before execution. The criterion is not action type; it is blast radius combined with reversibility. Writing a unit test is low blast radius and reversible; a gate here adds friction without safety value. Merging to a release branch, modifying a database schema, invoking a cloud API that provisions infrastructure, or triggering a deployment pipeline. These are high blast radius, often irreversible, and warrant a human checkpoint. Decision gates are not an escape hatch that agents route around; they are a defined interface between autonomous action and human judgment.

Behavioral monitoring provides runtime visibility into what an agent is actually doing, not just what it was asked to do. This means tracking which tools an agent invokes, in what sequence, with what arguments. It means alerting when an agent’s behaviour departs from the expected pattern for its task type: an agent assigned a documentation task that starts reading secrets files is anomalous, regardless of whether its instructions technically permit that access. Drift detection and anomaly alerting at the agent layer are the runtime equivalent of the static analysis tools we already apply to code.

Audit trails make the governance architecture durable and defensible. Every tool invocation, every decision point, every external call an agent makes should be logged immutably with enough context to reconstruct the agent’s reasoning path. This is not just a compliance requirement, though it will increasingly be one. It is the foundation for incident response when something goes wrong, for improving agent behaviour over time, and for demonstrating to stakeholders that autonomous systems are operating within sanctioned boundaries.

The human-in-the-loop spectrum

Not every agent action warrants the same level of oversight. The question is how to calibrate human involvement to match the risk profile of the action, without creating so much friction that agents become impractical to use.

The Anthropic report and broader industry practice describe three operating modes. “In the loop” means a human reviews every significant agent output before it is applied. “On the loop” means a human sets the constraints and reviews outcomes retrospectively, with the agent operating autonomously between checkpoints. “Out of the loop” means the agent operates fully autonomously within defined guardrails, with human involvement triggered only by anomalies or gate conditions.

The appropriate mode for any given action class is a function of two variables: blast radius and reversibility. Low blast radius, easily reversible actions (running tests, generating documentation drafts, proposing code changes for review) are candidates for “out of the loop” operation. High blast radius or irreversible actions (deploying to production, modifying database schemas, committing directly to a protected branch, calling APIs that incur cost or create external dependencies) require “in the loop” or “on the loop” oversight.

Action Type	Example	Blast Radius	Reversibility	Recommended Mode
Code suggestion	Propose a fix for a failing test	Low	High	Out of the loop
Test execution	Run the test suite against proposed changes	Low	High	Out of the loop
PR creation	Open a pull request for human review	Low	High	Out of the loop
Branch merge	Merge a feature branch to main	Medium	Medium	On the loop
Config change	Modify application configuration	Medium	Medium	On the loop
Schema migration	Alter a production database schema	High	Low	In the loop
Deployment trigger	Invoke a production deployment pipeline	High	Low	In the loop
Secret access	Read or write credentials	High	Low	In the loop
Infrastructure provisioning	Spin up cloud resources	High	Low	In the loop

The table is a starting point, not a fixed policy. A schema migration in a development environment has a different blast radius than the same operation against production. Calibration requires understanding your specific deployment topology, your organisation’s risk tolerance, and the agent’s track record on the task type in question.

What this looks like in practice

The governance architecture above is not abstract. Here is a policy-as-code configuration that represents how a well-governed AI coding agent workflow looks in a concrete deployment:

# agent-governance-policy.yaml
# Policy-as-code for AI coding agent governance
# Apply per agent instance; scope policies to task context

agent:
  id: coding-agent-prod
  display_name: "Coding Agent (Production)"
  version: "1.0.0"

permission_boundaries:
  # Define default access — deny everything not explicitly permitted
  default: deny

  # Scope permissions to task context using label selectors
  rules:
    - context_labels: [task_type=bug_fix]
      allow:
        - resource: codebase
          operations: [read, write]
          path_patterns:
            - "src/**"
            - "tests/**"
          # Explicitly deny access to infrastructure and secrets
          deny_patterns:
            - "infrastructure/**"
            - ".env*"
            - "**/*.pem"
            - "**/*.key"
        - resource: test_runner
          operations: [invoke]
        - resource: version_control
          operations: [branch_create, commit, push_to_feature_branch]

    - context_labels: [task_type=documentation]
      allow:
        - resource: codebase
          operations: [read]
          path_patterns: ["**"]
        - resource: documentation
          operations: [read, write]
          path_patterns: ["docs/**", "*.md"]

decision_gates:
  # Actions that require human approval before execution
  require_approval:
    - action: merge_to_protected_branch
      approvers: [engineering_lead]
      timeout: 4h
      on_timeout: block  # Never auto-approve on timeout

    - action: schema_migration
      approvers: [db_owner, engineering_lead]
      require_all: true
      timeout: 8h
      on_timeout: block

    - action: deployment_trigger
      approvers: [engineering_lead]
      environments: [staging, production]
      timeout: 1h
      on_timeout: block

    - action: external_api_call
      # Calls outside the approved domain list require review
      approved_domains:
        - "api.github.com"
        - "registry.npmjs.org"
      unapproved: require_approval
      approvers: [security_owner]

    - action: secret_access
      approvers: [security_owner]
      timeout: 30m
      on_timeout: block

behavioral_monitoring:
  # Alert on actions anomalous for the agent's current task context
  anomaly_rules:
    - name: unexpected_file_access
      condition: "file_read outside permitted path_patterns for current task_context"
      severity: high
      action: alert_and_suspend

    - name: secret_file_pattern_access
      condition: "file_read matches pattern [*.pem, *.key, .env*, *credentials*]"
      severity: critical
      action: terminate_and_alert

    - name: unexpected_network_egress
      condition: "outbound_request to domain not in approved_domains"
      severity: high
      action: block_and_alert

    - name: goal_drift_detection
      condition: "tool_sequence diverges from expected_patterns for task_type by > 2 sigma"
      severity: medium
      action: checkpoint_with_human

audit_trail:
  # Immutable log of all agent actions
  log_targets:
    - type: append_only_log
      destination: "audit/agent-decisions.jsonl"
      fields: [timestamp, agent_id, task_id, action, tool, arguments, outcome, run_id]
      retention: 2y

  # Flag for compliance review
  compliance_tags:
    - pattern: "action in [schema_migration, deployment_trigger, secret_access]"
      tag: compliance_review_required

The Anthropic report’s finding that well-maintained context files reduce agent errors by 40% aligns with what we see in practice. When an agent knows exactly what it is permitted to touch, and the policy enforces that boundary at runtime, it spends less effort exploring paths that will be blocked and more effort on the actual task. Governance and capability are not in tension here. A well-bounded agent is a more reliable agent.

The governance decisions you need to make this quarter

Engineering teams deploying agents today are making governance decisions by default, not by design. The absence of a policy is itself a policy: implicit, permissive, and unauditable. Here are the questions that need explicit answers before you expand your agentic footprint.

Who owns agent governance in your organisation? In most teams, no one does. It falls between security (who owns identity and access management), engineering (who owns the tooling), and platform (who owns the infrastructure). Governance without an owner is governance that does not get maintained. Assign it, budget for it, and give the owner authority to block agent deployments that do not meet the policy.

What is your agent permission model? Blanket access to the codebase is not a permission model; it is the absence of one. Define what each deployed agent is permitted to read, write, invoke, and modify, scoped to its task class. Treat this as you would treat a service account in your IAM system: least privilege by default, expansion only with justification.

Where are your human checkpoints? Map your agent workflows to the oversight matrix above. Every workflow that touches a high-blast-radius, low-reversibility action needs an explicit gate. If you cannot name the approver and the process, the gate does not exist.

How do you audit agent behaviour? If an agent made a decision yesterday that contributed to a production incident today, can you reconstruct what it did, when, with what context, and why? If the answer is no, you do not have an audit trail; you have logs. Logs and audit trails are different things. An audit trail captures decision context, not just events.

What is your incident response plan for agent failure? Not for human failures that agents exacerbate, but for agent-specific failure modes: a goal hijack, an agent that exits a permitted boundary, an agent that takes an irreversible action it should not have been able to take. Who detects it? Who stops it? What is the rollback path? How do you know the agent has been stopped rather than just disconnected?

The organisations that will navigate this period well are not necessarily the ones deploying the most advanced agents. They are the ones that have answered these questions before the first incident forces them to. The gap between enterprise intent and governance readiness is wide enough that most teams have a meaningful window to close it before it becomes a crisis. That window is not unlimited.

If you are working through how to apply these frameworks to your specific deployment context, or if you are further along and finding that your current governance approach is not keeping pace with what your agents can now do, we are glad to think through it together.