The first AI agent most companies hire is an intern with root access.
It gets a vague brief, too much context, no clear owner, and a queue full of work nobody has properly shaped. Then people act surprised when the output is uneven. Some pull requests are useful. Some are noisy.
The board hears "agent" and expects leverage. The engineering team sees another source of review load. Security sees a production risk with a friendly interface. Finance sees a licence bill and asks when the productivity line moves. Everyone is right to be suspicious. Most agent pilots are not designed like operational change. They are designed like demos.
The problem is not that AI agents are useless. The problem is that many organisations are managing them like magic, not labour. They buy a tool, point it at a codebase, and expect ROI. That is not how leverage works.
If you want ROI from AI agents, stop treating them as general-purpose assistants. Build the basics: scoped roles, accountable humans, trusted inputs, risk-based controls, measurable outcomes, and a clear path for stopping, narrowing, or expanding the work.
The boring operating model matters more than the model choice.

Agents need jobs, not vibes
A useful employee has a job. A useless employee has access and enthusiasm. The same is true for agents.
"Help our engineers move faster" is not a role. "Keep these five low-risk frontend packages current by opening one pull request per package with tests and a risk note" is a role. The first creates theatre. The second creates measurable work.
Most failed agent pilots have the same shape. The organisation starts with a broad ambition because broad ambition sounds strategic. The agent is asked to understand a large system, make judgement calls across unclear boundaries, and produce changes that cross several ownership lines. The pilot then gets judged on whether humans feel more productive. That is a bad experiment. It has no control surface.
A typical bad assignment is: "Refactor billing." It sounds concrete because billing is a real area of the system. It is not concrete enough to be safe.
The scope is too broad. The success criteria are unclear. The business logic is risky. The dependencies hide in payment providers, invoices, tax rules, customer contracts, reporting, and support workflows.
Review becomes expensive because a senior engineer has to rediscover the whole problem before they can trust the change. No CTO would hand that brief to a new hire without context, guardrails, and a domain-aware reviewer. An agent should not get it either.
A better first role is narrow enough that you can write down what good looks like:
- triage flaky tests in one repository
- open minor dependency upgrades for approved packages
- draft migration pull requests for a repeated API change
- summarise production incidents from logs, traces, and the incident channel
- route support tickets to the right product area with evidence
- prepare release notes from merged pull requests and tickets
None of these sound like science fiction. That is the point. Good automation usually starts as boring work with sharp edges.
Human teams already work this way when they are healthy. You do not ask one generic engineer to own backend changes, platform reliability, QA coverage, security checks, and product trade-offs at the same time. You use backend, platform, QA, security, and product roles because each brings a different lens and a different failure mode.
Agent teams need the same shape. One agent can handle dependency upgrades. Another can generate tests. Another can update an API client after a contract change. Another can reproduce a bug from a ticket. Another can check whether a pull request follows local conventions.
Specialisation is not bureaucracy. It is how work becomes reliable.
The role should say what the agent is allowed to touch, what it must not touch, what evidence it must provide, when it should stop, and who checks the output. If that sounds like writing a job description, yes. That is exactly what it is.
The unit of AI management is not the model. It is the role.
The CTO job is the operating model
An agent cannot be accountable. A person can.
That sentence cuts through a lot of nonsense. Agents can propose, draft, test, summarise, and sometimes execute within constraints. They cannot own the business consequence of a bad change. They cannot decide that a customer impact is acceptable. They cannot explain to the board why a compliance control was bypassed. A human still owns the outcome.
This is where agent adoption becomes a CTO-level problem, not a tooling problem. The platform team may install the runtime. Security may define access rules. Product teams may benefit from faster work. Finance may approve the budget. But somebody has to design how the capability works across all of them.
Without that design, you get a shadow org chart. Nobody is quite sure who manages the agent. The product team wants the output. The owning engineering team gets the review load. Security gets nervous after the fact. When something goes wrong, everyone can plausibly say it was not their decision.
Do not let that happen. Every agent role needs one accountable human. Not a committee. Not "engineering". One person who can answer three questions quickly:
- What outcome is this agent role meant to improve?
- What risk are we accepting by running it?
- What evidence would make us stop, narrow, or expand it?
That person does not have to inspect every line. They do have to own the shape of the work: scope, inputs, permissions, thresholds, success metrics, escalation paths, and stop conditions. They make sure the agent does not become an unowned production system that happens to write code.
The CTO does not need to personally manage every agent. The CTO does need to set the pattern. Which work is eligible? Which systems are off limits? Who approves new roles? Which evidence is required before autonomy increases? How does security know what is running? How does finance know whether value was created rather than shifted into checking queues?
That is the model. It is not glamorous. It is also the difference between agent leverage and AI sprawl.
The minimum agent operating model
Treat each role like something you are putting on the org chart. Name why it exists, who owns it, what it may touch, what it must never touch, which tools and inputs it can use, what evidence it must produce, how checks work, how success is measured, which controls apply, where it escalates, when you stop it, and what would justify expanding it. If you cannot write that page, the role is not ready.
Small systems create agent leverage
Agents perform better when the system is legible. That should make every CTO slightly uncomfortable, because many codebases are legible only to the five people who have suffered through them for years.
A human can walk across organisational memory. They know why the weird adapter exists. They remember the customer migration from 2021. They know the "temporary" flag cannot be removed because a large enterprise client still depends on it. An agent gets whatever context you provide. If the important context lives in folklore, the agent will miss it.
This is why architecture suddenly matters in a very practical way. Big tangled systems are not just hard for humans to change. They are hard for agents to reason about safely. When one change requires understanding six domains, three teams, and a decade of exceptions, your agent pilot is not testing AI. It is testing how much accidental complexity your organisation has normalised.
Small, well-owned units create leverage. Clear module boundaries create leverage. Good tests create leverage. Up-to-date runbooks create leverage. A service catalogue creates leverage. These things were already good engineering practice. Agents make the payoff more visible because they amplify whatever structure exists.
This does not mean splitting everything into microservices. That is the lazy version of the argument. A small unit can be a module, package, workflow, bounded service, data pipeline, or test suite. The important property is that the unit has a clear purpose, clear ownership, clear inputs and outputs, and a path to approval that does not require a tribal council.
An agent is a brutal architecture reviewer. If it needs to read half the company to make a safe change, the problem is not the prompt.
Governance is how you earn autonomy
Trusting agents is not a moral position. It is an engineering decision about blast radius.
The wrong debate is "should we let agents do production work?" The right debate is "which production work, under which controls, with which rollback path, and after what evidence?" Those questions are much easier to answer when the role is specific.
You can build trust in stages. First the agent drafts and a human executes. Then it opens pull requests and humans merge. Then it handles low-risk changes inside a narrow boundary. Then it gets more autonomy where the cost of being wrong is low and the signals are strong. This is not anti-AI caution. It is how mature organisations adopt any operational capability without betting the company on a demo.
The controls should match the risk. A release-note draft needs a light edit. A dependency upgrade needs tests, ownership metadata, and human review. A change to billing logic needs a much higher bar. A production remediation needs explicit runbooks, limits, observability, and a fast way to stop the agent when reality diverges from the plan.
Instructions are not controls. A prompt that says "be careful" is not a permission boundary. A policy document nobody can enforce is not governance. Destructive operations need hard confirmations, narrowly scoped credentials, approvals that cannot be bypassed, logs that humans can inspect, and backups outside the same blast radius.

In the incident report, Jer Crane at PocketOS described an AI coding agent incident where safety prompts were not enough. The lesson is not to panic about agents. It is to stop pretending instructions are controls. If an agent can delete production data with the same credentials it uses for routine work, you do not have an agent problem. You have an access design problem.
Good governance is not a monthly AI committee checking slide decks. It is operational hygiene: role approval, access review, audit trails, incident playbooks, and clear escalation when the agent encounters something outside its lane. Keep it boring. Boring is what lets you widen autonomy without lying to yourself about risk.
Measure constraint relief, not demo magic
The easiest way to fool yourself with AI agents is to measure activity. Pull requests opened. Lines changed. Tickets touched. Summaries generated. These numbers make a pilot look busy. They do not prove ROI.
The useful question is whether the agent reduces a constraint without moving the cost somewhere else. If the agent opens twice as many pull requests but senior review time triples, you have not created leverage. You have created queue pressure. If incident summaries arrive faster but responders distrust them, you have created another artefact to check. If the agent drafts migration changes but every change needs heavy rework, the savings may be imaginary.
Measure the work like operational work:
- lead time from trigger to accepted output
- human review time per accepted output
- acceptance rate after review
- defect, rollback, or correction rate
- amount of previously neglected work cleared
- cost of ownership, including licences, infrastructure, review, and governance
The ROI model is simple enough to fit on one page. Start with the before state: how much time the work currently consumes, how long it waits, what risk it creates, and what valuable work gets displaced. Add the agent cost: tooling, runtime, maintenance, review, security controls, and false positives. Then count accepted outputs, avoided delay, reduced manual effort, and reduced operational risk. If you cannot see the constraint moving, the agent may still be interesting, but it is not yet producing ROI.
In one team the first working agent role was minor dependency upgrades on a small set of frontend packages. Before the pilot those upgrades were batched every few months. Each batch consumed a senior engineer for two to three days of scattered work. Nobody argued the upgrades were unimportant, but they kept losing to feature work until the batch became annoying enough to force a cleanup.
After the role was narrowed to low-risk packages, one pull request per package, required tests and a short risk note, the agent opened 38 pull requests in six weeks. Thirty-one merged after minor human edits. Median time from alert to merged change fell from roughly three weeks to under four days. Review time per change stayed flat.
In another organisation, support escalations for one B2B product were routinely misrouted. Tickets bounced between product, engineering, and customer success before landing with the right team. The pain was not only the delay; every bounce forced someone to reread the ticket, guess the context, and apologise for a handover they did not control.
The first agent role did not answer customers. It read the ticket, matched it to a product area, attached the evidence, and suggested a route. Over a month, 214 escalations went through the role. Seventy-six per cent were accepted without rerouting. Median time to first technical owner fell from 9 hours to 2.5 hours. The team kept humans in control of customer replies, but removed a chunk of coordination drag.
Neither example is glamorous. That is why they worked. The roles were narrow. The before state was known. The review cost was visible. The risk was bounded. The result was not "AI transformation". It was work moving faster without quietly dumping cost onto senior people.
Start with one owned role
The best first agent role is rarely the flashiest one. It is usually the neglected work that has a clear trigger, a repeatable process, and an annoying human cost.
Look for work that already falls between the cracks: stale dependencies, flaky tests, missing release notes, outdated internal documentation, repetitive migration tasks, noisy alerts, support tickets that need routing. These are not strategy problems. They are drag. Drag is where agents can earn their keep early.
Pick one role. Make the scope painfully clear. Give it an owner. Measure the before state. Run it for six weeks. Count accepted outputs, review time, lead time, defects, and the amount of work cleared that would otherwise have stayed stuck. Treat it like an operational service: keep it, narrow it, expand it, or shut it down.
Do not start by creating an AI transformation programme. Start by proving that one agent can remove one real constraint without increasing risk or review load. Then repeat the pattern. The model scales better than enthusiasm.
AI agents will not remove the need for management. They will punish organisations that were already bad at it. The companies that win will not simply have better models. They will have clearer roles, cleaner systems, tighter controls, tighter feedback loops, and humans who remain accountable for the work. After six weeks, a CTO should be able to point to one cleared queue, stable review cost, and a clear decision: expand, narrow, or stop.