From DevOps to AgentOps: The Operations Stack Is Evolving Faster Than Most Teams

For the last decade, "DevOps" was a sufficient frame for most operational thinking. Continuous integration, continuous deployment, automated testing, observability — the practices were clear enough, the tooling was mature, and teams that invested in getting it right saw real returns.

That frame is still useful. It is just no longer complete.

The Problem with Applying DevOps Thinking to AI Systems

DevOps was designed around a key assumption: given the same input, you get the same output. A deterministic system. That assumption is what makes most of the tooling work — your CI pipeline passes or fails, your tests reproduce reliably, your deployment either succeeds or it does not.

Large language models break that assumption. You can deploy the exact same model with the exact same prompt and get meaningfully different outputs across requests. The system is probabilistic by design. The operational questions that follow from this are genuinely different ones.

What does "this deployment is healthy" mean when the output cannot be directly compared? How do you version a prompt the way you version code, when a small change in phrasing can shift behaviour substantially? What does a rollback look like when you have switched models? How do you test for quality when "correct" is context-dependent?

These are not theoretical questions. They are the ones that teams running LLMs in production are working through right now, with varying degrees of success.

The Vocabulary Has Expanded

The industry has responded by fragmenting the ops stack into more specialised disciplines: MLOps for machine learning systems more broadly, LLMOps specifically for large language models, and increasingly AgentOps for systems where AI agents take autonomous actions over time.

Each layer has genuinely different concerns.

LLMOps focuses on things like prompt versioning and evaluation, token cost monitoring, retrieval-augmented generation quality, hallucination guardrails, and the operational patterns around model serving. Where a DevOps engineer monitors CPU usage and error rates, an LLMOps practitioner monitors output quality scores and the distribution of user corrections.

AgentOps adds another set of concerns on top of that. When an agent is taking actions in the world — calling APIs, executing code, managing infrastructure — the observability requirements expand again. You need to understand not just what the agent output, but what it decided, why, and what it did with that decision. Permission scoping becomes critical. Multi-agent coordination introduces failure modes that single-system pipelines simply do not have.

The Organisations Navigating This Well

What I have noticed about teams handling this transition thoughtfully is that they tend not to be the ones who adopted every new framework immediately. They are the ones who took the time to understand what actually changed at each layer — and what stayed the same.

The fundamentals of good engineering practice hold up reasonably well: version control, testability, observability, gradual rollout, rollback capability. What changes is what those things mean in practice when the system is probabilistic. That translation work is where most of the difficulty lives.

A 2026 analysis of enterprise AI operations found that organisations fluent across the full stack — DevOps through AgentOps — were running 30 to 40 per cent faster than those still applying traditional DevOps patterns to AI systems. That gap is plausible. It matches what I see in practice.

An Honest Note on Where Things Stand

Most of this is still being worked out. The tooling for AgentOps in particular is early. The patterns that will hold up over the next few years are not yet clear, and anyone who tells you they have it all figured out is probably either describing a very specific narrow context or extrapolating optimistically.

What does seem settled is the direction. Software systems are going to contain more AI components, more agentic behaviour, more probabilistic outputs. Treating that as a temporary anomaly to be managed with existing tools is going to become increasingly costly.

The teams that invest now in understanding how the operational requirements differ — and building skills accordingly — are likely to find that investment compounding over time. It is not the most exciting framing, but it tends to be the accurate one.

If you are working through where your organisation sits on this spectrum, I am glad to think through it with you.