There is a pattern I've noticed in conversations about AI adoption: teams often apply their existing DevOps mental model to AI systems and then get confused when things don't work the way they expect.
The confusion is understandable. DevOps was built around deterministic software — the same input reliably produces the same output, you can write tests that verify behaviour, and you can reason about what a system will do in production based on what it did in staging. Most of what we built over the last decade in CI/CD, testing, monitoring, and incident response was designed around these properties.
LLMs don't have them. AI agents, which are built on LLMs, have them even less.
What breaks when you treat an LLM pipeline like code
The most obvious difference is determinism. You can test a function. You cannot write a unit test that tells you whether an LLM response is "correct" in the same way. The output is probabilistic — shaped by temperature settings, model versions, system prompt changes, and context — and "correct" is often a fuzzy, context-dependent judgment.
This breaks several things teams rely on:
Testing: Traditional test suites don't work well for LLM outputs. You end up with either overly brittle tests that break on minor phrasing changes, or tests so loose they don't catch actual regressions. Evaluating LLM quality requires different methods — sampling, human review pipelines, scoring models.
Deployment confidence: In a standard CI/CD pipeline, passing tests means you can deploy with reasonable confidence. With an LLM pipeline, passing tests means you've verified the plumbing works; whether the outputs are actually good is a separate question.
Incident response: When an LLM-powered feature behaves unexpectedly, the root cause is often not a code change. It might be a shift in user input distribution, model drift after a provider update, a prompt that worked in testing but degrades on real traffic, or retrieved context that's silently stale. None of these show up in standard error logs.
LLMOps: the operational layer that had to be invented
The term is awkward but the concept is real. LLMOps is the set of practices for operating LLM-based systems in production — the bits that DevOps doesn't cover.
A few things that actually matter:
Prompt versioning: Prompts are code, but most teams don't treat them that way. They live in config files, environment variables, or worse, hard-coded in functions. When a prompt change causes a regression, you want to know what changed, when, and why — the same as any other code change.
Token cost monitoring: LLM calls have variable cost based on input/output length, model choice, and caching behaviour. Without monitoring, costs can spike unexpectedly. This is especially true when retrieval-augmented generation (RAG) is involved and the retrieved context is longer than expected.
Evaluation pipelines: Some teams build automated scoring with a secondary LLM as judge. Others do periodic human review on sampled outputs. The right approach depends on the risk profile of the use case. Both are better than nothing.
Hallucination guardrails: For use cases where factual accuracy matters, output validation is worth investing in — whether that's retrieval verification, structured output schemas, or downstream checks that catch responses before they reach users.
AgentOps: the next layer
If LLMOps is about operating a single LLM call reliably, AgentOps is about operating systems where multiple LLM calls are chained together, with tool use, memory, and branching logic in between.
This introduces new failure modes that are harder to observe and debug:
Multi-step failures: An error in step three of a five-step agent may only manifest in the final output. The cause and effect are separated in time and in the trace.
Tool permission boundaries: Agents with tool access can take actions with real-world effects — sending emails, writing files, calling APIs. Getting the permission model right (what can the agent do, under what conditions, with what confirmation requirements) is a safety concern, not just an operational one.
Memory and context accumulation: Long-running agents accumulate context. How that context is managed — what gets summarized, what gets dropped, what gets stored and retrieved — affects both output quality and cost.
Observability gaps: Standard APM tools weren't designed for multi-step LLM traces. You need tooling that can show you the reasoning path, the retrieved context, the tool calls made, and the cost of each step in a single view. This tooling exists now, but adoption is still uneven.
What this means in practice
I don't think teams should be intimidated by this. Most of what I've described is learnable, and the tooling has matured significantly.
What I'd push back on is the idea that you can get the benefits of agentic AI without investing in the operational infrastructure to match. Teams that ship agents without observability, without prompt versioning, without a coherent evaluation approach — they often end up with systems that work in demos and degrade in production in ways that are hard to diagnose.
The teams I've seen do this well tend to have a few things in common: they treat the operational work as engineering work (not an afterthought), they invest in evaluation early, and they are disciplined about what they allow agents to do autonomously versus what requires a human in the loop.
None of that requires a specialised team or a new set of job titles. It requires applying the same engineering rigour to AI systems that you'd apply to any other production system — adjusted for the ways AI systems are different.
If you're building something in this space and want to think through the operational model, feel free to get in touch.