When a company tells me they want to adopt AI, the conversation usually starts in the wrong place.
They want to talk about which model to use, which vendor, which use case to prioritise. These are reasonable things to think about. But they're not where most AI initiatives actually get stuck.
Most AI projects stall because the data infrastructure underneath isn't ready. And the companies that don't realise this early spend six to twelve months finding it out the hard way.
The pattern I keep seeing
I've seen versions of this across different contexts — at larger organisations with established data estates, and at early-stage companies that are essentially building everything for the first time.
At one company, the AI initiative looked well-funded and well-resourced. The model selection was sensible. The use case was clear. But the underlying customer data was scattered across four different systems, only partially deduplicated, and had no consistent identifier across them. The first thing the team had to build was not an AI feature. It was an entity resolution layer. That took five months and wasn't in anyone's original plan.
At a startup with a much smaller data footprint, the constraint was different: event data existed but had never been modelled with any consistency. The meaning of a given field had drifted over eighteen months of rapid feature development. There was no schema registry, no version history, no documentation. The AI work got blocked on a data archaeology project.
Neither team was underprepared in the usual sense. Both had reasonable people and adequate resources. The issue was sequencing — they'd planned the AI layer before they understood the state of the data layer it needed to sit on top of.
Why AI makes existing problems visible
Legacy data problems are usually manageable until you try to do something ambitious with the data.
A reporting system can tolerate inconsistent identifiers across systems — you just run a join and accept some noise. A recommendation engine cannot. The model will encode the inconsistency into its outputs and there will be no obvious way to know when it's doing so.
A data warehouse built for analytics can survive schema drift over time — someone queries it, they notice the anomaly, they clean it up. A real-time inference system cannot. The pipeline will process the anomaly silently and the errors will surface in production behaviour, not in error logs.
AI raises the quality bar on data in a specific way: it requires data to be consistent, documented, and semantically stable in ways that most legacy systems were never designed to guarantee. This is not a criticism of how those systems were built — they were built for different purposes. It's an observation about what changes when you try to use them for AI.
Four questions worth asking before you commit
Before scoping an AI initiative, I'd want to understand the data situation along four dimensions. These are not a comprehensive audit — they're a quick readiness assessment to understand what else might need to happen before the AI work can actually proceed.
1. Do you know where the data lives and who owns it?
Not in a theoretical sense. In a practical one. For the specific data the AI system will need: which team created it, which system stores it, who has access, and who is accountable for its quality. If the answer involves phrases like "it should be in the warehouse" or "you'd have to ask the analytics team," that's a gap worth understanding before it becomes a project dependency.
This matters because AI initiatives tend to surface ownership questions that nobody has had to answer before. Resolving ownership mid-project is slow. Resolving it before the project starts is a conversation.
2. How consistent is the data over time?
Has the schema changed significantly in the last year? Are there fields whose meaning has shifted without the underlying data being updated? Are there event types that were deprecated, renamed, or never reliably populated?
The question isn't whether the data is perfect. It isn't. The question is whether the inconsistencies are understood and bounded, or whether they're latent surprises. A model trained on five years of data will encode every schema evolution and every undocumented convention into its weights. If you don't know what those are, you won't know where the model's behaviour is coming from.
3. Can you link records across the systems the AI will need to use?
Most useful AI applications require joining signals from more than one source. Recommending next actions requires linking user behaviour to account status. Detecting anomalies requires linking operational data to historical baselines. Summarising customer context requires linking CRM records to support history.
The ability to join these signals reliably — on a consistent key, without significant data loss — is not a given. Many companies discover their entity resolution problem only when they try to build something that requires it.
4. Is there a team who can own the data pipeline, or only the model?
AI in production is a data engineering problem as much as it is a machine learning problem. The model is a component. The pipeline that feeds it — data collection, transformation, validation, monitoring — is what actually keeps it running.
If the only team with capacity to work on this is the team building the model, that's a resourcing gap that will surface as a reliability problem later. The question isn't just whether the team can build the AI feature. It's whether the team can maintain the data infrastructure the feature depends on.
What to do with the answers
If these four questions surface significant gaps, the implication isn't that the AI initiative should be abandoned. It's that there's prior sequencing work to do.
The order that tends to work: understand the data landscape first, address the most critical gaps (entity resolution, schema documentation, access controls), then build the AI layer on top of a foundation that's actually ready for it. This is slower in the planning phase. It is significantly faster in the execution phase.
The alternative — starting with the model and discovering the data problems as they become blockers — is not faster. It just distributes the same time budget differently, with more of it spent on unplanned work under project pressure.
I've seen both approaches. The sequencing-first version is less exciting to pitch but more reliable to deliver.
If you're planning an AI initiative and are trying to get a clear read on your data readiness, I'm happy to think through it — get in touch.
You might also find the Technology Decisions page useful for adjacent questions about how to frame build-versus-buy choices and platform decisions at different stages.