Lee Goodenough Consulting

 


Is Digital Transformation Now Exposed to a New Class of Failure? (AI, Agents, and the Invisible Mess)

Introduction

Not so long ago, when platform implementations struggled or outright failed, you could spot the warning signs a mile off. Bad processes, rubbish documentation, delayed go-lives, and plenty of unhappy stakeholders. Fingers were pointed, and usually at the usual suspects: scope creep, miscommunication, technical debt, or plain lack of product understanding. Even if no one wanted to own up, the evidence was all there, in plain sight.

Fast-forward to today, and as AI agents and large language models take the wheel in more and more business processes, a subtler risk is emerging. We’ve been handed the keys to the automation dream, brilliant at ticking boxes, less brilliant at telling you when they’ve ticked the wrong ones. What happens when an “intelligent” agent quietly takes a wrong turn? It might deliver a plausible but false answer, invent a data point, or execute a decision nobody notices for weeks, if ever.

In June 2025, Salesforce’s own researchers benchmarked their flagship LLM-powered agents on realistic CRM tasks. The results? Success rates barely above chance for anything complex, with most failures hidden in the details, which are silent, hard to detect, and largely invisible unless someone goes digging. If even the giants of enterprise tech are airing these limitations, what should the rest of us make of it? (source)

Are we entering an era where digital transformation is defined not by obvious breakdowns, but by subtle, AI-induced drift? Or is this just another chapter in the long-running saga of hidden project risks?

From Obvious failures to invisible risks

The real danger with AI-driven processes isn’t just that failure has become less obvious, it’s that the tools we trust to monitor success are themselves being fooled.

Traditional KPIs, dashboards, and status reports aren’t designed to catch an agent quietly making things up. The system keeps ticking along, the green lights stay on, and by the time anyone suspects a problem, it’s often too late to trace what went wrong.

Why does this happen?

AI agents don’t fail like humans. When a person misreads a spec, they usually leave a trail, a dodgy email, an angry meeting, a status slide full of question marks. An agent, though, can “hallucinate” a perfectly plausible answer, update the CRM, and move on. No complaint, no confession, just silent error propagation.

Worse, these systems are increasingly judged by their own output. If an agent’s updates tick all the reporting boxes, nobody asks if the boxes were the right ones in the first place. In effect, the monitor is marking its own homework.

This isn’t just theory. In the Salesforce benchmark, most failures were not obvious meltdowns, but nuanced errors that went undetected until someone checked the fine print. It’s not hard to imagine these issues compounding over weeks or months, especially in environments where AI recommendations are taken at face value.

Why don’t we spot these issues sooner?

A few reasons:

  • Blind trust in automation: The more we automate, the more we assume things “just work.”
  • Confirmation bias: When the dashboard is green, people see what they want to see. Nobody goes digging unless something breaks publicly.
  • Volume and velocity: AI can generate and process so much data, so quickly, that humans struggle to keep up—let alone investigate edge cases.

None of this means every project is doomed to subtle failure. But it does mean we need to rethink how we define and detect risk in an agent-driven world. The old markers (late delivery, angry users, missed revenue) may be replaced by something much harder to spot: slow drift, hidden inconsistencies, or, worst of all, confidently reported “success” that’s quietly off the mark.

Salesforce’s Moment of Humility

It’s rare to see a tech giant publicly admit their own tools might not be ready for prime time. But in June 2025, Salesforce put its flagship AI agents through their paces, using a set of realistic CRM tasks designed to reflect the kind of work these systems are meant to automate every day.

The results were sobering. On single-step tasks (things that, in theory, should be AI bread and butter) the agents only managed a 58% success rate. For multi-step processes, where the agent needed to follow instructions across several actions, that figure dropped to 35%. Worse, the failures didn’t always come with obvious error messages or failed processes. Most of the time, the agents confidently delivered the wrong answer, filled in the wrong field, or missed the point entirely. 

There’s also the awkward issue of confidentiality. The benchmark showed agents had almost no awareness of when they were handling sensitive information. In an enterprise setting, that’s more than a footnote; it’s a risk vector waiting to be exploited.

What’s most striking isn’t the specific numbers, but what they represent. Salesforce has every reason to show their technology in the best light. If even they are raising a red flag, it’s a clear sign that the risks aren’t hypothetical or limited to dodgy pilot projects, they’re baked into the current reality of deploying agents at scale.

The takeaway? Is creating a benchmark a tool to measure others by? Or is this a genuine warning of the reality of the solution today?

What Makes AI Agent Failure Different?

It’s easy to write off AI failures as “just another IT problem”bad data, dodgy integration, or some forgotten bit of requirements gathering. But agent-driven systems aren’t just faster or more automated versions of what came before. The nature of their mistakes is fundamentally different.

First, there’s the confidence. An old system usually failed with a bang (an error message, a crash, or a workflow grinding to a halt). AI agents, on the other hand, fail with a smile. They don’t just guess, they guess convincingly. The output looks plausible, the task gets marked as complete, and nobody’s the wiser until something doesn’t add up later.

Second, there’s the opacity. Traditional software, generally allowed you to trace a bad outcome to a piece of logic or a line of code. With language models and AI agents, tracing an outcome often means untangling a string of probabilistic decisions, buried deep in a black box. Ask for an explanation, and you’ll get something that sounds right, even if it isnt.

Due to the fast pace of how we operate and the advances of what we do, have we forgotten our ability to challenge the outcome? There are echoes of this in other sectors (finance, medicine, aviation) where black box models have taught everyone the hard way to build in layers of validation and independent oversight. But business software is only just beginning to face these realities at scale.

So yes, in some ways these are familiar risks dressed in new kit. But the combination of speed, opacity, and unfounded confidence, makes for a new class of failure. Not louder. Not always bigger. Just much, much harder to see AND less frequent.  The technology isn’t going away. If we want better results, our mindset needs to shift from passive scepticism to active, structured challenge. That’s not about slowing progress; it’s about making sure progress is real.

Are We Equipped to Triage the Invisible?

It’s one thing to spot a train wreck when the carriages are already off the rails. It’s another to notice a slow, silent drift happening in the background, especially when the system keeps assuring you everything’s on track.

Most organisations still rely on the usual indicators: green dashboards, completed tasks, the odd user complaint. But if AI agents are quietly making the wrong decisions with unwavering confidence, these signals don’t go far enough.

So, are we actually set up to catch these new kinds of failures? In my opinion, not really (although I welcome people explaining exactly what they are doing to prevent it). 

Here’s where I believe most teams are falling short:

1. Skills and Experience

People know how to challenge a spreadsheet or question a bad business process, flow or event a trigger. But interrogating an AI agent’s output (especially when it “sounds” plausible) requires a different toolkit (I am also guilty of this). You need people who can ask awkward questions, run independent checks, and aren’t intimidated by the jargon or the technology. Right now the speed of the answer is a gigantic dopamine kick, its like the thirst for knowledge has been thrown at you, when its 99.9% correct, then its always correct….right?

2. Culture

There’s a tendency to treat automation as infallible, because automation always was infallible, it was a process, it didnt provide interpretation. That makes it harder to foster a culture where team members feel comfortable challenging outcomes, flagging anomalies, or even suggesting that something might be off. The bigger the hype, the more pressure there is to avoid rocking the boat.

3. Governance and Tools

Most governance frameworks were built for systems that fail noisily, not quietly. Routine audits, exception reporting, and good old-fashioned user feedback are still useful, but they’re not enough. AI observability: being able to trace, validate, and explain agent decisions, needs to become a standard part of the playbook.

4. Practical Questions for Leaders

  • Are you regularly validating agent-driven outputs against reality, not just internal system logic?
  • Do you have independent review processes for critical agent decisions?
  • Are your teams empowered (and incentivised) to call out suspect outputs, even if nothing appears broken?
  • When an agent “hallucinates”, how quickly do you find out, and what do you do next?

The bottom line: You can’t just “set and forget” with AI agents. It’s time for transformation leaders to get uncomfortable, ask better questions, and build a habit of structured challenge into the day-to-day. The invisible mess is only invisible if you choose not to look for it.

Conclusion: Progress Is Good—But So Is Scrutiny

It’s easy to get lost in the noise around AI agents (whether it’s the hype about what they’ll revolutionise), or the hand-wringing about what they might break. The truth is, the genie is out of the bottle. We’re not going back to a world without agents in our processes, and nor should we want to. The potential for progress is real.

I’ll admit it: I’m just as guilty as anyone of finding the shortest path to the answer. Speed is critical to me—get it done, move on, next problem. But earlier in my career, working for Spirent (a test and measurement company full of brilliant people), I saw how the best technical teams treat speed and accuracy as twin priorities. In mission-critical environments, you don’t just chase the outcome, you continuously test, validate, and re-examine every assumption. You plan for risks before they’re even on the radar.

It’s the same discipline we use in data, infrastructure, and cyber security. We don’t just test the flow of information; we test the output and its impact, over and over again. It’s relentless, but that’s what prevents disaster and keeps trust intact, even as we push boundaries.

Of course, automated data validation tools have their place highlighting, flagging missing, inconsistent, or outlier data before it flows downstream. But when it comes to the trickier business of validating whether assumptions and outcomes are actually correct, we’re still in largely human territory. For now, true reliability in an AI-driven world means using the best of both: automation to catch what it can, and relentless human challenge for the rest.

So, to answer the question is yes, digital transformation faces a new class of risk. But the tools for managing it aren’t new: disciplined challenge, validation, and a culture that welcomes uncomfortable questions. That’s what keeps progress real, and trust intact.

This isn’t a call to slam on the brakes. It’s a prompt to keep our heads up as we accelerate. Are we building the skills, the habits, and the culture to spot the invisible mess, not just the obvious blow-ups? Are we taking our own medicine, challenging outputs, seeking second opinions, and ensuring the “facts” in our systems are actually true?

Most importantly: are we doing the work to maintain trust? Not just in the technology, but in the people using it, and in the decisions it helps make. Progress doesn’t come from pretending there are no risks. It comes from being honest about what we know, what we don’t, and what we’re willing to question.

Embrace the agents. But never stop asking if they’re giving you the right answer.

Share now

Facebook
Twitter
LinkedIn