AI ROI in 2026: Why Enterprise AI Fails & Works

Hyperscalers are on track to spend $675 billion on AI infrastructure in 2026, up 63% from the prior year. Virtually every major enterprise in America is buying AI. The question almost none of them can answer is whether it’s working.

The data is stark. MIT found that 95% of AI pilots deliver zero measurable P&L impact. S&P Global found that 42% of companies abandoned most of their AI projects in 2025. IBM put the number of initiatives delivering expected ROI at 25%. Morgan Stanley found that only 21% of S&P 500 companies could cite a measurable AI benefit at all. The gap between AI spending and AI proof is the central tension of the current cycle.

The companies pulling ahead aren’t buying better models. They built three layers underneath the technology before deploying it: measurement that proves whether AI tasks are working, infrastructure that connects those tasks into automated workflows, and strategy that keeps the whole system learning. The layers are nested and sequential. Most companies never built the first one.

The market is already pricing the difference. Companies that score as dual leaders on measurement and infrastructure returned 41.38% over twelve months versus the S&P 500’s 29.40%, a spread of nearly 1,200 basis points. Companies with only one layer trail the benchmark. Citi identified a 30 basis point credit spread penalty for companies spending on AI without evidence of return. The gap is showing up in both equity and debt.

This piece synthesizes twelve original Terminal X research reports analyzing earnings transcripts, 10-K filings, and analyst Q&A across five sectors to map the full picture: where the measurement gaps are, what infrastructure closes them, and which companies are compounding their advantage while the rest retrofit.

In this article:

1. The $675 Billion Question
Why does most enterprise AI spending produce zero measurable ROI?
2. The Measurement Layer
What does it actually mean to measure AI ROI?
3. The Hardest Proof
Why is financial services AI ROI the hardest to prove, and why is “replace our labor” the wrong starting goal?
4. The Infrastructure Layer
What infrastructure do you need before AI can automate anything, and how does it determine time to value?
5. The Delivery Models
Who is selling enterprise AI, and which delivery model actually helps close the ROI gap?
6. The Strategy Layer
What separates companies whose AI gets smarter from those running the same playbook?
7. The Market Is Already Pricing This
Is the stock market already pricing the AI measurement gap?
8. Three Layers, One Window
How should investors and operators respond to the AI ROI gap?

The $675 Billion Question

Why does most enterprise AI spending produce zero measurable ROI?

Every major enterprise in America is buying AI, but how many of them can tell you whether it’s working? The gap between spending and proof is the central tension of the current AI cycle, and closing it requires something most companies skipped entirely: a measurement layer designed into the system before the technology was deployed.

What separates the companies pulling ahead is the foundation they built underneath the technology. Measurement that proves whether AI tasks are working, infrastructure that connects those tasks into automated workflows, and strategy that keeps the whole system learning. Three layers, each enabling the next, each a precondition for the one above it. The pattern maps to a broader framework that banking advisory analysts have described as task-level AI, automation-level AI, and decisioning-level AI. Most companies never built the first layer, and the layers above it collapsed as a result.

MIT’s 2025 study on AI implementation found that 95% of pilots deliver zero measurable P&L impact. The reason is less about the technology than the process: roughly 80% of the work required to move from pilot to production is data engineering, governance, workflow integration, and measurement infrastructure. Most pilots launch without predefined success criteria, which means there is no way to declare success even if the technology performs exactly as designed.

The early era of enterprise AI adoption was built on usage metrics: how many employees were on the platform, how many hours they logged, which teams had access. Those numbers were easy to collect and satisfying to report. They were also irrelevant to the question that matters, which is whether the AI produced better outcomes than what it replaced.

S&P Global found that 42% of companies abandoned most of their AI projects in 2025, more than double the prior year. IBM’s CEO study put the number of initiatives delivering expected ROI at 25%, with 56% of CEOs reporting zero significant financial benefit. By Q4 2025, Morgan Stanley found that only 21% of S&P 500 companies could cite a measurable AI benefit at all.

Investors noticed before most boards did. Citi identified a 30 basis point credit spread penalty for companies classified as AI “adopters” versus “enablers,” meaning the debt market is already charging a premium for spending without evidence of return. The difference between measuring activity and measuring proof is now priced into the cost of capital.

Cross-industry AI ROI data: 1.7x average for leaders, 10% median, 14% average reported by Chief AI Officers, 30% of executives who can quantify ROI report returns under 5%. Sources: BCG, IBM, Capgemini.

Hyperscalers are on track to spend $675 billion on AI infrastructure in 2026, up 63% from the prior year, with cumulative investment approaching $3 to $4 trillion by the end of the decade. That capital built the data centers and trained the models. What it has not built, in most cases, is any reliable way to know whether those tools are working in the broader corporate economy.

The finance industry illustrates the gap most clearly, in part because of unique challenges in measurement in this space which will be addressed later in this article. Bank Director’s 2025 survey of 141 directors at banks under $100 billion found that 82% don’t measure ROI on any technology investment, not just AI. S&P Global’s banking survey revealed that 91% of boards approved AI programs while only 26% had the capability to execute them.

For this article, Terminal X ran a twelve-report analysis using its AI agent to study earnings transcripts, investor presentations, 10-K filings, and analyst Q&A across five sectors: Financial Services, Defense/Aerospace, Healthcare, Manufacturing/Energy, and Enterprise Technology. The methodology scored companies on the distance between their AI rhetoric and their concrete financial KPIs. The measurement gap in banking turned out to be the same gap in defense, healthcare, manufacturing, and enterprise tech. The structure is universal.

The companies pulling ahead didn’t buy better models, they built the foundation to prove what those models produce. And the market is already pricing the difference.

The Measurement Layer

What does it actually mean to measure AI ROI?

Most companies fail at the first layer because they treat measurement as a reporting exercise rather than a design requirement. The failure happens in three predictable stages, and even the companies that appear to be leading are mostly measuring theater.

The first stage is not measuring at all. That 82% figure from Bank Director isn't an outlier. Across sectors, proofs of concept launch without predefined KPIs and produce activity but not evidence. IIF and EY's 2025 survey found that 96% of financial institutions cite noisy, untimely, or inaccurate data as their primary AI challenge. When the data environment can't support basic measurement, everything built on top of it is speculation.

The second stage is measuring the wrong thing. Task-level metrics like hours saved, FTE equivalents avoided, and tokens processed sound rigorous but never connect to the income statement. The Terminal X reports generated a stress-test on the productivity claims that major banks made in recent earnings transcripts by tracing them to actual headcount and compensation data in SEC filings. The results were consistent.

JPMorgan Chase (JPM) claimed roughly 10% developer productivity savings. Actual 2025 compensation expense grew approximately 6%, and headcount rose to 318,512. Bank of America (BAC) claimed similar gains and cited 2,000 FTEs avoided while compensation expense climbed $2.9 billion. Morgan Stanley (MS) reported around 20% productivity improvement while comp expense grew 12%. BNY Mellon's (BK) $550 million in AI savings was offset by $500 million in new AI investment. Every bank in the sample told a story about efficiency on the earnings call. The 10-K told a different one.

The "we didn't add headcount" defense doesn't hold up under scrutiny. Most enterprises don't actually reduce headcount after deploying AI. They absorb natural attrition without backfilling, redeploy people to adjacent roles, or simply grow into the capacity without hiring as fast as they otherwise would have. That makes "FTEs avoided" a modeled counterfactual, not a realized saving. It never hits the income statement as a line item, it can't be audited against actual payroll data, and it relies on assumptions about what hiring would have looked like in a world that didn't happen. It's a reasonable internal planning metric, but it's not proof of ROI. The gap between the productivity narrative and the compensation data in SEC filings exists because these two things are measuring fundamentally different questions.

2025 AI Productivity Claims vs. Realized Compensation Expense Growth. Source: Terminal X analysis of earnings transcripts and SEC filings.

The same gap between rhetoric and reality appears outside banking. In analysis of earnings transcript language, companies were scored on the distance between AI mention volume and concrete financial KPIs. Salesforce (CRM) appeared roughly 150 times across generic AI references in recent transcripts with essentially zero concrete financial metrics attached. The company's headline metric was 2.4 billion "Agentic Work Units" and 19 trillion tokens processed, numbers with zero GAAP impact and no presence in SEC filings.

Intuit (INTU) reported "60 billion ML predictions per day" while categorizing AI costs as "unallocated" in financial reporting. Mastercard (MA), PayPal (PYPL), and Synchrony (SYF) used permanent "early innings" language across multiple quarters despite years of deployment. The pattern is the same regardless of sector: high mention volume, vanity metrics that count activity rather than outcomes, and deflection when analysts ask what it means in dollars.

AI Rhetoric Gap: Keyword Volume vs. Concrete Financial KPIs. Generic AI mentions (red) versus concrete financial metrics (green). Source: Terminal X analysis of earnings transcripts.

The third stage is genuine outcome measurement, and almost nobody is there. Only about 5% of enterprises achieve substantial ROI at scale. The companies that reach this stage designed measurement into the workflow before deployment, not as a dashboard bolted on after the fact. What they built, in every case, was some version of the same value model.

The AI Value Model: Seven Steps from Baseline to Proof

1. Current state cost baseline. What does the business spend today on the process AI will improve? Break it into measurable components: labor hours multiplied by fully loaded cost, downtime hours multiplied by cost per hour, defect rate multiplied by rework cost, cycle time multiplied by throughput value. Every process has a countable cost. If it can't be counted, AI can't be measured against it.

2. Predefined KPIs with baselines. Before deployment, define three to five specific metrics that will determine success. Not "improve efficiency" but something like "reduce linehaul freight diversions by 50% within twelve months." Capture the baseline measurement before the AI touches anything. This is where most companies fail first: they deploy without defining what success looks like, so they can never declare it.

3. Projected improvement with stated assumptions. What does the AI change? Express it as a percentage improvement with an explicit basis: "30% reduction in empty miles based on comparable deployments." The assumptions are debatable, but the framework is rigorous because the assumptions are visible.

4. Dollar savings. Baseline cost multiplied by improvement percentage equals annual savings. This is the number that goes in the business case.

5. Implementation cost. License, integration, data preparation, training, change management. The full loaded cost, not just the software price.

6. Payback period. How many months until cumulative savings exceed implementation cost? If the answer is longer than the pilot period, the pilot will be perceived as a failure even if the technology works.

7. Sensitivity analysis. What happens if the improvement is half of what was projected? Does the business case still work? If it doesn't survive a pessimistic scenario, the deployment is a bet, not an investment.

The Value Model in Practice: XPO Logistics

XPO Logistics (XPO) is the clearest real-world example of this framework in action, because management built what they call a "strict bottom-line attribution framework" that maps AI output directly to the income statement. The company runs a North American LTL (less-than-truckload) freight operation with roughly $900 million in transported freight costs.

The predefined KPIs were specific and countable: linehaul freight diversions, empty miles, and operational efficiency points across three cost categories. Before AI deployment, XPO captured baseline measurements for each. The AI system optimizes routing in real time, and every routing decision flows through the attribution framework so the financial impact is captured automatically.

The results: 80% reduction in total linehaul freight diversions, 12% compression in empty miles, and a 45.5% year-over-year reduction in purchased transportation costs in Q4 2025. Management can trace each efficiency point to a specific dollar figure: $16 million in linehaul savings, $9 million in pickup and delivery savings, and $4 million in dock operations savings per point, totaling $29 million per single efficiency point gained. The operating ratio improved 180 basis points.

That's steps one through four of the value model: baseline cost captured, KPIs defined and measured, improvement achieved and quantified, dollar savings attributed. The reason XPO can do this while most companies can't is that the measurement infrastructure was designed into the system before the AI was deployed. The routing optimization and the attribution framework are the same system. The act of optimizing a route is the act of measuring whether the optimization worked.

Measurement Leaders: scored on specific KPIs, before/after figures, attribution method, dollar impact, and stated assumptions. Source: Terminal X earnings transcript analysis.

The common thread among leaders is that every company in the table above built some version of this value model before deploying AI. Amazon (AMZN) can attribute over $10 billion in incremental sales to Rufus because the attribution framework was part of the product design, isolating AI-assisted shoppers from legacy digital shoppers and measuring conversion lift mathematically. RTX Corp (RTX) defined cycle time, inventory, and production output as KPIs before deploying AI across its CORE operating system, and can now trace a tripling of AMRAAM guidance section output to the same framework. The companies without that foundation are stuck producing activity reports and hoping the board doesn't ask too many questions.

The Hardest Proof

Why is financial services AI ROI the hardest to prove, and why is "replace our labor" the wrong starting goal?

Financial services is where the measurement problem is most acute, and it’s certainly not because the industry is behind. The nature of the work makes ROI fundamentally harder to isolate.

In manufacturing, you count parts inspected, downtime hours, defect rates, and cycle times. RTX cut inventory by 45% and compressed circuit card cycle times by 35%. Deere's (DE) See & Spray technology reduced herbicide usage by targeting only the weeds. The inputs and outputs are physical, and the before-and-after is clean. In healthcare claims processing, Elevance Health (ELV) cut claims denials by 68% through its HealthOS platform. The metric is concrete: fewer denied claims, faster reimbursement, lower cost-to-collect.

Financial services has none of that clarity. The inputs are information and the outputs are decisions whose quality takes years to evaluate.

The core problem is counterfactual because in markets, your success is not measured in a vacuum. Every competitor adopting the same tools degrades your returns without them, which means the real question is how much worse your performance would have been if you hadn't deployed. The spending isn't optional. Hedge fund alpha compressed from 3.4% annually between 1994 and 2008 to negative 1.0% between 2009 and 2019, according to CFA Institute and Sullivan data.

Jamie Dimon framed it directly: "Efficiency gains from AI will likely be competed away and passed onto customers rather than permanently adding points to the bank's profit margin." Measuring incremental alpha takes years of data to separate signal from noise, and that's a fundamentally different measurement challenge than counting defect rates on a manufacturing floor.

The signal is emerging, but slowly. Canoe's 2026 data shows the percentage of LSEQ quantitative funds outperforming their benchmark climbing from 30% in 2023 to 37% in 2024 to 52% in 2025. Three data points in sequence over three years. The trajectory suggests AI-driven alpha is real, but it takes multi-year performance data to distinguish it from factor exposure, market regime, or luck.

Percent of LSEQ Funds Outperforming Benchmark: 30% (2023), 37% (2024), 52% (2025). The signal is real but attribution takes years. Source: Canoe/LSEQ.

There's also a category of AI value in financial services that has no pre-existing baseline at all. Ropes & Gray's real estate fund previously reviewed 50 out of 500 leases because human review of all 500 was cost-prohibitive. AI now reviews all 500 in less time than the original 50. The value isn't "faster lease review." It's information the firm structurally could not have had before. Quantifying that requires a model connecting the new capability (reviewing all 500 leases) to a projected financial outcome (fewer missed terms at renewal, better negotiating positions), with stated assumptions that get tested over time.

Why "Replace Our Labor" Is the Wrong Starting Goal

Across all industries, the most common AI ROI framework is headcount replacement: deploy the tool, fire the people, count the salary savings. The appeal is obvious because the math looks simple. But the framework skips three questions that determine whether the math actually works.

First, can AI do the specific, bounded tasks at an error rate the business can tolerate? A manufacturer can tolerate a 2% defect detection miss rate if the alternative is 15% from humans. A small service business with high-value client relationships cannot tolerate appointment mismatches because every error is directly visible.

Second, what does it cost to get it wrong? The error cost analysis is almost always missing from the ROI calculation, and it's the variable that kills the business case most often at small scale, where a misbooked appointment or a compliance violation costs more than the labor savings.

Third, does the infrastructure cost exceed the labor savings at the deployment scale? Replacing a $200,000 analyst at a bank with 500 analysts is a fundamentally different calculation than replacing a $40,000 receptionist at a single location, because the implementation cost for data structuring, workflow design, and monitoring is roughly similar in both cases.

A small service business learned this the hard way after deploying an AI phone system to save a receptionist's salary. The AI could handle conversational flow (the probabilistic part) but couldn't reliably match service requests to the right provider (the deterministic part) because the underlying data wasn't structured. The error rate produced client complaints and missed opportunities that cost more than the salary, and the business rolled it back within months.

The failure came from skipping directly from “AI can do this task” to “fire the person” without the intermediate analysis. That pattern is the micro version of what the banks are doing when they claim 10% productivity savings while compensation expense climbs $2.9 billion. The right starting goal is "add this specific capacity" or "make this workflow produce better outcomes," with the headcount question answered by the data after deployment, not assumed before it.

The Infrastructure Layer

What infrastructure do you need before AI can automate anything, and how does it determine time to value?

Measurement tells you whether individual tasks work. Scaling those tasks into automated workflows across the business requires something different: infrastructure that connects systems, governs data access, and handles two fundamentally different types of computation.

The first type is probabilistic. Large language models generate text, classify intent, extract themes, and reason over unstructured information. They're good at interpretation.

The second type is deterministic. Financial calculations, compliance checks, and data reconciliation require exact precision, where a 99.9% accuracy rate still means a violation every thousand queries. In regulated environments, that's a lawsuit.

In an article by John Stefani of OPCO, he noted that decades of rational deferrals in banking infrastructure compound the problem. M&A fragmentation left acquiring banks with partially integrated data from acquired institutions. COBOL-based cores still run transaction processing designed in the 1970s. Roughly 70% of US banks outsource core processing to Fiserv, FIS, or Jack Henry, and 41% of banks under $100 billion still manage business-line data in spreadsheets. Each deferral made sense at the time. Collectively, they created data environments where AI can't function reliably even when the measurement methodology exists, which is why the infrastructure layer matters as much as the measurement one.

The infrastructure challenge is routing the right workload to the right engine. The companies getting real ROI from AI automation architected their systems so the LLM reasons over what structured code already computed. Palantir's (PLTR) Foundry platform does this by design: the ontology layer, which is essentially a structured map of all objects and relationships in the business, connects data sources deterministically while AI models run interpretation on top. The companies that let the LLM do the math get hallucinated numbers, lose user trust, kill adoption, and never reach the measurement stage.

Infrastructure also determines time to value, meaning how quickly a deployment produces measurable results. A turnkey AI application pre-built for a specific use case compresses time to value from months to weeks because the integration, UX, and model design problems are already solved. A horizontal platform leaves all of those problems open, which is why horizontal deployments take six to twelve months and fail at roughly double the rate.

That gap determines how fast a company can create internal champions: users who experience a real improvement, advocate for expansion to adjacent teams, and generate the organizational momentum that turns a pilot into a production deployment. Companies without infrastructure get stuck in pilot purgatory because they can't prove the pilot worked, because they can't measure it, because the data layer isn't there.

Bank of America (BAC) committed $3 billion to its data foundation between 2014 and 2019, a full decade of investment that predated the generative AI moment. That investment means BAC's AI deployments today run against a unified data layer with governance, lineage tracking, and real-time pipelines already in place. Capital One (COF) is in its thirteenth year of technology transformation and ranked second on the 2025 Evident AI Index. JPMorgan Chase (JPM) allocated $19.8 billion to technology spending with over 1,000 AI use cases running on a unified data platform.

Compare that to the laggard side. Terminal X analysis scored companies on five vulnerability signals: delayed modernization, fragmented data, siloed AI deployments, vendor dependency, and scaling barriers. The companies scoring highest on these signals share a pattern: AI works in isolated pockets but can't scale across the business because the plumbing underneath is fragmented.

C3.ai (AI) is a useful example. The company has real product substance, an agentic platform, the STAFF framework, and 174 initial production deployments in FY25. Its PANDA program for the U.S. Air Force produced verified, measurable results that come up again in the enabler section. But the platform is fragmented across federal, enterprise, and energy segments, and the infrastructure underneath can't support the breadth of what the company is trying to do. That disconnect between product capability and architectural foundation is why C3 shows up as a laggard on infrastructure despite real AI results in narrow use cases.

Outside financial services, the infrastructure gap looks different but produces the same result. Honeywell (HON) manages over 200 legacy IT and OT data sources, with an aerospace spin-off disrupting whatever integration existed. Ford (F) invested billions in AI-driven manufacturing but its data remains siloed across brands and platforms. Walmart (WMT) runs AI experiments at massive scale but legacy data architectures prevent the kind of cross-system automation that would move the needle on margins.

Companies Reporting Quantitative AI Impacts (4Q24–4Q25): rising but from a very low base. Infrastructure investment is the prerequisite for being able to report quantitative impacts at all. Source: Terminal X / S&P 500 earnings analysis.

The percentage of S&P 500 companies reporting quantitative AI impacts grew from roughly 10% in Q4 2024 to 21% by Q4 2025. The trend is moving in the right direction, but the absolute number is still remarkably low given the capital deployed. Infrastructure investment is the prerequisite for being able to report quantitative impacts at all. The companies that built their data foundations before the AI hype cycle are the ones measuring ROI now. Everyone else is retrofitting, and the data confirms what that costs in time and returns.

The Delivery Models

Who is selling enterprise AI, and which delivery model actually helps close the ROI gap?

Most enterprises don't build their AI stack from scratch. They buy it, and the delivery model they choose determines whether they end up with a system that can actually measure its own impact or just another vendor dashboard showing activity metrics.

The basic distinction is between horizontal platforms (raw infrastructure where a company builds its own solution) and turnkey applications (finished products for a specific use case that the company configures rather than builds). The tradeoff is speed versus flexibility, and the data is clear: turnkey deployments achieve 67% production success rates versus 33% for horizontal, because fewer things have to go right when the product is already built.

Horizontal data platforms like Snowflake (SNOW) and Databricks provide the governed data layer where AI runs against enterprise data. Databricks reached a $62 billion valuation with $1.4 billion in AI revenue run rate. These platforms solve the infrastructure problem well, but they push the measurement burden entirely onto the customer. Their native metrics are consumption-based: tokens processed, API calls, seats active. A company running Snowflake knows how much compute it consumed. It doesn't know whether the AI using that compute produced a better decision than what came before.

Vertical integrators like Palantir (PLTR) and C3.ai (AI) promise measurement, infrastructure, and strategy in a single deployment, with 3 to 8x higher ROI over three-year horizons. The tradeoff is flexibility. Vertical solutions carry an estimated six-month shelf life before they risk becoming expensive technical debt, because the architecture is tightly coupled to a specific set of assumptions about how the business operates. When those assumptions change, so does the cost of staying on the platform.

C3.ai (AI) illustrates both the promise and the tension. PANDA is probably the cleanest ROI story in enterprise AI right now because the measurement architecture is built into how the system works. The platform monitors sensor data across the Air Force fleet, generates failure predictions for specific components on specific aircraft, and tracks whether maintenance teams acted on the prediction and whether the component was actually degraded. Every prediction generates a trackable outcome without adding any reporting burden, because the attribution chain captures itself through the existing maintenance workflow.

Each avoided unscheduled maintenance event saves an estimated $47,000 in emergency labor and parts. Across a fleet generating 500 unscheduled events per year, PANDA predicts 40% of them before they happen, roughly $9.4 million in maintenance costs that never hit the budget. The platform costs approximately $3 million per year, meaning this was a 3x return on hard dollars alone. The U.S. Air Force Rapid Sustainment Office raised C3's contract ceiling to $450 million to scale PANDA across the service's fleet based on those results.

But C3 tried to sell that same heavy vertical model to commercial enterprises with consumption pricing, capturing the friction of both delivery models without hyperscaler scale. The business result: 46% revenue decline, negative 250% net income margin, and a 26% workforce reduction.

Palantir (PLTR) avoided the same trap by compressing time to value through bootcamps and forward-deployed engineers who embed with clients for days rather than months. PANDA proves vertical AI can produce real, auditable ROI when the workflow generates its own attribution data. The challenge is making that economics work outside a single high-value contract.

AI Enabler Evaluation Matrix: four delivery models scored on measurement, automation, strategy, and ROI visibility. Source: Terminal X AI Delivery Models analysis.

The fourth model is the one solving the measurement problem most directly. Domain-specific AI firms operate at the workflow level where every output can be traced to a specific business action. Harvey reached an $11 billion valuation with $190 million in ARR serving 100,000 lawyers across 1,300 organizations. When a lawyer uses Harvey to review a contract, the platform knows which clauses were flagged, which were revised, and what the outcome was. Hebbia automates 90% of manual document synthesis in private equity due diligence, and can trace each synthesized output to the source documents that informed it. Terminal X automates investment research by 90% and accelerates investment memo creation by 75% with sentence- and cell-level source auditability across every output. AlphaSense provides market intelligence where the query, the sources, and the decision that followed are all linked.

The reason domain-specific firms solve the ROI measurement problem is structural: the workflow is narrow enough that attribution happens automatically. A horizontal platform can tell you how many tokens were processed. A domain-specific tool can tell you that a specific analysis led to a specific investment decision that produced a specific return. That's the difference between counting usage and measuring outcomes.

AI Adoption for Investment Managers: where buy-side firms are deploying AI today, led by screening, due diligence, and risk monitoring. Sources: Ropes & Gray, Canoe Intelligence, Grand Fisker Survey.

The market is converging on a hybrid stack where hyperscalers provide compute, horizontal platforms provide governed data, and domain-specific firms deliver the workflow-specific layer where measurement actually happens. The companies closing the ROI gap are the ones with AI embedded deeply enough in their actual work that the act of using it generates the evidence that it's working.

That's the criteria: native data connectivity across vendors, deterministic computation where precision matters, customization at the workflow level, and measurement that occurs as a byproduct of usage rather than a separate reporting exercise.

The Strategy Layer

What separates companies whose AI gets smarter from those running the same playbook?

Infrastructure lets you automate. Strategy determines whether that automation compounds advantage or just runs the same process faster.

The companies reaching the decisioning layer built systems with two qualities. The first is nimbleness: the architecture survives when the AI cycle shifts underneath it. The shift from chat to agentic workflows to MCP-based orchestration to autonomous systems happened in roughly eighteen months. Companies locked into a single vendor or a single architectural pattern couldn't pivot without rebuilding.

Palantir's (PLTR) Ontology framework is multi-model by design. The ontology is a structured map of all objects and relationships in the customer's business, and it persists regardless of which foundation model runs on top. Customer usage builds the data graph that makes the system more valuable, and the model underneath can be swapped without touching the workflow. Revenue grew 56% with operating margins expanding from 11% to 32%.

JPMorgan's (JPM) multi-model LLM Suite takes a different approach to the same problem. The platform runs different foundation models against the same data layer and switches between them without workflow disruption. When a better model becomes available, JPM adopts it without rebuilding anything.

A company that spent 2023 optimizing workflows around GPT-3.5 faced a choice when GPT-4 arrived months later, when Claude leapfrogged on reasoning, and when agentic tools reshaped developer workflows entirely. If the architecture decoupled the workflow layer from the model layer, the upgrade was a swap. If it didn't, the upgrade meant retraining people, rebuilding prompts, and unwinding the organizational adoption work that took months to complete.

The onboarding itself becomes tech debt when it's coupled to a model generation rather than an abstraction layer. Spending on current systems is fine, and some churn is inevitable, as long as the architecture allows the next generation to slot in without starting the adoption cycle over.

Two Virtuous Cycles, Not One

The second quality is virtuous cycle design, and the full picture is richer than the usual description suggests. Most discussions of AI compounding focus on the data cycle: usage generates more training data, more data improves the model, a better model produces more accurate outputs, and more accurate outputs drive more usage. That cycle is real and it's what companies like Upstart, Visa, and Guardant Health demonstrate.

But there's a second cycle running alongside it that determines whether the data cycle ever reaches scale. Call it the adoption cycle: AI makes an individual user's job measurably easier, that user becomes an internal champion, the champion advocates for expansion to adjacent teams, more users generate more data (feeding the data cycle), the better model produces even easier workflows, and easier workflows create more champions. The data cycle makes the model smarter. The adoption cycle makes the organization willing to use it.

The strategy leaders have both cycles running simultaneously. The laggards have neither. The companies stuck in the middle often have the data cycle working, meaning the model is measurably improving with usage, but not the adoption cycle, because the workflow improvement isn't visible enough at the individual level to create advocates.

A fraud detection system that improves its accuracy from 94% to 96% is meaningful to the business but invisible to the analyst using it. A document synthesis tool that cuts a four-hour task to twenty minutes is impossible to ignore. The second one creates champions. The first one doesn't, even though both produce compounding data value.

The measurement burden on the user is what determines how fast the adoption cycle spins. Historically, proving adoption required explicit structured data capture: clicks, form submissions, time stamps on specific actions. LLM-based analysis of unstructured data has changed this equation. Call transcripts, chat logs, and meeting notes can now be analyzed to extract evidence that the AI is working, without requiring the user to do anything differently.

Structured data is still better when the workflow naturally produces it, and a completed trade is a cleaner signal than a transcript analysis of a phone call. The best systems use both: structured capture where the workflow produces it (XPO's routing, Visa's transaction scoring) and LLM extraction from unstructured data where it doesn't. That dual approach is the design philosophy that enables both virtuous cycles simultaneously, and it means the barrier to creating champions is lower than it has ever been.

Upstart (UPST) is a lending platform that has accumulated 104 million repayment events across more than 2,500 variables. Every loan decision generates an outcome that feeds back into the model. The company's proprietary Upstart Macro Index isolates macroeconomic risk from credit risk, meaning the system can distinguish why a loan performed, not just whether it did. The result: 608 basis points of outperformance over Treasuries, with a dataset that compounds every quarter.

Visa's (V) fraud detection works the same way. Every transaction is scored, the outcome is observed, and the model recalibrates. The system isolated $90 million in fraud in six months, and the attribution data that proves its value is a byproduct of how it works, with no need to create a separate reporting exercise.

Guardant Health's (GH) Infinity AI engine feeds on over 100,000 genomic profiles. Each Shield test improves detection accuracy for the next one, producing a 7% improvement in Stage 1 colorectal cancer detection. Liberty Energy (LBRT) runs continuous ML monitoring on fleet equipment that extended engine life by 27% and compressed maintenance costs by 14%. Deere's (DE) See & Spray creates a physical virtuous cycle where sensor data feeds the model, the model improves spraying precision, and better spraying generates richer sensor data.

Payment Networks Value-Added Services Revenue Growth: Visa from ~$8.7B to ~$10.8B (FY24–FY25), Mastercard from ~$10.5B to ~$12.8B (CY24–CY25). Source: Terminal X / company filings.

Strategy laggards show the inverse pattern: single-vendor lock-in, monolithic deployments, and era-specific architectures that can't adapt. Bank of America (BAC) built Erica on pre-defined NLP rather than LLM architecture, and claims a 30% coding reduction while noninterest expense rose $2.9 billion in 2025. Goldman Sachs (GS) committed to a single-vendor Anthropic dependency for its OneGS 3.0 platform, and while Anthropic's models continue to improve rapidly, concentrating an entire AI strategy on one provider carries execution risk if the competitive dynamics shift. Headcount at GS climbed 2% despite $250 million in severance charges. Microsoft (MSFT) tied $250 billion to OpenAI through 2030 while 365 Copilot sits at 10% seat utilization.

The market hasn't learned to price strategy maturity yet. The data confirms it: among companies that appeared on both the leader and laggard strategy lists, returns ranged from Caterpillar (CAT) at positive 173% to UnitedHealth (UNH) at negative 49%. The signal is too noisy for the market to price consistently. But the companies building nimble, self-reinforcing systems now are positioning for the repricing, because both virtuous cycles compound every quarter the system runs.

The Market Is Already Pricing This

Is the stock market already pricing the AI measurement gap?

The first two layers, measurement and infrastructure, are already showing up in stock returns. Terminal X correlated the measurement and infrastructure maturity scores with financial performance across the companies in the analysis, and the relationship is clean.

Companies scoring as dual leaders on both measurement and infrastructure returned 41.38% over twelve months versus the S&P 500's 29.40%, a spread of nearly 1,200 basis points. Over twenty-four months annualized, the gap widened to 1,487 basis points. Revenue growth for dual leaders ran at 17.49% versus 13.40% for the benchmark, with operating margins expanding 120 basis points year-over-year against the index's 40.

The tier structure below is where the thesis sharpens. Companies with strong measurement but weak infrastructure returned just 8.14%, trailing the S&P by more than 2,100 basis points. Companies with strong infrastructure but weak measurement returned 18.52%, still trailing the benchmark by roughly 1,100 basis points.

That data point is the article's central empirical claim. Measurement alone doesn't produce returns. Infrastructure alone doesn't produce returns. The layers are nested and sequential, and the market is pricing them that way. The relationship emerged from the data rather than being imposed on it.

Terminal X ROI Analysis: AI maturity tier comparison. The layers are nested and sequential.

The laggard side confirms the penalty. Salesforce (CRM) declined 37.73%. Workday (WDAY) fell 47.62% despite 12.5% subscriber revenue growth and $100 million in new AI annual contract value. HubSpot (HUBS) dropped 52% while still growing revenue 18% in constant currency. The market is punishing them for spending on AI without building the foundation to prove what it does, even when the underlying business still looks reasonable.

Average 12-Month Total Stock Return by AI Maturity Tier: a staircase ascending from lowest maturity to highest. Source: Terminal X ROI Analysis.

BCG and IBM reached the same conclusion independently. Their research found that AI leaders achieve 3.6x three-year total shareholder return, 1.7x revenue growth, 2.7x return on invested capital, and 1.6x EBIT margin advantage. The fact that an external study using its own dataset lands on the same hierarchical finding strengthens both results.

BCG/IBM AI Leaders vs. Laggards: 3.6x three-year TSR, 1.7x revenue growth, 2.7x ROIC. Independent confirmation. Source: BCG, IBM.

Goldman Sachs' HALO framework (Heavy Assets, Low Obsolescence) explains the sector hierarchy in the data. Industrial companies like those in transportation, aerospace, and manufacturing show the purest linear correlation between AI maturity and financial performance because physical workflows provide natural measurement architecture: you can count parts, downtime, and defect rates without building a separate measurement system. Financial services has the opposite problem. Enterprise tech is the most volatile because a single architectural shift can reshape the competitive picture in months.

Three companies in the dual-leader tier posted negative returns despite strong AI maturity scores. Microsoft (MSFT) at negative 2.75% reflects investor anxiety over $161 billion in infrastructure capex. UnitedHealth (UNH) at negative 49% was driven by Medicare V28 coding headwinds and a DOJ investigation. JPMorgan (JPM) at negative 3.83% tracks to rate sensitivity. Every underperformer in the top tier has an identifiable cause that has nothing to do with AI execution, which is what makes the AI maturity signal credible. If the exceptions couldn't be explained, the correlation might be suspect, but the data shows a clear connection.

S&P Global estimates companies have three to five years before the gap between AI leaders and laggards becomes permanent. KPMG's data suggests investors are operating on a much shorter clock. The companies still building their measurement foundation are running out of room to catch up.

Three Layers, One Window

How should investors and operators respond to the AI ROI gap?

Measurement enables tasks, infrastructure enables automation, and strategy enables decisioning. The three layers are nested and sequential, and most companies are still stuck at the first one, buying task-level AI and expecting decisioning-level results. The market has already started separating winners from losers on the first two layers, and the third is the forward-looking bet.

For investors, the screening tool is earnings transcript language. Green flags include named KPIs with before-and-after figures, described attribution methodology, dollar impact with stated assumptions, and explicit connection to financial targets. Red flags include "early innings" language after two or more years of deployment, vanity metrics substituted for financial impact, AI costs buried in "unallocated" line items, and deflection when analysts press on ROI.

The 30 basis point credit spread penalty that Citi identified is early pricing. As the market matures and more companies report quantitative AI impacts, the penalty for spending without proof will widen. The companies without measurement infrastructure will pay an increasing premium for capital.

For operators, the three layers map to a concrete playbook, and the seven-step value model from Section 2 is the starting point. First, define interim KPIs before deployment, even if those KPIs don't exist yet. Creating new metrics is often necessary because the capability itself is new: Ropes & Gray couldn't measure the value of reviewing all 500 leases until AI made it possible, and the Air Force couldn't measure predicted maintenance events until PANDA existed.

What matters is that the intent is top-down from the beginning, deploying AI to add a specific capacity or improve a specific workflow, with a clear theory of how that connects to a financial outcome. The failure mode is deploying because the board said to and scrambling to justify it after the fact.

Second, design systems so usage generates attribution data as a byproduct rather than requiring a separate reporting exercise. PANDA, XPO, and Visa all do this because the measurement infrastructure was built before the AI was deployed. The value model structure, from baseline cost through sensitivity analysis, should be part of the system from day one. Third, build bridge models connecting interim metrics to projected financial outcomes with stated, testable assumptions that get validated quarterly. The goal is a system where AI pulls dynamically the data that people need to make optimal decisions efficiently across automated workflows, and where using it generates the evidence that it works.

The companies that built measurement into their AI infrastructure from the beginning are compounding their advantage every quarter the system runs. The measurement gap is the bottleneck, not the technology, and the math has already changed. The spending hasn't caught up yet.

Methodology

This article synthesizes twelve original Terminal X research reports analyzing earnings transcripts (last 2–3 quarters), investor presentations, 10-K filings, and analyst Q&A across five sectors: Financial Services, Defense/Aerospace, Healthcare, Manufacturing/Energy, and Enterprise Technology. The methodology is comparative bifurcation: systematically dividing companies into Leaders and Laggards on specific dimensions (measurement rigor, infrastructure readiness, strategy maturity) to expose execution gaps. External data points are sourced from MIT, S&P Global, Gartner, IBM, BCG, Capgemini, Google Cloud, Morgan Stanley, KPMG, Citi, Bank Director, IIF/EY, CFA Institute, Canoe/LSEQ, Deloitte, Goldman Sachs, and Evident AI. All company-specific financial figures are derived from public filings and earnings transcripts.

Disclaimer: This article is for informational purposes only and does not constitute investment advice. Past performance does not guarantee future results. The analysis reflects data available as of April 2026.

Want to build your own thesis? Launch Terminal X

AI ROI in 2026: Why Most Enterprise AI Fails (And What Actually Works)