Accuracy Is Not Consistency

Accuracy Is Not Consistency: Why Enterprise AI Needs Deterministic Guardrails

Evidence from 242 EU261 compensation runs with ChatGPT 4.1, ChatGPT 5.2 + Predictika LVE

Sanjay Mittal and Awhan Patnaik

Executive Summary

We evaluated ChatGPT 4.1 (and later ChatGPT 5.2, see section 8) on a logic- heavy EU261 travel-delay compensation workflow using 22 distinct test cases executed 11 times each (242 total runs). The LLM’s raw accuracy was 51.65% (125/242) but when guided by Predictika’s Logic Validation Engine (LVE) the user saw100% correct answer regardless of how much the LLM hallucinated!!

Run-to-run consistency—measured by comparing each run’s compensation value to a fixed baseline run within the same test case—ranged from 74.79% to 80.99% depending on baseline. However, only 8 of 22 test cases were fully consistent (all 11 runs identical). Of these, 5 were correct and 3 were consistently wrong, demonstrating that stability does not imply correctness. The true measure of consistency is what % of cases gave the same answer each time. By this measure consistency was only 37%. The earlier numbers between 75-81 only measure local consistency.

Tool calls to LVE succeeded in 98.35% of runs and tool arguments matched in 98.35%. Conditioned on a successful tool call, Predictika-guided outputs were 100% correct—turning a ~52% raw LLM accuracy into deterministic user-visible correctness.

Experimental Setup

We evaluated ChatGPT 4.1 on 22 EU261 travel-delay compensation scenarios (242 total runs, 11 per scenario). Each run produced:

llm-comp: compensation computed by the LLM

lve-comp: compensation computed by Predictika LVE (treated as ground truth)

Additional instrumentation captured:

Whether the LLM successfully invoked LVE via tool calling

Whether the tool arguments were correctly populated

Run-to-run consistency relative to multiple baselines

Whether an entire test case was fully consistent (all 11 runs identical)

This allowed us to independently measure:

Accuracy – agreement with deterministic ground truth

Local consistency – agreement across runs within a test case

Global consistency – percentage of runs matching a chosen baseline

Full consistency – whether every run for a test case matched

Quantitative Results

Metric	Value
Total runs	242
Distinct test cases	22
LLM accurate runs	125
LLM accuracy (%)	51.65%
Tool call success (%)	98.35%
Tool args match (%)	98.35%
Predictika-guided accuracy (given tool call) (%)	100
Consistency vs baseline run=-1 (%)	77.27%
Consistency vs baseline run=2 (%)	75.21%
Consistency vs baseline run=3 (%)	80.99%
Consistency vs baseline run=4 (%)	80.99%
Consistency vs baseline run=6 (%)	74.79%
Fully consistent test cases (all 11 runs identical)	8
Full consistency (%)	37%
Fully consistent AND accurate	5
Fully consistent BUT wrong	3

Accuracy and Consistency Are Orthogonal

A central finding is that accuracy and consistency are not interchangeable. Accuracy measures agreement with deterministic ground truth, while consistency measures whether the model produces the same answer when asked the same question repeatedly. In this study, local consistency is materially higher than accuracy, though full consistency is significantly lower. In any case, consistency does not imply reliability: an LLM can be consistently wrong, inconsistently wrong, inconsistently right, or consistently right. Only consistently right behavior is operationally safe. Predictika LVE guiding the LLM was the ONLY way to get consistently correct answers every time.

The fully-consistent subset illustrates the danger of equating stability with correctness. Only 5 of the 8 fully consistent cases were answered correctly by the LLM; 3 were consistently wrong. Conversely, many correct answers were not stable across runs, meaning a user could receive a correct answer in one interaction and an incorrect answer in the next. This is a serious problem in any domain but especially in consumer applications that have high volume and the same question can be asked many times and in varied ways.

Hallucination Is Bidirectional

Hallucination is often described as fabricating incorrect facts. The run-to- run variability observed here shows a broader phenomenon: bidirectional drift. The model does not simply deviate from truth; it also deviates from its own previous outputs. As a result, hallucinations occur in both directions— moving away from a correct answer and sometimes moving back toward it on subsequent runs.

This dataset quantifies that instability: depending on baseline choice, only 75%–81% of runs match the baseline run within the same test case. This makes consistency a practical proxy for stochastic unreliability. Importantly, it captures instability in both incorrect and correct regimes, which accuracy alone cannot reveal.

Deterministic Co-Piloting Eliminates User-Visible Error

Introducing Predictika LVE as a deterministic co-pilot changes the system outcome. In the guided architecture, the LLM handles natural language interaction while LVE executes EU261 rules deterministically and returns the compensation decision.

In our experiment, tool invocation succeeded in 98.35% of runs. Conditioned on successful tool calls, the delivered output was 100% correct. This demonstrates the architectural principle that probabilistic models should generate language, but deterministic engines should own decisions—especially when correctness is defined by regulations, contracts, or policies.

Failure in making a tool call consistently is another kind of hallucination that we have encountered in our work and with EU261 experiments. Fortunately, the error rate was under 2% but even that can be too high for high volume applications and the EU261 is certainly high volume with an estimated 100M passengers dealing with flight delays in 2024 alone. We don’t have an answer yet for how to get 100% consistency in invoking the LVE from an LLMthough as we see later with ChatGPT5.2, we had 100% tool call invocation. So, ChatGPT5.2 maybe more consistent in invoking the LVE. Time will tell.

What This Adds to the Research on Consistency

Research has shown that LLM reasoning can vary substantially with sampling randomness, prompt phrasing, and decoding parameters, even when the underlying task is deterministic. Techniques such as self- consistency—sampling multiple reasoning paths and voting—can improve average performance, but they do not guarantee correctness for any single interaction.

Our results provide a production-style measurement of that phenomenon in a regulated decision domain. They show that stable answers can still be wrong, correct answers can be unstable, and tool-calling introduces an additional reliability surface that must be engineered and monitored. Together, these motivate deterministic validation as a first-class component of enterprise AI agents.

ChatGPT 4.1 vs ChatGPT 5.2: Accuracy Improved, Reliability Still Missing

Using the same 22 EU261 scenarios and identical instrumentation, we repeated the experiment with ChatGPT 5.2 while continuing to use Predictika LVE as deterministic ground truth. This provides a direct comparison with ChatGPT 4.1 under controlled conditions.

ChatGPT 5.2 increased raw accuracy from 51.65% to approximately 70%. Best-case run-to-run local consistency improved more modestly, from roughly 81% to about 87%. The number of fully consistent test cases rose from 8 to 14 out of 22.

These gains are meaningful, but they highlight a fundamental limitation. Even with the latest frontier reasoning model, roughly 30% of cases remain incorrect, and about 36% of identical queries still yield different answers across runs (only 14 of 22 queries were fully consistent, thus 36% of queries were inconsistent from one run to another). Note that this 36% inconsistency is a better measure than the raw 87% local consistency measure we reported earlier. This is because what matters, for consistency, is whether the same query on multiple runs produces the same result or not. Model upgrades improve averages, but they do not provide guarantees.

To understand what this means in real-world terms, consider passenger volumes in 2024. Industry analyses estimate that roughly 30% of all European air travelers experienced delays or cancellations during the year—corresponding to approximately 287 million disrupted passengers across EU, EEA, UK, and partner jurisdictions following EU261-style protections. Separately, nearly 218,000 departures were delayed by more than three hours or cancelled outright, the categories most likely to trigger EU261 compensation. The numbers for 2025 are markedly higher.

Viewed at this scale, a residual 30% LLM error rate is no longer an abstract metric. Applied to actual passenger volumes, probabilistic AI alone could produce tens of millions (maybe over 100M) of incorrect compensation determinations annually. It gets worse when you consider 36% inconsistency. The same user, on the same query, tried at different times will get different answers, confusing a user and lead to a loss of confidence in the system.

The combined evidence from the ChatGPT 5.2 experiment and 2024 passenger disruption data leads to a clear conclusion: newer models materially improve average accuracy, but they do not eliminate stochastic instability or guarantee correctness at scale. In regulated domains, this gap is decisive. Predictika LVE closes it by converting probabilistic language systems into deterministic decision systems—transforming model progress into operational trust.

Implications and Conclusion

For enterprise AI, accuracy alone is insufficient as an evaluation metric. Consistency provides an orthogonal lens into stochastic reliability, but it is not a substitute for correctness. The combination of low raw accuracy and imperfect stability means that a purely LLM-based agent cannot be trusted to make regulatory or contractual decisions.

A hybrid approach resolves this: LLMs provide flexible natural language interfaces, while deterministic engines validate or decide outcomes based on formal rules. Predictika LVE operationalizes this approach today by enforcing business logic deterministically, producing auditable outcomes and eliminating user-visible decision errors when invoked.

Further reading

Logic Engines to Guide LLMs for Accurate and Reliable Reasoning — Predictika White Papers (2025)

https://predictika.com/assets/doc/Guiding_AI_agents_to_reason_correctly.html

https://predictika.com/assets/doc/Logic%20_engine_to_guide_LLM.html

https://predictika.com/assets/doc/Where_is_AGI.html

EU Travel Compensation Bot

U.S. Patents in Conversational AI, Symbolic Reasoning, and LLM Governance (granted 2023–2024)