The AI Demo Looked Great. Then We Tried to Put It in Production.
I have spent the last few months building an AI document intelligence platform for a customer in a regulated industry — a production system with audit trails, tenant isolation, and data that absolutely cannot leak between clients. The first problem I hit was one that anyone who has used ChatGPT or similar tools will recognise: the outputs varied significantly between similar questions, same query, different answers. I assumed the model was the issue, and it was not.
The root cause turned out to be a single API setting: the temperature parameter. It controls how much randomness the model introduces into its responses; left at the default, the model behaves creatively, explores different phrasings, offers varied perspectives, and generally acts like it is having a conversation. That is fine for a chatbot, but it is entirely wrong when a compliance team needs the same document analysed the same way every time. Lowering the temperature produced output that felt significantly closer to deterministic — consistent, predictable, much closer to what you would expect as a human reviewing the same document twice. It was not a complicated fix, and that is precisely the point.
Most organisations experience AI through demos and slides; the model is impressive, the possibilities feel endless, and the board gets excited. Between a compelling demo and a system that can run in production, though, there is a gap that is rarely discussed honestly, and that gap has nothing to do with the AI model itself. It is about everything around it: tenant isolation, ensuring that one client's data never bleeds into another client's responses, which is table stakes in regulated industries but surprisingly easy to get wrong with shared model contexts; audit trails, because every query, every response, and every decision point must be logged and traceable, and regulators do not accept "the AI said so" as an answer; environment configuration — temperature settings, token limits, retry logic, timeout handling, rate limiting — where each setting sounds trivial and each one can undermine trust if misconfigured; security boundaries, including API keys, network segmentation, data encryption at rest and in transit, and access controls, made more complex when the model is hosted externally; and error handling, because production systems need graceful degradation when the model returns nonsense, hallucinates, or times out mid-response.
If you are a CEO or board member evaluating AI investments, this is worth understanding. The model is typically the easiest part of the project, and the vendors will tell you it is plug and play. The real work and the real cost sit in making the system production-grade: infrastructure that scales without surprising you with a bill, security that satisfies your compliance team and not just your engineers, monitoring that tells you when something is wrong before your customers do, and configuration that has been tested, documented, and reviewed rather than left at defaults. I have seen organisations spend months on model selection and days on production readiness, and the ratio should be inverted.
If you are running AI in production or planning to, the questions worth asking are whether you have reviewed and tuned the temperature and other inference parameters for your specific use case, whether there is complete tenant isolation if you are serving multiple clients, whether you have audit logs that a regulator could review, what happens when the model returns an incorrect or nonsensical response, who owns the production configuration and how it is version-controlled, and whether you have load-tested the system under realistic conditions. If any of those are unanswered, your AI system is a demo running in production, and that is a risk rather than a feature.
I am spending more of my time on this kind of work now, and I find it genuinely rewarding — making AI systems production-ready rather than leaving them as demos and slides. It is less glamorous than model selection and prompt engineering, but it is where the real value is created and where the real risks are managed. The AI model is the tip of the iceberg; the infrastructure, security, and configuration beneath the surface is what determines whether the system earns trust or erodes it.