May 2026 AI

The SME’s Guide to Running AI Cheaply

Most of what gets written about AI assumes you have an enterprise budget, an enterprise team, and an enterprise appetite for platform fees. This is for the SME owner or CTO, particularly those of us operating in Ireland and across Europe, who wants to use AI meaningfully without signing a six-figure annual contract. The pricing landscape has shifted dramatically over the past twelve months, and the gap between what AI costs at the top end and what it costs if you are willing to do some of the work yourself has never been wider. The good news is that May 2026 is a genuinely good time to be running AI on a budget; the bad news is that the cheapest option is not always the best option, and the hidden costs will find you if you do not plan for them.

Industry-wide, AI inference prices have dropped roughly 80% between 2025 and 2026, and the spread across providers is now enormous. The headline development of the year is Kimi K2.6 from Moonshot AI, an open-weight model that ties GPT-5.5 on the standard coding benchmark while pricing at roughly a fifth of the closed-source equivalents; DeepSeek V3 and V4 sit just below it on capability and well below it on price, and Anthropic's Claude Haiku 4.5 remains the easy default for general business tasks that are not coding-heavy. If you self-host an open-weight model on your own hardware the cost is essentially electricity, three orders of magnitude below the cheapest API, which matters when you are processing real volume. Prompt caching cuts another 50 to 90 percent off repetitive workloads where the same system prompt goes out with every request, which describes most customer support, document processing, and categorisation pipelines. And the free tiers from Google AI Studio, Groq, and Cerebras are now generous enough that low-volume production workloads can run at no cost if you design around the rate limits.

The single biggest cost lever most SMEs do not know about is model routing. The idea is simple: not every query needs the most capable model. A customer asking for your opening hours does not need the same model that is analysing a legal contract. If you route 70% of queries to a budget model like Haiku 4.5 or DeepSeek V3, 20% to a strong mid-tier like Kimi K2.6 or Qwen3-Coder-Next, and 10% to premium for the hardest cases, the blended cost drops 60 to 80 percent compared to running everything through the top-tier model. This is how the larger companies run AI at scale; they just do not talk about it because it is not glamorous. The implementation is straightforward: a lightweight classifier, which can itself be a cheap model, looks at the incoming request, estimates complexity, and routes accordingly. The production readiness considerations I have written about apply here too; routing logic needs testing and monitoring just like any other production component, because a misrouted complex query to a cheap model gives you a bad answer at a low price, which is worse than an expensive good answer. The specific surprise of 2026 is that for coding workloads in particular, the routing decision now favours open-weight models more aggressively than even a year ago; Kimi K2.6 and DeepSeek V4 Pro are effectively tied with Claude Opus and GPT-5.5 on the headline coding benchmarks while costing a fraction of the price, which means SMEs building anything code-adjacent, such as internal tooling, document automation, or ETL pipelines, can route the bulk of that work to a model that costs less than a tenth of the closed-source equivalent without losing meaningful quality.

Self-hosting is where the economics get interesting for SMEs with consistent workloads, and for European businesses it has an additional advantage beyond cost. A used RTX 3090 with 24GB of VRAM, the spec that matters most for local LLM inference, sells in May 2026 for €700 to €1,000 on the second-hand market and remains the value choice; a new RTX 4090 runs €2,200 to €2,700 now that AI demand has pushed it well above launch MSRP. At a €180 monthly API spend a used 3090 pays for itself in about five months and a new 4090 in roughly thirteen, with running cost after that of €40 to €50 per month in Irish commercial electricity. The open-weight ecosystem has matured to the point where Llama 4 Scout, Qwen3-Coder-Next, Mistral Medium 3.5, and Kimi K2.6 all serve locally with no usage fees, no rate limits, and no vendor lock-in; Ollama has made the operational side trivial, with one command downloading and serving any of them on an OpenAI-compatible endpoint, so anything you built against the OpenAI API can point at your local box with a URL change. Self-hosting earns its keep when your workload is consistent; for spiky load you would be paying for idle GPU time, and APIs win at low utilisation. For European SMEs handling client data, keeping inference on your own kit is also the cleanest way to handle tenant isolation and GDPR, since the data never leaves your jurisdiction. If owning hardware is not practical, European cloud providers like OVHcloud, Hetzner, and Scaleway offer GPU instances that keep inference within the EU.

There is a full open-source stack that replicates what enterprise AI platforms sell for five or six figures, running on commodity hardware. The combination of n8n for workflow orchestration, Ollama for serving the model, and Qdrant as a vector database for retrieval-augmented generation gives you triggers, workflows, integrations with email, CRM, and databases, a local model server, and a storage layer for embeddings, all on a single machine. n8n handles the automation logic, connecting seventy-plus AI nodes with three hundred other services; Ollama serves whatever model you choose; and Qdrant stores and retrieves document embeddings so the model can answer questions grounded in your actual data. The total infrastructure cost is the machine and the electricity. The trade-off is that you are your own operations team; there is no vendor to call at 2 AM when something breaks, and you need someone on staff who can read logs, restart services, and debug prompt issues. For an SME with even a moderately technical person on the team, this is manageable; for one without, the support cost of running your own stack may exceed what you save on platform fees.

If you are an Irish SME, there is funding available that can offset the initial investment, and it is worth knowing about before you spend your own money. CeADAR, Ireland's national centre for AI and a European Digital Innovation Hub, offers free and heavily subsidised services to SMEs including technical expertise, test-before-invest projects, and training; their Phase 2 programme running from 2026 to 2029 aims to deliver over 1,100 test-before-invest projects and 200 training courses nationwide. Enterprise Ireland's Digital Transition Fund provides grants of up to €35,000 at 50% funding for digital transformation projects, which explicitly covers AI integration. And the Local Enterprise Offices offer the Grow Digital Voucher: up to €5,000 for implementing digital tools including AI, available after completing the free Digital for Business consultancy. None of these are enormous sums individually, but €5,000 from a LEO voucher covers a GPU and several months of experimentation, and €35,000 from the Digital Transition Fund can resource a properly scoped AI pilot from data preparation through to production. The application processes are straightforward, and the consultancy that unlocks the LEO voucher is itself useful because it forces you to think through what you actually need before you start spending. Across Europe more broadly, the EU AI Act's provisions for SMEs include priority access to regulatory sandboxes, simplified compliance documentation, and proportionate fees; the regulatory environment is explicitly designed to avoid crushing smaller businesses under compliance costs, which is worth remembering when the headlines make the AI Act sound like it only creates obstacles.

The costs that do not appear in the pricing comparison are the ones that catch people. Data preparation, which includes cleaning, formatting, chunking documents for embedding, and building the retrieval pipeline, consumes 10 to 15 percent of total project spend even for straightforward implementations. If you go down the fine-tuning route, the infrastructure overhead adds 15 to 30 percent on top of raw GPU costs, because training runs consume significantly more compute than inference and the tooling around them is not free. Embeddings are cheap per call but add up at volume; if you are embedding a large document corpus and re-embedding as documents change, that is a recurring cost that scales with your data rather than your query volume. And the largest hidden cost of all is iteration time measured in human hours: the first version of any AI workflow will not be the version you keep, and the time spent refining prompts, adjusting retrieval parameters, and testing edge cases is where most of the real spend sits. Grounding checks, which make sure the model only answers from your data and does not fabricate plausible-sounding nonsense, matter just as much on a budget model as on an expensive one, and skipping them to save time is the kind of economy that costs more later.

The most cost-effective AI deployment is sometimes no AI at all, and this is the point I find most underappreciated by the trade press. Rule-based automation for predictable workflows, where the logic can be expressed as "if this then that" without ambiguity, costs fractions of a cent to execute versus €0.01 to €0.10 per AI inference call. Customer routing, form validation, status notifications, scheduled reports, data transformations between systems: these are solved problems that do not benefit from a language model's ability to handle ambiguity, because there is no ambiguity to handle. AI earns its cost when the input is genuinely variable, when the task requires interpretation rather than rule-following, and when the value of a good answer significantly exceeds the cost of generating it. For everything else, a Zapier workflow or a Python script will do the same job for a fraction of the cost, and the projects where that was the right call do not appear in the AI ROI statistics.

The cheapest way to run AI is to be deliberate about where you use it and honest about where you do not need it. The pricing has reached a point where budget is no longer a valid reason to avoid AI entirely, but budget is always a valid reason to be thoughtful about where you deploy it. Start with one workflow where the value is obvious and the data is clean, check whether CeADAR or your LEO can offset the initial cost, run the numbers on API costs versus self-hosting, and resist the temptation to add a second use case until the first one is genuinely paying for itself, with a cost reduction or a revenue line that shows up in the accounts rather than a projection in a slide deck. The tools are there, the models are there, the funding support is there if you look for it, and the only thing missing is the discipline to use them well.

Sources

Moonshot AI, Kimi K2.6: open-weight 1T-parameter MoE (32B active) released April 2026; ties GPT-5.5 on SWE-Bench Pro at 58.6%; API pricing $0.60/$2.50 per million input/output tokens. artificialanalysis.ai
DeepSeek, Models & Pricing: current API pricing for DeepSeek V3 ($0.14/$0.28 per million tokens), R1 ($0.55/$2.19), and V4; off-peak discounts up to 50% on V3 and 75% on R1 during overnight GMT hours. api-docs.deepseek.com
Alibaba Qwen Team, Qwen3-Coder-Next: open-weight coding model released February 2026; 80B total parameters with 3B active; 256K context window; 70.6% on SWE-Bench Verified. github.com/QwenLM/Qwen3-Coder
Meta AI, The Llama 4 herd: Llama 4 Scout (17B active, 16 experts, 10M-token context, fits on a single H100) and Maverick (17B active, 128 experts); both open-weight. ai.meta.com
Anthropic, Claude API Pricing 2026: Haiku 4.5 at $1/$5 per million input/output tokens; prompt caching at 90% discount on cached input. platform.claude.com
MindStudio, The Best Open-Source LLMs for Agentic Coding in 2026: comparative analysis placing Kimi K2.6 and DeepSeek V4 Pro effectively level with the closed-source frontier on SWE-Bench Verified. mindstudio.ai
BestValueGPU and XDA Developers, RTX 4090 and RTX 3090 price trackers (EU and US, May 2026): new RTX 4090 currently retails for €2,200 to €2,700 in the EU; used RTX 3090 (24GB VRAM) sells for roughly €700 to €1,000 on the second-hand market and remains the value choice for local LLM inference. bestvaluegpu.com · xda-developers.com
CeADAR Ireland, Enterprise Ireland Digital Transition Fund, and Local Enterprise Office Grow Digital Voucher: programme details and grant ranges current as of May 2026. ceadar.ie · enterprise-ireland.com · localenterprise.ie

Michael Doody

Contract & Interim CTO

The SME’s Guide to Running AI Cheaply

Get insights like these in your inbox