# Running 70B on Your Desk: An Honest Guide to On-Prem AI in 2026

*Series B - The Amnesia Problem, Article 3*

It usually starts with the invoice.

Someone in finance flags the line item - $14,000 last month for API calls to a model your team started using six months ago as an "experiment." The CTO opens a spreadsheet, does some napkin math, and asks the question that has been quietly forming in the back of every technical leader's mind since late 2025: "What would it actually cost to run this in our own server room?"

The answer, in mid-2026, is more interesting than most people expect. Not because it's cheap - it isn't free - but because the calculation has fundamentally changed. The hardware got better. The models got smaller. The tooling grew up. And the regulatory environment started making "someone else's computer" a harder sell for sensitive workloads.

This is the honest map. Not the vendor pitch. Not the hobbyist fantasy. The calibrated, numbers-attached version of what on-prem AI actually looks like today - what works, what hurts, and where the crossover points really are.

## Why the Calculus Shifted

Eighteen months ago, running a 70-billion-parameter model on local hardware was a stunt. You needed $30,000 in GPUs, a small data center's worth of cooling, and the kind of systems engineering talent that doesn't answer recruiter emails. The models were dense, the quantization was lossy, and the tooling assumed you had a PhD and opinions about CUDA kernel fusion.

Three things changed.

**Models got architecturally smarter.** The shift to Mixture-of-Experts (MoE) architectures means that a "70B-class" model in 2026 doesn't necessarily have 70 billion active parameters. Llama 4 Scout has 109B total parameters but only 17B active per forward pass. Qwen3's 35B MoE variant competes with what last year required a 70B dense model. The performance tier stayed the same - the hardware requirements dropped.

**Quantization matured.** Q4_K_M quantization has become the practical standard - it compresses a 70B dense model from roughly 140 GB to 35-45 GB with minimal quality loss for most business tasks. Google's QAT (Quantization-Aware Training) approach in Gemma 3 pushed this further, making int4 models that were trained to be quantized rather than having quantization applied as an afterthought. The difference is measurable.

**Hardware hit an inflection point.** The 2025-2026 generation of consumer and prosumer silicon brought unified memory architectures and VRAM capacities that changed the economics entirely. You can fit a quantized 70B model in a machine that sits on a desk and draws less power than a space heater.

But hardware alone doesn't explain the shift. The regulatory environment is doing its own pushing.

## The Compliance Gravity Well

The EU AI Act's transparency obligations take effect in August 2026. The Omnibus simplification extended the high-risk standalone deadline to roughly December 2027, but the direction is clear: if you're deploying AI in the EU, you need to know what your models are doing, with what data, and be able to demonstrate it.

Here's the thing nobody in the compliance space says plainly enough: on-prem doesn't automatically make you compliant. It's not a checkbox. But it removes an entire category of problems.

When your model runs on hardware you control, in a jurisdiction you choose, processing data that never leaves your network perimeter - you've eliminated the CLOUD Act question, the cross-border data transfer question, and the "what does my provider do with my prompts?" question. You haven't solved governance. But you've made governance *possible* in a way that's harder when your inference calls transit through three countries and a terms-of-service document that changes quarterly.

NIS2 adds another dimension. Cloud AI providers are in scope as digital infrastructure. Every dependency you add to your AI stack is a link in your supply chain that NIS2 asks you to assess and monitor. Running inference locally doesn't eliminate supply chain risk - you still depend on model weights, frameworks, silicon vendors - but it shortens the chain considerably.

The sovereign AI movement across Europe reflects this gravity. GAIA-X and national AI initiatives are pushing for European compute sovereignty. Organizations paying the 15-30% premium for sovereign cloud are already doing the math on whether owning the hardware outright makes more sense.

None of this means you *should* run on-prem. It means the decision is no longer purely technical - it's also regulatory and strategic.

## The Hardware Landscape - What Actually Fits on a Desk

Let's get concrete. Here's what the 2026 hardware market looks like for on-prem inference, organized by what you'd actually spend.

### The $2,000-$4,000 Tier - "Surprisingly Viable"

**AMD Strix Halo (128 GB unified memory)** - This is the most interesting development in the on-prem AI space this year. A single system-on-chip with 128 GB of unified LPDDR5X memory, meaning both CPU and GPU share the same memory pool. A 70B Q4 model fits entirely in memory. Inference speed: roughly 5-15 tokens per second depending on context length and batch size. Not fast. But functional, and the tokens-per-dollar ratio is the best in class. Street price for a complete system: roughly $2,000-$3,000.

**NVIDIA RTX 5090 build** - The 5090 gives you 32 GB of GDDR7 with enormous bandwidth. A 70B Q4 model at 35-40 GB technically exceeds VRAM, so you're looking at partial offload to system RAM - which works but costs you speed. Pure GPU inference on models that fit (32B and below): 30-55 tokens per second, which is genuinely fast. MSRP is $1,999 but street prices in mid-2026 run $2,800-$3,600+ due to GDDR7 supply constraints. By the time you build a system around it, you're at $4,000-$5,500.

The honest take: if your workload is primarily 32B-class models (which, with modern MoE architectures, covers a lot of ground), the 5090 build is faster per token. If you need to run true 70B dense models without compromise, Strix Halo's unified memory wins on simplicity - and on price, given the current GPU markup.

### The $4,000-$8,000 Tier - "The Sweet Spot"

**Apple Mac Studio M4 Max (128 GB)** - The Mac Studio remains the "it just works" option for on-prem inference. 128 GB unified memory, 70B Q4 models at 8-25 tokens per second, near-silent operation, and a power draw that won't require an electrician. Price: roughly $3,500-$4,200 depending on configuration. The ecosystem is mature - `llama.cpp` and `MLX` both run well on Apple Silicon. The limitation is that you're locked into Apple's upgrade cycle and pricing.

**NVIDIA DGX Spark** - NVIDIA's entry into the "desktop AI appliance" category. Purpose-built for inference workloads. Priced in this tier and designed to slot into existing IT infrastructure more cleanly than a consumer GPU build. Worth evaluating if you're deploying for a team rather than personal use.

**Dual RTX 5090 build** - Two 5090s give you 64 GB of combined VRAM. A 70B Q4 model fits across both cards with room for KV cache. NVLink connectivity between cards keeps inter-GPU communication fast. Inference speeds approaching 40-60+ tokens per second on 70B models. Total system cost at current street prices: $8,000-$11,000. More complex to set up than a Mac Studio - tensor parallelism configuration isn't plug-and-play - but meaningfully faster.

### The $8,000-$20,000 Tier - "Small Team Production"

**Used NVIDIA H100 servers** - The secondary market for data center GPUs has matured. Used H100s run $15,000-$30,000 for a complete server, down from $30,000-$40,000+ at peak. The H100's 80 GB HBM3 and purpose-built tensor cores make it the performance benchmark. If your workload justifies the cost, nothing else touches it for inference throughput.

**AMD MI300X** - Often cheaper than H100 on the secondary market, with 192 GB of HBM3 - enough to run a 70B model at higher quantization levels or multiple models simultaneously. The software ecosystem (ROCm) has improved significantly but still requires more expertise than CUDA. A good option if you have or are willing to develop AMD GPU competence.

**RTX PRO 6000** - NVIDIA's workstation card. 96 GB GDDR7. Fits 70B models comfortably in a single card. Priced between the consumer 5090 and data center H100. The professional driver stack and ECC memory make it more suitable for always-on production workloads than consumer cards.

### VRAM Quick Reference

For planning purposes, here's what models actually need at Q4_K_M quantization:

| Model Class | VRAM Required | Example Models |
|---|---|---|
| 7-9B | 4.5-6.5 GB | Qwen3-8B, Gemma 3 9B, Llama 4 Scout |
| 27-32B | 16-22 GB | Qwen3-32B, Gemma 3 27B |
| 70B dense | 40-44 GB | Llama 3.3 70B, Qwen3-72B |
| MoE (35B active) | 20-28 GB | Qwen3-35B MoE (effective "70B-class") |

The important pattern: MoE models give you 70B-class performance at 32B-class VRAM requirements. This is the architectural shift that makes the $2K-$4K tier viable for serious work.

## The Model Decision Tree

Choosing a model for on-prem deployment is different from choosing an API. You're committing to hardware sizing, operational tooling, and license terms. Here's the framework we use.

**Qwen3 / Qwen3.5 (Apache 2.0)** - The default recommendation for most on-prem deployments. The model family spans from 8B to 397B, with both dense and MoE variants. Apache 2.0 licensing means no usage restrictions. The 32B dense and 35B MoE models hit the sweet spot - 70B-class performance in 16-22 GB of VRAM. Multilingual performance is strong, which matters for European deployments.

**Llama 4 Scout (Meta, custom license)** - The standout feature is the 1M+ token context window, which is relevant if your use case involves processing entire codebases or long documents in a single pass. The 109B/17B MoE architecture (17B active parameters) makes it surprisingly efficient to run. The license is more restrictive than Apache 2.0 - read it carefully if you're building commercial products.

**Gemma 3 (Google, permissive)** - The best option for consumer hardware. Google's QAT approach means the int4 quantized versions were designed to be quantized, and quality holds up better than post-hoc quantization of other models. The 27B variant is excellent on a single RTX 5090.

**DeepSeek V3.2 / R1 (MIT license)** - If your workload is reasoning-heavy - code generation, mathematical proof, complex analysis - DeepSeek's reasoning models are elite. The R1 model competes with frontier API models on reasoning benchmarks. MIT license. The trade-off is size: these are large models that benefit from serious hardware.

**Mistral Small 4 / Large 2** - Strong European provenance (French company, EU jurisdiction). Mistral Small 4 is efficient and capable for its size. Worth considering if European origin and jurisdiction matter for your compliance posture.

### A Note on "70B" as a Performance Tier

Throughout this article, "70B" refers to a performance class, not a strict parameter count. In 2024, getting 70B-class output quality required a 70B dense model. In 2026, you can get there with a well-trained 32B dense model or a 35B MoE model. The naming convention stuck because people understand what "70B-class" means in terms of output quality - it's the tier where models handle complex reasoning, nuanced writing, and multi-step tasks reliably.

When we say "running 70B on your desk," we mean running models that produce 70B-class output quality. On modern hardware, that's often physically easier than the name implies.

## The Money - When On-Prem Actually Wins

Here's where most guides get dishonest, either by ignoring the real costs of on-prem (labor, electricity, opportunity cost) or by using stale API pricing that makes the cloud look more expensive than it is. Let's use current numbers.

### API Pricing (Mid-2026, Blended)

| Provider | Input (per M tokens) | Output (per M tokens) | Blended Estimate |
|---|---|---|---|
| GPT-4o | ~$2.50 | ~$10-15 | ~$4-6 |
| Claude Sonnet 4 | ~$3 | ~$15 | ~$5-7 |
| Gemini Pro | ~$0.25-2 | ~$1.50-12 | ~$2-5 |

Blended rates assume a typical mix of input and output tokens. Your ratio will vary - code generation is output-heavy, RAG is input-heavy.

### On-Prem Monthly TCO

A realistic on-prem monthly cost, assuming a Mac Studio M4 Max deployment:

| Component | Monthly Cost |
|---|---|
| Hardware amortization (3-year) | ~$115 |
| Electricity (24/7, ~200W avg) | ~$30-50 |
| Internet/networking | ~$20 |
| Maintenance/labor (fractional) | ~$200-400 |
| Software/tooling | ~$0-50 |
| **Total** | **~$365-$635** |

For a dual-5090 build, increase hardware amortization to ~$200/month and electricity to ~$80-120. For a used H100, hardware amortization is ~$400-800/month but you get proportionally more throughput.

The labor line is the one people underestimate. Someone needs to monitor the system, update models, handle OOM crashes at 2 AM, and keep the serving stack patched. If you're already running infrastructure and this is incremental work, the cost is lower. If this is your team's first production system, budget more.

### The Crossover Point

At roughly **80-200 million tokens per month**, on-prem and API costs converge. Below that, API is cheaper. Above that, on-prem wins - and the gap widens fast.

- **10M tokens/month**: API wins easily. You'd spend $40-70 on API calls vs. $400+ on-prem TCO. Don't build infrastructure for this volume.
- **100M tokens/month**: The crossover zone. API costs hit $400-700/month. On-prem costs are similar but you gain data sovereignty and predictable pricing.
- **1B tokens/month**: On-prem wins decisively. API costs reach $4,000-7,000/month. On-prem TCO stays in the $600-$1,300 range. That's a 3-10x cost advantage.
- **10B tokens/month**: You should have built on-prem yesterday.

The pattern: on-prem has high fixed costs and near-zero marginal costs. API has low fixed costs and linear marginal costs. The lines cross somewhere in the hundreds of millions of tokens. Know where your usage falls before committing.

## What We Encountered - The Production Gap

Here's where the blog posts and YouTube tutorials diverge from reality. Getting a model to run is the easy part. Getting it to serve a team reliably is where the real engineering lives.

### The Ollama Trap

Ollama is brilliant. It has the best developer experience of any local inference tool - `ollama run qwen3:32b` and you're generating text in under a minute. For personal use, prototyping, and development, it's the right starting point.

It is not a production serving solution.

We've watched Ollama degrade gracefully up to about 5 concurrent users, then ungracefully after that. By 10 concurrent requests, latency spikes become unpredictable. The issue isn't a bug - it's an architectural choice. Ollama is designed for simplicity, not concurrency. It doesn't implement PagedAttention or sophisticated request batching. It loads models into memory as monolithic blocks.

This isn't a criticism. It's a taxonomy. Ollama is to vLLM what SQLite is to PostgreSQL - perfect for its intended use case, wrong for a different one.

### KV Cache and the OOM Surprise

The model weights are only part of the memory story. During inference, the KV (key-value) cache grows with context length and batch size. A 70B model that fits in 42 GB of VRAM at load time can OOM at 48 GB during a long conversation because the KV cache expanded beyond what you budgeted.

This is the most common failure mode we've seen in on-prem deployments. The model loads, benchmarks look great, someone sends a 30,000-token document for summarization, and the process crashes.

The fix is either limiting context length (which limits usefulness), using PagedAttention (which manages KV cache like virtual memory), or simply having more VRAM headroom than you think you need. Budget 20-30% headroom above the model's base VRAM requirement.

### Cold Starts

Loading a 70B model from disk to GPU memory takes 30-90 seconds depending on storage speed. If your serving solution unloads idle models (Ollama does this by default), the first request after idle time gets a multi-second delay while the model reloads.

For personal use, this is fine. For a team service, it's a support ticket. Keep models pinned in memory for production workloads.

### Observability Gaps

Cloud APIs give you dashboards, usage tracking, and cost attribution for free. On-prem gives you nothing unless you build it. You'll want, at minimum:

- Request latency tracking (P50, P95, P99)
- Token throughput monitoring
- VRAM utilization and KV cache pressure
- Per-user or per-application usage attribution
- Error rates and OOM tracking

Prometheus plus Grafana gets you most of this. vLLM exposes Prometheus metrics natively. Budget time for setting this up - it's not optional for team deployments.

## What Worked - A Practical Starting Point

Enough about problems. Here's a concrete path from "interested" to "running."

### Step 1: Start with Ollama (Seriously)

Despite the caveats above, Ollama is the right first step. It lets you validate that your hardware works, your model choice is sound, and your use case benefits from local inference - all before investing in production tooling.

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a 32B model (70B-class performance, fits most hardware)
ollama pull qwen3:32b

# Run an interactive session
ollama run qwen3:32b

# Serve via API (compatible with OpenAI API format)
ollama serve &
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:32b",
    "messages": [{"role": "user", "content": "Summarize the EU AI Act transparency requirements in three bullet points."}]
  }'
```

### Step 2: Benchmark Your Hardware

Before committing to a model or hardware configuration, measure actual performance:

```bash
# Quick benchmark with ollama
# Run a fixed prompt and observe tokens/second in the output
ollama run qwen3:32b --verbose <<< "Write a detailed explanation of how transformer attention mechanisms work, including multi-head attention, scaled dot-product attention, and the role of positional encoding. Be thorough."

# The --verbose flag shows:
# - prompt eval rate (tokens/s for processing input)
# - eval rate (tokens/s for generation)
# - total duration
```

The numbers that matter:
- **Prompt eval rate**: How fast the model processes your input. Matters for RAG and long-context workloads.
- **Eval rate (generation)**: How fast it produces output. This is what users feel.
- **Time to first token**: The latency before generation starts. Matters for interactive use.

If eval rate is below 5 t/s, the model is too large for your hardware for interactive use. It may still work for batch processing where latency doesn't matter.

### Step 3: Graduate to Production Serving

When you outgrow Ollama - and if you're serving a team, you will - the two serious options are **vLLM** and **SGLang**.

**vLLM** is the established choice. PagedAttention manages KV cache efficiently, continuous batching handles concurrent requests, and the OpenAI-compatible API makes it a drop-in replacement for cloud endpoints. It handles 100+ concurrent users where Ollama falls over at 10.

**SGLang** is the newer contender. Benchmarks show 20-29% better throughput for shared-context workloads - which is exactly what RAG pipelines produce (many requests sharing the same retrieved context). If your primary use case is RAG, SGLang is worth evaluating.

Both support tensor parallelism across multiple GPUs, quantized model loading, and Prometheus metrics for observability. Neither is as easy to set up as Ollama. That's the trade-off.

### Step 4: The Model Selection Framework

Rather than prescribing a model, here's the decision tree:

1. **What's your VRAM budget?**
   - Under 8 GB: Qwen3-8B or Gemma 3 9B (capable, but limited for complex tasks)
   - 16-24 GB: Qwen3-32B or Gemma 3 27B (the sweet spot for single-GPU setups)
   - 40+ GB: Full 70B dense models or large MoE variants
   - 128+ GB unified: Anything you want

2. **What's your primary task?**
   - General knowledge work: Qwen3 family (best all-around)
   - Code generation/reasoning: DeepSeek R1 or V3.2
   - Long document processing: Llama 4 Scout (1M+ context)
   - Consumer hardware, quality-sensitive: Gemma 3 with QAT

3. **What are your license requirements?**
   - Maximum freedom: Apache 2.0 (Qwen3) or MIT (DeepSeek)
   - Permissive with some limits: Gemma, Llama (read the fine print)
   - European provenance matters: Mistral

4. **How many concurrent users?**
   - 1-3: Ollama is fine
   - 5-20: Consider vLLM or SGLang
   - 20+: vLLM or SGLang, mandatory

## The Deeper Pattern

Step back from the specs and pricing for a moment.

What we're watching is the AI industry recapitulate thirty years of computing infrastructure history. Mainframes gave way to distributed systems. Distributed systems moved to the cloud. The cloud is now developing a gravitational counterforce - not back to mainframes, but toward a hybrid model where the location of compute is an engineering decision, not a default assumption.

AI followed the mainframe phase: concentrated in a few providers, accessed via terminal (API). The "distributed" phase is happening now - not because the technology demands it, but because the economics, regulations, and organizational needs demand it.

The organizations that will navigate this well are the ones that treat model deployment as an infrastructure decision with the same rigor they apply to database deployment. You don't put every database in the cloud. You don't run every database on-prem. You evaluate the workload, the sensitivity, the access patterns, the cost profile, and you make a calibrated choice.

AI inference is becoming infrastructure. The sooner it's treated that way - with capacity planning, monitoring, lifecycle management, and honest TCO analysis - the fewer surprises you'll encounter.

## Open Questions

Several genuinely uncertain trends will reshape on-prem AI in the next 12-18 months. These aren't predictions - they're the things we're watching.

**Will MoE consolidation change the hardware math?** If MoE architectures continue improving at the current rate, the entire concept of "70B dense" could become irrelevant for most workloads. A 2027 MoE model with 15B active parameters might match today's 70B dense quality. That would make the $2K tier the default, not the budget option.

**Does unified memory win?** AMD's Strix Halo and Apple's M-series both bet on unified memory - a single pool shared between CPU and GPU. NVIDIA's traditional approach separates VRAM from system RAM. For inference workloads where model size exceeds any single GPU's VRAM, unified memory is simpler. But if models keep getting more efficient (MoE, better quantization), we might never need more than 32 GB of fast VRAM. The question is whether models shrink faster than memory grows.

**What does post-quantum plus AI Act look like?** The EU AI Act's auditability requirements intersect with the growing push for post-quantum cryptography. If your AI system processes data that needs long-term confidentiality (health records, legal documents, financial data), the combination of local inference and quantum-resistant encryption becomes a compliance posture that's genuinely hard to replicate in multi-tenant cloud environments.

**Will inference-as-a-service pricing collapse?** Competition among API providers has already driven prices down dramatically. If Gemini Pro's aggressive pricing signals a race to the bottom, the crossover point could shift from 100M tokens to 1B+ tokens, making on-prem less economically justified for mid-tier volumes. We're not betting on this happening - margins are already thin - but it's worth watching.

## Why This Matters for AI Memory

This article sits at a crossroads in the Amnesia Problem series. The prior articles - [The Rediscovery Tax](/blog/the-rediscovery-tax/) and [Why RAG Isn't Memory](/blog/why-rag-isnt-memory/) - established that AI systems have a memory problem: they forget, they confuse retrieval with recall, and the cost of that amnesia is real and measurable.

The articles that follow - [The Trust Chain Problem](/blog/the-trust-chain-problem/) and [Shadow Memories](/blog/shadow-memories/) - will argue that solving AI memory requires governance, auditability, and trust infrastructure that most organizations haven't built.

This article is the bridge. Because here's the thing: you cannot govern what you don't control.

When your AI inference runs through an API, your model's "memory" - its context, its conversation history, its RAG-retrieved knowledge - transits through infrastructure you don't own, under terms you didn't negotiate, in jurisdictions you may not have evaluated. You can build governance policies, but you can't enforce them past your network boundary.

On-prem AI is where the memory problem becomes solvable. Not automatically solved - that's the work of knowledge infrastructure, which is what this series is building toward. But solvable. You can audit every prompt. You can control every retrieval. You can ensure that your AI's "memory" follows the same retention, classification, and access control policies as every other piece of organizational knowledge.

That's not a technical argument for on-prem. It's an architectural one. And it's why the question "should we run our own models?" is really the question "do we want to control our AI's relationship with organizational knowledge?"

For a growing number of organizations in 2026, the answer is yes.

---

*This is article B.3 in the Amnesia Problem series. Next: [B.4 - The Trust Chain Problem](/blog/the-trust-chain-problem/) - what happens when AI memory needs to be trusted, not just stored.*

---

*Running inference workloads on your own terms is an engineering decision - and one that benefits from experience with the specific failure modes, hardware trade-offs, and operational patterns involved. If your organization is evaluating on-prem AI deployment, [we can help you avoid the expensive lessons](https://sheridan.hu/contact).*
