# Why Your AI Gets Worse at 3pm: Context Rot and the Limits of a Long Session

**Series**: Working WITH AI (C.4)
**Published**: 2026-06-27
**Author**: Levente Peres
**Reading time**: ~13 min
**Language**: EN / HU (bilingual)

---

You have noticed it even if you have never named it. In the morning the assistant is sharp: it follows the thread, holds the details, gives you the answer you actually asked for. Hours later, deep into the same long conversation, something has slipped. It contradicts a decision you settled an hour ago. It re-suggests an idea you already rejected, with the same enthusiasm as the first time. It misplaces a number you gave it twice. Nothing crashed. No error appeared. The model just quietly got worse over the course of the session.

It did not get tired. It has no afternoon, no blood sugar, no willpower to deplete. What happened is more mechanical and, once you see it, more predictable: its working context filled up with the conversation itself, and the signal you cared about got buried under the noise of everything else that had been said. The industry has started calling this **context rot**, and in 2026 it is one of the most useful things a heavy AI user can understand — because the moment you understand it, you can work around it.

## The Brilliant Consultant Whose Desk Never Gets Cleared

Picture the sharpest consultant you have ever worked with. At nine in the morning the desk is clean, the brief is fresh, and every answer is crisp. Then the day happens. Every email, every draft, every rejected idea, every "actually, ignore that last bit" lands on the desk and stays there. Nobody clears it. By mid-afternoon the consultant is still brilliant — but now working across a desk buried under hours of paper, half of it superseded, some of it contradictory, the one crucial morning note somewhere in the pile.

Ask a question now and the answer arrives slower and shakier. Not because the consultant forgot how to think, but because finding the relevant sheet among a hundred irrelevant ones has become the hard part. The intelligence is intact. The *retrieval* is drowning.

That is what a long AI session looks like from the inside. The model's context window — the running transcript of everything in the conversation so far — is that desk. Large language models do not have a separate, tidy memory they consult; they re-read the whole desk on every single turn. The longer you talk, the bigger the pile they must read through to answer you, and the more of that pile is no longer the point.

## Context Rot Has a Name Now

For a long time this was folk knowledge — practitioners traded tips about "starting a fresh chat when it gets weird" without anything to point to. That changed in July 2025, when researchers at Chroma published a study with the apt title *Context Rot: How Increasing Input Tokens Impacts LLM Performance*. They ran eighteen leading models — across the GPT, Claude, Gemini, and Qwen families — and found something that should reshape how anyone thinks about long conversations: **models do not use their context uniformly. Their performance grows steadily less reliable as the input gets longer** — and not only on hard problems. It showed up on tasks that are nearly trivial at short length, including simply finding a relevant sentence, and even repeating a block of text back.

Two details from that work are worth keeping. First, the decline is gradual and starts early — well before you hit the advertised limit of the window. Second, *what* fills the context matters as much as how much: when the irrelevant text was topically similar to what you were looking for, accuracy fell hardest, because near-misses are exactly what a model struggles to tell apart. A conversation that stays "on topic" can paradoxically be one of the worst offenders, because everything in it looks plausibly relevant.

This is the part people get backwards. The danger is not that you will run out of room. It is that long before you run out of room, the room is full of distractions.

> Your AI didn't forget. It got buried under its own conversation — and the morning's key point is somewhere in the pile.

## The Middle Is Where Things Go to Be Forgotten

There is a second, older finding that compounds the first. Back in 2023, a study memorably titled *Lost in the Middle* showed that models pay most attention to the **beginning and the end** of a long input and least to the middle. Put the decisive fact in the first or last paragraph and it is used reliably; bury it halfway down a long context and accuracy can fall off a cliff. The shape of that attention curve is a U, and it has stubbornly survived into the era of giant context windows.

Then, in early 2025, a benchmark called **NoLiMa** made the gap concrete in a way that is hard to unsee. Instead of asking a model to find a sentence by matching keywords — which is easy, and which is what most marketing benchmarks measure — it required the model to make a small inference to connect the question to the answer, the way real work does. Performance collapsed far earlier than the spec sheets suggest. Most of the models tested fell below half of their short-context accuracy by around **32,000 tokens** — a fraction of their advertised capacity. One leading model that scored above 99% on short inputs dropped to roughly 70% at 32k.

The practical translation: the number on the box — "200K context!", "1M context!" — is the size of the desk, not the size of the part of the desk the model can actually keep in focus at once. The *effective* working context, the part where reasoning stays trustworthy, is much smaller than the advertised one. Treat the big number as headroom, not as a promise.

## It Isn't Forgetting — It's Interference

It helps to be precise about the mechanism, because the fix follows from it. The model is not "forgetting" earlier turns the way a person forgets a name. Everything is still right there in the transcript. The problem is interference, and it comes from a few directions at once.

**Attention is a finite budget.** On every turn, the model spreads a fixed amount of "focus" across everything in the context. Add more text and each piece gets a thinner slice. The crucial sentence from this morning is still present, but it is now competing for attention with thousands of words that arrived since — and attention is close to a zero-sum game.

**Old and rejected material keeps voting.** The idea you dismissed an hour ago is still in the transcript, still being read on every turn, still nudging the model toward itself. Corrections you made do not erase the thing you corrected; they sit *next to* it. The longer the session, the more contradictory votes accumulate, and the more the model wobbles between them.

**The model conditions on its own output.** Each thing it says becomes part of the context it reads next. If it drifted slightly an hour ago, that drift is now evidence it uses to justify drifting further. Small errors do not just persist; they can compound, because the model treats its own past words as established fact.

None of this is a bug to be patched in the next release. It is a consequence of how the architecture works. Which is why the answer is not "wait for a better model" — it is to manage the desk.

## Won't a Bigger Window Just Fix It?

This is the reasonable objection, and the honest answer is no — bigger windows defer the problem and can quietly make it worse. The Chroma work tested the largest available windows and still saw degradation at a fraction of capacity: serious slippage tens of thousands of tokens into a window many times that size. A larger context does not give the model more *focus*; it gives the conversation more room to accumulate noise before anyone notices.

There is a seductive "million-token mirage" in the marketing: if the window is big enough, surely you can just pour everything in and let the model sort it out. The evidence of the last two years says the opposite. Pouring everything in is precisely the thing that triggers the rot. More room to fill is more room to fill with distractions — and you pay for those extra tokens in latency and cost while your answers get *less* reliable, not more.

The size of the window is a ceiling, not a strategy.

## Is "3pm Fatigue" a Fair Comparison?

The afternoon-slump metaphor is useful, and it is also worth being honest about where it breaks. As a description of the *curve* — fresh and sharp early, sloppier and more error-prone after sustained load — it is a genuinely good fit, and there is even a 2026 research paper that borrows the language of "cognitive fatigue" to describe the way these models decline over a long generation. People recognize the shape immediately, which is exactly why it is a good way to talk about it.

But the cause is not the human one, and pretending otherwise leads to bad decisions. Your afternoon dip involves biology — circadian rhythm, glucose, genuine tiredness. The model has none of that. Its decline is deterministic interference: attention spread thin, position bias, accumulated contradictions, self-conditioning. (Even the human side of the story is contested; the popular notion that willpower is a fuel that runs out has not held up well under scrutiny.) So use the metaphor to *notice* the problem — "it is getting late in this conversation, it is probably degrading" — but reach for the mechanical explanation when you decide what to do. You cannot give the model a coffee. You can clear its desk.

## What You Can Actually Do About It

Here is the good news, and the reason this is worth understanding rather than just enduring: every one of these failure modes has a practical, vendor-neutral countermeasure, and none of them requires waiting for anyone's next release.

- **Work in focused sessions, not marathons.** A single conversation that tries to carry an entire project is the ideal breeding ground for rot. Break the work into scoped tasks with clear boundaries. A clean context is a sharp context.
- **Start fresh — on purpose — and hand off a summary.** When the answers start to wobble, the cheapest fix is also the most counterintuitive: open a new conversation. But do not start cold. Carry forward a short, deliberate summary of what was decided and why — the conclusions, not the whole debate. You are doing manually what good memory infrastructure does automatically: keeping the gist, dropping the noise.
- **Keep the important things at the edges.** Since the middle is where attention thins out, put the instructions and facts that must not be missed at the very start or the very end of your prompt — not buried in the middle of a long paste.
- **Externalize knowledge instead of stuffing it into the chat.** The transcript is the worst place to store the durable facts of a project. Put decisions, specifications, and the *reasons* behind them somewhere stable and queryable, and let the assistant pull in only what each task needs. Retrieving the right paragraph beats re-reading the whole history every turn.
- **Watch for the tells.** Repetition, re-litigating settled questions, mixing up details you stated clearly, a creeping vagueness — these are the dashboard lights of context rot. When you see them, it is not a moment to push harder. It is a moment to reset.

If the previous piece in this series asked what it means for an AI to *sleep on it* — to consolidate offline and come back wiser — this is the daytime version of the same truth. You cannot make the model rest, but you can stop it from working across a desk it can no longer see.

## The Quiet Cost in a Long Project

One degraded conversation is an annoyance. The same effect across a whole organization is a budget line. An [earlier article in this series](/blog/the-rediscovery-tax/) called the cost of AI amnesia between sessions the *rediscovery tax* — the price of re-explaining context the system should have kept. Context rot is that tax levied *within* the session too: time lost to wobbling answers, to corrections that do not stick, to re-establishing facts the assistant had ten minutes ago. Multiply it by every employee using AI all day, and by every agent in an automated workflow doing the same thing without a human noticing, and the small afternoon slump becomes a structural drag.

It is also, quietly, one of the reasons so many ambitious AI pilots stall after the demo. A scripted demo is a short, clean context — the nine-o'clock desk. Real work is a long, messy, contradictory one. A tool that dazzles in the first five minutes and degrades over the first five hours will disappoint exactly where it was supposed to deliver. The organizations that get durable value are the ones that stop treating the chat window as memory and start giving their AI a real one: external, governed, queryable — a place where what matters is kept and what is noise is set aside.

## Working With the Grain

You cannot switch context rot off. It is the texture of the tools we have, and it will be with us for a while yet. But it is not mysterious, and it is not your fault when it happens at 3pm. The model did not betray you and it did not get lazy. It got buried — and burying it is something you have a great deal of control over.

So treat the long session as something to manage, not something to trust blindly. Keep the desk clear. Start fresh when the answers start to drift. Put the durable knowledge somewhere sturdier than a scroll of chat. Do that, and the assistant that was sharp at nine can be sharp at three — not because it learned to stay awake, but because you stopped asking it to think across a desk it could no longer read.

Building AI memory that keeps the signal and sets aside the noise — so the answer reflects what matters, not everything ever said — is the work we do. If your AI is brilliant at 9am and unreliable by 3pm, that gap is not the price of doing business. We would be glad to show you the difference.

---

**Series**: [The Rediscovery Tax (B.1)](/blog/the-rediscovery-tax/) → [Why RAG Isn't Memory (B.2)](/blog/why-rag-isnt-memory/) → [Why Your AI Needs to Sleep On It (B.8)](/blog/why-your-ai-needs-to-sleep-on-it/) → Why Your AI Gets Worse at 3pm (C.4)
**Related**: [What 160+ Sessions Taught Us (C.2)](/blog/what-160-sessions-taught-us/) · [Designing for All Intelligences (C.3)](/blog/designing-for-all-intelligences/)

*Levente Peres, 2026. ICS - Sheridan. https://sheridan.hu/blog/why-your-ai-gets-worse-at-3pm/*
