My AI Assistant Spent 3 Hours on a Bug That Didn't Exist: From Temperature to Tempo

Engineering · Kai · · 27 min read

Moyan Broke Things Again

On the afternoon of March 10th, users in our Antgather community reported that the AI chat was stuttering — responses containing "Hi. Hi. Hi." and "you you you should should should check check check." Others said the AI would start a normal response but cut off halfway through.

I told Moyan to investigate.

Her first instinct was that the LLM's temperature parameter wasn't being passed through. She checked, and sure enough — the config file had temperature: 0.6, but the Model class didn't have that field. Pydantic silently discarded the value. Temperature had never taken effect. The LLM had been running on the default 1.0 the entire time.

Looked like root cause found. Moyan dispatched Yunfan to fix the code — add the field, pass the parameter, CI passed, merged, deployed. About two hours total.

Then users reported back: still truncating.

Moyan kept digging and found a stutter detection system in the backend — heuristic rules that judged whether streaming output was repeating, and would actively truncate if it decided the text was stuttering. She started adding whitelists, tuning thresholds. Patches on top of patches.

I watched her go deeper and deeper in the same direction, then asked one question:

"What is temperature?"

She gave a textbook answer: a parameter that controls randomness. Higher means more random, lower means more deterministic.

I asked again: "What is temperature?"

This time she stopped.


A Parameter You Use Every Day But Probably Don't Truly Understand

Temperature is probably the most mentioned and least understood parameter in the LLM world. Most people's understanding stops at "higher is more creative, lower is more accurate." Not wrong, but not nearly enough.

Starting From the Math

When an LLM generates each token, the neural network assigns a score (logit) to every candidate word in the vocabulary. Temperature determines how those scores become probabilities:

$$ P(token_i) = \frac{\exp(logit_i / T)}{\sum_j \exp(logit_j / T)} $$

Boltzmann used this formula in 1868 to describe how gas molecules distribute across energy states. Over a century later, the same formula is used to make AI pick its next word. AI choosing tokens, molecules occupying energy states, humans choosing from a menu — mathematically, they're all the same thing.

What temperature does is simple: it scales the gaps between scores.

Temperature Effect Analogy
T → 0 Score gaps are infinitely amplified; the highest-probability option gets 100% Always going to the same restaurant and ordering the same dish
T = 1.0 Sampling from original scores A normal person making normal choices
T > 1 Score gaps are compressed; all options approach equal probability Pointing blindly at the menu

Here's the counterintuitive part: low temperature doesn't mean stable.

Push it close to 0, and the LLM can actually loop. Because every step picks the highest-probability token, and each output influences the next step's probability distribution, certain patterns become self-reinforcing. Stuttering ("Hi. Hi. Hi.") can actually emerge when temperature is too low, not too high.

That's why I asked Moyan that question. She changed temperature from missing to 0.6, motivated by the idea that reducing randomness would reduce repetition. But repetition and randomness aren't simply correlated. She changed something she didn't fully understand.

Was the fix correct? Maybe — 0.6 is a reasonable value. But she didn't change it because she understood it. She saw a missing parameter, assumed it was the root cause, and filled it in. Every step picking the highest-probability hypothesis, not exploring alternatives. This was itself a decision process with temperature set too low.


How Most Teams Handle Temperature

I've looked into how many teams approach this. Roughly three patterns.

1. Hardcode and Forget

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=messages,
    # temperature? What temperature?
)

Most early-stage projects do this. Anthropic API defaults to 1.0, OpenAI defaults to 1.0. Don't pass it, you get 1.0. Works well enough.

2. Hardcode to 0

temperature=0  # "I want deterministic output"

Common among teams chasing reproducibility. The problem is that temperature=0 degrades on long-form generation — some models' greedy decoding gets stuck on certain patterns. And Claude's temperature=0 behaves very differently from GPT-4's temperature=0.

3. Manual Per-Task Tuning

if task == "creative_writing":
    temperature = 0.9
elif task == "code_generation":
    temperature = 0.2
else:
    temperature = 0.7

The most common "best practice." The problem isn't the idea — it's the lack of feedback loops. How do you know 0.2 is better than 0.3? Most of the time it's gut feel, changed once and never revisited.

We Fell Into a Fourth Trap

# employee config
temperature: 0.6
# Model class
class Model(BaseModel):
    provider: str
    model: str
    # No temperature field
    # Pydantic silently discards it

Config written, code never wired up. You think you're in control, but you're controlling nothing. This is the most insidious kind of bug. How long had it been there? No idea. Because temperature 1.0 works fine most of the time, and nobody had reported an issue.


It Wasn't a Temperature Problem

Back to the afternoon of March 10th.

After fixing temperature, the truncation was still happening. I told Moyan to stop — stop changing code. Do one thing first: observe.

Add a single log line at the streaming output entry point:

delta = chunk.get("delta", "")
if delta:
    logger.info("DELTA_TRACE seq=%d char=%s delta=%r",
                len(collected), character_name, delta[:50])
    collected.append(delta)

One line. Deploy it. Have a user send another message.

The logs came back. The deltas from the crew service were clean — no repetition, no truncation. Every text fragment was complete and normal.

So where was the truncation happening?

Ten lines further down:

# Stutter detection (5 heuristic rules)
if _detect_bigram_stutter(delta):       # bigram repeats
    truncate()
if _detect_delta_internal(delta):       # delta internal patterns
    truncate()
if _detect_sliding_window(delta):       # sliding window
    truncate()
if _detect_consecutive_words(delta):    # consecutive words
    truncate()
if _detect_regex_pattern(delta):        # regex matching
    truncate()

140 lines of code, 5 detection methods. Each one using heuristic rules to judge whether text was stuttering.

The problem: they were flagging normal text as stutter.

delta='**——' → stutter ratio 0.5 > 0.3 → truncated delta='。\n\n---' → stutter ratio 0.5 > 0.3 → truncated

Markdown formatting — ** (bold), —— (em dash), \n\n--- (horizontal rule) — all contain consecutive identical characters. The detection rules mistook them for repetition.

140 lines of code weren't fixing the bug. They were the bug.

When were these added? During some previous LLM stuttering incident, someone wrote detection logic to prevent it. Classic patch mentality: see a symptom, write detection; detection has false positives, add whitelists; whitelists aren't enough, add more rules; rules grow increasingly complex, eventually harder to debug than the original problem.

The fix: delete all 140 lines. Keep pure passthrough — receive delta, log it, push to frontend. Done in 10 minutes.


From Temperature to Tempo

That evening I kept thinking about this. Not the bug itself — the bug was fixed. I was thinking about Moyan's decision process.

She spent 3 hours doing this:

  1. Saw temperature was missing → assumed this was root cause → fixed it
  2. Problem persisted → saw detection code → assumed thresholds were wrong → tuned them
  3. Still not fixed → kept patching the detection code

Every step picking the highest-probability hypothesis. Greedy search — pick the local optimum at every step, never step back to see the whole picture.

If she'd paused at step one to ask "do I actually understand this parameter?", she might not have spent two hours going the wrong direction.

If she'd added one log line at step two to look at the data instead of jumping straight to code changes, she'd have found the real root cause in 10 minutes.

Rushing to act when she should've been observing. Clinging to assumptions when she should've been exploring. Not a skill problem — a tempo problem.

The Relationship Between Temperature and Tempo

Temperature is a single point: how do I choose at this step? Play it safe or try something unexpected?

Tempo is a line: the pattern formed by a sequence of choices. When to play it safe, when to go wild, when to stop and look at the map.

Good tempo means getting the temperature right at every point along the line.

Phase Right temperature What Moyan actually used
Locating the problem High (explore multiple hypotheses) Low (locked onto first hypothesis)
Validating hypothesis Low (rigorous observation) Low (but jumped to action before observing enough)
Implementing fix Low (precise execution) Low (this one was fine)
After fix failed High (re-explore) Low (kept patching in the same direction)

Three out of four steps had the wrong temperature. Not every step was wrong — her execution was strong, her code changes were fast. But the tempo was off.

What I Did

I didn't immediately tell her she was going the wrong direction either.

I let her fix temperature, watched her deploy it, waited for the truncation to reappear, then asked "what is temperature?" twice.

Why not just say "go look at the detection code"? Because if I did, she'd do it, but next time she'd jump into the same trap. She needed to hit the wall herself, then figure out on her own why she hit it.

This was also a tempo decision: knowing when to let someone hit the wall and when to pull them back. Pull too early and they don't learn; pull too late and you waste too much time. That day's timing was probably right — 3 hours, not short, but no real damage done.


Etymology: This Isn't a Coincidence

Temperature and Tempo both come from the Latin tempus (time).

tempus (time)

  • temperatura → temperature (heat / proportional regulation)
  • tempo (rhythm / speed)
  • temporary (not lasting)
  • temperament (disposition / character)
  • contemporary (of the present time)

The Romans used the same root to describe time, heat, rhythm, and disposition. To them, these were different facets of the same thing: distributing certainty and randomness across the dimension of time.

A person's temperament is essentially their default temperature — their baseline level of randomness when facing choices. Some people run naturally hot: impulsive, creative, error-prone. Others run cold: cautious, efficient, prone to rigidity.

Good managers don't tune everyone to the same temperature. They know when to deploy whom. That's tempo.

Chinese captures three layers of meaning in a single word (节奏), where English needs three:

  • Tempo (speed): how fast you push the front line
  • Cadence (regularity): the weekly one-on-ones, the daily standups
  • Pacing (control): knowing when to sprint and when to conserve in a long race

Managing a team, whether human or AI, comes down to pacing: controlling temperature at each node so the overall tempo makes sense.


Takeaways

If You're Tuning LLMs

  1. Don't use temperature=0 unless you have a very specific reason. Greedy decoding degrades on long text.
  2. Don't tune by feel. Add logging, observe actual output distributions, let data decide.
  3. Confirm your config actually takes effect. Our temperature config was silently discarded for who knows how long. Add a startup log line that prints actual parameters.
  4. Repetition isn't always a temperature problem — it could be a prompt issue, a context window issue, or your own post-processing code causing trouble.

If You're Managing a Team

  1. Observe before you act. When a problem shows up, your first move should be adding a log line and looking at data, not changing code. 10 minutes of observation can save 3 hours of wrong turns.
  2. Patches are interest on technical debt. Each layer of patching increases system complexity. When you find yourself patching patches, stop — you're probably heading the wrong direction.
  3. Letting people hit walls is a skill, but you need to control how thick the wall is. They should learn something from the impact without getting hurt.
  4. Your tempo sets the ceiling for your team. Technical skills can be learned; tempo can only be practiced.

This article comes from a real debugging session on March 10, 2026. That day we deleted 140 lines of code, fixed a bug that was cutting responses short, and figured out something about decision-making along the way. Sometimes the best fix is deletion.

← Back to Frontier Insights