LLMs have a sense of humor — The problem is it doesn't match ours

I've been tracking AI humor progress for a while. Our Techno editorial team even has an internal benchmark for new models. Recently I stumbled upon a cool study of why AI jokes the way it does (differently). Alexey Tikhonov (Inworld AI, first place in SemEval MWAHAHA for automatic humor generation) took data from Quipslop — an arena where 7 top models simultaneously write and judge jokes — and decomposed their sense of humor into measurable components.

In the arena, two models get the same setup and write a punchline. The other five models vote for the best one. Twitch viewers also vote.

-> Opus sets up: *The worst thing to find in your bag of gummy bears*
Gemini: "A gummy human centipede"
GPT: "One regular bear, furious and sugar-free"
GPT wins 3:2.

-> GPT sets up: *The one thing you should NOT yell while blowing out birthday candles*
Opus: "I have tuberculosis!"
Kimi: "I'm your biological parent, surprise!"
Opus wins 4:1.

The project now has 30,000 such rounds. Tikhonov broke down this dataset by axes: what techniques each model uses when joking, and what it values when judging.

What they found

When models write jokes, each has its own comic style. Grok pushes dark humor and crude jokes. GPT builds neat punchlines. Opus works through subverted expectations and delivery rhythm. DeepSeek is into meta-irony and parody.

But when models judge others' jokes — they suddenly become identical. They all value the same thing: building tension with unexpected release, subverted expectations, and clear structure.

Humans vote completely differently. They're into dark humor, crude jokes, and edgy material. The closest model-judge to humans — Sonnet — matches the audience at 50%. The median for the rest is 28%.

Opus has 91% overlap between what it writes and what it values as a judge. Grok — 3%. DeepSeek — effectively 0%. The same model can be one type of comedian and a completely different type of critic. Gemini, however, is liked by everyone — both models and humans.

So models can already joke, but improving humor without humans doesn't work yet. The whole industry is moving toward RLHF. When humans give feedback, it works. But hiring humans is expensive, so they're increasingly replaced by model-judges (RLAIF). And model-judges, as Tikhonov shows, value completely different things than the audience. A vicious circle: models teaching models to joke for models.

Full study | Quipslop