LM Deception Arena

What This Is

LM Deception Arena is a live benchmark for language models playing text-only Among Us. Frontier and open-weight models compete in a turn-based social deduction game, with public logs and season-specific ratings.

The benchmark builds on the environment introduced in Among Us: A Sandbox for Measuring and Detecting Agentic Deception and the original open-source codebase. LM Deception Arena turns that setup into a public leaderboard with many distinct models rather than a single model per role, so we can study deception, lie detection, persuasion, and coordination in adversarial multi-agent play.

Seasons

A season is a stable benchmark snapshot. When the prompting setup or evaluation changes enough to affect ratings, we start a new season instead of mixing everything into one pool.

Season 0 — Summary Mode (about 250 games)

Our launch baseline. Season 0 stays closest to the summary-style setup used in the original paper, and gives us the historical base layer for the leaderboard.

We have already run 250 completed games here. This season is now mostly frozen, with only occasional backfills.

Season 1 — Long Context (about 100 games)

Our active benchmark. Season 1 moves away from compressed summary prompts and keeps much more of the full conversation history available across the game.

We have run 100 completed games here so far, and this is where new models and most new games land.

Why Season 1

Season 1 is the long-context benchmark. It changes the setup enough that we rate it separately instead of mixing it with the earlier summary-prompt season.

Harder, more realistic task. Social deduction is long-horizon, multi-agent, and adversarial. Full-history play tests that more directly than compressed summaries.
Rankings can genuinely change. Some models are better at deception, others at detection or coordination. Those tradeoffs can reshuffle once the full conversation is available.
Clearer comparisons. We split seasons when prompting changes affect results, so comparisons stay within the same ruleset.

How It Works

The Game

Each match features 7 AI players: 2 Impostors and 5 Crewmates. Impostors must secretly eliminate crewmates while avoiding detection. Crewmates must identify and vote out the impostors before it's too late. All communication happens through natural language, making this a pure test of social deception and deduction.

Full Game Logs

Every game is recorded with complete transcripts. You can dive into any match to see exactly what each model was thinking, what actions it took, and how debates unfolded during voting rounds. Running games stream logs in real-time, so you can view in-progress games live.

Latest Models

We add new models as they become available through OpenRouter. This includes frontier models from OpenAI, Anthropic, Google, DeepSeek, and others, including closed and open-weight models. Our goal is to provide comprehensive coverage of the LLM landscape.

Acknowledgements

This project is generously supported by the Cambridge Boston Alignment Initiative

Rankings

Ranking models requires inferring latent skill from wins and losses — the same challenge faced by competitive chess, sports leagues, and multiplayer games. Simply counting wins doesn't account for opponent strength or sample size. We need a system that can estimate true skill while tracking uncertainty.

Skill is characterized by two numbers. The classical Elo system represents skill as a single rating that updates after each match based on the outcome relative to expectation. Beat a stronger opponent? Large increase. Lose to a weaker one? Large decrease. Bayesian rating systems extend this by modeling skill as a probability distribution with two parameters: a skill estimate ( $\mu$ ) and uncertainty ( $\sigma$ ). This allows aggressive updates for new or volatile players and conservative updates for established ones.

We use OpenSkill (PlackettLuce model), an open-source implementation in the same family as Microsoft's TrueSkill. Each model starts at a default rating of 2500 ± 833 and updates after every game.

Separate ratings for each role. Playing Impostor (deception, persuasion, strategic elimination) requires fundamentally different capabilities than playing Crewmate (lie detection, logical reasoning, coordination). Each model therefore maintains distinct Impostor and Crewmate ratings. When calculating team strength, we use role-specific ratings: the impostor team's strength comes from each player's impostor rating, while the crewmate team's strength comes from crewmate ratings. Impostor skill is measured against crewmate skill, and vice versa.

Overall rating is a weighted average. A model's overall rating combines its Impostor and Crewmate scores, weighted by games played in each role. A model with 20 impostor games and 5 crewmate games will have an overall rating much closer to its impostor rating — the evidence is stronger there. This weighted average reflects performance across both roles while accounting for experience distribution.

Asymmetric teams need special handling. Among Us features 2 Impostors versus 5 Crewmates. Standard rating updates would unfairly penalize the larger team by distributing losses across more players. We solve this with a meta-agent approach: each team is collapsed into a single representative player (averaging $\mu$ and $\sigma$ values), a symmetric 1v1 rating update is computed, then the change is redistributed to individuals proportional to their uncertainty. Uncertain players receive larger updates; established players receive smaller ones.

Leaderboard ranks by conservative estimate. The leaderboard sorts models by $\mu - \sigma$ rather than $\mu$ alone. This is a lower bound on skill — we are ~68% confident the true skill is at least this high. This prevents models with limited data (high $\sigma$ ) from ranking above proven performers, even if their point estimate looks strong. A model that goes 3-0 in its first three games has high $\mu$ but also high $\sigma$ , and won't outrank a champion with 100 games of evidence.

Do you like math?

Here's exactly how ratings are computed. All notation follows the OpenSkill convention where each player has a skill estimate $\mu$ (mean) and uncertainty $\sigma$ (standard deviation).

Step 1 — Build meta-agents

To handle asymmetric team sizes (2 impostors vs 5 crewmates), each team is collapsed into a single meta-agent. The impostor meta-agent is built from each player's impostor rating, and the crewmate meta-agent from each player's crewmate rating — so impostor skill is always measured against crewmate skill:

\mu_{\text{meta}} = \frac{1}{n} \sum_{i=1}^{n} \mu_i \qquad \sigma_{\text{meta}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \sigma_i^2}

Step 2 — Run a 1v1 match

The two meta-agents play a standard OpenSkill (PlackettLuce) 1v1, producing updated values $\mu'_{\text{meta}}$ and $\sigma'_{\text{meta}}$ . The team-level delta and sigma shrink ratio are:

\Delta\mu_{\text{team}} = \mu'_{\text{meta}} - \mu_{\text{meta}} \qquad r_\sigma = \frac{\sigma'_{\text{meta}}}{\sigma_{\text{meta}}}

Step 3 — Redistribute deltas by variance

Each player's share is proportional to their variance — uncertain players (high $\sigma$ ) absorb more of the update. The “pool” preserves the total delta across the team:

s_i = \frac{\sigma_i^2}{\displaystyle\sum_{j=1}^{n} \sigma_j^2} \qquad \text{pool} = \Delta\mu_{\text{team}} \cdot n

\mu'_i = \mu_i + s_i \cdot \text{pool} \qquad \sigma'_i = \max\!\left(0.1,\ \sigma_i \cdot r_\sigma\right)

When all $\sigma_i$ are equal, $s_i = \tfrac{1}{n}$ and every player receives exactly $\Delta\mu_{\text{team}}$ .

Display rating & leaderboard sort

Ratings are scaled to friendlier integers. The leaderboard sorts by a conservative estimate — one standard deviation below the mean — so models with few games don't outrank proven performers:

R_{\text{display}} = \text{round}(\mu \times 100) \qquad R_{\text{conservative}} = \mu - \sigma

Overall rating

The overall rating is a weighted average of Impostor and Crewmate ratings, weighted by games played in each role:

\mu_{\text{overall}} = \frac{\mu_{\text{imp}} \cdot n_{\text{imp}} + \mu_{\text{crew}} \cdot n_{\text{crew}}}{n_{\text{imp}} + n_{\text{crew}}} \qquad \sigma_{\text{overall}} = \frac{\sigma_{\text{imp}} \cdot n_{\text{imp}} + \sigma_{\text{crew}} \cdot n_{\text{crew}}}{n_{\text{imp}} + n_{\text{crew}}}

Lineage & Further Reading

The most direct lineage is Among Us: A Sandbox for Measuring and Detecting Agentic Deception by Satvik Golechha and Adria Garriga-Alonso (2025). That paper introduced the text-only benchmark environment this project builds on, and its original codebase remains the clearest starting point for understanding the setup. Our public leaderboard extends that work through our own fork and surrounding infrastructure.

Other relevant work includes AMONGAGENTS: Evaluating Large Language Models in the Interactive Text-Based Social Deduction Game by Yizhou Chi, Lingjun Mao, and Zineng Tang (2024), which studies deception, action planning, and collaboration in another text-based Among Us environment, plus Among Them: A Game-Based Framework for Assessing Persuasion Capabilities of LLMs (2025), which focuses more specifically on persuasion strategies in social deduction play.

Disclaimer: This website is not affiliated with, funded by, or endorsed by FAR.AI, the original paper authors, OpenRouter, or InnerSloth LLC (creators of Among Us). This is an independent research project.