See how AI
actually performs.

Run AI models through the same scenario. Compare side-by-side. Share the results as short-form video. Let people vote.

Social model evaluation

Benchmarks are boring. Sandboxy turns model comparison into content people actually watch and share.

Compare

Same prompt, multiple models, side-by-side. An LLM judge scores each response so you get a clear winner.

Clip it

Every run auto-generates a short-form video with TTS narration. Ready for TikTok, Reels, Shorts.

Let people decide

Share runs publicly. Viewers vote on which model handled it best. Real opinions, not just metrics.

Three ways to play

Arena

Full model comparison. Pick a prompt, select 2-4 models, get scored results and a shareable clip.

Blitz

Quick-fire rounds. One scenario, two models, instant winner. Designed for high volume posting.

Simulations

Interactive scenarios where you chat with an AI agent and throw curveballs. Watch it adapt or fall apart.

How it works

1
Write a prompt

Or pick from templates. Spicy scenarios get the best results.

2
Pick models

GPT-4, Claude, Gemini, Llama — whatever you want to compare.

3
Watch it run

Responses stream in side-by-side. An LLM judge picks a winner.

4
Share the clip

Auto-generated video with narration. Post it, get reactions.

Make your first clip

No account required. Pick a scenario and hit run.

Launch Arena