Real Data, Not Marketing: How We Actually Rank AI Models

A Promise We Made in November

Back in our dual response comparison post, we asked you to do something small: when you saw two AI responses side by side, pick the one you preferred.

We promised those choices would matter. We said your preferences would help us recommend better models, optimize our routing, and "build features that align with real user preferences."

Today we're shipping the first feature that's directly built on that data — and we want to walk you through exactly how it works, because transparency is the whole point.

The New Model Picker

Open the model dropdown in any chat. You'll notice it's no longer a flat list of names. Every model now has two visible metrics:

Quality: how often you and other users preferred this model in real A/B comparisons
Speed: how fast it actually responds based on the last few hundred messages

That's it. No marketing badges. No "Editor's Choice." No "Premium" stickers we made up to upsell you. Just two numbers, both derived from real usage data.

How "Quality" Works (And Why It's Different)

Most AI platforms rank models in one of three ways:

Vendor PR: "Anthropic released a new model, so we promote it."
Internal vibes: "Our team tested it for an hour and liked it."
Whatever pays the most: "Our partner is offering us a kickback this month."

None of these tell you whether a model is actually good for roleplay — which is what you're here for.

Here's what we do instead:

Step 1: Real A/B duels

Every time you saw the dual-response comparison and picked one over the other, we recorded it. Same prompt, same character, same context — only the model changed. Your choice was the only signal.

After months of this, we have tens of thousands of head-to-head matchups across every active model on the platform.

Step 2: Wilson Score, not naive win rate

Here's a subtle but important detail. If a model has 4 wins and 1 loss, its naive win rate is 80%. Sounds great. But 5 samples is statistical noise. A model with 800 wins and 200 losses also has an 80% win rate, and we should obviously trust the second one more.

We use the Wilson Score Lower Bound — the same algorithm Reddit uses to rank comments. It penalizes small sample sizes, so a model needs both high preference and enough data to climb the ranking.

This is why you'll see a row of small confidence dots next to every model: ●●●●○ means "we have a lot of data on this one." ●●○○○ means "treat the number with caution." We're not hiding the uncertainty — we're showing it to you.

Step 3: We never tell you the raw numbers

Here's where we draw a privacy line. The picker shows you the win rate and the confidence level, but never the absolute counts. A model might have been chosen 1,200 times or 12,000 times — you'll see "●●●●●" for both, because exposing exact counts would leak how many users we have and what they're doing.

This is a deliberate trade-off. We want you to trust the rating without us turning into an analytics dashboard that anyone can scrape.

How "Speed" Works

Quality is one number. Speed needs to be measured from real requests, not vendor-claimed benchmarks.

Every message we serve carries a metadata blob with the actual time-to-first-token and tokens-per-second for that response. We aggregate the most recent ~50 of those per model and surface the median (p50) — not the average.

Why median? Because averages lie when there are outliers. If a model is normally fast but had three slow nights last week, the average will tell you the model is slow. The median tells you what you'll typically experience, which is what you actually care about.

If you hover the speed indicator, you'll see the p95 too — the worst-case latency. Some models have very tight latency distributions, others have long tails. Now you can see both.

How "Speed" Score Is Visualized

A 1-second response feels much faster than a 5-second response. A 5-second response feels almost the same as a 6-second response. Latency perception is logarithmic, so the speed bar is mapped on a log scale across [500ms, 20s].

This means a reasoning model that takes 18 seconds for the first token (yes, this is real) shows up with a visibly empty speed bar — not a slightly-shorter version of an 8-second model. Because in your gut, those two experiences feel completely different.

What We Do Not Do

We want to be specific about the things we deliberately avoid:

No "recommended for you" black-box. The default sort is just the data: by Wilson Score, descending. Pro models float to the top only because Pro users paid for them, not because we manually boosted them. Within Pro and within Free, the order is purely earned.
No fake "new and improved" badges. A model only gets the NEW tag if it was added to the platform within the last 14 days. After that the badge falls off automatically — no human can keep it pinned.
No partner-driven ranking. We don't take payment from any AI provider for placement. If OpenAI or Anthropic released a model tomorrow that scored worst in our duels, it would sit at the bottom of the list. (And honestly, we'd ship it anyway and let the data speak.)
No credential to the highest-priced model. Our most expensive Pro model isn't always the highest quality model on the platform right now. We show you that. We don't hide it.

The "Evaluating" Bucket

When a model is brand new, it doesn't have enough A/B data to get a meaningful Wilson Score. Slapping a 50% win rate next to it would be misleading.

So new models go into an Evaluating bucket at the top of the picker. They show their speed (which we can measure immediately) but say "Collecting data" where the quality number would be. Once they accumulate enough duels, they automatically graduate into the main ranked list.

You'll always know whether the rating you're looking at is statistically meaningful or just a placeholder.

Why This Matters

The AI companion space is full of platforms that talk a big game about "the best models" without ever showing you a single number. Some of them are fronting cheaper models with premium pricing. Some of them route you to whichever model is on sale that month. Some of them just guess.

We pick a different default: tell you the truth, show you the math, let you decide.

You don't have to trust us when we say a model is good. You can look at the bar, see the confidence dots, hover for the p95, and make up your own mind. If you think the data is wrong, the dual-response comparison is still running in your chats — and your next preference vote will move the number.

A Reminder: All Official APIs

This is also a good moment to repeat something we've said before: every model in the picker is served through the official provider API. No fine-tuned knock-offs. No quantized stand-ins. No "GPT-4-equivalent" mystery models from third parties.

If the picker says "Claude Opus 4.6", you're talking to Claude Opus 4.6. If it says "DeepSeek V3.2", you're talking to DeepSeek V3.2. The quality scores are meaningful precisely because the models are real.

What's Next

The picker is the visible part. There's more we want to do with this data:

Per-character recommendations. Different models excel at different character archetypes. Our preference data should let us suggest "users tend to prefer model X for this kind of character."
Personalized rankings. Right now everyone sees the same global ranking. Eventually you should see your preferred models float to the top, based on your own past choices.
Live model health alerts. If a provider's API has a bad day and TTFT spikes, the picker should reflect that within the hour, not the next day.

But all of that requires a foundation of honest data. That foundation is what we're shipping today.

Open the model picker in your next chat and take a look. If you've voted in dual-response comparisons, your fingerprints are on every number you see.

Try the New Picker →