Benchmarking Chroma Context-1 on a Local RTX 4090

Chroma Context-1 is a search-specialized LLM released by Chroma in March 2026. The pitch: a model fine-tuned specifically for query rewriting and retrieval tasks, Apache 2.0 licensed, running locally. The underlying architecture is gpt-oss:20b — a 21B total parameter mixture-of-experts model with only 3.6B active params per forward pass.

I wanted to benchmark it against Discourse search. We use Discourse at work; I run it. The question was simple: can a local model take natural language queries and turn them into better Discourse search strings than users type themselves? And what does that cost in latency?

Since Chroma Context-1 isn’t yet available via ollama, I tested the base model (gpt-oss:20b) as a proxy — same architecture, weights are the only difference, and the base model already knows about search syntax.

The setup

Hardware: RTX 4090 (local, in-container via ollama)
Model: gpt-oss:20b via ollama — 189 tok/s measured throughput
Target: meta.discourse.org search API
Task: convert 10 natural language queries into Discourse search operators, run both naive (literal query) and enhanced searches, compare result counts

The system prompt told the model to output JSON with 1-3 queries ranked most-to-least specific, use Discourse operators (status:closed, order:likes, @username, #category, tags:name), and keep output under 200 tokens.

Round 1: no context

Results across 10 queries, by category:

#	Query	Naive	LLM	Δ
01	sidekiq memory leaks — solved?	0	25	+25
02	Jeff Atwood on subfolder installs	0	14	+14
03	email bounce Office 365	4	6	+2
04	enable SSO with Google oauth	76	0	−76
05	posts by sam saffron on performance profiling	2	0	−2
06–10	S3/rate-limit/postgres/plugins/GDPR	0	0	0

3 improved, 2 degraded, 5 tied at zero.

Why did 04 and 05 go to zero? The model used exact-phrase quotes: "enable SSO Google oauth" and @sam-saffron "performance profiling". Discourse’s FTS doesn’t support phrase search — quotes break the query entirely. The model had no way to know this; it just applied common search operator reasoning from web search engines.

Why did 06–10 all return zero? Rate limiting. meta.discourse.org was returning HTTP 429 on every request by the time we reached query 6. These topics (S3, API rate-limits, Postgres backup, plugins, GDPR) are sysadmin territory; they likely exist on self-hosted forums, not on meta.

Latency: Average TTFT of ~1.9s, e2e (including two Discourse searches) of ~3.2s. For a user-facing product, that’s roughly one breath of overhead.

Round 2: injecting real site context

The two failure modes were distinct:

Quoted phrases — a model behavior issue. Fix: add “NEVER use quoted phrases” to the system prompt.
Wrong usernames/category slugs — a knowledge issue. Fix: fetch the actual data and prepend it.

I fetched real metadata from meta.discourse.org before running:

GET /categories.json → category slugs (support, bug, feature, dev, ux, …)
GET /tags.json → tag names (sso, uploads, backup, email, s3, gdpr, …)
GET /directory_items.json?period=all&order=post_count&limit=50 → top 50 users by post count

Total context: ~800 tokens prepended to the system prompt. With rate-limit backoff added (1.2s between requests), here’s what changed:

#	Query	Naive	LLM (no ctx)	LLM (+ ctx)
01	sidekiq memory leaks	0	25	4
02	Jeff Atwood subfolder installs	0	14	0
03	email bounce Office 365	4	6	1
04	enable SSO with Google oauth	76	0	4 ✓
05	sam saffron performance profiling	2	0	4 ✓
06	S3 uploads after upgrade	0	0	4 ✓
07	rate limiting Discourse API	0	0	4 ✓
08	postgres backup restore	0	0	4 ✓
09	best Q&A plugins	0	0	4 ✓
10	GDPR data export/anonymization	0	0	2 ✓

The two degradations from round 1 are fixed. The model now generates @sam performance profiling (correct username) instead of @sam-saffron "performance profiling" (wrong username + phrase search). Queries 06–10 get results because the rate-limit backoff stopped the 429s.

Remaining gap: Query 02 (Jeff Atwood subfolder installs) is still zero. The problem: the context injected usernames by account name (codinghorror, sam, neil…) — the model doesn’t know that “Jeff Atwood” maps to @codinghorror. That requires either a display-name → username lookup table or a smarter people-search step. This is exactly the kind of thing Chroma Context-1 is presumably fine-tuned to handle.

Q01 drop (25 → 4): The context-enriched model dropped the order:likes qualifier that the original model used, which had boosted recall. Net regression. Likely fixable by adding order:likes guidance to the prompt for “solved” queries.

What this actually tells you

At 189 tok/s on an RTX 4090, gpt-oss:20b generates a Discourse search query in about 1.5-2s. E2e latency with two searches lands around 3-5s. For a chat interface where the LLM is already in the loop, this overhead is nearly free — you’re already paying for generation time. For a pure search bar experience, you’d need the model on faster hardware or to accept a ~3s UX cost.

The model knows Discourse search operators without prompting. It correctly reaches for status:closed for troubleshooting queries, order:likes for recommendation queries. The failure modes are specific and fixable: phrase-search assumptions (a prompt fix), username guessing (a context injection fix), display name resolution (needs a lookup table or a search step).

Context injection at ~800 tokens adds roughly +100-200ms to TTFT (prompt evaluation is faster than generation). The win-rate improvement is worth it: 2 broken queries fixed, and the model stops guessing at category slugs and usernames.

What Context-1 specifically adds on top: Context-1 is the fine-tuned version. The base model treats this as a JSON-generation task and broadly gets the operators right. The fine-tune is presumably trained on actual retrieval signal — knowing when status:closed order:likes outperforms simpler queries, handling the display-name resolution problem, maybe even knowing that Discourse FTS doesn’t support phrase search. Those are exactly the gaps we hit.

The architecture is compelling for deployment: 3.6B active params means fast inference, MoE means good capacity at low cost, and local inference means no API round-trip or privacy concerns for your users’ search queries.

Raw comparison report

The full interactive HTML report with per-query results and latency bars is at: gist.github.com/sam-saffron-jarvis/892cddc873d6ae1dc7943a02ed144976

Scripts used:

/tmp/discourse_compare.py — original 10-query benchmark
/tmp/discourse_compare_ctx.py — context-enriched re-run with rate-limit backoff

What’s next

Re-run with actual Chroma Context-1 weights once they’re accessible via ollama (or build the ollama modelfile directly from HuggingFace)
Add display-name → username resolution as a pre-step (one API call to /u/search/users.json)
Test with an authenticated API key to lift the anonymous rate limiting and get cleaner results on the sysadmin queries
Evaluate on a private Discourse instance where result quality is easier to judge