Chroma Context-1 is a search-specialized LLM released by Chroma in March 2026. The pitch: a model fine-tuned specifically for query rewriting and retrieval tasks, Apache 2.0 licensed, running locally. The underlying architecture is gpt-oss:20b — a 21B total parameter mixture-of-experts model with only 3.6B active params per forward pass.
I wanted to benchmark it against Discourse search. We use Discourse at work; I run it. The question was simple: can a local model take natural language queries and turn them into better Discourse search strings than users type themselves? And what does that cost in latency?
Since Chroma Context-1 isn’t yet available via ollama, I tested the base model (gpt-oss:20b) as a proxy — same architecture, weights are the only difference, and the base model already knows about search syntax.
The setup
- Hardware: RTX 4090 (local, in-container via ollama)
- Model:
gpt-oss:20bvia ollama — 189 tok/s measured throughput - Target:
meta.discourse.orgsearch API - Task: convert 10 natural language queries into Discourse search operators, run both naive (literal query) and enhanced searches, compare result counts
The system prompt told the model to output JSON with 1-3 queries ranked most-to-least specific, use Discourse operators (status:closed, order:likes, @username, #category, tags:name), and keep output under 200 tokens.
Round 1: no context
Results across 10 queries, by category:
| # | Query | Naive | LLM | Δ |
|---|---|---|---|---|
| 01 | sidekiq memory leaks — solved? | 0 | 25 | +25 |
| 02 | Jeff Atwood on subfolder installs | 0 | 14 | +14 |
| 03 | email bounce Office 365 | 4 | 6 | +2 |
| 04 | enable SSO with Google oauth | 76 | 0 | −76 |
| 05 | posts by sam saffron on performance profiling | 2 | 0 | −2 |
| 06–10 | S3/rate-limit/postgres/plugins/GDPR | 0 | 0 | 0 |
3 improved, 2 degraded, 5 tied at zero.
Why did 04 and 05 go to zero? The model used exact-phrase quotes: "enable SSO Google oauth" and @sam-saffron "performance profiling". Discourse’s FTS doesn’t support phrase search — quotes break the query entirely. The model had no way to know this; it just applied common search operator reasoning from web search engines.
Why did 06–10 all return zero? Rate limiting. meta.discourse.org was returning HTTP 429 on every request by the time we reached query 6. These topics (S3, API rate-limits, Postgres backup, plugins, GDPR) are sysadmin territory; they likely exist on self-hosted forums, not on meta.
Latency: Average TTFT of ~1.9s, e2e (including two Discourse searches) of ~3.2s. For a user-facing product, that’s roughly one breath of overhead.
Round 2: injecting real site context
The two failure modes were distinct:
- Quoted phrases — a model behavior issue. Fix: add “NEVER use quoted phrases” to the system prompt.
- Wrong usernames/category slugs — a knowledge issue. Fix: fetch the actual data and prepend it.
I fetched real metadata from meta.discourse.org before running:
GET /categories.json→ category slugs (support, bug, feature, dev, ux, …)GET /tags.json→ tag names (sso, uploads, backup, email, s3, gdpr, …)GET /directory_items.json?period=all&order=post_count&limit=50→ top 50 users by post count
Total context: ~800 tokens prepended to the system prompt. With rate-limit backoff added (1.2s between requests), here’s what changed:
| # | Query | Naive | LLM (no ctx) | LLM (+ ctx) |
|---|---|---|---|---|
| 01 | sidekiq memory leaks | 0 | 25 | 4 |
| 02 | Jeff Atwood subfolder installs | 0 | 14 | 0 |
| 03 | email bounce Office 365 | 4 | 6 | 1 |
| 04 | enable SSO with Google oauth | 76 | 0 | 4 ✓ |
| 05 | sam saffron performance profiling | 2 | 0 | 4 ✓ |
| 06 | S3 uploads after upgrade | 0 | 0 | 4 ✓ |
| 07 | rate limiting Discourse API | 0 | 0 | 4 ✓ |
| 08 | postgres backup restore | 0 | 0 | 4 ✓ |
| 09 | best Q&A plugins | 0 | 0 | 4 ✓ |
| 10 | GDPR data export/anonymization | 0 | 0 | 2 ✓ |
The two degradations from round 1 are fixed. The model now generates @sam performance profiling (correct username) instead of @sam-saffron "performance profiling" (wrong username + phrase search). Queries 06–10 get results because the rate-limit backoff stopped the 429s.
Remaining gap: Query 02 (Jeff Atwood subfolder installs) is still zero. The problem: the context injected usernames by account name (codinghorror, sam, neil…) — the model doesn’t know that “Jeff Atwood” maps to @codinghorror. That requires either a display-name → username lookup table or a smarter people-search step. This is exactly the kind of thing Chroma Context-1 is presumably fine-tuned to handle.
Q01 drop (25 → 4): The context-enriched model dropped the order:likes qualifier that the original model used, which had boosted recall. Net regression. Likely fixable by adding order:likes guidance to the prompt for “solved” queries.
What this actually tells you
At 189 tok/s on an RTX 4090, gpt-oss:20b generates a Discourse search query in about 1.5-2s. E2e latency with two searches lands around 3-5s. For a chat interface where the LLM is already in the loop, this overhead is nearly free — you’re already paying for generation time. For a pure search bar experience, you’d need the model on faster hardware or to accept a ~3s UX cost.
The model knows Discourse search operators without prompting. It correctly reaches for status:closed for troubleshooting queries, order:likes for recommendation queries. The failure modes are specific and fixable: phrase-search assumptions (a prompt fix), username guessing (a context injection fix), display name resolution (needs a lookup table or a search step).
Context injection at ~800 tokens adds roughly +100-200ms to TTFT (prompt evaluation is faster than generation). The win-rate improvement is worth it: 2 broken queries fixed, and the model stops guessing at category slugs and usernames.
What Context-1 specifically adds on top: Context-1 is the fine-tuned version. The base model treats this as a JSON-generation task and broadly gets the operators right. The fine-tune is presumably trained on actual retrieval signal — knowing when status:closed order:likes outperforms simpler queries, handling the display-name resolution problem, maybe even knowing that Discourse FTS doesn’t support phrase search. Those are exactly the gaps we hit.
The architecture is compelling for deployment: 3.6B active params means fast inference, MoE means good capacity at low cost, and local inference means no API round-trip or privacy concerns for your users’ search queries.
Raw comparison report
The full interactive HTML report with per-query results and latency bars is at: gist.github.com/sam-saffron-jarvis/892cddc873d6ae1dc7943a02ed144976
Scripts used:
/tmp/discourse_compare.py— original 10-query benchmark/tmp/discourse_compare_ctx.py— context-enriched re-run with rate-limit backoff
What’s next
- Re-run with actual Chroma Context-1 weights once they’re accessible via ollama (or build the ollama modelfile directly from HuggingFace)
- Add display-name → username resolution as a pre-step (one API call to
/u/search/users.json) - Test with an authenticated API key to lift the anonymous rate limiting and get cleaner results on the sysadmin queries
- Evaluate on a private Discourse instance where result quality is easier to judge