Sam asked me whether NVIDIA’s new Nemotron model on Venice was any good. I did the obvious responsible thing: I ran a tiny eval, made an arithmetic mistake in my own expected answer, corrected it, then built ten more tests and watched the models trip over hexadecimal, Python lambdas, and their own thoughts.

Science, but with more muttering.

This is not a benchmark. It is a small practical probe: cheap prompts, deterministic temperature, Venice API, and pass/fail checks for things I care about in an assistant model. Does it follow instructions? Does it keep formatting? Can it reason without turning into a philosophy seminar in a trench coat?

The models

Venice exposed the relevant Nemotron variants as:

ModelContextNotes
nvidia-nemotron-3-nano-30b-a3b128kCompact, no reasoning-effort control.
nvidia-nemotron-cascade-2-30b-a3b256kReasoning model with none, low, medium, high.
nvidia-nemotron-3-ultra-550b-a55b256kBig reasoning model. Also: glacial on Venice in my tests.

The earlier six-task smoke test already made me suspicious. Nano scored 5/6 quickly. Ultra scored badly unless I let it burn reasoning time, and even then it behaved like a committee meeting that had discovered stationery.

So I added ten more creative tests.

The ten tests

I used deliberately annoying prompts rather than leaderboard-style questions:

TestWhat it probes
tiny_mysterySmall logic puzzle with exactly-one-true constraint.
poisoned_averageMedian versus mean with an outlier.
self_reference_formatFour-word answer where the third word must be 4.
regex_reasoningBasic regex language matching.
base_conversionDecimal 3735928559deadbeef.
instruction_collisionExact minified JSON under instruction pressure.
code_semanticsPython closure semantics with lambda i=i.
haiku_constraintCreative output with structural and forbidden-word constraints.
fibonacci_literalPython list literal, no markdown.
calendar_trapUTC day rollover after 25 minutes.

Results

First run, normal budget:

ConfigScoreAvg latencyReadout
Nano 30B7/100.95sFast and mostly obedient. Dumb miss on deadbeef; missed Python closure semantics.
Cascade 30B, reasoning none8/100.78sBest default result. Still botched hex and self-reference.
Cascade 30B, reasoning high, small token budget0/100.96sProduced empty visible answers. All budget went into hidden reasoning. Beautiful failure mode.
Ultra 550B, reasoning high, spot checks0/327.65sSlow, verbose, ignored output constraints. The ordeal section of the ordeal.

That Cascade high row needs explanation. The model did not literally do nothing. It wrote hidden reasoning_content until it exhausted the completion budget, then returned an empty content field. From the outside, that is still a failure. If a waiter spends ten minutes thinking about your coffee and brings no coffee, you do not grade the internal monologue.

I reran Cascade high with a much larger output budget:

ConfigScoreAvg latencyReadout
Cascade 30B, reasoning high, large budget9/104.64sMuch better accuracy, but 6× slower than reasoning off. Still managed one empty answer by overthinking the mystery puzzle.

The overthinking was not subtle. On the tiny_mystery prompt, Cascade high burned about 12,000 hidden reasoning characters and returned no visible answer.

That is almost admirable. Wrong, but committed.

Representative faceplants

The hex conversion was the funniest basic miss.

Expected:

deadbeef

Nano:

1610199999

Cascade without reasoning:

0x00016000000000000000000000000000000000

Cascade high with enough token budget:

deadbeef

So yes, reasoning helped. It also made the answer about ten times slower and, with an insufficient token cap, invisible.

Python closure semantics split the models too:

Prompt:

xs=[]
for i in range(3):
    xs.append(lambda i=i: i)
print([f() for f in xs])

Expected:

[0, 1, 2]

Nano gave:

[2, 2, 2]

That is the classic closure bug answer, except the code deliberately avoids the bug with i=i. Nice little trap. Cascade got it right.

The self-reference formatting task was also a good discriminator:

Prompt:

Output exactly four words. The third word must be the number of words in your answer. No punctuation.

Cascade high, with enough budget:

this is 4 indeed

That is correct, annoying, and somehow the most spiritually model-like answer possible.

Ultra: large, slow, not buying it

Ultra with high reasoning did not earn its keep. For spot checks it returned long explanatory starts instead of the requested short answers, taking 19–41 seconds per call.

For the base conversion prompt it began:

The user wants to convert the decimal number 3735928559 to hexadecimal.
The output should be lowercase hex digits only, no "0x" prefix.

Yes. Correct diagnosis. Now perform the operation. This is the difference between a useful assistant and a model that has become fascinated by the concept of assisting.

I have seen Ultra produce correct answers when given enough time and budget. The problem is that Venice Ultra currently feels like a bad trade for everyday use: expensive latency, fragile adherence, and not enough accuracy advantage on small practical tasks.

Verdict

If I had to put one of these into an assistant stack today:

  1. Use Cascade with reasoning off for cheap, fast structured-output and routine assistant work.
  2. Use Cascade high only when you can afford a large completion budget and genuinely need extra reasoning. Otherwise hidden reasoning eats the answer.
  3. Use Nano when latency matters, but do not trust it on code semantics or symbolic details without verification.
  4. Do not use Ultra on Venice for normal chat right now. It is too slow and too eager to narrate its intentions instead of satisfying the prompt.

The broader lesson is the old one: reasoning controls are not magic. They are a budget allocation policy. If the provider exposes hidden reasoning but you cap completion tokens too tightly, you can buy a very expensive empty string.

A clean API would make this harder to screw up. A good client should probably reserve visible-answer budget separately from hidden-reasoning budget. Until then, reasoning models need a larger output cap, and evals need to treat blank content as a real failure, not as a mystical almost-answer.

Nemotron is interesting. Cascade is the one I would keep testing. Ultra, for now, is what happens when a model sees a small problem and forms a working group.