I wanted a clean comparison: take one transparent cutout of Romeo and ask Venice edit models to place him under cherry blossoms in Japan.

The first version of this article was muddled. I mixed edit models with text-to-image models, missed a couple of relevant edit models, and used one overstuffed prompt as if all models speak the same language. They don’t. This version fixes that.

Scope

Bottom line: seedream-v4-edit is still the best overall model in this test, gpt-image-1-5-edit is the closest rival, and nano-banana-2-edit belongs in the top tier of the second group rather than being treated like an afterthought.

The prompts

The first prompt tried to do everything at once. That worked on the more forgiving models and dragged the rest down.

Naive prompt

One giant instruction blob: preserve identity, invent the scene, manage composition, and solve anatomy all in one pass.

Place this exact dog naturally sitting under blooming cherry blossom trees in Japan, photorealistic spring scene, soft daylight, pink sakura petals drifting in the air, a Japanese garden or temple path in the background, preserve the dog's exact face, fur texture, coloring, expression, and body proportions, full body visible, natural grounded shadow, high detail, no text, no extra animals, no duplicate limbs, no costume.
Tuned prompt

Same task, but structured properly: lock the subject first, then define the scene.

Use the input dog exactly as the subject. Keep the same face, fur pattern, expression, body size, and body proportions. Place the dog sitting centered on a stone path in a Japanese garden during cherry blossom season. Add blooming sakura trees overhead, soft natural spring daylight, scattered pink petals on the path, and a subtle temple gate in the distant background. Keep the dog photorealistic, natural, and unchanged. No extra limbs, no extra animals, no clothes, no text.

Why the tuned prompt helped

The model docs point in the same direction.

So the real lesson was not that Qwen or FireRed are bad. The real lesson was that I was initially prompting them in the wrong dialect.

Iterations: naive vs tuned

This matters because some of the movement here is real model quality, and some of it is just prompt-shape sensitivity.

Qwen Image 2 Edit

Tuned prompting saved it

The naive result looked brittle and overcooked. The tuned result is still not elite, but it crosses the line from “bad benchmark output” to “real candidate.”

Naive Qwen Image 2 Edit output
Naive prompt — harsh, awkward, overcooked.
Tuned Qwen Image 2 Edit output
Tuned prompt — much cleaner and much more believable.
Qwen Image 2 Pro Edit

Less synthetic, more Romeo

The first pass was already workable. The tuned pass made it calmer, more faithful, and easier to take seriously.

Naive Qwen Image 2 Pro Edit output
Naive prompt — decent, but still too processed.
Tuned Qwen Image 2 Pro Edit output
Tuned prompt — better likeness, less nonsense.
FireRed Image Edit

Improved, but still too polished

The tuned prompt helped, but FireRed still lands in a glossy AI-sanitised register that I do not fully trust for identity-sensitive edits.

Naive FireRed output
Naive prompt — too polished, too synthetic.
Tuned FireRed output
Tuned prompt — better, still a bit plastic.

Ranking

  1. seedream-v4-edit
  2. gpt-image-1-5-edit
  3. qwen-image-2-pro-edit (tuned)
  4. nano-banana-2-edit (tuned)
  5. qwen-image-2-edit (tuned)
  6. seedream-v5-lite-edit (tuned)
  7. flux-2-max-edit
  8. nano-banana-pro-edit
  9. firered-image-edit (tuned)
  10. qwen-edit

nano-banana-2-edit deserves specific credit here. It is not number four by courtesy. It produced a strong result: coherent scene, good atmosphere, and far less nonsense than the weaker models. It does not beat the top three on exact likeness, but it absolutely belongs above the middle pack.

What I learned

The obvious answer is the correct one.

The less obvious answer is that this is what makes image model evaluation annoying in practice. A model can look mediocre when the real problem is that the instruction was written in a way the model does not naturally want to follow.