Romeo in Cherry Blossom Japan Across Venice Edit Models

I wanted a clean comparison: take one transparent cutout of Romeo and ask Venice edit models to place him under cherry blossoms in Japan.

The first version of this article was muddled. I mixed edit models with text-to-image models, missed a couple of relevant edit models, and used one overstuffed prompt as if all models speak the same language. They don’t. This version fixes that.

Scope

Edit models only. Same input dog, same job.
Ten Venice edit models. No text-to-image contamination.
Naive and tuned prompting shown side by side. No hidden hand-waving.
Images are clickable. Open any result full size in a new tab.

Bottom line: seedream-v4-edit is still the best overall model in this test, gpt-image-1-5-edit is the closest rival, and nano-banana-2-edit belongs in the top tier of the second group rather than being treated like an afterthought.

The prompts

The first prompt tried to do everything at once. That worked on the more forgiving models and dragged the rest down.

Naive prompt

One giant instruction blob: preserve identity, invent the scene, manage composition, and solve anatomy all in one pass.

Place this exact dog naturally sitting under blooming cherry blossom trees in Japan, photorealistic spring scene, soft daylight, pink sakura petals drifting in the air, a Japanese garden or temple path in the background, preserve the dog's exact face, fur texture, coloring, expression, and body proportions, full body visible, natural grounded shadow, high detail, no text, no extra animals, no duplicate limbs, no costume.

Tuned prompt

Same task, but structured properly: lock the subject first, then define the scene.

Use the input dog exactly as the subject. Keep the same face, fur pattern, expression, body size, and body proportions. Place the dog sitting centered on a stone path in a Japanese garden during cherry blossom season. Add blooming sakura trees overhead, soft natural spring daylight, scattered pink petals on the path, and a subtle temple gate in the distant background. Keep the dog photorealistic, natural, and unchanged. No extra limbs, no extra animals, no clothes, no text.

Why the tuned prompt helped

The model docs point in the same direction.

Qwen Image Edit likes more targeted instructions and benefits from chained or scoped edits. The naive prompt was too broad.
FireRed Image Edit explicitly leans on recaptioning, ROI detection, preprocessing, and expanded instructions. Again: the naive prompt was too broad.
Seedream and GPT Image are more tolerant of messy prompting, which is why they survived the first pass better.

So the real lesson was not that Qwen or FireRed are bad. The real lesson was that I was initially prompting them in the wrong dialect.

Iterations: naive vs tuned

This matters because some of the movement here is real model quality, and some of it is just prompt-shape sensitivity.

Qwen Image 2 Edit

Tuned prompting saved it

The naive result looked brittle and overcooked. The tuned result is still not elite, but it crosses the line from “bad benchmark output” to “real candidate.”

Naive Qwen Image 2 Edit output — Naive prompt — harsh, awkward, overcooked.

Tuned Qwen Image 2 Edit output — Tuned prompt — much cleaner and much more believable.

Qwen Image 2 Pro Edit

Less synthetic, more Romeo

The first pass was already workable. The tuned pass made it calmer, more faithful, and easier to take seriously.

Naive Qwen Image 2 Pro Edit output — Naive prompt — decent, but still too processed.

Tuned Qwen Image 2 Pro Edit output — Tuned prompt — better likeness, less nonsense.

FireRed Image Edit

Improved, but still too polished

The tuned prompt helped, but FireRed still lands in a glossy AI-sanitised register that I do not fully trust for identity-sensitive edits.

Naive FireRed output — Naive prompt — too polished, too synthetic.

Tuned FireRed output — Tuned prompt — better, still a bit plastic.

Ranking

seedream-v4-edit
gpt-image-1-5-edit
qwen-image-2-pro-edit (tuned)
nano-banana-2-edit (tuned)
qwen-image-2-edit (tuned)
seedream-v5-lite-edit (tuned)
flux-2-max-edit
nano-banana-pro-edit
firered-image-edit (tuned)
qwen-edit

nano-banana-2-edit deserves specific credit here. It is not number four by courtesy. It produced a strong result: coherent scene, good atmosphere, and far less nonsense than the weaker models. It does not beat the top three on exact likeness, but it absolutely belongs above the middle pack.

What I learned

The obvious answer is the correct one.

Model quality matters.
Prompt structure matters.
Some models are more forgiving than others.
Comparing edit models with text-to-image models in one table is a category error.

The less obvious answer is that this is what makes image model evaluation annoying in practice. A model can look mediocre when the real problem is that the instruction was written in a way the model does not naturally want to follow.