I wanted a clean comparison: take one transparent cutout of Romeo and ask Venice edit models to place him under cherry blossoms in Japan.
The first version of this article was muddled. I mixed edit models with text-to-image models, missed a couple of relevant edit models, and used one overstuffed prompt as if all models speak the same language. They don’t. This version fixes that.
Scope
- Edit models only. Same input dog, same job.
- Ten Venice edit models. No text-to-image contamination.
- Naive and tuned prompting shown side by side. No hidden hand-waving.
- Images are clickable. Open any result full size in a new tab.
seedream-v4-edit is still the best overall model in this test, gpt-image-1-5-edit is the closest rival, and nano-banana-2-edit belongs in the top tier of the second group rather than being treated like an afterthought.The prompts
The first prompt tried to do everything at once. That worked on the more forgiving models and dragged the rest down.
One giant instruction blob: preserve identity, invent the scene, manage composition, and solve anatomy all in one pass.
Place this exact dog naturally sitting under blooming cherry blossom trees in Japan, photorealistic spring scene, soft daylight, pink sakura petals drifting in the air, a Japanese garden or temple path in the background, preserve the dog's exact face, fur texture, coloring, expression, and body proportions, full body visible, natural grounded shadow, high detail, no text, no extra animals, no duplicate limbs, no costume.
Same task, but structured properly: lock the subject first, then define the scene.
Use the input dog exactly as the subject. Keep the same face, fur pattern, expression, body size, and body proportions. Place the dog sitting centered on a stone path in a Japanese garden during cherry blossom season. Add blooming sakura trees overhead, soft natural spring daylight, scattered pink petals on the path, and a subtle temple gate in the distant background. Keep the dog photorealistic, natural, and unchanged. No extra limbs, no extra animals, no clothes, no text.
Why the tuned prompt helped
The model docs point in the same direction.
- Qwen Image Edit likes more targeted instructions and benefits from chained or scoped edits. The naive prompt was too broad.
- FireRed Image Edit explicitly leans on recaptioning, ROI detection, preprocessing, and expanded instructions. Again: the naive prompt was too broad.
- Seedream and GPT Image are more tolerant of messy prompting, which is why they survived the first pass better.
So the real lesson was not that Qwen or FireRed are bad. The real lesson was that I was initially prompting them in the wrong dialect.
Iterations: naive vs tuned
This matters because some of the movement here is real model quality, and some of it is just prompt-shape sensitivity.
Tuned prompting saved it
The naive result looked brittle and overcooked. The tuned result is still not elite, but it crosses the line from “bad benchmark output” to “real candidate.”
Less synthetic, more Romeo
The first pass was already workable. The tuned pass made it calmer, more faithful, and easier to take seriously.
Ranking
- seedream-v4-edit
- gpt-image-1-5-edit
- qwen-image-2-pro-edit (tuned)
- nano-banana-2-edit (tuned)
- qwen-image-2-edit (tuned)
- seedream-v5-lite-edit (tuned)
- flux-2-max-edit
- nano-banana-pro-edit
- firered-image-edit (tuned)
- qwen-edit
nano-banana-2-edit deserves specific credit here. It is not number four by courtesy. It produced a strong result: coherent scene, good atmosphere, and far less nonsense than the weaker models. It does not beat the top three on exact likeness, but it absolutely belongs above the middle pack.
What I learned
The obvious answer is the correct one.
- Model quality matters.
- Prompt structure matters.
- Some models are more forgiving than others.
- Comparing edit models with text-to-image models in one table is a category error.
The less obvious answer is that this is what makes image model evaluation annoying in practice. A model can look mediocre when the real problem is that the instruction was written in a way the model does not naturally want to follow.





