Hello @Mike-E-angelo ,
Welcome to Microsoft Q&A .Thank you for reaching out to us.
Thank you for sharing the detailed observation regarding increased latency after moving from dall-e-3 to gpt-image-1-mini. The behavior being observed is understandable and aligns with how different image generation models are designed and optimized.
The key clarification is that gpt-image-1-mini is not architected as a direct latency-equivalent replacement for dall-e-3. While the “mini” variant focuses on cost efficiency and scalable throughput, this does not necessarily translate to lower per-request response time. In practice, performance characteristics vary depending on how the workload is structured, including request size, output configuration, and concurrency. In Azure OpenAI, both latency (per-call response time) and throughput (overall system capacity) are influenced by workload shape and deployment conditions
Generally, a transition between models may result in differences in generation time even when performing similar tasks.
Key factors contributing to increased latency - The following conditions commonly influence image generation time
- Higher image resolution (for example, 1024Ă—1024) requiring more processing time
- Generating multiple images per request (n > 1) increasing total compute workload
- Complex or detailed prompts that require additional processing
- Operation under shared (pay-as-you-go) capacity where system load may introduce queueing delays
To improve response time, please check if the following help -
- Optimizing output size and generation parameters
- Please try reducing image resolution (e.g., from 1024Ă—1024 to 512Ă—512)
- Limiting the number of images generated per request
- Refining workload characteristics by
- Simplifying prompts where possible to reduce processing complexity
- Distributing requests more evenly to avoid peak usage spikes
- Improving throughput handling where applicable
- Using parallel requests carefully to improve total output rate
- Please note that this improves overall throughput rather than individual latency
- Evaluating deployment type for stable performance
- For workloads requiring consistent latency, Provisioned Throughput deployments can be considered
- These deployments allocate dedicated model capacity and provide predictable latency and throughput behavior
With the above optimizations, observable latency improvements are often achievable. However, depending on workload characteristics, generation time may still differ from dall-e-3, as each model is optimized with different performance tradeoffs—typically cost efficiency and scalability versus lowest per-request latency.
The following references might be helpful , please check them out
Thank you