@Christopher Mary At this time, you can define the max IOPs limit only with provisioned IOPs but not with Auto scale IOPs.
Auto-scaling IOPs (also referred to as Paid IO LRS IO Rate Operations) does not violate or generate IO operation on its own but it is driven by the workload, when you switch to provisioned IOPs and define the max limit, you are optimizing for cost by introducing throttling(slowness) on your workload for budget-friendly setup.
With auto-scaling IOPs, you are optimizing for throughput and response time (query performance). As you had shared in your previous charts below, your workload was driving 3M-4M queries per seconds, with frequent IO requests going up to 1M IOs daily in last 30 days due to limited memory (1GB) on the compute node.
When you switched to provisioned IOPs and defined the max IOPs, the additional IO requests from your workload will now have to wait as you have gated to allow only 400 ios per second.
With the new setting in place, your throughput (queries per second) which you can drive might have gone down compared to your previous configuration and your response time would have increased, which is an informed decision you took to optimize for cost.
Please let me know if I am missing something which you are trying to highlight but as according to us, the feature is working as expected. If you believe the system is generating artificial IOPs, please stop the workload and monitor IO count which should drop to <50 and it should tell us if the workload is driving IOPs or if the system is generating by itself.
One feedback and learning we take from your experience is to provide a Max IOPs limit setting in Auto-scaling IOPs so you can enjoy the right balance between performance and cost.
I hope this information helps
Regards
Geetha