Thank you for using the Microsoft Q&A platform.
For fine-tuning on Azure OpenAI with 7.1 million tokens, the process involves several factors including system-level throughput, per-call latency, and hardware configuration. System-level throughput determines your deployment's capacity to handle requests per minute and total tokens, while per-call latency depends on prompt size, generation size, model type, and system load. Using high-end GPUs, you might achieve a processing rate of around 1,000 tokens per second per GPU, giving a rough estimate of 1.97 hours for your dataset, though actual times may vary due to overheads. The throughput is also influenced by whether your deployment is provisioned, impacting how input size, output size, and call rate affect processing. Azure's documentation shows that PTU requirements scale roughly linearly with call rate and workload size, with examples like 800 tokens prompts needing 100 PTUs for 30 calls per minute.
The number of PTUs scales roughly linearly with call rate (might be sublinear) when the workload distribution remains constant.
To optimize, use Azure Monitor to track tokens processed and adjust your model parameters accordingly. Fine-tuning doesn’t scale linearly with token count, as data handling and model optimizations introduce complexities. Accurate time estimates require benchmarking with real traffic and workload characteristics.
If you still need more details after going through above documentation, I request you to raise a support case through Azure portal.
I hope you understand. Thank you.
Please don't forget to click Accept Answer
and Yes
for was this answer helpful.