Objet : Demande d'informations sur les modèles et quotas Azure OpenAI

Question

Objet : Demande d'informations sur les modèles et quotas Azure OpenAI

Bania RABIA 0

Bonjour,

Je souhaite développer un chatbot capable d'interagir avec mes documents SharePoint en mettant en place un système de génération augmentée par récupération (RAG). Ce projet est destiné à 80 utilisateurs, chacun effectuant en moyenne 5 requêtes par jour entre 9 h et 17 h. Chaque requête comprend environ 1 000 jetons en entrée et 1 000 jetons en sortie, soit un total de 2 000 jetons par requête.

De plus, j'envisage d'utiliser le modèle de langage pour transformer un fichier texte en JSON. Cette opération serait effectuée jusqu'à 3 fois par jour, chaque transformation impliquant également environ 1 000 jetons en entrée et 1 000 jetons en sortie.

Pour ce faire, j'envisage d'utiliser Azure OpenAI pour déployer les modèles suivants :

Ada pour les embeddings GPT-3.5 Turbo 0613 ou GPT-4o Mini, en fonction de leur disponibilité. Je suis conscient que le service Azure OpenAI impose certaines restrictions par rapport à l'utilisation directe des services d'OpenAI. Par exemple, Azure OpenAI offre des quotas et des limites spécifiques, tels que des limites de jetons par minute et des restrictions sur le nombre de déploiements simultanés.

Demande : Nous souhaitons intégrer ces deux modèles dans notre entreprise. Nous aimerions bénéficier de votre aide pour :

Comprendre les options de quotas disponibles afin de répondre à notre problématique. Clarifier les quotas pour le modèle GPT-4o Mini et <GPT-3.5 turbo 0613 dans le cadre d'un déploiement standard. Avec une consommation estimée maximale à 2 000 jetons par minute, les quotas par défaut semblent insuffisants pour répondre à nos besoins. Vous nous suggérez quoi? Enfin, nous souhaiterions une estimation des coûts : lorsqu’on demande une augmentation de quota, une tarification supplémentaire est-elle appliquée ou le coût reste-t-il identique à celui des appels API standards ?

Je reste à votre disposition pour tout complément d'information.

Cordialement,

romungi-MSFT 48,916 Reputation points Microsoft Employee Moderator

2024-11-27T05:45:59.98+00:00

Posting the original question in English.

I want to develop a chatbot capable of interacting with my SharePoint documents by setting up an augmented generation by retrieval (RAG) system. This project is intended for 80 users, each of whom makes an average of 5 requests per day between 9 a.m. and 5 p.m. Each request includes approximately 1,000 input tokens and 1,000 output tokens, for a total of 2,000 tokens per request.

In addition, I plan to use the language model to turn a text file into JSON. This would be done up to 3 times a day, with each transformation also involving around 1,000 tokens as input and 1,000 tokens as output.

To do this, I plan to use Azure OpenAI to deploy the following models:

Ada for GPT-3.5 Turbo 0613 or GPT-4o Mini embeddings, depending on availability. I am aware that the Azure OpenAI service imposes some restrictions compared to the direct use of OpenAI's services. For example, Azure OpenAI offers specific quotas and limits, such as token limits per minute and restrictions on the number of concurrent deployments.

Request: We want to integrate these two models into our company. We'd love to hear from you about:

Understand the quota options available to address our problem. Clarify quotas for GPT-4o Mini and <GPT-3.5 turbo 0613 as part of a standard deployment. With an estimated maximum consumption of 2,000 tokens per minute, the default quotas seem insufficient to meet our needs. What do you suggest? Finally, we would like an estimate of the costs: when a quota increase is requested, is there additional pricing applied or does the cost remain the same as standard API calls?

I remain at your disposal for any further information.
romungi-MSFT 48,916 Reputation points Microsoft Employee Moderator

2024-11-29T08:41:45.79+00:00

@Bania RABIA Did you get a chance to check if the below answer was helpful?

1 answer

Your answer

romungi-MSFT 48,916 Reputation points Microsoft Employee Moderator

2024-11-27T05:45:59.98+00:00

Posting the original question in English.

I want to develop a chatbot capable of interacting with my SharePoint documents by setting up an augmented generation by retrieval (RAG) system. This project is intended for 80 users, each of whom makes an average of 5 requests per day between 9 a.m. and 5 p.m. Each request includes approximately 1,000 input tokens and 1,000 output tokens, for a total of 2,000 tokens per request.

In addition, I plan to use the language model to turn a text file into JSON. This would be done up to 3 times a day, with each transformation also involving around 1,000 tokens as input and 1,000 tokens as output.

To do this, I plan to use Azure OpenAI to deploy the following models:

Ada for GPT-3.5 Turbo 0613 or GPT-4o Mini embeddings, depending on availability. I am aware that the Azure OpenAI service imposes some restrictions compared to the direct use of OpenAI's services. For example, Azure OpenAI offers specific quotas and limits, such as token limits per minute and restrictions on the number of concurrent deployments.

Request: We want to integrate these two models into our company. We'd love to hear from you about:

Understand the quota options available to address our problem. Clarify quotas for GPT-4o Mini and <GPT-3.5 turbo 0613 as part of a standard deployment. With an estimated maximum consumption of 2,000 tokens per minute, the default quotas seem insufficient to meet our needs. What do you suggest? Finally, we would like an estimate of the costs: when a quota increase is requested, is there additional pricing applied or does the cost remain the same as standard API calls?

I remain at your disposal for any further information.
romungi-MSFT 48,916 Reputation points Microsoft Employee Moderator

2024-11-29T08:41:45.79+00:00

@Bania RABIA Did you get a chance to check if the below answer was helpful?

Answer 1

@Bania RABIA I think you should be able to request the quota that you need for the above use case. Once you create an Azure OpenAI resource, you will have an option to create deployments of base models under standard deployment and these models have a soft limit of quota to ensure they are optimally used. This soft limit can be increased from the quota page on Azure OpenAI portal and once the request is approved, the deployment should be using the increased quota for future requests.

To know more about models, go to the models page to check in which region they are available and their limits. If you need additional or provisioned capacity you can use provisioned set of models for higher or provisioned capacity.

For example, gpt-4o-mini might have a capacity of 2000k tokens already set in default account.

User's image

When you request additional quota you are not charged, you are charged only based on usage for pay as you go models. That is, based on token usage on input, cached input and output the billing is done. If you have any cost constraints, you can setup usage reports or budgets from azure portal to monitor usage and setup alerts. I hope this helps!!

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Objet : Demande d'informations sur les modèles et quotas Azure OpenAI

1 answer

Your answer