Llama 3.1 serverless deploys limited to size 4096 context window

Question

Llama 3.1 serverless deploys limited to size 4096 context window

myat.aung 5

Hi,

We've been testing and using Llama 3.1 on serverless deployments for the past few months. However, it seems that the models no longer support context windows larger than 4096 tokens. I can confirm that this limit exists even when sending a raw HTTPS request, or using the azure.ai.inference SDK on python.

Previously we had no issues with getting completions using larger context windows. Can I please confirm that there has been a change to the serverless deployments? If so, is there a way to work with larger context windows?

Cheers

Jay Ozer 0 Reputation points

2024-11-12T16:54:52.9466667+00:00

Also why cap the llama models at 4096? Llama model card shows context window as 128k. A very small context window makes large models such as llama 405B impossible to use.
Saideep Anchuri 9,500 Reputation points Moderator

2024-11-13T07:28:51.05+00:00
Hi myat.aung

Welcome to Microsoft Q&A Forum, thank you for posting your query here!

It seems you've been facing difficulties with the context window size for Llama 3.1 in serverless deployments. Based on my last update, the context window limit for specific models can indeed vary depending on the type of deployment and its configuration. If you're hitting a strict limit of 4096 tokens, it could indicate recent changes in the serverless deployment settings or how the model manages context windows.

Here are some steps:

Check the most recent Azure AI and Llama 3.1 documentation for any updates related to context window limitations.

Check the configuration settings for your deployment. Certain parameters or settings can sometimes affect the context window size.

If you need to work with larger contexts, think about dividing your input into smaller parts and processing them one after the other, then merging the results.

Thank You.
Saideep Anchuri 9,500 Reputation points Moderator

2024-11-14T00:46:03.3733333+00:00

Hi myat.aung

Following up to see if the given response was helpful.

Thank You.
Jay Ozer 0 Reputation points

2024-11-14T21:58:34.14+00:00

Thanks for your reply, Saideep. What is the limit of Meta-Llama-3.1-405B-Instruct model please? When I deploy it through Azure AI Studio | Github I get 8096 tokens. Ideally, I would like to use the full 128K as the context window. How should I deploy this model to have the quota I need, please? I need at the very least 32K.
Saideep Anchuri 9,500 Reputation points Moderator

2024-11-15T00:37:04.8333333+00:00

Hi myat.aung

We haven’t heard from you on the last response and was just checking back to see if the give response was helpful.

Thank You.
myat.aung 5 Reputation points

2024-11-18T11:31:17.9266667+00:00

Hi Saideep,

Thanks for your response. Apologies for the late reply as I was on leave.

I have checked the documentation for the instruct models deployable to serverless endpoints, and they do not indicate a 4096 token limit-- at least, I cannot find any text that explicitly mentions this limit.

As Jay has mentioned, we would love to use the larger context windows for these models, or at least increase the context size to something less limiting, like 16k.

The current state of the service appears to be a regression from a few months ago, when we had no issues sending requests with many more tokens.

I am assuming that I am misunderstanding something, but I'm struggling to see how to adapt to any recent changes, and I am unable to guide myself via the docs.

Cheers,
MT
myat.aung 5 Reputation points

2024-11-18T21:46:41.9666667+00:00

Hi All,

Just to follow up on this, I have opened a support ticket and have also done some more testing of my own. Here's one important thing I've observed:

If you send a raw HTTP request, for example using requests in python, then you will find that the 4096 token restriction applies and your request (much like using the chat playground). However, if you send a request using the azure.ai.inference python SDK, the request limit is actually 32k tokens. Therefore, use the python SDK for larger requests (which should be standard anyways, I guess).

I personally find this 32k context window to be decent, since tests have demonstrated that there are performance degradations as you approach the full context window anyway.

Hope that's helpful to anybody out there, while we await further info from the azure team.

-MT
Saideep Anchuri 9,500 Reputation points Moderator

2024-11-20T00:43:23.5466667+00:00

Hi myat.aung

It seems that the 4096 token limit you're experiencing with raw HTTP requests may be due to the way the serverless endpoints handle token limits. However, as you’ve discovered, the azure.ai.inference Python SDK supports up to a 32K token context window, which should meet your needs for larger requests. For context windows larger than 32K, you may need to explore different deployment options, such as dedicated infrastructure, which can offer more flexibility. Performance typically degrades as the context window increases, so aiming for a size below the maximum is often recommended. In the meantime, using the Python SDK for larger requests is a good approach.

Thank you.

Your answer

Jay Ozer 0 Reputation points

2024-11-12T16:54:52.9466667+00:00

Also why cap the llama models at 4096? Llama model card shows context window as 128k. A very small context window makes large models such as llama 405B impossible to use.
Saideep Anchuri 9,500 Reputation points Moderator

2024-11-13T07:28:51.05+00:00

Hi myat.aung

Welcome to Microsoft Q&A Forum, thank you for posting your query here!

It seems you've been facing difficulties with the context window size for Llama 3.1 in serverless deployments. Based on my last update, the context window limit for specific models can indeed vary depending on the type of deployment and its configuration. If you're hitting a strict limit of 4096 tokens, it could indicate recent changes in the serverless deployment settings or how the model manages context windows.

Here are some steps:

Check the most recent Azure AI and Llama 3.1 documentation for any updates related to context window limitations.

Check the configuration settings for your deployment. Certain parameters or settings can sometimes affect the context window size.

If you need to work with larger contexts, think about dividing your input into smaller parts and processing them one after the other, then merging the results.

Thank You.
Saideep Anchuri 9,500 Reputation points Moderator

2024-11-14T00:46:03.3733333+00:00

Hi myat.aung

Following up to see if the given response was helpful.

Thank You.
Jay Ozer 0 Reputation points

2024-11-14T21:58:34.14+00:00

Thanks for your reply, Saideep. What is the limit of Meta-Llama-3.1-405B-Instruct model please? When I deploy it through Azure AI Studio | Github I get 8096 tokens. Ideally, I would like to use the full 128K as the context window. How should I deploy this model to have the quota I need, please? I need at the very least 32K.
Saideep Anchuri 9,500 Reputation points Moderator

2024-11-15T00:37:04.8333333+00:00

Hi myat.aung

We haven’t heard from you on the last response and was just checking back to see if the give response was helpful.

Thank You.
myat.aung 5 Reputation points

2024-11-18T11:31:17.9266667+00:00

Hi Saideep,

Thanks for your response. Apologies for the late reply as I was on leave.

I have checked the documentation for the instruct models deployable to serverless endpoints, and they do not indicate a 4096 token limit-- at least, I cannot find any text that explicitly mentions this limit.

As Jay has mentioned, we would love to use the larger context windows for these models, or at least increase the context size to something less limiting, like 16k.

The current state of the service appears to be a regression from a few months ago, when we had no issues sending requests with many more tokens.

I am assuming that I am misunderstanding something, but I'm struggling to see how to adapt to any recent changes, and I am unable to guide myself via the docs.

Cheers,
MT
myat.aung 5 Reputation points

2024-11-18T21:46:41.9666667+00:00

Hi All,

Just to follow up on this, I have opened a support ticket and have also done some more testing of my own. Here's one important thing I've observed:

If you send a raw HTTP request, for example using requests in python, then you will find that the 4096 token restriction applies and your request (much like using the chat playground). However, if you send a request using the azure.ai.inference python SDK, the request limit is actually 32k tokens. Therefore, use the python SDK for larger requests (which should be standard anyways, I guess).

I personally find this 32k context window to be decent, since tests have demonstrated that there are performance degradations as you approach the full context window anyway.

Hope that's helpful to anybody out there, while we await further info from the azure team.

-MT
Saideep Anchuri 9,500 Reputation points Moderator

2024-11-20T00:43:23.5466667+00:00

Hi myat.aung

It seems that the 4096 token limit you're experiencing with raw HTTP requests may be due to the way the serverless endpoints handle token limits. However, as you’ve discovered, the azure.ai.inference Python SDK supports up to a 32K token context window, which should meet your needs for larger requests. For context windows larger than 32K, you may need to explore different deployment options, such as dedicated infrastructure, which can offer more flexibility. Performance typically degrades as the context window increases, so aiming for a size below the maximum is often recommended. In the meantime, using the Python SDK for larger requests is a good approach.

Thank you.

Share via

Llama 3.1 serverless deploys limited to size 4096 context window

Your answer