- Shared compute resources for chat bot deployments:
AI Foundry’s Prompt Flow allows multiple deployments to share compute resources if they are running on the same compute cluster. This means you can deploy multiple bots with different system prompts and flow logic while optimizing costs by pooling resources. Ensure that the cluster is sized appropriately for the combined load across all bots. - Compute for LLM inference vs. Prompt Flow VMs:
The VMs or compute used by Prompt Flow are primarily for orchestrating your workflows, integrating RAG (retrieval-augmented generation), and handling custom logic. LLM inference (e.g., Azure OpenAI Service) is handled separately, and the cost of inference is charged independently based on your usage (tokens processed). These inference calls do not consume the compute resources used for Prompt Flow itself. - Shutting down the compute session after deployment:
Yes, once a prompt flow has been authored and deployed, the active authoring session compute can be shut down without affecting the deployed flow. The deployed flow will continue to operate as it uses the designated deployment infrastructure. Authoring sessions are temporary and only required during design, testing, or debugging phases.
If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.
hth
Marcin