Hello,
I am facing deployment issues with my fine-tuned model on Azure ML using DeepSpeed and vLLM. Despite configuring tensor parallelism, the deployment fails when calling the endpoint.
Deployment Details
- Model: Auto-deployed through Azure ML (MLFlow format) - FineTuned Phi3.5MoE
- Inference Engine:
vLLM
(auto-configured).
- VM SKU:
Standard_NC24ads_A100_v4
.
- Instance Count: 3 for GPU parallelism.
Errors in Logs
- Main Error:
IndexError: list index out of range
File "/azureml-envs/default/lib/python3.10/site-packages/llm/optimized/inference/replica_manager.py", line 215, in get_replica
replica = self.engine_replicas[self._replica_index]
- Replica Manager Initialization:
2024-12-16 13:21:00,587 [replica_manager] initialize 136: INFO Lock acquired by worker with pid: 7. Loading model. Using tensor parallel of 2 GPUs per replica.
2024-12-16 13:21:00,974 [replica_manager] initialize 168: INFO 0
Initialized 0 replicas.
- Warnings:
• async_io requires the dev libaio .so object and headers but these were not found.
• sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3.
• using untested triton version (2.3.1), only 1.0.0 is known to be compatible.
Questions
- Why does the replica manager fail to initialize replicas (Initialized 0 replicas)?
- How can I resolve the list index out of range issue in the inference server?
- Are the warnings related to missing libaio headers, the Torch version mismatch, or Triton compatibility causing this failure?
- Since the deployment was auto-configured, is there a way to adjust configuration files (DeepSpeed, vLLM) without needing custom Python modifications?