Framework pytorch installed with version 1.13.1+cu116 but found version 1.11.0+cu102.

ErvinD-3505 5 Reputation points
2023-09-29T13:46:05.8366667+00:00

Hello,

Since recently, using auto-ml for TCNForecaster result in a crash in one of the child job. The error can be found down below.

This used to work until 1 - 2 months ago. The error seems to come from the library managing the distributed training, horovod, where a mismatch in version raisers an error as documented here: https://horovod.readthedocs.io/en/latest/_modules/horovod/common/exceptions.html

Thank you.

Execution failed. User process 'python' exited with status code 1. Please check log file 'user_logs/std_log.txt' for error details. Error:   File "hd_forecasting_dnn_driver.py", line 2, in <module>
    import azureml.contrib.automl.dnn.forecasting.wrapper.dispatched.invoker.runner as runner
  File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/azureml/contrib/automl/dnn/forecasting/wrapper/dispatched/invoker/runner.py", line 15, in <module>
    from ....wrapper.forecast_wrapper import DNNForecastWrapper, DNNParams
  File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/azureml/contrib/automl/dnn/forecasting/wrapper/forecast_wrapper.py", line 33, in <module>
    from azureml.contrib.automl.dnn.forecasting.wrapper._distributed_helper import DistributedHelper
  File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/azureml/contrib/automl/dnn/forecasting/wrapper/_distributed_helper.py", line 18, in <module>
    import horovod.torch as hvd
  File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/horovod/torch/__init__.py", line 35, in <module>
    from horovod.torch import elastic
  File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/horovod/torch/elastic/__init__.py", line 17, in <module>
    from horovod.torch.mpi_ops import init, shutdown
  File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/horovod/torch/mpi_ops.py", line 35, in <module>
    check_installed_version('pytorch', torch.__version__, e)
  File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/horovod/common/util.py", line 260, in check_installed_version
    raise HorovodVersionMismatchError(name, version, installed_version) from exception
horovod.common.exceptions.HorovodVersionMismatchError: Framework pytorch installed with version 1.13.1+cu116 but found version 1.11.0+cu102.
             This can result in unexpected behavior including runtime errors.
             Reinstall Horovod using `pip install --no-cache-dir` to build with the new version.

 Marking the experiment as failed because initial child jobs have failed due to user error
Community Center | Not monitored
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.