46,190 questions
Framework pytorch installed with version 1.13.1+cu116 but found version 1.11.0+cu102.
ErvinD-3505
5
Reputation points
Hello,
Since recently, using auto-ml for TCNForecaster result in a crash in one of the child job. The error can be found down below.
This used to work until 1 - 2 months ago. The error seems to come from the library managing the distributed training, horovod, where a mismatch in version raisers an error as documented here: https://horovod.readthedocs.io/en/latest/_modules/horovod/common/exceptions.html
Thank you.
Execution failed. User process 'python' exited with status code 1. Please check log file 'user_logs/std_log.txt' for error details. Error: File "hd_forecasting_dnn_driver.py", line 2, in <module>
import azureml.contrib.automl.dnn.forecasting.wrapper.dispatched.invoker.runner as runner
File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/azureml/contrib/automl/dnn/forecasting/wrapper/dispatched/invoker/runner.py", line 15, in <module>
from ....wrapper.forecast_wrapper import DNNForecastWrapper, DNNParams
File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/azureml/contrib/automl/dnn/forecasting/wrapper/forecast_wrapper.py", line 33, in <module>
from azureml.contrib.automl.dnn.forecasting.wrapper._distributed_helper import DistributedHelper
File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/azureml/contrib/automl/dnn/forecasting/wrapper/_distributed_helper.py", line 18, in <module>
import horovod.torch as hvd
File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/horovod/torch/__init__.py", line 35, in <module>
from horovod.torch import elastic
File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/horovod/torch/elastic/__init__.py", line 17, in <module>
from horovod.torch.mpi_ops import init, shutdown
File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/horovod/torch/mpi_ops.py", line 35, in <module>
check_installed_version('pytorch', torch.__version__, e)
File "/azureml-envs/azureml-automl-dnn-forecasting-gpu/lib/python3.8/site-packages/horovod/common/util.py", line 260, in check_installed_version
raise HorovodVersionMismatchError(name, version, installed_version) from exception
horovod.common.exceptions.HorovodVersionMismatchError: Framework pytorch installed with version 1.13.1+cu116 but found version 1.11.0+cu102.
This can result in unexpected behavior including runtime errors.
Reinstall Horovod using `pip install --no-cache-dir` to build with the new version.
Marking the experiment as failed because initial child jobs have failed due to user error
Community Center | Not monitored
Sign in to answer