Automl TCN Forecaster Error: TCN run has met runtime error.

Dimitri CABAUD 1 Reputation point
2022-08-31T10:05:54.217+00:00

Hello,

I tried to train a TCN Forecaster model using automl capabilities in AzureML but during the train I have oftently this error appeared for child jobs trials:

"TCN run has met runtime error."

And when I check the stdlogs.txt I have this:

2022-08-31:09:37:28,394 WARNING [automl_base_settings.py:740] Received unrecognized parameter is_gpu
2022-08-31:09:38:29,825 INFO [logging_utilities.py:403] [RunId:AutoML_dd8945bf-9a1b-446a-b6d5-4be83fbd4a5b_HD_5]CPU logical cores: 6, CPU cores: 6, virtual memory: 58948935680, swap memory: 58948931584.
2022-08-31:09:38:29,825 INFO [logging_utilities.py:410] [RunId:AutoML_dd8945bf-9a1b-446a-b6d5-4be83fbd4a5b_HD_5]Platform information: Linux.
2022-08-31:09:38:29,896 INFO [logging_utilities.py:403] [RunId:AutoML_dd8945bf-9a1b-446a-b6d5-4be83fbd4a5b_HD_5]CPU logical cores: 6, CPU cores: 6, virtual memory: 58948935680, swap memory: 58948931584.
2022-08-31:09:38:29,896 INFO [logging_utilities.py:410] [RunId:AutoML_dd8945bf-9a1b-446a-b6d5-4be83fbd4a5b_HD_5]Platform information: Linux.
2022-08-31:09:38:30,32 INFO [_distributed_helper.py:27] Horovod import succeeded
Building model
2022-08-31:09:38:37,788 INFO [forecast_tcn_wrapper.py:238] Building model
2022-08-31:09:38:37,789 INFO [tcn_model_utl.py:183] Model used the following hyperparameters: num_cells=3, multilevel=CELL, depth=2, num_channels=128, dropout_rate=0.5, dilation=2
Start time: 2022-08-31T09:36:27.199106Z, latest permissible end time: 2022-08-31 09:48:27.199106+00:00
2022-08-31:09:38:41,984 INFO [forecast_tcn_wrapper.py:301] Start time: 2022-08-31T09:36:27.199106Z, latest permissible end time: 2022-08-31 09:48:27.199106+00:00
the name of the metric used EarlyStoppingCallback normalized_mean_absolute_error
2022-08-31:09:38:41,984 INFO [forecast_tcn_wrapper.py:304] the name of the metric used EarlyStoppingCallback normalized_mean_absolute_error
The patience used in used EarlyStoppingCallback 20
2022-08-31:09:38:41,985 INFO [forecast_tcn_wrapper.py:305] The patience used in used EarlyStoppingCallback 20
the name of the improvement passed to EarlyStoppingCallback 0.001
2022-08-31:09:38:41,985 INFO [forecast_tcn_wrapper.py:306] the name of the improvement passed to EarlyStoppingCallback 0.001
LR Factor 0.5
2022-08-31:09:38:41,985 INFO [forecast_tcn_wrapper.py:307] LR Factor 0.5
Apply log transform to label during training: True
2022-08-31:09:38:41,985 INFO [forecast_tcn_wrapper.py:311] Apply log transform to label during training: True
2022-08-31:09:38:41,986 INFO [forecaster.py:761] No GPU of compute capability >= 7.0 detected; AMP is disabled.
Trying with batch_size: 1024
2022-08-31:09:38:43,141 INFO [forecast_tcn_wrapper.py:178] Trying with batch_size: 1024
2022-08-31:09:38:57,772 ERROR [runner.py:62] TCN runner script terminated with an exception of type: <class 'azureml.automl.core.shared.exceptions.ClientException'>
2022-08-31:09:38:57,872 INFO [logging_handler.py:290] Sending 2428 bytes
2022-08-31:09:38:57,872 INFO [logging_handler.py:304] Finish uploading in 0.065193 seconds.
2022-08-31:09:38:57,873 INFO [run.py:2341] fail is not setting status for submitted runs.
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.24923253059387207 seconds
Traceback (most recent call last):
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/azureml/contrib/automl/dnn/forecasting/wrapper/forecast_tcn_wrapper.py", line 187, in train
dataloader_val=dataloader_valid)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/forecast/forecast/forecaster.py", line 186, in fit
train_loss = self._train_epoch(self.dataloader_train, epoch)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/forecast/forecast/forecaster.py", line 303, in _train_epoch
results = self.train_batch(X_past, y_past, X_fut, y_fut)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/forecast/forecast/forecaster.py", line 431, in train_batch
predictions = self.model(inputs)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/forecast/forecast/models/model.py", line 110, in forward
out = self._premix(state)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/forecast/forecast/models/premix/conv.py", line 72, in forward
return self._conv(x)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 298, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 295, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "hd_forecasting_dnn_driver.py", line 16, in <module>
runner.run()
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/azureml/contrib/automl/dnn/forecasting/wrapper/dispatched/invoker/runner.py", line 59, in run
_run(mltable_data_json, **kwargs)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/azureml/contrib/automl/dnn/forecasting/wrapper/dispatched/invoker/runner.py", line 138, in _run
y_valid=y_valid, featurizer=featurizer)
File "/azureml-envs/azureml_fe4614eb6c16f8a375ce1a2fee114d15/lib/python3.7/site-packages/azureml/contrib/automl/dnn/forecasting/wrapper/forecast_tcn_wrapper.py", line 198, in train
inner_exception=e)) from e
azureml.automl.core.shared.exceptions.ClientException: ClientException:
Message: TCN run has met runtime error.
InnerException: None
ErrorResponse
{
"error": {
"code": "SystemError",
"message": "TCN run has met runtime error.",
"details_uri": "https://aka.ms/automltroubleshoot",
"target": "TCNWrapper",
"inner_error": {
"code": "ClientError",
"inner_error": {
"code": "AutoMLInternal"
}
},
"reference_code": "639c8253-73b1-4844-a051-d2627b52be88"
}
}

Do you know where does this error come from ? And how can I avoid it ?

Thanks in advance for your help

Not Monitored
Not Monitored
Tag not monitored by Microsoft.
35,951 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Dimitri CABAUD 1 Reputation point
    2022-08-31T18:41:39.967+00:00

    Yes of course @YutongTie-MSFT , I can give you more information/context about my configuration and what I'm trying to do.

    So I'm trying to train a TCN Forecaster on a multivariate dataset with multiple series: I have around 500k rows, 30 columns and 400 series.
    I use a Standard_NC6 instance with a GPU Tesla K80 and you can find below the code I used to configure the model.

    I cannot share with you the dataset because it contains sensible data. If you need more information don't hesitate to tell me :)

    • from azureml.automl.core.forecasting_parameters import ForecastingParameters

      forecasting_parameters = ForecastingParameters(
      time_column_name=time_column_name,
      forecast_horizon=n_test_periods,
      time_series_id_column_names=time_series_id_column_names,
      freq="d",
      )
      automl_config = AutoMLConfig(
      task="forecasting",
      debug_log="automl_oj_sales_errors.log",
      primary_metric="normalized_mean_absolute_error",
      enable_dnn=True,
      allowed_models=['TCNForecaster'],
      experiment_timeout_hours=1,
      training_data=train_dataset,
      label_column_name=target_column_name,
      compute_target=compute_target,
      enable_early_stopping=True,
      featurization=featurization_config,
      n_cross_validations=3,
      verbosity=50,
      max_cores_per_iteration=-1,
      max_concurrent_iterations =1,
      forecasting_parameters=forecasting_parameters,
      )

    0 comments No comments