Hello,
I'm working on a hyperparameter tuning job in Azure ML using the CLI/YAML schema. I want to be able to add different optimizers as part of the hyperparameter search space, and also be able to tune their hyperparameters but I'm concerned that doing this will cause the sweep job to trial hyperparameters that aren't being used by the optimizer, therefore wasting compute time and resources.
Let's say I have the following function to do this:
from keras.optimizers import SGD, Adam, RMSprop, Adagrad, Adadelta
def get_optimizer(args):
name = args.optimizer
if name == 'SGD':
return SGD(learning_rate=args.learning_rate, momentum=args.momentum)
elif name == 'Adam':
return Adam(learning_rate=args.learning_rate, beta_1=args.beta_1, beta_2=args.beta_2, epsilon=args.epsilon)
elif name == 'RMSprop':
return RMSprop(learning_rate=args.learning_rate, rho=args.rho, epsilon=args.epsilon)
elif name == 'Adagrad':
return Adagrad(learning_rate=args.learning_rate, epsilon=args.epsilon)
elif name == 'Adadelta':
return Adadelta(learning_rate=args.learning_rate, rho=args.rho, epsilon=args.epsilon)
else:
raise ValueError(f"Unknown optimizer: {name}")
The hyperparameter 'momentum' is only relevant for Stochastic Gradient Descent (SGD), therefore if I use a 'choice' input for a sweep job and it choses 'Adam' during the hyperparameter sweep, I don't want it to trial lots of different values for 'momentum', as this will waste a trial and potentially confuse a Bayesian optimization algorithm.
Ideally, it'd be good to be able to define the search space in the YAML schema for the sweep job as such:
search_space:
optimizer:
type: choice
values: ['SGD', 'Adam', 'RMSprop', 'Adagrad', 'Adadelta']
learning_rate:
type: uniform
min_value: 0.0001
max_value: 0.01
beta_1:
type: uniform
min_value: 0.85
max_value: 0.99
conditional:
- parent: optimizer
value: 'Adam'
beta_2:
type: uniform
min_value: 0.9
max_value: 0.999
conditional:
- parent: optimizer
value: 'Adam'
momentum:
type: uniform
min_value: 0.5
max_value: 0.9
conditional:
- parent: optimizer
value: 'SGD'
Does anyone know if there's a different way I can achieve what I'm after, or do I need to re-think the hyperparameter tuning strategy and use Hyperopt in Databricks or something like that?
Many thanks