Custom Learning Rate


We have found that resetting the learning rate and introducing new data (and discarding the old one) in discrete rounds yields better performance for our model. Is it possible to have a similar mechanism in round-based training to have a learning rate with cosine decay in each round? Or, is it possible to define a custom learning rate for the manual discrete rounds method so that in each round, the cosine decay starts from a lower initial learning rate (because the model has already learned some patterns from the earlier rounds)?

Thank you very much,

PS. Having the both options is extremely helpful. Also, I tried to experiment with the following but got an error suggesting a conflict with the cosine decay:

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(

Hi Ali,

if I understand you correctly, a cosine decay schedule with restarts could be helpful for your situation. You can implement it in your training like this:

schedule = tf.keras.optimizers.schedules.CosineDecayRestarts(
optimizer = tf.keras.optimizers.Adam(schedule, global_clipnorm=1.0)
history = trainer.train_rounds([YOUR CODE], optimizer=optimizer)

Edit: I just saw that you are discarding the old data, so t_mul can stay at 1.0 and you can adapt m_mul to control the decrease of the learning rate over restarts.
If you want use train_rounds and keep previously simulated data, the growing number of data sets makes this a bit tricky, since the multiplicate factor t_mul cannot match the additive growth of the number of iterations within the rounds. You could allow for such a behavior by constructing a custom schedule, possibly building upon the TensorFlow CosineDecayRestarts code.



We could add a flag for discarding / keeping old data from previous rounds in round-based training? I assume you are currently “hacking” it with an external loop over offline training?

1 Like

Thank you very much, both. Yes, I am “hacking” it with an external loop over offline training. The only reason is to avoid high computational costs in the last rounds while providing large batches of data in each round. I think there’s a tradeoff between the two methods. I feel this approach helps to adaptively focus the model’s learning on areas of the parameter space that are poorly understood, whereas the current round-based training refines the posterior approximation with increasing precision, which might not be achieved otherwise.