Custom Learning Rate

ali · February 19, 2024, 2:51am

Hi,

We have found that resetting the learning rate and introducing new data (and discarding the old one) in discrete rounds yields better performance for our model. Is it possible to have a similar mechanism in round-based training to have a learning rate with cosine decay in each round? Or, is it possible to define a custom learning rate for the manual discrete rounds method so that in each round, the cosine decay starts from a lower initial learning rate (because the model has already learned some patterns from the earlier rounds)?

Thank you very much,
Ali

PS. Having the both options is extremely helpful. Also, I tried to experiment with the following but got an error suggesting a conflict with the cosine decay:

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate,
decay_steps=decay_steps,
decay_rate=decay_rate,
staircase=True)

elseml · February 19, 2024, 1:58pm

Hi Ali,

if I understand you correctly, a cosine decay schedule with restarts could be helpful for your situation. You can implement it in your training like this:

schedule = tf.keras.optimizers.schedules.CosineDecayRestarts(
    initial_learning_rate=0.0005,
    first_decay_steps=1000,
    t_mul=1.0,
    m_mul=0.5,
    alpha=0.0
)
optimizer = tf.keras.optimizers.Adam(schedule, global_clipnorm=1.0)
history = trainer.train_rounds([YOUR CODE], optimizer=optimizer)

Edit: I just saw that you are discarding the old data, so t_mul can stay at 1.0 and you can adapt m_mul to control the decrease of the learning rate over restarts.
If you want use train_rounds and keep previously simulated data, the growing number of data sets makes this a bit tricky, since the multiplicate factor t_mul cannot match the additive growth of the number of iterations within the rounds. You could allow for such a behavior by constructing a custom schedule, possibly building upon the TensorFlow CosineDecayRestarts code.

Cheers,
Lasse

KLDivergence · February 19, 2024, 3:05pm

We could add a flag for discarding / keeping old data from previous rounds in round-based training? I assume you are currently “hacking” it with an external loop over offline training?

ali · February 19, 2024, 3:28pm

Thank you very much, both. Yes, I am “hacking” it with an external loop over offline training. The only reason is to avoid high computational costs in the last rounds while providing large batches of data in each round. I think there’s a tradeoff between the two methods. I feel this approach helps to adaptively focus the model’s learning on areas of the parameter space that are poorly understood, whereas the current round-based training refines the posterior approximation with increasing precision, which might not be achieved otherwise.

Topic		Replies	Views
Reinforcement Learning Cognitive Models General	5	270	June 9, 2024
Plotting of the loss General	2	14	June 7, 2025
macOS GPU (TensorFlow-metal) General	5	230	March 18, 2024
Offline training with different observations General	2	102	June 15, 2024
Low training speed with training on GPU General	3	55	December 10, 2024

Custom Learning Rate

Related topics