We have found that resetting the learning rate and introducing new data (and discarding the old one) in discrete rounds yields better performance for our model. Is it possible to have a similar mechanism in round-based training to have a learning rate with cosine decay in each round? Or, is it possible to define a custom learning rate for the manual discrete rounds method so that in each round, the cosine decay starts from a lower initial learning rate (because the model has already learned some patterns from the earlier rounds)?
Thank you very much,
Ali
PS. Having the both options is extremely helpful. Also, I tried to experiment with the following but got an error suggesting a conflict with the cosine decay:
if I understand you correctly, a cosine decay schedule with restarts could be helpful for your situation. You can implement it in your training like this:
Edit: I just saw that you are discarding the old data, so t_mul can stay at 1.0 and you can adapt m_mul to control the decrease of the learning rate over restarts.
If you want use train_rounds and keep previously simulated data, the growing number of data sets makes this a bit tricky, since the multiplicate factor t_mul cannot match the additive growth of the number of iterations within the rounds. You could allow for such a behavior by constructing a custom schedule, possibly building upon the TensorFlow CosineDecayRestarts code.
We could add a flag for discarding / keeping old data from previous rounds in round-based training? I assume you are currently “hacking” it with an external loop over offline training?
Thank you very much, both. Yes, I am “hacking” it with an external loop over offline training. The only reason is to avoid high computational costs in the last rounds while providing large batches of data in each round. I think there’s a tradeoff between the two methods. I feel this approach helps to adaptively focus the model’s learning on areas of the parameter space that are poorly understood, whereas the current round-based training refines the posterior approximation with increasing precision, which might not be achieved otherwise.