Recently I used the R package to generate some data as summary statistics and put it in Bayesflow for training. The data is in high dimension. I try a small size training, epochs=1, iterations_per_epoch=100, batch_size=32, validation_sims=20. I find that val_losses cannot be plotted out. The most important thing is it cannot generate some posterior samples after training. So the diagnostic plot is empty also.
At first, I was considering whether the R data(generated from the R package, I use rpy2) format is incorrect. because I wrote the simulator_fun using the r package, but when I saw the test data generated successfully from the model, I thought the R package already worked.
But I don’t know why “val_losses” has no plot, and also I put the test data into the model, and the outcome is nan. I am not sure whether the training is not enough (but I try iteration=10000, same situation) or there is any other error in the model. But I already tried it on a simple toy example, the outcome is ok. May I ask do you know why the model looks like that, Did the data from R not go into the training successfully? or is there any other error? Thanks so much.
receiving nan outcomes sounds like a breaking error rather than insufficient training. Even if you did not train your network at all, running data through a random initialized network should give you some (random) output.
Did you format your training data in the required format (e.g., for offline training: a simulations_dict with multidimensional sim_data and prior_draws numpy arrays)? Does your training data possess some nans that you have to filter out before passing to the network / fix during the simulation?
I’m not sure if it’s related, but coincidentally, this also happened to me a few hours ago. I was doing a round-based training, and it went on for more than 20 hours. In my experience, it happens when the data becomes so large because the ‘nan’ issue does not occur when I set a lower value for the number of rounds. On the other hand, if I reduce the number of simulations (data points) in each round, then I can increase the number of rounds. This was round 7 with 280,000 simulated data (40K each round). I hope it helps.
Thanks for the reply, yes I have checked no nans in training data, and the simulations_dict with multidimensional sim_data and prior_draws are all numpy arrays.
In general, the only cases where I have seen nans in the loss functions are:
When training diverges (e.g., exploding gradients)
When there is something terribly wrong with the data (e.g., nans, inf, etc)
Number one can be easily inspected.
Number two requires more attention. Do the data or parameters contain super large numbers? If so, standardization is needed, as in any deep learning application.
Stefan is addressing the problem systematically, so I would suggest providing more information as he requested. But from my limited experience in offline/round-based training, reducing the number of rounds or the data you provide might be helpful. For example, if you are introducing 100,000 data points, try with lower numbers (like 1000 data points) to see if the problem exists, or you can reduce the number of epochs or rounds. At least, this is how I can resolve the issue with my code, but there is a chance that something weird is going on with my data, too, when I generate a large number of data points.
Thanks so much, i think I use online training, so I generated some samples to check the data, i think it is not a bigdata, and I haven’t seen nans or inf. I find a weird thing is my code sometimes can generate the loss and posterior samples with values, and sometimes the outcome is all nans. I will try to check all of the possible errors.
may i ask one more question.
I try a very small dataset only 50 samples, i have check no nan or inf, i use offline training so that can control the data, i find the loss value is nan, and i can plot the graph for history[“train_losses”], but the graph for history[“val_losses”] is empty. if we can plot history[“train_losses”], does that means the training is successfully, but no value in history[“val_losses”]. Also i try the standardization, same situation. Thanks.
so sorry for the late reply, these days, I try my best to solve the problems, but I find that sometimes if I change to another dataset, then the loss has values. I guess my coding is right, there may be some problems with the data generated. Thanks so much for your help!!!
Thanks so much, during my coding, I found use the same code, sometimes the loss is nan, and if I rerun it again it has values, do you know why? Also, sometimes I see the same problem with Ali, using a smaller dataset will solve the problems, but why a larger dataset are easier to see nan in loss?
Hi~~May I ask a question, in the training, I saw we have loss and Avg loss value (for example Loss: -2.036,W.Decay: 0.056,Avg.Loss: -1.990), but why in the pink area, the loss is nan? I have checked my data do not contain any inf or nan. and also I think if we have avg.loss value, that means the training is successful right?