Cannot generate posterior samples after training

wyy · February 23, 2024, 9:13am

Recently I used the R package to generate some data as summary statistics and put it in Bayesflow for training. The data is in high dimension. I try a small size training, epochs=1, iterations_per_epoch=100, batch_size=32, validation_sims=20. I find that val_losses cannot be plotted out. The most important thing is it cannot generate some posterior samples after training. So the diagnostic plot is empty also.

At first, I was considering whether the R data(generated from the R package, I use rpy2) format is incorrect. because I wrote the simulator_fun using the r package, but when I saw the test data generated successfully from the model, I thought the R package already worked.

But I don’t know why “val_losses” has no plot, and also I put the test data into the model, and the outcome is nan. I am not sure whether the training is not enough (but I try iteration=10000, same situation) or there is any other error in the model. But I already tried it on a simple toy example, the outcome is ok. May I ask do you know why the model looks like that, Did the data from R not go into the training successfully? or is there any other error? Thanks so much.

elseml · February 23, 2024, 9:23am

Hello wyy,

receiving nan outcomes sounds like a breaking error rather than insufficient training. Even if you did not train your network at all, running data through a random initialized network should give you some (random) output.
Did you format your training data in the required format (e.g., for offline training: a simulations_dict with multidimensional sim_data and prior_draws numpy arrays)? Does your training data possess some nans that you have to filter out before passing to the network / fix during the simulation?

Cheers,
Lasse

ali · February 23, 2024, 7:42pm

I’m not sure if it’s related, but coincidentally, this also happened to me a few hours ago. I was doing a round-based training, and it went on for more than 20 hours. In my experience, it happens when the data becomes so large because the ‘nan’ issue does not occur when I set a lower value for the number of rounds. On the other hand, if I reduce the number of simulations (data points) in each round, then I can increase the number of rounds. This was round 7 with 280,000 simulated data (40K each round). I hope it helps.

wyy · February 25, 2024, 7:33am

Hi Lasse

Thanks for the reply, yes I have checked no nans in training data, and the simulations_dict with multidimensional sim_data and prior_draws are all numpy arrays.

By the way， may i ask another question: is the following example offline training? I thought this example was for online training.
https://bayesflow.org/_examples/Intro_Amortized_Posterior_Estimation.html

Thank you
wyy

wyy · February 25, 2024, 7:36am

Thanks so much, is that means to reduce the batch size and iteration and increase the epoch?

Best
wyy

KLDivergence · February 25, 2024, 5:22pm

Correct, this example tackles online training!

KLDivergence · February 25, 2024, 5:26pm

In general, the only cases where I have seen nans in the loss functions are:

When training diverges (e.g., exploding gradients)
When there is something terribly wrong with the data (e.g., nans, inf, etc)

Number one can be easily inspected.
Number two requires more attention. Do the data or parameters contain super large numbers? If so, standardization is needed, as in any deep learning application.

Seeing the network setup will also help.

ali · February 26, 2024, 1:57am

Stefan is addressing the problem systematically, so I would suggest providing more information as he requested. But from my limited experience in offline/round-based training, reducing the number of rounds or the data you provide might be helpful. For example, if you are introducing 100,000 data points, try with lower numbers (like 1000 data points) to see if the problem exists, or you can reduce the number of epochs or rounds. At least, this is how I can resolve the issue with my code, but there is a chance that something weird is going on with my data, too, when I generate a large number of data points.

wyy · February 26, 2024, 3:24am

Thanks so much for your reply!!!

wyy · February 26, 2024, 3:24am

Do we have offline training examples?

marvinschmitt · February 26, 2024, 5:50am

Here’s an example that uses offline training:

wyy · February 26, 2024, 8:12am

Thanks so much, i think I use online training, so I generated some samples to check the data, i think it is not a bigdata, and I haven’t seen nans or inf. I find a weird thing is my code sometimes can generate the loss and posterior samples with values, and sometimes the outcome is all nans. I will try to check all of the possible errors.

wyy · February 26, 2024, 8:14am

thanks so much!!!

elseml · February 26, 2024, 9:18am

@wyy Here is also another example of offline training:
https://bayesflow.org/_examples/TwoMoons_Bimodal_Posterior.html

wyy · February 29, 2024, 8:03am

may i ask one more question.
I try a very small dataset only 50 samples, i have check no nan or inf, i use offline training so that can control the data, i find the loss value is nan, and i can plot the graph for history[“train_losses”], but the graph for history[“val_losses”] is empty. if we can plot history[“train_losses”], does that means the training is successfully, but no value in history[“val_losses”]. Also i try the standardization, same situation. Thanks.

marvinschmitt · February 29, 2024, 8:21am

Could you please post a minimal reproducible example? I would need to investigate the concrete case in more detail to help here. Thanks!

wyy · March 4, 2024, 3:32am

so sorry for the late reply, these days, I try my best to solve the problems, but I find that sometimes if I change to another dataset, then the loss has values. I guess my coding is right, there may be some problems with the data generated. Thanks so much for your help!!!

marvinschmitt · March 4, 2024, 9:19am

Ok, thanks for the update. If you run into any similar issues again, I’m happy to help with a minimal reproducible example at hand.

Cheers,
Marvin

wyy · March 6, 2024, 8:07am

Thanks so much, during my coding, I found use the same code, sometimes the loss is nan, and if I rerun it again it has values, do you know why? Also, sometimes I see the same problem with Ali, using a smaller dataset will solve the problems, but why a larger dataset are easier to see nan in loss?

Best
wyy

wyy · March 7, 2024, 9:04am

Hi~~May I ask a question, in the training, I saw we have loss and Avg loss value (for example Loss: -2.036,W.Decay: 0.056,Avg.Loss: -1.990), but why in the pink area, the loss is nan? I have checked my data do not contain any inf or nan. and also I think if we have avg.loss value, that means the training is successful right?

Best
wyy

Topic		Replies	Views
Simple negative binomial inference General	6	165	January 23, 2025
What does `epoch` mean when using an online data generating process? General	3	42	October 28, 2025
Model Misspecification Diffusion model for conflict tasks General	5	393	January 20, 2024
Warning during online training leading to NaN loss General	4	70	February 25, 2025
Posterior z-score and Unusual Sampling General	3	64	October 5, 2024

Cannot generate posterior samples after training

Related topics