Simple negative binomial inference

Dear BayesFlow team,

I am trying to get a feel for how to work with the codebase by using a very simple example related to the real inference I want to work on. The simple exercise consists of training a neural network (NN) to perform inference over the parameters of a negative binomial distribution. There are many moving parts and options that I do not understand, and none of the examples in the documentation helped me overcome the bump. Here is the basic setup:

After the imports, I define my prior function as

def prior_fn():
    # Prior on p for negative binomial parameter
    p = RNG.beta(1, 10)
    # Prior on log of r parameter
    logr = RNG.normal(0, 1)

    # Return the prior
    return np.float32(np.array([p_hat, logr]))

with the corresponding simulation.Prior object

prior = bf.simulation.Prior(
    prior_fun=prior_fn,
    param_names=param_names,
)

Next, I define the likelihood function in which I convert the integer outputs to float32. For now, I’m using a fixed number of observations (100), but I would like to later understand how to train the NN varying this number.

def likelihood_fn(params, n_obs=100):
    # Unpack parameters
    p = params[0]
    r = np.exp(params[1])

    # Sample from the negative binomial distribution
    u = RNG.negative_binomial(r, p, size=n_obs)

    # Add an extra dimension to make the output a 3D tensor
    u = np.expand_dims(u, axis=1)

    return np.float32(u)

Again, I define the corresponding Simulator and GenerativeModel objects

# Define Likelihood simulator function for BayesFlow
simulator = bf.simulation.Simulator(simulator_fun=likelihood_fn)

# Build generative model
model = bf.simulation.GenerativeModel(prior, simulator)

I define the summary network as a DeepSet (but I have also tried a SetTransforme without much success)

# Define summary network as a Deepset
summary_net = bf.networks.DeepSet(summary_dim=4)

For the inference NN, I don’t have a good intuition of which arguments I should modify or not. So far, one of the failed attempts I have looks something like this:

# Define the conditional invertible network with affine coupling layers
inference_net = bf.inference_networks.InvertibleNetwork(
    num_params=prior(1)["prior_draws"].shape[-1],
    num_coupling_layers=6,
    coupling_design="affine",
    permutation="learnable",
    use_act_norm=True,
    coupling_settings={
        "mc_dropout": True,
        "dense_args": dict(units=32, activation="relu"),
    }
)

I put everything together

# Assemble the amoratizer that combines the summary network and inference
# network
amortizer = bf.amortizers.AmortizedPosterior(
    inference_net, summary_net,
)

# Assemble the trainer with the amortizer and generative model
trainer = bf.trainers.Trainer(amortizer=amortizer, generative_model=model)

and train the model using

# Train the model
history = trainer.train_online(
    epochs=20,
    iterations_per_epoch=128,
    batch_size=64,
    validation_sims=200,
)

However, as shown in the figure below, it doesn’t matter what I try—albeit with very little understanding of the specific details of what each argument does—the loss function explodes into some non-sensical values to quickly come back.

I am sure that I am doing something naively wrong, but I have struggled enough with it to ask for help. So, I hope you can guide me to what I’m doing wrong here.

Thanks in advance for your invaluable help!

Dear Manuel,

welcome to the BayesFlow forum! Thanks for providing code along with your question. I reproduced your setup (only change: using default DeepSet and InvertibleNetwork architectures for simplicitiy) and while I observed initial loss spikes, the training nevertheless quickly converged to a stable solution with nice diagnostics:



I think the issue that causes the exploding loss in the beginning of the training are the extreme outliers produced by your generative model (e.g. stds around 13,000 while means are around 200; see also worse recovery for higher values of r). During the training, some of the minibatches then have extreme values present. You could stabilize it by using standardization or normalization of the input data via a custom configurator (see here) - but if you use simulation-based estimation (of for example data mean and std) that is generalizable to more complex settings, this would also be a bit tricky and require very large simulation sizes due to the tail behavior of your simulator. So in case the model itself is modifiable, constraining the prior to produce a less extreme tail behavior would be the easiest solution here.

One small note: prior_fn currently expects p_hat in the return statement, but only p is defined before.

Hope that helps!
Lasse

2 Likes

Hey Lasse,

Thank you so much for such a quick response! This is a total game-changer. I tried changing my prior to something more reasonable, which now seems to work for this simple example.

Now, I’m ready to try the real inference I’m after. I might come back to the forum if I get stuck again. But this is already incredibly promising!

Thanks again for your help, and congratulations to the whole team on the library. The idea is really neat, and I’m glad to see this principled application of neural networks for such an important application.

3 Likes

Hi Lasse and the rest of the BayesFlow team,

I started playing again with the package and ran into another difficulty I cannot explain:

First, I changed my priors from my last post to avoid the crazy outliers to something I think is more reasonable. Now my prior functions look like this:

def prior_fn():
    # Prior on p_hat for negative binomial parameter
    p_hat = RNG.beta(1, 10)
    # Prior on r parameters
    r = RNG.gamma(3, 1/5)

    # Return the prior
    return np.float32(np.array([p_hat, r]))

My likelihood function remains the same (I’ll come back to the parameter n_obs as my question relates to this):

def likelihood_fn(params, n_obs=n_obs):
    # Unpack parameters
    p_hat = params[0]
    r = params[1]

    # Sample from the negative binomial distribution
    u = RNG.negative_binomial(r, p_hat, size=n_obs)

    # Add an extra dimension to make the output a 3D tensor
    u = np.expand_dims(u, axis=1)

    return np.float32(u)

I then define a summary network with a pretty generous number of summary statistics dimension

# Define summary network as a Deepset
summary_net = bf.networks.DeepSet(summary_dim=128)

and an invertible network

# Define the conditional invertible
inference_net = bf.inference_networks.InvertibleNetwork(
    num_params=prior(1)["prior_draws"].shape[-1],
)

After defining the rest of the necessary objects, I train the network as follows:

# Train the model
history = trainer.train_online(
    epochs=100,
    iterations_per_epoch=32,
    batch_size=128,
    validation_sims=128,
)

With this setup in place, here is my question:

When I set n_obs = 100 I get very good results. The latent dimensions seem to be uncorrelated

The rank statistics look really good


There seems to be a decent posterior contraction

And the inferences are generally okay

But when I set n_obs=10_000 which is what my real application will require (or even larger numbers), although the parameter recovery is basically perfect

And there is a real posterior contraction

The rank statistics seem to indicate that “there is something wrong”


Do you have any idea on how to interpret this result? Could it be fixed with some changes to my model’s architecture?

Thank you in advance for all of your help!

Hey Manuel,

great to see how your project is progressing! We have also observed such calibration patterns in some other modeling projects. We are still discussing more principled explanations of the phenomenon, but our practical interpretation was recently put into words in this preprint (in the description of Figure 7): “As we observe particularly strong posterior contraction for this parameter, even minor deviations in posterior coverage would be detected as miscalibration.”

Best,
Lasse

2 Likes

I’m wondering if there has been any more thought or discussion on this recently. I am running into the same phenomenon with a relatively high-dimensional ODE in which I am learning (for now) one parameter. This parameter is recovered very well, but seems to have this unimodal rank histogram. From my understanding, this is a signature of an overdidpersed posterior relative to the prior. This is probably not a huge problem for scientific parameter recovery, as these posteriors are at least conservative. Still, it has left me wondering: where does this problem originate? Can it be alleviated in any way? Would it have any detrimental effects on detection of model misspecification?

The effects on model misspecification would be hard to predict in general, at least in my mind. I believe the problem originates in the networks not being able to learn extreme posterior contractions requiring large changes in volume, but it is possible that different kinds of posterior networks (e.g., flow matching vs. coupling flows) may have very different properties.