Different ECDF outputs between bayesflow 1 and 2

Hi all,

I am currently running a drift diffusion model focused on post-decisional evidence accumulation and am having trouble reconciling different outcomes between BF versions 1 and 2.

I understand that these processes aren’t deterministic and there will always be differences but the rank ECDF almost always go over the 95% confidence bound across different post samples when doing it on BF v2 vs BF v1. Furthermore, they are quite volatile for parameters boundary separation/an and non-decision time/ter in BF v2.

I’m wondering what the reasons for this could be? It’s also worth noting that recovery is much stronger in BF v2 compared to BF v1 across all parameters, but especially for v_ratio/efficiency (~0.5 vs ~0.7). My hunch is that I didn’t do a good job translating the configurator from BF v1 to the adapter in BF v2. Another possible reason is that the former is run on coupling flow while the other is on matching flow. Although I’m not knowledgeable enough to understand whether this would make a big difference.

BF v1

from tensorflow.keras.utils import to_categorical

# A configurator extracts the results of the model to a format that the neural network would like
def configurator(forward_dict):
    
    # Prepare placeholder dict
    out_dict = {}

    # Extract simulated response times
    data = forward_dict["sim_data"]

    # (B, N, 1) 
    context  = np.array(forward_dict["sim_batchable_context"])[..., None].astype(np.float32)
    resp = ((data[:, :, 1] + 1.0) * 0.5).astype(np.float32)
    stim = ((data[:, :, 3] + 1.0) * 0.5).astype(np.float32)

    rt_resp  = data[:, :, 0:1].astype(np.float32)
    cat_resp = to_categorical(resp, num_classes=2)
    cat_acc  = to_categorical(data[:, :, 2], num_classes=2)
    cat_stim = to_categorical(stim, num_classes=2)
    ord_conf = data[:, :, 4:5].astype(np.float32)
    post_rt  = data[:, :, 5:6].astype(np.float32)

    out_dict["summary_conditions"] = np.concatenate([rt_resp, cat_resp, cat_acc, cat_stim, ord_conf, post_rt, context], axis=-1)

    vec_num_obs = forward_dict["sim_non_batchable_context"] * np.ones((data.shape[0], 1))
    out_dict["direct_conditions"] = np.sqrt(vec_num_obs).astype(np.float32)
    out_dict["parameters"] = forward_dict["prior_draws"].astype(np.float32)
    return out_dict

BF v2

adapter = (
    bf.Adapter()
    .broadcast("num_obs", to = "rt")
    .as_set(["rt", "resp", "acc", "stim", "conf_resp", "post_rt"])
    .sqrt("num_obs")
    .shift(["resp", "stim"], by = 1.0)
    .scale(["resp", "stim"], by = 0.5)
    # .one_hot(["resp", "acc", "stim"], num_classes = 2) 
    .concatenate(["v", "a", "ter", "v_bias", "a_bias", "m_bias", "v_ratio"], into = "inference_variables")
    .concatenate(["rt", "resp", "acc", "stim", "conf_resp", "post_rt"], into = "summary_variables")
    .rename("num_obs", "inference_conditions")
    .convert_dtype("float64", "float32"))

# If one_hot is ran, the following code 'adapter(sim_draws)' results in - IndexError: arrays used as indices must be of integer (or boolean) type

I’ve done my best to make sure the rest of the code functions as similarly as possible.

Hi,
there could be a number of different things going on. To check the role of the adapter, maybe you could compare the outputs of the configurator vs the adapter for the same data to see if they are identical, or where the outputs differ. Changes in the defaults of the networks for example could play a role as well, that would be my suspicion. I think for us it would be best if you post the recovery plot you obtain with v1 (and maybe a few more details on the networks you used), and (if possible) a self-contained script with the v2 code.

Regarding the one-hot, you might have to ensure that resp, and stim have the datatype integer and not float, and that they run from 0 to (num_classes - 1). If this doesn’t fix it, please reach out.

We have seen a few estimation problems due to our defaults for the DeepSet recently (see for example here). If you are using that, this might also be a problem here. But I don’t know the exact changes between v1 and v2 in this regard.

This would be my first thoughts, I think with a few more details we might be able to give more hints on how to resolve this.

It would be helpful to see the other settings and diagnostics as well. Note, that you don’t need one hot encoding for binary variables, as a binary vector already contains all information.

Hi all,

Sorry for the late reply and thank you very much for the reply and advice.

I’ve checked the role of the adapter using an identical sim output and the outcome is the same, so it’s likely not that. Regarding hot encoding, I also thought that the hot encoding of binary would be pointless but BFv1 would return an error if the binary outcomes weren’t hot encoded, so I figured that may also be an issue with BFv2. I managed to get hot encoding to work in v2 but it didn’t fix the issue.

Regarding the networks, I’m currently using setTransformer for the summary net for both of them, specifically

BFv1

summary_net = bf.networks.SetTransformer(input_dim = 10, summary_dim = 32, name = "meta_ddm_summary")

inference_net = bf.networks.InvertibleNetwork(
    num_params = len(prior.param_names),
    coupling_settings = {"dense_args": dict(kernel_regularizer=None), "dropout": False},
    num_coupling_layers = 10, 
    coupling_design = "spline",
    name = "meta_ddm_inference")

BFv2

summary_network = bf.networks.SetTransformer(summary_dim = 32, dropout = 0.0)

inference_network = bf.networks.FlowMatching(depth = 10, integrate_kwargs={"method": "euler", "steps": 100}) 

# inference_network = bf.networks.CouplingFlow(depth = 10, transform = "spline") # results in nan loss

I also misread the recovery plot -.-, confused the r2 and r on v1 plot. The results are actually quite similar, although generally stronger in v2 for most params.

The difference in rank ECDF still persists as shown above.

I don’t have a local GPU so it will take a while for me to set up and verify a self-contained script with the v2 code (I used colab with). In the mean time, I have the notebooks and checkpoints for both v1 and v2 uploaded as well as the dockerfiles to run the environments if needed.

Thanks a lot for the additional details. I sometimes observe that different inference network types lead to different calibration, so I would assume that the inference network is at the heart of the differences you see. There are a few things you can try:

  • How does the coupling flow with transform="affine" work for you? I’m also not sure why the problem with the nan loss arises here, could you try if this persists if you reduce the depth?
  • Increase the accuracy of the integration for FlowMatching, by setting "method": "rk45". This will increase the integration time by about a factor 4, but should give more accurate samples. However, the effects on the calibration is hard to tell for this, it might also increase some forms of miscalibration.

I think what you provided is fine for now, so do not worry to set up a self-contained code for the moment.

A couple of further points:

  • version 2.0.7 (release yesterday) improved the defaults for sampling with flow matching
  • nan losses are 90% due to problems in the simulator (e.g., rare extreme values, large numerical values)
  • flow matching generally needs longer training than coupling flows (~up to factors of 5 - 10)
  • miscalibration is common for easily recoverable parameters, as the Bayesian update is very large and calibration checks are dimensionless per design (e.g., a difference between 0.01 (true) and 0.02 (approx) posterior standard deviation will appear as grave as a difference between 0.1 and 0.2; this is a property, not a bug).
1 Like