Handling missing data in summary and inference networks

Hi,

Thanks for posting the question on the BayesFlow Forums, I appreciate it.

There has been scholar work on dealing with missing data for Neural Posterior Estimation (NPE): https://www.biorxiv.org/content/10.1101/2023.01.09.523219v1

In a nutshell, they argue for encoding the missing data with some specific (impossible) value and adding a missingness mask. For instance, if your data is known to be positive real-valued, y>0, y\in\mathbb{R}, you would encode missing data as y=-5 and additionally add a mask (additional data dimension) that contains m=0 for existing observations and m=1 for missing ones. To this end, it’s important that you also include missing data in the NN training phase. You can achieve this with a configurator – this way, your current simulator can remain as-is.

I have personally used this technique in the context of multimodal NPE, where we additionally want to integrate data from heterogeneous data sources. See Experiment 2 in the paper: [2311.10671] Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based Inference
As described above, I use a normal simulator and handle all missing data in the configurator. The code is currently closed-source but we’ll release it in the future. In the meantime, you can reach out to me and I’m happy to share the code with you.

Cheers,
Marvin

4 Likes