Handling missing data in summary and inference networks

Hi, I’m wondering whether it is of any interest to include the ability for sequential networks or invariant networks to handle missing data. Some of our subjects have only partial data, but it would be a huge waste to just remove them. From the documentation, it’s not straightforward to handle missing data. If this is something the maintainers feel like would be useful I am down to help with implementing this.

I currently have a simple workaround. A part of my data is already categorical, I simply encode the missing category as an additional category. This might not work well if you don’t expect a lot of missing data. In my data, 60% of the subjects have at least 1/3 of their data missing.

For continuous data, maybe simply filling in mean or average across subject/condition could be a first step.


Thanks for posting the question on the BayesFlow Forums, I appreciate it.

There has been scholar work on dealing with missing data for Neural Posterior Estimation (NPE): https://www.biorxiv.org/content/10.1101/2023.01.09.523219v1

In a nutshell, they argue for encoding the missing data with some specific (impossible) value and adding a missingness mask. For instance, if your data is known to be positive real-valued, y>0, y\in\mathbb{R}, you would encode missing data as y=-5 and additionally add a mask (additional data dimension) that contains m=0 for existing observations and m=1 for missing ones. To this end, it’s important that you also include missing data in the NN training phase. You can achieve this with a configurator – this way, your current simulator can remain as-is.

I have personally used this technique in the context of multimodal NPE, where we additionally want to integrate data from heterogeneous data sources. See Experiment 2 in the paper: [2311.10671] Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based Inference
As described above, I use a normal simulator and handle all missing data in the configurator. The code is currently closed-source but we’ll release it in the future. In the meantime, you can reach out to me and I’m happy to share the code with you.



Very helpful, thank you Marvin. Would you use the same strategy in the context of time series data with different frequencies for different variables (e.g. x measured weekly and y daily; where you could put the ‘-5’ values on days for which we don’t have data for variable x) to get them into the same dimensionality, or there are alternative/more standard ways to pass multi-dimensional time series with different dimensions as input to the summary network?

Hi Hazhir, I believe we have never come across such an application, but it is definitely a very interesting one! The approach you are proposing may actually work out of the box. If the time series are not many (<= 3), you may also feed each univariate time series into a separate summary network, combining the resulting summaries in the end. Let me know which approach works best for you and if you have any questions implementing the latter.


1 Like

Thank you Stefan, very helpful. I assume you are pointing to fusion methods in your Fuse it or Lose it paper? Why would that method be limited to L<=3 different time series. My quick reading was that late fusion is rather scalable and performs reasonably well. Also, any insights how to conceptually think about tradeoffs between the formal fusion approach vs. the workaround Marvin proposed? In dealing with organizational data we often come across different frequencies of measurement (e.g. quarterly, monthly, and weekly data for different firm data), but also for things like COVID some data (e.g. total cases) are often daily while others (e.g. deaths by age) are only tracked on weekly basis.

Yes, late fusion scales linearly with the number of sources, but the requirement for one summary network per source may become demanding with dozens of sources (depending on hardware, of course). I suppose it all boils down to testing both approaches and evaluating which works best, as I see hardly any compelling reason why one should perform better than the other a priori. Let us know which approach works for you and we should also think about creating a small tutorial for fusion.