Handling missing data in summary and inference networks

chrisliu · January 14, 2024, 3:29pm

Hi, I’m wondering whether it is of any interest to include the ability for sequential networks or invariant networks to handle missing data. Some of our subjects have only partial data, but it would be a huge waste to just remove them. From the documentation, it’s not straightforward to handle missing data. If this is something the maintainers feel like would be useful I am down to help with implementing this.

I currently have a simple workaround. A part of my data is already categorical, I simply encode the missing category as an additional category. This might not work well if you don’t expect a lot of missing data. In my data, 60% of the subjects have at least 1/3 of their data missing.

For continuous data, maybe simply filling in mean or average across subject/condition could be a first step.

marvinschmitt · January 15, 2024, 12:53am

Hi,

Thanks for posting the question on the BayesFlow Forums, I appreciate it.

There has been scholar work on dealing with missing data for Neural Posterior Estimation (NPE): https://www.biorxiv.org/content/10.1101/2023.01.09.523219v1

In a nutshell, they argue for encoding the missing data with some specific (impossible) value and adding a missingness mask. For instance, if your data is known to be positive real-valued, y>0, y\in\mathbb{R}, you would encode missing data as y=-5 and additionally add a mask (additional data dimension) that contains m=0 for existing observations and m=1 for missing ones. To this end, it’s important that you also include missing data in the NN training phase. You can achieve this with a configurator – this way, your current simulator can remain as-is.

I have personally used this technique in the context of multimodal NPE, where we additionally want to integrate data from heterogeneous data sources. See Experiment 2 in the paper: [2311.10671] Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based Inference
As described above, I use a normal simulator and handle all missing data in the configurator. The code is currently closed-source but we’ll release it in the future. In the meantime, you can reach out to me and I’m happy to share the code with you.

Cheers,
Marvin

hazhir · January 15, 2024, 7:28pm

Very helpful, thank you Marvin. Would you use the same strategy in the context of time series data with different frequencies for different variables (e.g. x measured weekly and y daily; where you could put the ‘-5’ values on days for which we don’t have data for variable x) to get them into the same dimensionality, or there are alternative/more standard ways to pass multi-dimensional time series with different dimensions as input to the summary network?

KLDivergence · January 15, 2024, 10:23pm

Hi Hazhir, I believe we have never come across such an application, but it is definitely a very interesting one! The approach you are proposing may actually work out of the box. If the time series are not many (<= 3), you may also feed each univariate time series into a separate summary network, combining the resulting summaries in the end. Let me know which approach works best for you and if you have any questions implementing the latter.

Cheers,
Stefan

hazhir · January 16, 2024, 9:47am

Thank you Stefan, very helpful. I assume you are pointing to fusion methods in your Fuse it or Lose it paper? Why would that method be limited to L<=3 different time series. My quick reading was that late fusion is rather scalable and performs reasonably well. Also, any insights how to conceptually think about tradeoffs between the formal fusion approach vs. the workaround Marvin proposed? In dealing with organizational data we often come across different frequencies of measurement (e.g. quarterly, monthly, and weekly data for different firm data), but also for things like COVID some data (e.g. total cases) are often daily while others (e.g. deaths by age) are only tracked on weekly basis.

KLDivergence · January 16, 2024, 2:11pm

Yes, late fusion scales linearly with the number of sources, but the requirement for one summary network per source may become demanding with dozens of sources (depending on hardware, of course). I suppose it all boils down to testing both approaches and evaluating which works best, as I see hardly any compelling reason why one should perform better than the other a priori. Let us know which approach works for you and we should also think about creating a small tutorial for fusion.

Topic		Replies	Views
Preferred way to deal with time series with non-equidistant time steps General	10	199	August 8, 2024
Cannot do offline training with summary network General	5	205	December 9, 2023
Adding manual summary statistics to summary network General	5	164	August 6, 2024
Handle 3D dataset General	3	32	April 29, 2025
Setting up TimeSeriesTransformer General	8	217	February 14, 2024

Handling missing data in summary and inference networks

Related topics