How to define coupling type?

I usually choose the affine layers to train the model, which is good enough for my problem. When I tried neural spline, it seems to be more difficult and need much more time to train model. I read some papers related to neural spline flow, it says that spline flow has higer expressiveness, improved invertibility and better density estimation than affine layers. I had anticipated that parameter estimation using splines would outperform that obtained through affine. Contrary to my expectations, the results have taken an unexpected turn. Could anyone offer some insights or suggestions?
In addition, I also see the option of interleaved, which alternates between affine and spline. I do not fully understand the use of conbination of affine and spline. Any evidence to show the advantanges of using both?
I also have a bizzard result. In some model training, I got fairly good recovery plots (e.g., R2 is over 0.9) but bad ECDF, such as falling outside the confidence bands. Interestingly, dedicating more time to training seems to exacerbate the issue rather than improve it. Can anyone shed light on this phenomenon or offer any suggestions to address it?

Splines may not always work best; the original papers used somewhat contrived examples. Our experience typically shows that:

  1. There is hardly any difference for simple low-dimensional problems.
  2. Splines are better for separating modes in low-dimensional problems (e.g., two moons).
  3. Affines may be better suited for higher-dimensional problems (e.g., MNIST-like).
  4. Splines are easier to overfit, especially for offline learning with small data.

Since your setting has a rather low simulation budget, I suspect that overfitting may be at play. You could inspect validation loss curves for hints at overfitting and increase regularization if that’s the case.


Splines are indeed more expressive than affine couplings, but take longer to compute because each input value has to be assigned to a spline bin, which is of logarithmic complexity in the number of bins (as compared to constant for affine). There is also a bunch more computations involved than just a single multiplication and addition as is the case for affine couplings.

In general, you don’t need as many spline couplings as you need affine couplings to achieve the same level of expressivity, so the effect somewhat cancels out.

As for the interleave option, I am unaware of any public research on this, but in my own research I have found this combination to generally yield a good trade-off between compute and expressivity. Using a combination may also, depending on the implementation of the spline, be necessary to model non-zero-centered distributions.


Thanks for your suggestions! I will try to tune the model by adding dropout. Do you know how to activate the checkpoints to save the best model in the pipeline of BayesFlow?

I do agree with you! Spline takes much more time than affine. I also found that spline can better identify the multi-modes due to its expressiveness.
I will try the interleave for my case to see how it works.

Jice, on your second question, we have also had similar experiences. Typically it happens when a parameter is very well identified (very high contraction). I don’t know the explanation but one hunch is that the curvature of the posterior for those highly identifiable parameters is very sensitive to small changes in parameter, which makes the training of the NN to identify that curvature more complex, specially since the training data is sampled from the prior and not the area where the curvature is most salient/important. If this hypothesis is on track, then you may see better results by contracting the prior to focus on regions you care about more, but the problem may persist as long as posterior contraction is very high (close to one for most datasets).


Thanks for the discussion. In some cases, I had very small variances and very accurate parameter recovery, leading to bad simulation calibration, e.g., bad ECDF. I tuned the model by adding dropout, more coulping layers and also change in prior, the results seems to be improved, but not always. ‘Not always’ here means training is kinda stochastic, the results from each training is not the same. I believe the tuning is pretty important but difficult. The only accurate parameter estimation does not indicate the model is well trained.