Parameter Recovery Issues with Item Response Theory (IRT) Models

VinzentJ · August 20, 2025, 1:39pm

Hi everyone,

I’m working on estimating Item Response Theory (IRT) parameters with BayesFlow, specifically implementing a 2PL model where I want to estimate item discrimination (a) and difficulty (b) parameters from response patterns. However, I’m experiencing poor parameter recovery - the estimated parameters seem to follow the prior distributions rather than recovering the true values. I have only managed to recover the item difficulty (unfortunately not very accurately) with a small number of parameters (~4 items), but item discrimination is of course more difficult to determine under these conditions due to identifiability. For higher numbers of parameters (I tried up to 20 items) I can’t recover any of them. When I provided handcrafted summary statistics (sum scores from external tests), even estimating person abilities (theta) for 100 persons became feasible, but I still struggled with the discrimination and difficulty.

Code Example

import os
if "KERAS_BACKEND" not in os.environ:
    os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import bayesflow as bf
import numpy as np

# Set seeds for reproducibility
np.random.seed(42)

# Global parameters
N_ITEMS = 3
N_PERSONS = 40

SUMMARY_DIM = 20            # amount of summary dimensions used for summary network (SetTransformer)
NUM_COUPLING_LAYERS = 3     # amount of coupling layers used for inference network (FlowMatching)
EPOCHS = 100
BATCH_SIZE = 16
NUM_BATCHES_PER_EPOCH = 32

# Create simulator
def prior(n_items):
    a = np.random.lognormal(0, 0.5, size=n_items)  # discrimination
    b = np.random.normal(0, 1, size=n_items)       # difficulty

    return dict(a=a, b=b)

def likelihood(a, b, n_persons):
    """Likelihood function generating response data"""
    theta = np.random.normal(0, 1, size=n_persons)

    linear_predictors = a[:, None] * (theta[None, :] - b[:, None])
    p = 1 / (1 + np.exp(-linear_predictors))
    responses = np.random.binomial(1, p)

    return dict(theta = theta, responses=responses)

def meta():
    return dict(n_items = N_ITEMS, n_persons = N_PERSONS)

# Create simulator and adapter
simulator = bf.make_simulator([prior, likelihood], meta_fn=meta)

adapter = (
    bf.adapters.Adapter()
    .to_array()
    .constrain("a", lower=0.01)  # discrimination must be positive
    .convert_dtype(from_dtype="float64", to_dtype="float32")
    .broadcast("n_items", to="a")
    .broadcast("n_persons", to="theta")
    .concatenate(["n_items", "n_persons"], into="inference_conditions")
    .rename("responses", "summary_variables")
    .concatenate(["a", "b"], into="inference_variables")
)

# Test simulator and adapter
test_data = simulator.sample(2)
print(f"Simulator output shapes:")
for key, value in test_data.items():
    if hasattr(value, 'shape'):
        print(f"  {key}: {value.shape}")
    else:
        print(f"  {key}: {value}")

adapted_data = adapter(test_data)
print(f"\nAdapter output shapes:")
for key, value in adapted_data.items():
    if hasattr(value, 'shape'):
        print(f"  {key}: {value.shape}")
    else:
        print(f"  {key}: {value}")

# Define networks
summary_network = bf.networks.SetTransformer(summary_dim=SUMMARY_DIM)
inference_network = bf.networks.FlowMatching(num_coupling_layers=NUM_COUPLING_LAYERS)

# Create workflow
workflow = bf.BasicWorkflow(
    simulator=simulator,
    adapter=adapter,
    inference_network=inference_network,
    summary_network=summary_network,
)

# Train model
history = workflow.fit_online(
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    num_batches_per_epoch=NUM_BATCHES_PER_EPOCH
)

# Test parameter recovery
validation_data = simulator.sample(50)
result = workflow.sample(conditions=validation_data, num_samples=50)
bf.diagnostics.plots.recovery(
    estimates=result,
    targets=validation_data
)

Output

I have already tried other networks (CouplingFlow as IN and DeepSet as SN), but unfortunately I have not been able to achieve any success with them either. Would anyone have ideas on how I could improve parameter recovery with BayesFlow? Any help would be greatly appreciated!

valentin · August 20, 2025, 5:51pm

Hi Vinzent,
welcome to the forum! Thanks for the nice reproducible example. I ran your code and noticed that responses has the shape (batch_size, n_items, n_persons). The set transformer and the deep set take inputs with shape (batch_size, set_size, input_dim). Do I understand it correctly that you want the people to be exchangeable? If so, you can add a responses = responses.swapaxes(-1, -2) call to your likelihood to swap the axes. This improves the recovery for me. The discrimination does still not show very good recovery though, but as you write that might be expected. Maybe you can test again and let us know how it goes.

If you want to recover more parameters and use a summary network, make sure to set the summary dimension high enough, 4 times the number of inference parameters might be a good place to start. Also, the DeepSet does not work too well sometimes with its default settings, so I would recommend sticking with the set transformer. To test if something works, the coupling flows should be sufficient and give you faster inference times.

Let us know how it goes, and if you encounter any problems or have questions, don’t hesitate to reach out.

VinzentJ · August 22, 2025, 10:34am

Thanks for the quick response! You’re right, the persons should be exchangeable. I’ve already tried a bit to see if I can also estimate a larger number of items (I tried up to 20) with this variant and it looks good!

Changes in the code:

N_ITEMS = 20 # tried higher number of items
SUMMARY_DIM = (N_ITEMS * 2) * 4 # adaptive number of summary dimensions
responses = responses.swapaxes(-1, -2) # persons interchangeable
inference_network = bf.networks.CouplingFlow(num_coupling_layers=NUM_COUPLING_LAYERS) # CouplingFlow for faster inference

Adapted code:

import os
if "KERAS_BACKEND" not in os.environ:
    os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import bayesflow as bf
import numpy as np

# Set seeds for reproducibility
np.random.seed(42)

# Global parameters
N_ITEMS = 20
N_PERSONS = 40

SUMMARY_DIM = (N_ITEMS * 2) * 4   # number of summary dimensions used for summary network (SetTransformer)
NUM_COUPLING_LAYERS = 3           # number of coupling layers used for inference network (CouplingFlow)
EPOCHS = 100
BATCH_SIZE = 16
NUM_BATCHES_PER_EPOCH = 32

# Create simulator
def prior(n_items):
    a = np.random.lognormal(0, 0.5, size=n_items)  # discrimination
    b = np.random.normal(0, 1, size=n_items)       # difficulty

    return dict(a=a, b=b)

def likelihood(a, b, n_persons):
    """Likelihood function generating response data"""
    theta = np.random.normal(0, 1, size=n_persons)

    linear_predictors = a[:, None] * (theta[None, :] - b[:, None])
    p = 1 / (1 + np.exp(-linear_predictors))
    responses = np.random.binomial(1, p)
    responses = responses.swapaxes(-1, -2)

    return dict(theta = theta, responses=responses)

def meta():
    return dict(n_items = N_ITEMS, n_persons = N_PERSONS)

simulator = bf.make_simulator([prior, likelihood], meta_fn=meta)

# Create adapter
adapter = (
    bf.adapters.Adapter()
    .to_array()
    .constrain("a", lower=0.01) # discrimination must be positive
    .convert_dtype(from_dtype="float64", to_dtype="float32")
    .broadcast("n_items", to="a")
    .broadcast("n_persons", to="theta")
    .concatenate(["n_items", "n_persons"], into="inference_conditions")
    .rename("responses", "summary_variables")
    .concatenate(["a", "b"], into="inference_variables")
)

# Test simulator and adapter
test_data = simulator.sample(2)
print(f"Simulator output shapes:")
for key, value in test_data.items():
    if hasattr(value, 'shape'):
        print(f"  {key}: {value.shape}")
    else:
        print(f"  {key}: {value}")

adapted_data = adapter(test_data)
print(f"\nAdapter output shapes:")
for key, value in adapted_data.items():
    if hasattr(value, 'shape'):
        print(f"  {key}: {value.shape}")
    else:
        print(f"  {key}: {value}")

# Define networks
summary_network = bf.networks.SetTransformer(summary_dim=SUMMARY_DIM)
inference_network = bf.networks.CouplingFlow(num_coupling_layers=NUM_COUPLING_LAYERS)

# Create workflow
workflow = bf.BasicWorkflow(
    simulator=simulator,
    adapter=adapter,
    inference_network=inference_network,
    summary_network=summary_network,
)

# Train model
history = workflow.fit_online(
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    num_batches_per_epoch=NUM_BATCHES_PER_EPOCH
)
bf.diagnostics.plots.loss(history)

# Test parameter recovery
validation_data = simulator.sample(50)
result = workflow.sample(conditions=validation_data, num_samples=50)
bf.diagnostics.plots.recovery(
    estimates=result, 
    targets=validation_data
)

Result:

When I let it train longer (~200 epochs), all item difficulties (b) are estimated well. Unfortunately, the item discrimination is still not estimated well, even though I had hoped that the higher number of items would allow for the estimation. I’m still experimenting with what I can do about this. If I find a solution, I’ll post it here. Thanks a lot for the help!

KLDivergence · August 22, 2025, 1:15pm

Nice! I thought that item discriminations should be affected by the number of test takers, not the number of items?

@valentin We may think of an architecture that uses two set transformers. One produces a summary across items and one across persons.

PS. Make sure you obtain enough samples from the posterior for the diagnostics. 50 samples is a very low number.

VinzentJ · August 22, 2025, 5:01pm

Regarding item discrimination, I initially thought both factors were necessary since in a likelihood-based approach i tested earlier parameter recovery only worked well with sufficiently high numbers of both items and test takers. However, I’m getting counterintuitive results: when I increase the number of test takers (to 300), I’m not seeing good recovery for discrimination and the recovery of the difficulty gets worse. And thanks for the tip, I increased the number of samples drawn from the posterior.

valentin · August 23, 2025, 1:12pm

To me this looks somewhat similar to what we have seen in this thread. My guess would be that the summary network cannot handle that much data with its default settings, so it only provides summary statistics for a few variables, and looses the information for the rest. You can try to increase its capacity, but that might be not so straight-forward.
A different approach that I’m currently testing and that seems to be simpler is to just bundle multiple summary networks of the same type to increase the capacity. I have just opened a pull request to make this easier.

Below is a preliminary version if you want to try it out already. Here, the deep set (with the new defaults) seems to work better than the set transfomer.

import os
if "KERAS_BACKEND" not in os.environ:
    os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import bayesflow as bf
import numpy as np
from collections.abc import Mapping, Sequence
from bayesflow.networks import SummaryNetwork
from bayesflow.utils.serialization import deserialize, serializable, serialize
from bayesflow.types import Tensor, Shape
from keras import ops


@serializable("custom")
class InofficialFusionNetwork(SummaryNetwork):
    def __init__(
        self,
        backbones: Sequence | Mapping[str, keras.Layer],
        head: keras.Layer | None = None,
        **kwargs,
    ):
        """(SN) Wraps multiple summary networks (`backbones`) to learn summary statistics from (optionally)
        multi-modal data.

        There are two modes of operation:

        - Identical input: each backbone receives the same input. The backbones have to be passed as a sequence.
        - Multi-modal input: each backbone gets its own input, which is the usual case for multi-modal data. Networks
          and inputs have to be passed as dictionaries with corresponding keys, so that each
          input is processed by the correct summary network. This means the "summary_variables" entry to the
          approximator has to be a dictionary, which can be achieved using the
          :py:meth:`bayesflow.adapters.Adapter.group` method.

        This network implements _late_ fusion. The output of the individual summary networks is concatenated, and
        can be further processed by another neural network (`head`).

        Parameters
        ----------
        backbones : Sequence or dict
            Either (see above for details):

            - a sequence, when each backbone should receive the same input.
            - a dictionary with names of inputs as keys and corresponding summary networks as values.
        head : keras.Layer, optional
            A network to further process the concatenated outputs of the summary networks. By default,
            the concatenated outputs are returned without further processing.
        **kwargs
            Additional keyword arguments that are passed to the :py:class:`~bayesflow.networks.SummaryNetwork`
            base class.
        """
        super().__init__(**kwargs)
        self.backbones = backbones
        self.head = head
        self._dict_mode = isinstance(backbones, Mapping)
        if self._dict_mode:
            # order keys to always concatenate in the same order
            self._ordered_keys = sorted(list(self.backbones.keys()))

    def build(self, inputs_shape: Shape | Mapping[str, Shape]):
        if self._dict_mode and not isinstance(inputs_shape, Mapping):
            raise ValueError(
                "`backbones` were passed as a dictionary, but the input shapes are not a dictionary. "
                "If you want to pass the same input to each backbone, pass the backbones as a list instead of a "
                "dictionary. If you want to provide each backbone with different input, please ensure that you have "
                "correctly assembled the `summary_variables` to provide a dictionary using the Adapter.group method."
            )
        if self.built:
            return
        output_shapes = []
        if self._dict_mode:
            missing_keys = list(set(inputs_shape.keys()).difference(set(self._ordered_keys)))
            if len(missing_keys) > 0:
                raise ValueError(
                    f"Expected the input to contain the following keys: {self._ordered_keys}. "
                    f"Missing keys: {missing_keys}"
                )
            for k, shape in inputs_shape.items():
                # build each summary network with different input shape
                if not self.backbones[k].built:
                    self.backbones[k].build(shape)
                output_shapes.append(self.backbones[k].compute_output_shape(shape))
        else:
            for backbone in self.backbones:
                # build all summary networks with the same input shape
                if not backbone.built:
                    backbone.build(inputs_shape)
                output_shapes.append(backbone.compute_output_shape(inputs_shape))
        if self.head and not self.head.built:
            fusion_input_shape = (*output_shapes[0][:-1], sum(shape[-1] for shape in output_shapes))
            self.head.build(fusion_input_shape)
        self.built = True

    def compute_output_shape(self, inputs_shape: Mapping[str, Shape]):
        output_shapes = []
        if self._dict_mode:
            output_shapes = [self.backbones[k].compute_output_shape(shape) for k, shape in inputs_shape.items()]
        else:
            output_shapes = [backbone.compute_output_shape(inputs_shape) for backbone in self.backbones]
        output_shape = (*output_shapes[0][:-1], sum(shape[-1] for shape in output_shapes))
        if self.head:
            output_shape = self.head.compute_output_shape(output_shape)
        return output_shape

    def call(self, inputs: Mapping[str, Tensor], training=False):
        """
        Parameters
        ----------
        inputs : Tensor | dict[str, Tensor]
            Either (see above for details):

            - a tensor, when the backbones where passed as a list and should receive identical inputs
            - a dictionary, when the backbones were passed as a dictionary, where each value is the input to the
              summary network with the corresponding key.
        training : bool, optional
            Whether the model is in training mode, affecting layers like dropout and
            batch normalization. Default is False.
        """
        if self._dict_mode:
            outputs = [self.backbones[k](inputs[k], training=training) for k in self._ordered_keys]
        else:
            outputs = [backbone(inputs, training=training) for backbone in self.backbones]
        outputs = ops.concatenate(outputs, axis=-1)
        if self.head is None:
            return outputs
        return self.head(outputs, training=training)

    def compute_metrics(self, inputs: Mapping[str, Tensor], stage: str = "training", **kwargs) -> dict[str, Tensor]:
        """
        Parameters
        ----------
        inputs : Tensor | dict[str, Tensor]
            Either (see above for details):

            - a tensor, when the backbones where passed as a list and should receive identical inputs
            - a dictionary, when the backbones were passed as a dictionary, where each value is the input to the
              summary network with the corresponding key.
        stage : bool, optional
            Whether the model is in training mode, affecting layers like dropout and
            batch normalization. Default is False.
        **kwargs
            Additional keyword arguments.
        """
        if not self.built:
            self.build(keras.tree.map_structure(keras.ops.shape, inputs))
        metrics = {"loss": [], "outputs": []}

        def process_backbone(backbone, input):
            # helper function to avoid code duplication for the two modes
            if isinstance(backbone, SummaryNetwork):
                backbone_metrics = backbone.compute_metrics(input, stage=stage, **kwargs)
                metrics["outputs"].append(backbone_metrics["outputs"])
                if "loss" in backbone_metrics:
                    metrics["loss"].append(backbone_metrics["loss"])
            else:
                metrics["outputs"].append(backbone(input, training=stage == "training"))

        if self._dict_mode:
            for k in self._ordered_keys:
                process_backbone(self.backbones[k], inputs[k])
        else:
            for backbone in self.backbones:
                process_backbone(backbone, inputs)

        if len(metrics["loss"]) == 0:
            del metrics["loss"]
        else:
            metrics["loss"] = ops.sum(metrics["loss"])
        metrics["outputs"] = ops.concatenate(metrics["outputs"], axis=-1)
        if self.head is not None:
            metrics["outputs"] = self.head(metrics["outputs"], training=stage == "training")

        return metrics

    def get_config(self) -> dict:
        base_config = super().get_config()
        config = {
            "backbones": self.backbones,
            "head": self.head,
        }
        return base_config | serialize(config)

    @classmethod
    def from_config(cls, config: dict, custom_objects=None):
        config = deserialize(config, custom_objects=custom_objects)
        return cls(**config)


# Set seeds for reproducibility
np.random.seed(42)

# Global parameters
N_ITEMS = 20
N_PERSONS = 400

SUMMARY_DIM = (N_ITEMS * 2) * 4   # number of summary dimensions used for summary networks in total
NUM_SUMMARY_NETWORKS = 4          # number of summary networks (SetTransformer)
NUM_COUPLING_LAYERS = 3           # number of coupling layers used for inference network (CouplingFlow)
EPOCHS = 10
BATCH_SIZE = 16
NUM_BATCHES_PER_EPOCH = 1000

# Create simulator
def prior(n_items):
    a = np.random.lognormal(0, 0.5, size=n_items)  # discrimination
    b = np.random.normal(0, 1, size=n_items)       # difficulty

    return dict(a=a, b=b)

def likelihood(a, b, n_persons):
    """Likelihood function generating response data"""
    theta = np.random.normal(0, 1, size=n_persons)

    linear_predictors = a[:, None] * (theta[None, :] - b[:, None])
    p = 1 / (1 + np.exp(-linear_predictors))
    responses = np.random.binomial(1, p)
    responses = responses.swapaxes(-1, -2)

    return dict(theta = theta, responses=responses)

def meta():
    return dict(n_items = N_ITEMS, n_persons = N_PERSONS)

simulator = bf.make_simulator([prior, likelihood], meta_fn=meta)

# Create adapter
adapter = (
    bf.adapters.Adapter()
    .to_array()
    .constrain("a", lower=0.01) # discrimination must be positive
    .convert_dtype(from_dtype="float64", to_dtype="float32")
    .broadcast("n_items", to="a")
    .broadcast("n_persons", to="theta")
    .concatenate(["n_items", "n_persons"], into="inference_conditions")
    .rename("responses", "summary_variables")
    .concatenate(["a", "b"], into="inference_variables")
)

# Test simulator and adapter
test_data = simulator.sample(2)
print(f"Simulator output shapes:")
for key, value in test_data.items():
    if hasattr(value, 'shape'):
        print(f"  {key}: {value.shape}")
    else:
        print(f"  {key}: {value}")

adapted_data = adapter(test_data)
print(f"\nAdapter output shapes:")
for key, value in adapted_data.items():
    if hasattr(value, 'shape'):
        print(f"  {key}: {value.shape}")
    else:
        print(f"  {key}: {value}")

# Define networks
summary_network = InofficialFusionNetwork(
    backbones=[
        bf.networks.DeepSet(summary_dim=SUMMARY_DIM // NUM_SUMMARY_NETWORKS, mlp_widths_invariant_outer=(64, 4))
        for _ in range(NUM_SUMMARY_NETWORKS)
    ]
)
inference_network = bf.networks.CouplingFlow(num_coupling_layers=NUM_COUPLING_LAYERS)

# Create workflow
workflow = bf.BasicWorkflow(
    simulator=simulator,
    adapter=adapter,
    inference_network=inference_network,
    summary_network=summary_network,
)
validation_data = simulator.sample(500)

# Train model
history = workflow.fit_online(
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    num_batches_per_epoch=NUM_BATCHES_PER_EPOCH,
    validation_data=validation_data,
)
bf.diagnostics.plots.loss(history)

# Test parameter recovery
result = workflow.sample(conditions=validation_data, num_samples=50)
bf.diagnostics.plots.recovery(
    estimates=result, 
    targets=validation_data
)

This gives the following result:

KLDivergence · August 23, 2025, 5:37pm

Nice! As far as I can remember, there is also some fundamental non-identifiability in IRT model families, but @paul.buerkner knows more about it for sure.

paul.buerkner · August 25, 2025, 10:55am

I think the way the priors are set up, there shouldn’t be non-identifiability in the above simulator. You may want to try changing the parameterization to a * theta + b, which corresponds to a factor loading + intercept parameterization. This may make it easier for the network, since it no longer has to disentangle two item parameters being multiplied together.

VinzentJ · August 25, 2025, 1:46pm

Thanks for all the tips and help, I will try these things out and look how it affects the results. When I trained the networks with more data using the parameters below (and all the “old” settings from my last post), I also got quite a bit better results, but hopefully your new tips will lead to even better outcomes.

N_ITEMS = 15
N_PERSONS = 40

SUMMARY_DIM = (N_ITEMS * 2) * 4
NUM_COUPLING_LAYERS = 3

EPOCHS = 2000
BATCH_SIZE = 64
NUM_BATCHES_PER_EPOCH = 128

paul.buerkner · August 25, 2025, 1:58pm

what is the calibration (via diagnostics.plots.calibration_ecdf)? It is not unexpected that the recovery for the discrimination parameters is not super great, especially for few people.

VinzentJ · August 26, 2025, 8:59am

I closed my session and had to load the model, so this is the calibration (and recovery) for some new generated validation data.

paul.buerkner · August 26, 2025, 9:11am

This looks pretty solid to me. What do you think? Which settings were now the one to make it work like this?

KLDivergence · August 26, 2025, 9:14am

Yeah, these explorations can build some solid intuitions about amortized inference for IRT.

VinzentJ · August 27, 2025, 4:01pm

My exact used code for achieving the results was the following:

import os
if "KERAS_BACKEND" not in os.environ:
    os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import bayesflow as bf
import numpy as np

# Set seeds for reproducibility
np.random.seed(42)

# Global parameters
N_ITEMS = 15
N_PERSONS = 40

SUMMARY_DIM = (N_ITEMS * 2) * 4             # number of summary dimensions used for summary network (SetTransformer)
NUM_COUPLING_LAYERS = 3                     # number of coupling layers used for inference network (FlowMatching)
EPOCHS = 2000
BATCH_SIZE = 64
NUM_BATCHES_PER_EPOCH = 128

# Create simulator
def prior(n_items):
    a = np.random.lognormal(0, 0.5, size=n_items)  # discrimination
    b = np.random.normal(0, 1, size=n_items)       # difficulty

    return dict(a=a, b=b)

def likelihood(a, b, n_persons):
    """Likelihood function generating response data"""
    theta = np.random.normal(0, 1, size=n_persons)

    linear_predictors = a[:, None] * (theta[None, :] - b[:, None])
    p = 1 / (1 + np.exp(-linear_predictors))
    responses = np.random.binomial(1, p)
    responses = responses.swapaxes(-1, -2)

    return dict(theta = theta, responses=responses)

def meta():
    return dict(n_items = N_ITEMS, n_persons = N_PERSONS)

simulator = bf.make_simulator([prior, likelihood], meta_fn=meta)

# Create adapter
adapter = (
    bf.adapters.Adapter()
    .to_array()
    .convert_dtype(from_dtype="float64", to_dtype="float32")
    .broadcast("n_items", to="a")
    .broadcast("n_persons", to="theta")
    .concatenate(["n_items", "n_persons"], into="inference_conditions")
    .rename("responses", "summary_variables")
    .concatenate(["a", "b"], into="inference_variables")
)

# Test simulator and adapter
test_data = simulator.sample(2)
print(f"Simulator output shapes:")
for key, value in test_data.items():
    if hasattr(value, 'shape'):
        print(f"  {key}: {value.shape}")
    else:
        print(f"  {key}: {value}")

adapted_data = adapter(test_data)
print(f"\nAdapter output shapes:")
for key, value in adapted_data.items():
    if hasattr(value, 'shape'):
        print(f"  {key}: {value.shape}")
    else:
        print(f"  {key}: {value}")

# Define networks
summary_network = bf.networks.SetTransformer(summary_dim=SUMMARY_DIM)
inference_network = bf.networks.CouplingFlow(num_coupling_layers=NUM_COUPLING_LAYERS)

# Create workflow
workflow = bf.BasicWorkflow(
    simulator=simulator,
    adapter=adapter,
    inference_network=inference_network,
    summary_network=summary_network,
)

# Train model
history = workflow.fit_online(
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    num_batches_per_epoch=NUM_BATCHES_PER_EPOCH
)
bf.diagnostics.plots.loss(history)

# Test parameter recovery
validation_data = simulator.sample(50)
result = workflow.sample(conditions=validation_data, num_samples=1000)
bf.diagnostics.plots.recovery(
    estimates=result, 
    targets=validation_data
)

I basically only used responses = responses.swapaxes(-1, -2), changed the inference network from FlowMatching to CouplingFlow and increased EPOCHS, BATCH_SIZE and NUM_BATCHES_PER_EPOCH as well as the num_samples drawn from the posterior. And I removed the .constrain(“a”, lower=0.01) from the Adapter().

KLDivergence · August 27, 2025, 6:19pm

Increasing seed_dim and num_seeds in the SetTransformer can also increase the expressiveness.

valentin · August 27, 2025, 6:27pm

What do you think is the best place to put those tips (also for other networks), so users can find them? Do we put them in the docstring, or at some place in the user guide? @KLDivergence @paul.buerkner

paul.buerkner · August 28, 2025, 7:25am

I like notebooks. Along the lines of: What to fix if something doesn’t quite work. And illustrate it with an example. Of course that requires a bit of work. Alternatively, I think the User guide would be a good place. To just quickly highlight some things to try.

matthias.kloft · August 28, 2025, 12:11pm

I think the overall challenge is to make inference independent of num_items and num_persons. You probably need person summaries and then another summary over these person summaries (see also https://bpspsychub.onlinelibrary.wiley.com/doi/full/10.1111/bmsp.12396). I will try to work something out in the next weeks. If you’re interested in exchanging some ideas, shoot me a message.

KLDivergence · August 28, 2025, 1:19pm

It can be done as yet another example of varying number of parameters (as opposed to varying numbers of trials / dat afactors). Instead of representing parameters as a vector of dimension 2 * \text{num_items}, represent them as a \text{num_items} \times 2 matrix and use a many-to-many summary network (e.g., a set transformer without pooling). All inference networks in bayesflow work with arbitrary tensor dimensions except for the last axis which needs to be the same across all realizations.

matthias.kloft · August 28, 2025, 1:51pm

So the person abilities/features are implicitly represented in the inference network?

Topic		Replies	Views
Reinforcement Learning Cognitive Models General	5	290	June 9, 2024
Population Level Parameter Recovery Problem General	2	168	March 13, 2024
Model Misspecification Diffusion model for conflict tasks General	5	381	January 20, 2024
Different ECDF outputs between bayesflow 1 and 2 General	6	32	September 4, 2025
Difficulty Estimating a Multivariate Normal Outcome General	28	147	August 7, 2025