BayesFlow on HPC Clusters

ali · January 14, 2024, 11:04pm

I have access to HPCs with a high number of CPUs running Windows Server. Using the examples of the tutorials and some other toy models, I have noticed that the speed of ‘training’ is actually slower compared to laptops and PCs with lower configurations (and the training part is the bottleneck). I have tested the same behavior on different HPCs (even with 192 logical processors).

Maybe I have been mistaken, but different training modes and configurations did not improve the relative performance of HPCs compared to my laptop (e.g., offline vs online or different batch_size vs epochs). From my previous experiences, this usually happens when HPCs use single cores, and laptops and PCs usually have better single core performance than HPCs.

I also know that the HPCs I mentioned do not provide powerful GPUs, but I hoped to harness the CPU power instead (I am not even sure if GPUs are heavily involved). So, I was wondering if anyone has any experience with this and if there are potential solutions to overcome this issue.

chrisliu · January 14, 2024, 11:12pm

I might be wrong, but I don’t think bayesflow natively handles multiple cores, and this is likely why your code is not faster on hpc. I haven’t personally benchmarked this, but one thing you can try is do multi-processing in the simulator yourself when simulating multiple batches of data. For example, if your simulator gets a generation batch size of 5000, you can use something like joblib to evenly spread them between each processors.

marvinschmitt · January 14, 2024, 11:26pm

I agree that the simulator might be the most straightforward place to parallelize for online training, especially when the simulation program is the bottleneck.

Once you have implemented a parallelized batch simulator (as proposed by Chris), make sure to let BayesFlow know by setting simulator_is_batched=True in the GenerativeModel: bayesflow.simulation module — BayesFlow: Amortized Bayesian Inference

That being said, adding bespoke multicore handling in BayesFlow seems like a sensible addition. Do you have any insights or ideas about an optimal way add this functionality?

PS: I share your experience that laptop computers often have better single-core CPU performance than clusters. That’s why I train all of my small-ish stuff on my MacBook

chrisliu · January 15, 2024, 12:13am

I personally feel that simulator is probably the only sensible place to add multi cores. In my experience, the training is likely compute bound but hard to distribute across cpus, and the better solution there is usually to use GPUs rather than more CPUs

KLDivergence · January 15, 2024, 5:12pm

I agree, distributed simulations can speed up online training a lot. As Marvin points out, power users can currently implement a batched parallel simulator and pass it to the wrappers, but it may be a good idea to provide a native solution for non-batched simulators as well. I can try to work on that.

For distributed training, we need to go through the TensorFlow strategy interface and implement something along those lines for distributed offline training:

What do you think? Would anyone be willing to help with that?

ali · January 15, 2024, 5:34pm

Thank you all for sharing your thoughts and advice. I think having distributed training would be extremely helpful. I have seen lower performance when doing offline training on HPCs too compared to my MacBook Pro. So, the training can potentially be a huge bottleneck. I would have loved to help, but unfortunately, I don’t have much coding experience. I had actually seen that tutorial but had no idea how one could use it. For testing and replications, I would gladly help.

ali · January 19, 2024, 5:40pm

I have some clarification questions. I am using offline training and still cannot figure out how to speed up the training phase, which I am confident is the bottleneck. So, I want to troubleshoot and even try GPU clusters to see if there will be any significant differences. However, even on my MacBook Pro (M3 Max), I’ve noticed that Python only uses CPU resources during the training, and GPUs are not even activated.

Do I understand it correctly? Does BayesFlow support any form of distributed training on GPUs? How would I make sure that GPUs are engaged on macOS and Windows platforms?

KLDivergence · January 19, 2024, 8:25pm

Perhaps @marvinschmitt can provide some pointers about Mac?

I use TensorFlow on Windows and Linux with NVIDIA graphics cards. You can check if TensorFlow finds the graphics card, e.g., via:

https://www.tensorflow.org/api_docs/python/tf/test/is_gpu_available

ali · January 19, 2024, 9:47pm

Thank you so much for the response and the link. I think it might be caused by the way TensorFlow is developed. I will test on Linux with NVIDIA and will share my experience here. Thanks again!

marvinschmitt · January 20, 2024, 8:56am

Yeah, GPU acceleration with Apple Silicon (M1, M2, M3) is a pain. Last time I checked (sometime last year) my CPU performance was faster than metal-accelerated GPU performance so I just gave up.

My current workflow is: Fast iterations on implementing models happens on my local M1 CPU. When I’m happy with the model, I’ll move it over to a Linux+GPU setup and do the more extensive training runs there.

One unexpected phenomenon: When I use transformer summary networks, I can run larger batches on my Mac. The limiting factor is the quadratic space complexity and the Mac‘s shared memory layout gives us more memory than most cheap GPUs on a cluster offer.

To answer your question: No, BayesFlow does not implement dedicated GPU handling. This is all controlled via TensorFlow. The internet has up-to-date documentation on making your particular GPU work with TensorFlow, so that will be more helpful than my handwavy explanations Once you have set up GPU training via TensorFlow, it will work with BayesFlow too.

ali · January 20, 2024, 4:57pm

Thank you very much, Marvin. I did not know there was a separate configuration and setup for TensorFlow. I went through it yesterday, and it took me a lot of time to make TensorFlow recognize the GPU. However, I am still getting some errors, which I hope to resolve soon (attached screenshot).

I was under the impression that BayesFlow would take care of that during the installation process. But now I understand the complexity.

I am curious to know if you have natively set this up in the Windows environment. It would be great if I could get your opinion on the versions of TensorFlow (that itself clarifies the CUDA and cuDNN versions) and Python that match BayesFlow so that I can run the GPU setup natively on Windows.

ali · January 20, 2024, 6:42pm

I successfully enabled TensorFlow GPU natively on Windows. There’s a narrow range of versions that overlap. To summarize my experience, I installed TensorFlow 2.10.1, then BayesFlow. But in the end downgraded TensorFlow Probability to version 0.17. Hope it helps.

KLDivergence · January 20, 2024, 11:56pm

Thanks, Ali! Yeah, unfortunately, TensorFlow versions > 2.10 no longer support native GPU on Windows and you need to use WSL 2. Fortunately, these is still no significant difference between 2.10 and 2.15.

ali · January 24, 2024, 10:27pm

Hi, I think I get noise in training on HPCs (with or without GPU, but with GPU, training is much faster). I tested using the same model by fixing the random number generator and using the same setup. The training is offline. But. I don’t get unusual results on my MacBook Pro or other regular machines. The validation is also sometimes off on HPCs. I have to emphasize that only sometimes, not always. Do I have to adjust hyperparameters significantly to prevent this? Or is it the nature of these machines and the way they handle data? Thank you so much.

ali · January 24, 2024, 11:40pm

I found and fixed the issue. It’s not even related to TensorFlow or BayesFlow. It relates to parallelization in HPCs in one of the software I use. I use a software to for taking care of the offline data generation for ODE problems, and it was using parallel data processing causing randomness in the oder of data and simulation (just the order, not the value). This made the learning so random and difficult. Hope it helps other researchers.

BayesFlow and TensorFlow work extremely well and smooth on GPU. Thanks again for this great work!

KLDivergence · January 26, 2024, 1:35pm

Hi Ali,

Glad you found the problem, I was playing around with the TensorFlow seeds and not succeeding to reproduce the problem.

Your experiences with parallel SBI would most definitely be valuable to others! Feel free to share other results as your project progresses.

PS. I had some good experience with running TensorFlow on GPU in WSL 2, it seems that sooner or later WSL 2 would be the go-to for running tf on Windows.

ali · February 8, 2024, 8:32pm

Hi all,

Since starting this thread, I have learned a lot about BayesFlow, TensorFlow, and the differences between GPU and CPU. So far, I am getting promising results on GPU. However, in one of my HPCs, which only has CPUs (on Windows), I realized that I receive this message whenever TensorFlow is imported:

“This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.

To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.”

I am suspicious this could be why BayesFlow/TensorFlow is much slower on that machine than on my laptop. After searching the internet a bit, I realized the usual package that normal users like me use to install on their machines using ‘pip’ does not include all the available resources. However, users can customize it by building TensorFlow from the source (using Bazel). I don’t mind spending time rebuilding TensorFlow with the appropriate compiler flags, but has anyone ever tried this approach before? Also, will the final product be consistent with BayesFlow?

PS. I have not finalized my workflow on WSL 2, because I am using a system dynamics software that is highly dependent on Windows.

Thank you very much,
Ali

marvinschmitt · February 8, 2024, 8:54pm

Hi Ali,

Thanks for the update, that’s very helpful!

“This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.”
I am suspicious this could be why BayesFlow/TensorFlow is much slower on that machine than on my laptop.

I observed the same when using BayesFlow on some Google Cloud VMs. Since my application was not large-scale and I only used the cluster resources to parallelize many experiments before a deadline, I didn’t bother to go through self-compiling tensorflow. So I have no tips for the process itself, sorry.

Also, will the final product be consistent with BayesFlow?

BayesFlow interfaces with the standard tensorflow API. This means: Once you have installed tensorflow, BayesFlow should work fine (at least concerning tensorflow), regardless of the details of your tensorflow installation.

If you encounter BayesFlow errors/issues with your self-compiled tensorflow version, please let us know and we’ll try to fix it

Cheers,
Marvin

Topic		Replies	Views
Enabling multi‐GPU (2× GPUs) training in BayesFlow General	2	18	June 17, 2025
Low training speed with training on GPU General	3	57	December 10, 2024
Model Misspecification Diffusion model for conflict tasks General	5	375	January 20, 2024
BayesFlow Neo (V2) Now Available Announcements	0	64	March 19, 2025
Getting error running Linear Regression Example General	18	176	June 10, 2025

BayesFlow on HPC Clusters

Related topics