Run alternating compute shader passes

Godot Version

4.6 beta 3

Question

Start reading from next blockquote to skip all the project explanation

I’m working on a simple particle simulation using verlet integration running on the GPU using compute shaders.

The setup is pretty simple, currently with just 2 compute shader passes (plus a rendering pass, but that’s not relevant for this problem):

The first is a space binning pass

The second is a pass in wich all collisions and dynamics are calculated for each particle

To improve the simulation quality, I implemented sub-stepping multiple times in a single frame, and I’ve done this by dispatching the second pass multiple times.

Actually, in my current implementation this second pass also has 2 sub-passes, one for forces and gravity integration and one for collision detection between particles, so I dispatch the same compute shader 2 times for every sub-step in the simulation, and I have a parameter in the push constants to decide wich sub-pass should be running in each dispatch, implemented using a simple if statement

Although this seems a bit unoptimized to have the whole logic and the memory impact needed for collision detection to be there in a simple pass such as the gravity integration pass, the simulation actually runs pretty well and can handle hundreds of thousands of particles while still being surprisingly stable. I would also be pretty curious to know how much of an impact having two sub-passses is to the performance.

But the problem is that now I wanted to implement constraints between particles, using another pass that runs for each constraint rather than for each particle.

Start reading here to skip all of the project explanation

So what I need to do, is to run multiple alternating compute shader passes, that is to say to run (excluding the binning pass, ignore this parenthesis if you skipped the explanation) all in the same compute list:

pass1

then pass 2

then pass 3

then again pass1

then pass 2

then pass 3

then again pass1

then pass 2

then pass 3

etc. , running a set of 3 distinct passes multiple times.

Now, for what I understand, the only way to do this is to

  • bind compute pipeline for pass 1
  • bind uniform set for pass 1
  • set push constants for pass 1
  • dispatch
  • add barrier
  • bind compute pipeline for pass 2
  • bind uniform set for pass 2
  • set push constants for pass 2
  • dispatch
  • add barrier
  • bind compute pipeline for pass 3
  • bind uniform set for pass 3
  • set push constants for pass 3
  • dispatch
  • add barrier

And to repeat all of this over and over again

I guess that would work, but it seems really over complicated to me and with lot of room for optimization.

So my question is: Is this the only way to do it, or is there a better way that avoids using all those compute pipelines and uniform sets bindings?

If you need to see the code, feel free to ask; and sorry for eventual grammatical errors

Thanks in advance for any tips!

Afaik, there’s no other way to do it. The main optimization that could be made is with descriptor set binding, but none of the Vulkan’s descriptor handling nuance is exposed via Godot’s RenderingDevice abstraction.

1 Like

So sad, just as I feared. Thanks for the reply anyways!

I’ll still keep the topic open just in case someone else has some advices on the topic

1 Like

I played with compute shaders a lot the past two months, mainly with Navier-Stokes fluid solver which uses between 23 to 46 (or more) passes. Almost each pass needs a barrier because of the nature of Navier-Stokes, with texture pingpong. It looks similar with your algorithm.

Tried different options:

  • one pipeline per pass with binding of a descriptor set for each. Basically using each pass like a function call.
  • one pipeline per pass but with only one descriptor set shared across all pipeline (same uniforms/buffers). It is automatically rebinded at the dispatch call.
  • An uber-shader containing all the passes with pass selection using a a push constant PASS_ID and a switch/case. With a a descriptor set composed of input and output texture arrays.

I know the uber shader style is usually bad practice but in my case it was the most performant. I miss tooling for good profiling (intel iGPU, linux, so no Intel profiler nor NVIDIA Nsight, renderdoc doesn’t help much here). I think it’s because each pass is at most 3-4 lines of GLSL so doing the compute take no time compared with switching pipeline/descriptors.
And the bonus is: Simpler to maintain, only one pipeline, only one descriptor set, just repush the parameter constant: PASS_ID, input_texture_index(es), output_texture_index(es).

Also, if you render something, run it on the main rendering device using RenderingServer.call_on_render_thread(), assign some output to a Texture2DRD and use it directly with a Sprite2D or TextureRect or as input of your final gdshader. This way there is no data pingpong between GPU and CPU RAM, only the push constants which is less of 128 bytes by definition.

RenderingServer.create_local_rendering_device() is fine to do things that need to be fetched back on the CPU RAM. But this can also be done on the main rendering device. Local rendering devices are not like CPU threads.

I’m far from being a expert but I hope I helped somewhere :slight_smile:

1 Like

Thanks a lot for the reply! I’ll defenetly try all of those solutions and pick the most performant one. In the future though, I’d like to learn how to build the engine from source and to create my own branch, and then try to fix the problem from the source code. But I have no idea how to do it yet nor how difficult it will be.

And for the rendering part, none of the textures ever gets in CPU ram, basically I have one texture created with the main RenderingDevice and rendered using Texture2DRD, and another texture created with the local RenderingDevice using RenderingDevice.texture_create_from_extension() that I update after all the particle dynamics, so only when necessary. This is nice because I never call rd.texture_get_data() and in general I don’t have that CPU ram bottleneck, but I don’t really know if it is actually more performant than doing the texture update in the main rendering thread. If you’d like to get a look at the code, feel free to ask, it is actually really quite simple.

Anyways, I’ll do some researches to try to understand more deeply how godot compute shaders work and things like that.

Again, thanks a lot for the advices!

Yeah, texture_create_from_extension() is the trick found on the forum but it wasn’t working for me (Godot trigger assert messages) but I was on 4.5.1 maybe some rules was relaxed for 4.6, or maybe my driver cannot do it, idk. Using a local device is more a need when you want to submit()/sync() yourself and data is not shared with main device. I found a interesting comment about that here : Getting image from Texture2DRD causes a crash · Issue #111030 · godotengine/godot · GitHub

Vulkan compatibility for compute across OSes, GPUs, driver versions, mobile-or-not can be a mess. I suggest some reading directly at the source: vulkan specs, GLSL specs. I also used some LLMs as search engine because GPU/shader/glsl/… is a vast topic with 20+ years of history.

I’m not sure compiling your own branch can help. The Godot RenderingDevice API is very close to vulkan API. Some vulkan features are missing but it’s for maximum compatibility I guess. Also Godot do the vulkan plumbing itself which simplify a lot. Anyway, if you want to compile, maybe a cpp GDExtension could be enought, idk, I not yet tried this path (but it’s planned).

I will be happy to take a look to your code. Maybe I can suggest something :slight_smile: