How to pass information from a compute shader to another without leaving the GPU?

godot 4.5

Hello,

Short question: how am i supposed to pass information from a shader to another without needing to sync() with the CPU with gdscript?

More info: I am using gdscript. As the title says, i currently have 2 compute shaders. The first outputs a very large array into a SSBO buffer which is taken as input by the second for processing.

The problem: But this means that i have to sync() the data on the CPU which TANKS performances (it’s 20x slower because of the sync)

I have seen this element of answer by luxagenic but i’m not fully sure i understand, it seems roundabout, and it seems to be to pass data between a compute and a fragment shader maybe? my array being of variable size, it seems a bit unwieldy to use that. Is there a "proper" way to pass information between shaders? - #6 by luxagenic

Why would you need sync() if no cpu is involved? Just maintain a shared buffer that’s bound to both shaders, and execute them one after another. Afaik, RenderingDevice inserts a barrier between them automatically.

The shader does not work anymore (it just goes all black, as if crashing) when i do not sync(). It throws a bunch of syncing errors:

Am i supposed to have one rendering device for each shader or something?

Here is the error:

device already submitted, call sync to wait until done.
<C++ Error> Condition “local_device_processing” is true.
<C++ Source> servers/rendering/rendering_device.cpp:6294 @ submit()

relevant code snippets. cast_shadow_uniform is the buffer that passes the data:

cast_shadow_uniform = RDUniform.new()
cast_shadow_uniform.uniform_type = RenderingDevice.UNIFORM_TYPE_STORAGE_BUFFER
cast_shadow_uniform.binding = 4
cast_shadow_uniform.add_id(cast_shadow_buffer)

cast_shadow_uniform_set = rd.uniform_set_create([position_uniform, occluder_uniform, occluder_size_uniform, occluder_opacity_uniform, cast_shadow_uniform], cast_shadow_shader, 0)

cast_shadow_compute_list = rd.compute_list_begin()
rd.compute_list_bind_compute_pipeline(cast_shadow_compute_list, cast_shadow_pipeline) 
rd.compute_list_bind_uniform_set(cast_shadow_compute_list, cast_shadow_uniform_set, 0)
var res : int = occluders_total_size/64 +1
rd.compute_list_dispatch(cast_shadow_compute_list, res, 1, 1)
rd.compute_list_end()
rd.submit()
rd.sync() #--------------
	cast_shadow_uniform.binding = 8
	
	output_uniform = RDUniform.new()
	output_uniform.uniform_type = RenderingDevice.UNIFORM_TYPE_STORAGE_BUFFER
	output_uniform.binding = 9
	output_uniform.add_id(output_buffer)
	
	uniform_set = rd.uniform_set_create([input_uniform, direction_uniform, angle_uniform, color_uniform, intensity_uniform, falloff_factor_uniform,
	position_uniform, triangularized_occluder_uniform, cast_shadow_uniform, output_uniform], shader, 0)
	
	print("Uniform + buffers set up: ", float(Time.get_ticks_usec() -time_ms)/1000)
	time_ms = Time.get_ticks_usec()
	
	compute_list = rd.compute_list_begin()
	rd.compute_list_bind_compute_pipeline(compute_list, pipeline) 
	rd.compute_list_bind_uniform_set(compute_list, uniform_set, 0)
	rd.compute_list_dispatch(compute_list, work_group_width*work_group_height, 1, 1)
	rd.compute_list_end()
	rd.submit()
	rd.sync()

Try running both shaders inside the same compute list. Here you’ll likely need to insert the barrier manually.

If you need the shaders to run per frame and sync() is taking too long, your shaders are doing too much work per frame anyway though.

I didn’t know you could, how would i do that? is it going to be like that?

compute_list = rd.compute_list_begin()
rd.compute_list_bind_compute_pipeline(compute_list, pipeline_1) 
rd.compute_list_bind_uniform_set(compute_list, uniform_set_1, 0)
rd.compute_list_dispatch(compute_list, x, y, z)

compute_list_add_barrier(compute_list)

rd.compute_list_bind_compute_pipeline(compute_list, pipeline_2) 
rd.compute_list_bind_uniform_set(compute_list, uniform_set_2, 0)
rd.compute_list_dispatch(compute_list, x, y, z)

rd.compute_list_end()
rd.submit()
rd.sync()

(the final sync is fine and dandy btw, it’s just the sync between shaders that’s a problem)

You may need to call compute_list_add_barrier() after the first dispatch if the second shader depends on the results of the first shader.

If the first sync() is taking a lot of time that means that the first shader is taking a lot of time. Putting them both into the same list likely won’t change that but they’ll at least run one after another as fast as possible.

You can just run the whole thing less often.

Now that i edited it, you would say that the snippet i wrote should work? i really am lost as to how you have multiple shaders run in a compute list

compute_list_add_barrier() requires an argument. Check the reference.

Sorry i think i’m not clear. I’m asking if the rest of the code (beside the barrier) is how you’d do launch 2 shaders sequentially in a compute list.

Run it and see if it works.

1 Like

Run it and see if it works.

my bad, i should have gone through and see indeed. It worked.

You were right that the sync() being that slow was a symptom of the shader being slow. Now i have to improve this horrendously slow shader

Thank you in any case!

1 Like

You have opportunities to optimize at several places:

  • the shader code itself
  • adjust the number of invocations per workgroup to be optimal for your target hardware
  • halve or quarter the total invocation resolution (if applicable to your use case)
  • submit the compute list and then wait N frames before calling sync(). Then repeat. The compute results will update every Nth frame but that may be fine for many use cases.
  • the shader code itself

  • adjust the number of invocations per workgroup to be optimal for your target hardware

I’m looking into those two, they’re the best paths. The shader is a lighting system: the rasterizer can definitely be upgraded with batching and so on and i think there is too much work per invocation right now. So the two others are really if nothing else works. Thanks again.
edit a few hours later: i already massively upped the perfomances by adjusting workload per invocation

1 Like