How to pass information from a compute shader to another without leaving the GPU?

homme_coquet · February 26, 2026, 1:12pm

godot 4.5

Hello,

Short question: how am i supposed to pass information from a shader to another without needing to sync() with the CPU with gdscript?

More info: I am using gdscript. As the title says, i currently have 2 compute shaders. The first outputs a very large array into a SSBO buffer which is taken as input by the second for processing.

The problem: But this means that i have to sync() the data on the CPU which TANKS performances (it’s 20x slower because of the sync)

I have seen this element of answer by luxagenic but i’m not fully sure i understand, it seems roundabout, and it seems to be to pass data between a compute and a fragment shader maybe? my array being of variable size, it seems a bit unwieldy to use that. Is there a "proper" way to pass information between shaders? - #6 by luxagenic

normalized · February 26, 2026, 1:26pm

Why would you need sync() if no cpu is involved? Just maintain a shared buffer that’s bound to both shaders, and execute them one after another. Afaik, RenderingDevice inserts a barrier between them automatically.

homme_coquet · February 26, 2026, 1:30pm

The shader does not work anymore (it just goes all black, as if crashing) when i do not sync(). It throws a bunch of syncing errors:

Am i supposed to have one rendering device for each shader or something?

Here is the error:

device already submitted, call sync to wait until done.
<C++ Error> Condition “local_device_processing” is true.
<C++ Source> servers/rendering/rendering_device.cpp:6294 @ submit()

homme_coquet · February 26, 2026, 1:33pm

relevant code snippets. cast_shadow_uniform is the buffer that passes the data:

cast_shadow_uniform = RDUniform.new()
cast_shadow_uniform.uniform_type = RenderingDevice.UNIFORM_TYPE_STORAGE_BUFFER
cast_shadow_uniform.binding = 4
cast_shadow_uniform.add_id(cast_shadow_buffer)

cast_shadow_uniform_set = rd.uniform_set_create([position_uniform, occluder_uniform, occluder_size_uniform, occluder_opacity_uniform, cast_shadow_uniform], cast_shadow_shader, 0)

cast_shadow_compute_list = rd.compute_list_begin()
rd.compute_list_bind_compute_pipeline(cast_shadow_compute_list, cast_shadow_pipeline) 
rd.compute_list_bind_uniform_set(cast_shadow_compute_list, cast_shadow_uniform_set, 0)
var res : int = occluders_total_size/64 +1
rd.compute_list_dispatch(cast_shadow_compute_list, res, 1, 1)
rd.compute_list_end()
rd.submit()
rd.sync() #--------------

	cast_shadow_uniform.binding = 8
	
	output_uniform = RDUniform.new()
	output_uniform.uniform_type = RenderingDevice.UNIFORM_TYPE_STORAGE_BUFFER
	output_uniform.binding = 9
	output_uniform.add_id(output_buffer)
	
	uniform_set = rd.uniform_set_create([input_uniform, direction_uniform, angle_uniform, color_uniform, intensity_uniform, falloff_factor_uniform,
	position_uniform, triangularized_occluder_uniform, cast_shadow_uniform, output_uniform], shader, 0)
	
	print("Uniform + buffers set up: ", float(Time.get_ticks_usec() -time_ms)/1000)
	time_ms = Time.get_ticks_usec()
	
	compute_list = rd.compute_list_begin()
	rd.compute_list_bind_compute_pipeline(compute_list, pipeline) 
	rd.compute_list_bind_uniform_set(compute_list, uniform_set, 0)
	rd.compute_list_dispatch(compute_list, work_group_width*work_group_height, 1, 1)
	rd.compute_list_end()
	rd.submit()
	rd.sync()

normalized · February 26, 2026, 1:45pm

Try running both shaders inside the same compute list. Here you’ll likely need to insert the barrier manually.

If you need the shaders to run per frame and sync() is taking too long, your shaders are doing too much work per frame anyway though.

homme_coquet · February 26, 2026, 1:51pm

I didn’t know you could, how would i do that? is it going to be like that?

compute_list = rd.compute_list_begin()
rd.compute_list_bind_compute_pipeline(compute_list, pipeline_1) 
rd.compute_list_bind_uniform_set(compute_list, uniform_set_1, 0)
rd.compute_list_dispatch(compute_list, x, y, z)

compute_list_add_barrier(compute_list)

rd.compute_list_bind_compute_pipeline(compute_list, pipeline_2) 
rd.compute_list_bind_uniform_set(compute_list, uniform_set_2, 0)
rd.compute_list_dispatch(compute_list, x, y, z)

rd.compute_list_end()
rd.submit()
rd.sync()

(the final sync is fine and dandy btw, it’s just the sync between shaders that’s a problem)

normalized · February 26, 2026, 1:57pm

You may need to call compute_list_add_barrier() after the first dispatch if the second shader depends on the results of the first shader.

If the first sync() is taking a lot of time that means that the first shader is taking a lot of time. Putting them both into the same list likely won’t change that but they’ll at least run one after another as fast as possible.

You can just run the whole thing less often.

homme_coquet · February 26, 2026, 2:00pm

Now that i edited it, you would say that the snippet i wrote should work? i really am lost as to how you have multiple shaders run in a compute list

normalized · February 26, 2026, 2:03pm

compute_list_add_barrier() requires an argument. Check the reference.

homme_coquet · February 26, 2026, 2:04pm

Sorry i think i’m not clear. I’m asking if the rest of the code (beside the barrier) is how you’d do launch 2 shaders sequentially in a compute list.

normalized · February 26, 2026, 2:05pm

Run it and see if it works.

homme_coquet · February 26, 2026, 4:11pm

Run it and see if it works.

my bad, i should have gone through and see indeed. It worked.

You were right that the sync() being that slow was a symptom of the shader being slow. Now i have to improve this horrendously slow shader

Thank you in any case!

normalized · February 26, 2026, 5:19pm

You have opportunities to optimize at several places:

the shader code itself
adjust the number of invocations per workgroup to be optimal for your target hardware
halve or quarter the total invocation resolution (if applicable to your use case)
submit the compute list and then wait N frames before calling sync(). Then repeat. The compute results will update every Nth frame but that may be fine for many use cases.

homme_coquet · February 26, 2026, 8:30pm

the shader code itself

adjust the number of invocations per workgroup to be optimal for your target hardware

I’m looking into those two, they’re the best paths. The shader is a lighting system: the rasterizer can definitely be upgraded with batching and so on and i think there is too much work per invocation right now. So the two others are really if nothing else works. Thanks again.
edit a few hours later: i already massively upped the perfomances by adjusting workload per invocation

system · March 28, 2026, 8:30pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.