### Godot version
4.2.x master [16a93563bfd3b02ca0a8f6df2026f3a3217f5571]
### …System information
Godot v4.2.dev (262d1eaa6) - Ubuntu 20.04.6 LTS (Focal Fossa) - X11 - Vulkan (Forward+) - dedicated AMD Radeon RX 6800 XT - AMD Ryzen 9 5900X 12-Core Processor (24 Threads)
### Issue description
**"OOOFFF"**
That's the only way I can describe this ticket.
It starts as a simple CPU to GPU synchronozation bug but there is also another set of wrong assumptions and bugs which, when combined, "more or less" synchronization ends up being correct.
There are so many things going on I don't know where to start, so I'm going to write this in the order in which I discovered these problems.
At worst this is just a waste a VRAM, at best suddenly many of those DEVICE_LOST start making sense.
# A lone validation error that shouldn't have happened
It all started with a random validation error saying a `vkDestroyRenderPass` got called while it was still in use while trying to repro #80286 on AMD, an error that I could not reproduce ever again.
So I went looking for vkDestroyRenderPass calls in the codebase and luckily they weren't many.
I'm still evaluating everything to understand what's going on. But it is clear it's **wrong** (there's no way to sugar coat it).
That's when I noticed the function `RenderingDeviceVulkan::swap_buffers` does the following:
```cpp
void RenderingDeviceVulkan::swap_buffers() {
context->swap_buffers();
frame = (frame + 1) % frame_count;
_begin_frame();
}
// And inside _begin_frame we call _free_pending_resources( frame )
void RenderingDeviceVulkan::_free_pending_resources(int p_frame) {
// Free in dependency usage order, so nothing weird happens.
// Pipelines.
while (frames[p_frame].render_pipelines_to_dispose_of.front()) {
RenderPipeline *pipeline = &frames[p_frame].render_pipelines_to_dispose_of.front()->get();
vkDestroyPipeline(device, pipeline->pipeline, nullptr);
frames[p_frame].render_pipelines_to_dispose_of.pop_front();
}
// .. and lots of other vkDestroy* calls
}
```
**Where's the wait?.**
That's wrong. The code should've been something along:
```cpp
context->swap_buffers();
frame = (frame + 1) % frame_count;
vkWaitForFences(device, 1, &fences[frame]);
_begin_frame();
```
I was about to shout 'Eureka!', but went into swap_buffers() just in case there was already a wait there (although it would be some weird math because it would have to be `vkWaitForFences(device, 1, &fences[frame + 1])`).
That's when it got worse.
# An inverted wait
I couldn't find anything in `swap_buffers`, but I was wondering where is the `vkWaitForFences` call???
I found it in [VulkanContext::prepare_buffers](https://github.com/godotengine/godot/blob/16a93563bfd3b02ca0a8f6df2026f3a3217f5571/drivers/vulkan/vulkan_context.cpp#L2265-L2266). But sadly it performs the following:
```cpp
vkWaitForFences(device, 1, &fences[frame_index], VK_TRUE, UINT64_MAX);
vkResetFences(device, 1, &fences[frame_index]);
vkQueueSubmit(graphics_queue, 1, &submit_info, fences[frame_index]);
frame_index += 1;
frame_index %= FRAME_LAG;
```
**This is wrong.**
In case someone wonders, for data that is used to communicate CPU from/to GPU we are supposed to have at least 2 pools of memory:
CPU writes to pool 0 while GPU is reading from pool 1, and CPU writes to pool 1 while GPU is reading from pool 0. If the GPU is still reading from pool 0 and CPU wants now to write to pool 0 (it's done writing to 1), it needs to stall and wait.
This is a standard double buffer scheme. A triple buffer scheme is the same, but there is an extra pool to minimize potential stalls.
Unfortunately with Godot's current code the following sequence is what will happen after running for several frames.
| Action | Remarks |
|-----------------------|---------------------------------------------------|
| vkWaitForFences on 0 | |
| vkResetFences on 0 | |
| vkQueueSubmit on 0 | |
| frame_index = 1; | |
| Fill gpu buffers on 1 | |
| vkWaitForFences on 1 | |
| vkResetFences on 1 | |
| vkQueueSubmit on 1 | |
| frame_index = 0; | |
| Fill gpu buffers on 0 | Oops!! We never waited to see if fence[0] is done |
| vkWaitForFences on 0 | |
| vkResetFences on 0 | |
| vkQueueSubmit on 0 | |
And thus **we're not actually synchronizing anything!** The CPU keeps writing to data that may be in use by the GPU.
A simple possible solution is the following:
```cpp
vkResetFences(device, 1, &fences[frame_index]);
vkQueueSubmit(graphics_queue, 1, &submit_info, fences[frame_index]);
frame_index += 1;
frame_index %= FRAME_LAG;
vkWaitForFences(device, 1, &fences[frame_index], VK_TRUE, UINT64_MAX);
```
For best performance it's often suggested to delay the vkWaitForFences() as much as possible (i.e. until it's actually needed for us to write to a buffer from CPU, or destroy a resource, etc). But it would seem given Godot's code chaotic complexity this is the only safe solution that won't result in more hazards.
Sadly, the report doesn't end here. There is more. Godot doesn't have 2 pools, it's got 4.
# Wrong assumptions on Swapchains
`RenderingDeviceVulkan::frame_count` is set to the following:
```cpp
// Always need one extra to ensure it's unused at any time, without having to use a fence for this.
frame_count = p_context->get_swapchain_image_count() + 1;
```
> **Note:** frame_count is almost always 4
That comment is HUGE. **This is a false assumption**.
Neither `vkAcquireNextImageKHR` nor `vkQueuePresentKHR` have an obligation to synchronize the CPU.
They use VkSemaphores, which synchronize GPU work with other GPU work (e.g. so that a window isn't presented to the screen while the GPU is still working on it, producing a similar effect to when VSync is off).
However in practice most (all???) drivers do stall the CPU due to limitations in SW or HW architectures. Linux+X11 is notorious for stalling inside `vkAcquireNextImageKHR`, but other drivers prefer to stall inside `vkQueuePresentKHR` instead.
Since we are debunking false assumptions, we also don't have the guarantee that `vkAcquireNextImageKHR` will return swapchain indices in order (e.g. 0, 1, 2, 3, 0, 1, 2, 3). It could return them out of order or even return the same set multiple times (e.g. 2, 0, 2, 0, 1).
We could set the value frame_count to as high as 30, and we still won't have the guarantee that CPU and GPU work will be synchronized. We *need* a fence.
## Swapchains should not dictate the number of pools
As I said earlier, on a double buffer scheme, we should have 2 pools of memories (`FRAME_LAG = 2`).
But because of the wrong assumption on swapchains, Godot keeps 4 pools of memories (because `frame_count = 4` on most machines).
Swapchain count should not be used for anything else other than tracking swapchain related data like VkRenderPass.
# Combine these 2 bugs and it works
So we have 2 competing implementations and 2 bugs:
- `VulkanContext::prepare_buffers` uses `vkWaitForFences` but the calls are out of order
- This is what's causing the CPU to wait for the GPU.
- This *should* be used to track the 2 pools.
- `VulkanContext` uses swapchain count to use 4 pools instead of 2.
I've ran simulations in Spreadsheets of what happens when we run our current `vkWaitForFences` code with just 2 values and a `frame_count = 4` and it would seem that this just happens to synchronize correctly, *a happy accident*.
Because Godot will stall when starting the 3rd frame, and it keeps 4 pools of data, the data should be safe for access on the 3rd frame.
However the complexity of all this is too large, so it's possible that I missed something and we may still get a race condition (that lone `vkDestroyRenderPass` error that sparked this invegistation must've come from somewhere!), just a rare one.
I also noticed that if, for some God-forsaken reason, `frame_count = 3` or lower, then we ***will*** have race conditions.
I analyzed the surrounding code behind `frame_count` setup, and this is possible but it is improbable.
We set `desiredNumOfSwapchainImages = 3` and I don't think there are implementations that have `surfCapabilities.maxImageCount = 2` however I could be wrong. Also, if the driver returns 2 in `vkGetSwapchainImagesKHR` despite us asking for at least 3 (which AFAIK would be a driver bug), we're still in trouble.
However it would make sense that those users who can repro `DEVICE_LOST` with high frequency happen to have `frame_count <= 3`.
I didn't analyze what happens when `separate_present_queue == true`. In fact, this is so rare that I don't think this code has been tested at all, and we have no way to verify yet if the bug reports have been using this.
### Steps to reproduce
This bug is rare to repro, but it should be much easier to repro if we force frame_count to a value lower than 4.
### Minimal reproduction project
None.