Multithreading trouble, crashes on exit, how to safely load files on the fly

spaceoctopus · January 6, 2025, 2:13pm

Godot Version

4.3

Hello godot friends!
For context, I develop this tool over here: https://forum.godotengine.org/t/gdimagedb-a-tag-based-media-database-for-your-desktop/

Something that tool has to do a lot is to load image files from the OS filesystem on the fly.
Ideally that should happen asynchronously to not lock up the GUI and to allow background preloads.
Unfortunately I have been running into a lot of issues writing async code with godot.

The general problem:
I have run into lots of different scenarios that could cause random crashes. Apparently these are due to indirect async interactions with the scene tree, which are apparently completely forbidden.
Think instantiating a control node within another thread, then adding it to the tree in the main thread.
This would randomly (fairly rarely) crash outright, and with near 100% reproducibility crash on exit.
I have resolved to only load resources in threads, which appears to be mostly safe - except when it isn’t.

The question:
How do I asynchronously load and cache images on the fly without causing crashes on exit?

Apparently I can’t use the WorkerThreadPool since there is no threadsafe way to notify my code when a thread is finished. Calling (even deferred calling) a signal from a thread seems to cause crashes on exit - maybe. I am not actually sure what exactly causes the crash.
The documentation says to use wait_for_task_completion, but that cannot be correct since the WorkerThreadPool will remove finished threads with wait_for_task_completion returning INVALID_PARAMETER for the IDs of finished threads.

Apparently I can also not use regular threads - at the very least not if I want to cache the images (which is required) as they also cause crashes on exit.

It is very unclear to me what actually is and isn’t thread safe in godot.

Edit, more information:
These crashes themselves are completely outside of my code.
I have rewritten my entire loading and instantiation code in C# since I am more experienced in that language - same result.
I have even built and run debug builds of godot to try and get to the bottom of the problem, only to end up miles deep in invalid memory and completely out of my depth when it comes to c++ debugging.

pennyloafers · January 6, 2025, 11:53pm

Use mutexes when caching on shared resources.

spaceoctopus · January 7, 2025, 11:51am

I found the cause of the problem.

When you hand a refcounted object to a thread it will increment the refcount (as it should), but not decrement it when the thread finishes up and is deleted.
This will cause segfaults on exit, presumably because godot will attempt to free all referenced objects - including non-existent threads.

This looks like a bug to me - I am going to file a report for it on github.

Edit, in case someone finds this in the future:
It appears that the failure to decrement is not in the threadproc, but in the parameter binding.
Reported here.

pennyloafers · January 7, 2025, 1:23pm

That is a interesting find. But also not realistic as you hold onto a reference in the main thread and block until the thread finishes.

I would want to duplicate the resource so the reference in the main thread can change without risk of thread-safety. or never have a reference from the main thread.

if there are resources shared you need to mutex them

btw I tried your script locally and did not experience a crash. I think there is probably something happening in the Callable copy with bind that stays on the function stack until the _ready function ends

the same thing happens with normal methods

extends Node


func _ready():
	var test = RefCounted.new() # ref-count +1
	print(test.get_reference_count()) # prints 1
	var callme = methodproc.bind(test) # makes copy of method with resource bound, ref-count +1
	print(test.get_reference_count()) # prints 2
	callme.call() 
	print(test.get_reference_count()) # prints 2
	

func methodproc(obj : RefCounted): # ref count +1
	print("method ",obj.get_reference_count()) # prints 3
        # -1 on end of function

spaceoctopus · January 7, 2025, 2:11pm

I am well aware of that.
The code I put into the bug report is a minimum reproducible sample, not a snippet from my actual code.

My program is a bit too large and complicated to serve as a good example.

Did you run the code in the editor or in an exported executable?
The crash only occurs in an exported executable.
I did something wrong, please check my comment on the github issue.
I assume that the editor/debugger either performs different cleanup operations or just holds on to the memory until it is closed itself.

–
If you want to see the problem in context you can check out the source of my software over here: GitHub - SpiceOctopus/gdimagedb: A simple media database implemented with godot.

The relevant files are src/ui/media_viewer/media_viewer.gd and src/cache_manager.gd.

The easiest example is in the async_load_display method in the media viewer. That method is executed by the WorkerThreadPool.
It calls CacheManager.get_image and is passed a DBMedia object.
DBMedia is a metadata object that holds information about an image. This is what used to be refcounted, causing the error to occur.

You can see how the image is loaded and cached in the CacheManager.get_image method, which does use mutex locks to prevent simultaneous cache writes.
The DBMedia object cannot change during this process, which is why I saw no reason to duplicate it.

pennyloafers · January 7, 2025, 5:09pm

func preload_image(media : DBMedia) -> void:
	var texture = ImageTexture.create_from_image(Image.load_from_file(media.path))
	CacheManager.image_mutex.lock()
	CacheManager.image_cache[media.id] = texture
	CacheManager.image_mutex.unlock()

This has a code smell, why not make a CacheManager api with this code built in. CacheManager.cache_image(id, texture), this way you don’t need to monkey with the CacheManager properties directly. other then the smell its not problematic.

I looked around, and nothing threw obvious red flags. it would be nice to have a stack trace to see where it crashed and a means to crash it reliably.
if its a low level segfault it may be easier to capture if you run the project from the command line, but you also may need a debug version of Godot to show the correct symbols in the trace.

using called_deferred is an acceptable way to get information back to the main thread as the MessageQueue singleton should be thread safe.

I noticed some strange errors when running from command line

WARNING: ObjectDB instances leaked at exit (run with --verbose for details).
     at: cleanup (core/object/object.cpp:2284)
Leaked instance: GDScriptNativeClass:9223372051601949353
Leaked instance: GDScript:9223372065980024096 - Resource path: res://db_media.gd
Leaked instance: Object:110058538395
Leaked instance: Object:110075315632
Leaked instance: Object:110092092849
Leaked instance: Object:110041761202
Leaked instance: Object:111534935478
Leaked instance: Object:111551712695
Leaked instance: Object:111568489912
Leaked instance: Object:111585267129
Leaked instance: Object:111736262082
Leaked instance: Object:111753039299
Leaked instance: Object:111769816516
Leaked instance: Object:111786593733
Leaked instance: Object:111920811469
Hint: Leaked instances typically happen when nodes are removed from the scene tree (with `remove_child()`) but not freed (with `free()` or `queue_free()`).
ERROR: 1 resources still in use at exit.
   at: clear (core/io/resource.cpp:599)
Resource still in use: res://db_media.gd (GDScript)

You should probably change DBMedia to a resource instead of an object to resolve these issues.

I can also get it to crash in a strange way if i try and open the media preview and immediately hit ui_cancel, which calls queue_free. this crashes in a strange way. it works best with a large image that will take a little time to load on the first run.

	elif Input.is_action_pressed("ui_cancel"):
		closing.emit()
		queue_free()

Another thing, I think any Node class, that can spawn threads, should wait until their threads are done before being freed. (cache_manager, and media_viewer).

it would probably also lock this out of paranoia

*_mutex.lock()
	if *_cache.has(media.id):
        var resource := *_cache[media.id]
        *_mutex.unlock()
		return resource

This way if the cache data dictionary gets cleared during its search it wont fall off a cliff and you can preserve the resource before its freed. Then the clearing thread can clear the dictionary and update the UI once its safe.

spaceoctopus · January 7, 2025, 5:48pm

You are definitely correct about that one. There is a reason I did not point you to the preload functions^^
The “why” is pretty simple: I just have not gotten around to improving this yet.

I’ll test, but that’s kinda what got me here in the first place. DBMedia was RefCounted in order to avoid leaks. Since Resource inherits RefCounted I would guess that it has the same issue?
I’ll try and see.
I am aware of the DBMedia objects leaking. If Resource still causes segfaults I will add a cleanup routine to properly dispose the DBMedia objects.
The current state of the code is basically a hotfix to prevent segfaults at the cost of some minor leaks on exit.

Thank you for finding that one, I’ll have to check that out.

If I can find a reliable way to do that, I will change it. I think I already mentioned somewhere above that WorkerThreadPool.wait_for_task_completion is strangely unreliable.

Also a good idea. While I never managed to have a cache read crash, it can’t hurt to be on the safe side.

If you want to crash gdimagedb that way:

change the type of DBMedia to RefCounted.
open an image in the MediaViewer
close the program

This was a 100% reliable way for me to segfault on exit.

spaceoctopus · January 7, 2025, 6:20pm

This is now fixed on the development branch.

pennyloafers · January 7, 2025, 6:22pm

There should be an object level NOTIFICATION_PREDELETE on the _notification(what) callback.

You could wait for tasks during this time.

pennyloafers · January 8, 2025, 12:02am

here is the stack trace. after closing and changing the class to refcounted

================================================================
handle_crash: Program crashed with signal 11
Engine version: Godot Engine v4.3.stable.custom_build (77dcf97d82cbfe4e4615475fa52ca03da645dbd8)
Dumping the backtrace. 
[1] /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f409b4c8520] (??:0)
[2] /lib/x86_64-linux-gnu/libc.so.6(pthread_mutex_lock+0x4) [0x7f409b51def4] (??:0)
[3] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x8bc605) [0x55dfa0a0d605] (/usr/include/c++/11/mutex:111)
[4] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x4f2a239) [0x55dfa507b239] (./godot_src/./core/os/memory.h:119)
[5] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x4c2fdd1) [0x55dfa4d80dd1] (./godot_src/./core/os/memory.h:119)
[6] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x625a88) [0x55dfa0776a88] (./godot_src/./core/variant/variant.h:309)
[7] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x4c2ba3f) [0x55dfa4d7ca3f] (./godot_src/core/variant/callable_bind.cpp:173)
[8] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x4c291ce) [0x55dfa4d7a1ce] (./godot_src/./core/os/memory.h:119)
[9] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x4f6ac96) [0x55dfa50bbc96] (./godot_src/./core/templates/paged_allocator.h:91)
[10] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x4f6c0ec) [0x55dfa50bd0ec] (./godot_src/./core/templates/hash_map.h:255)
[11] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x491c637) [0x55dfa4a6d637] (./godot_src/./core/os/memory.h:119)
[12] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x658b5f) [0x55dfa07a9b5f] (./godot_src/main/main.cpp:4399)
[13] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x5bb33b) [0x55dfa070c33b] (./godot_src/platform/linuxbsd/godot_linuxbsd.cpp:91)
[14] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f409b4afd90] (??:0)
[15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f409b4afe40] (??:0)
[16] ./godot_src/bin/godot.linuxbsd.editor.x86_64(+0x5ebf45) [0x55dfa073cf45] (??:?)
-- END OF BACKTRACE --
================================================================

It will take me a little time to analyze the source code

here is a better one from gdb

(gdb) bt
#0  ___pthread_mutex_lock (mutex=0x1f0) at ./nptl/pthread_mutex_lock.c:80
#1  0x0000555555e10605 in __gthread_mutex_lock (__mutex=0x1f0) at /usr/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:749
#2  __gthread_recursive_mutex_lock (__mutex=0x1f0) at /usr/include/x86_64-linux-gnu/c++/11/bits/gthr-default.h:811
#3  std::recursive_mutex::lock (this=0x1f0) at /usr/include/c++/11/mutex:108
#4  std::unique_lock<std::recursive_mutex>::lock (this=<synthetic pointer>) at /usr/include/c++/11/bits/unique_lock.h:139
#5  std::unique_lock<std::recursive_mutex>::unique_lock (__m=..., this=<synthetic pointer>) at /usr/include/c++/11/bits/unique_lock.h:69
#6  MutexLock<MutexImpl<std::recursive_mutex> >::MutexLock (p_mutex=..., this=<synthetic pointer>) at ./core/os/mutex.h:81
#7  GDScriptInstance::~GDScriptInstance (this=0x5555611211b0, __in_chrg=<optimized out>) at modules/gdscript/gdscript.cpp:2144
#8  0x000055555a47e239 in memdelete<ScriptInstance> (p_class=0x5555611211b0) at ./core/os/memory.h:116
#9  Object::~Object (this=0x555560d6d7f0, __in_chrg=<optimized out>) at core/object/object.cpp:2048
#10 0x000055555a183dd1 in memdelete<RefCounted> (p_class=0x555560d6d7f0) at ./core/os/memory.h:111
#11 memdelete<RefCounted> (p_class=0x555560d6d7f0) at ./core/os/memory.h:111
#12 Variant::_clear_internal (this=0x55555ff31660) at core/variant/variant.cpp:1389
#13 0x0000555555b79a88 in Variant::clear (this=0x55555ff31660) at ./core/variant/variant.h:309
#14 Variant::~Variant (this=0x55555ff31660, __in_chrg=<optimized out>) at ./core/variant/variant.h:803
#15 CowData<Variant>::_unref (this=<optimized out>, this=<optimized out>, p_data=0x55555ff31660) at ./core/templates/cowdata.h:263
#16 CowData<Variant>::_unref (this=<optimized out>, p_data=0x55555ff31660) at ./core/templates/cowdata.h:245
#17 0x000055555a17fa3f in CowData<Variant>::~CowData (this=0x55555fcfb118, __in_chrg=<optimized out>) at ./core/templates/cowdata.h:477
#18 Vector<Variant>::~Vector (this=0x55555fcfb110, __in_chrg=<optimized out>) at ./core/templates/vector.h:291
#19 CallableCustomBind::~CallableCustomBind (this=0x55555fcfb0f0, __in_chrg=<optimized out>) at core/variant/callable_bind.cpp:173
#20 0x000055555a17d1ce in memdelete<CallableCustom> (p_class=0x55555fcfb0f0) at ./core/os/memory.h:116
#21 Callable::~Callable (this=this@entry=0x55556111b1f8, __in_chrg=<optimized out>) at core/variant/callable.cpp:431
#22 0x000055555a4bec96 in WorkerThreadPool::Task::~Task (this=0x55556111b1f0, __in_chrg=<optimized out>) at core/object/worker_thread_pool.h:74
#23 PagedAllocator<WorkerThreadPool::Task, false, 1024u>::free (p_mem=0x55556111b1f0, this=0x55555d2dbba0) at ./core/templates/paged_allocator.h:90
--Type <RET> for more, q to quit, c to continue without paging--bt
#24 WorkerThreadPool::finish (this=this@entry=0x55555d2dba30) at core/object/worker_thread_pool.cpp:748
#25 0x000055555a4c00ec in WorkerThreadPool::finish (this=0x55555d2dba30) at ./core/templates/local_vector.h:149
#26 WorkerThreadPool::~WorkerThreadPool (this=0x55555d2dba30, __in_chrg=<optimized out>) at core/object/worker_thread_pool.cpp:771
#27 0x0000555559e70637 in memdelete<WorkerThreadPool> (p_class=0x55555d2dba30) at ./core/os/memory.h:111
#28 memdelete<WorkerThreadPool> (p_class=0x55555d2dba30) at ./core/os/memory.h:111
#29 unregister_core_types () at core/register_core_types.cpp:384
#30 0x0000555555bacb5f in Main::cleanup (p_force=p_force@entry=false) at main/main.cpp:4397
#31 0x0000555555b0f33b in main (argc=<optimized out>, argv=0x7fffffffdd28) at platform/linuxbsd/godot_linuxbsd.cpp:89

spaceoctopus · January 9, 2025, 9:59am

Update:
I tried changing DBmedia to Resource type, but that causes the same segfault on exit.

I assume that anything that inherits RefCounted will have this issue.

spaceoctopus · January 9, 2025, 10:48am

More testing:

if I keep a record of all ids from WorkerThreadPool in the MediaViewer and call WorkerThreadPool.wait_for_task_completion(id) at the point of NOTIFICATION_PREDELETE, then using RefCounted DBMedia will not segfault.

This would be in line with the documentation, which states that wait_for_task_completion has to be called for every WorkerThread.

This has however lead me to discover a new way to cause segfault on exit.
With RefCounted DBMedia, calling wait_for_task_completion for a thread id that had already been waited for will not only throw an error, but also segfault on exit.

This can be provoked by opening the image next to a large image in the grid and switching over immediately. When the preload thread is not done, wait_for_task_completion will be called upon switching images.

I am not really sure how it is supposed to be possible to reliably call wait_for_task_completion for every WorkerThread, but only exactly once.
The WorkerThreadPool does not tell me when a thread is done and the threadproc can’t know the thread id.
Is the intended implementation that I build a separate id system so that I can pass a unique id to every thread that I can match to the thread id?
Should I only ever start one task at a time?

pennyloafers · January 9, 2025, 12:14pm

My thought is this, you could potentially only spawn threads within the cache_manager on behalf of callers, returning a session ticket. (Instead of the caller spawning its own thread to interact with the cache manager). This would make thread management easier. (And along with this ticket system you can manage thread ids in one place too)

since you have a container that needs an image, or resource, that is loaded asynchronous, throw a default placeholder and keep the session ticket until the thread returns. (Or keep the loading string)

The thread can call_deferred the loaded resource back to the main thread and the main thread can emit a signal to the caller that the resource is done loading, with the session ticket. The caller can place the resource in a container. if the caller no longer cares about the image (like during the preview) it can safely ignore the returned resource.

I was worried about ref counting in Godot but as long as you dont modify the resource contents directly with multiple threads supposedly its okay.

pennyloafers · January 9, 2025, 12:27pm

Since the stack trace shows the faults at the low level mutex lock. Im wondering if its a thread trying to lock the cach_manager mutex that has been deleted by the main thread on exit. But this is just a hunch based on your investigation. (Mutexs are ref counted🤔)