ENet Server Drops All Clients at Connection Threshold

Godot Version

4.4.1

Question

We’re experimenting with Godot 4.4.1 (C#) and its high level ENet networking to see whether we can build a truly massively multiplayer online game (MMO) with it or not.

Our setup is a single Godot project with two export presets: One for a headless dedicated server (usually running in a Linux VM) and one for the game client. In our tests the builds run on separate machines.

The main issue we keep hitting is a hard limit on the number of simultaneous client connections. Once that limit is reached, the next client attempting to join makes the server spike in CPU usage, round-trip-time shoots up, and then the server drops all connected clients at once.

To isolate the problem we built a minimal test project: the clients connect and the server does nothing except broadcast the current client count. Even here, each new connection above approximately 1300 clients causes that CPU spike and, eventually, a mass disconnect. Once a client is fully connected the ping returns to normal; the spike happens only during the handshake.

On a powerful Windows PC the issue happens at roughly 1400 clients. On a Linux VM it’s closer to 1300. If we send more data per client, for example player names, positions, or basic gameplay messages, the limit drops further. In our first playable prototype we could only keep around 80 clients connected before everyone was kicked, which is obviously far too low for a massively multiplayer online game.

Our main concern is that the issue doesn’t develop gradually: instead of high CPU load on the game server causing noticeable lag or a single client losing its connection, the problem strikes abruptly: once some mysterious limit is crossed, the server kicks out every connected client at once. So far we are unable to find the reason for that.

Our minimal reproduction project is on GitHub at https://github.com/fl0wByte/GodotNetworkingMRE. Any feedback or tips how to improve the situation would be greatly appreciated.

1 Like

One thing to keep in mind is that the Godot enet implementation will be sending background packets to check of a client is still alive and gathering peer statistics, like ping, under the hood.

I think there is a death timer for removing stale peers. My guess is that your channels are overwhelmed with packets that the death timer gets invoked while packets are being processed.

Another thing is you probably want to rate limit your network frames to 12/24/30 frames per second or lower. This could be done easly with synchronizers.

var err = peer.CreateServer(Port, MaxPlayers, 2, 0, 0);

You should probably also increase the number pf channels allocated. I assume since your limiting it to 2 (actually 4 under the hood) that it could be limiting the possible buffers available to queue client packets.

You also probably want to turn off relay setting on the server

You can configure peer timeouts

2 Likes

Update, i took a peek at enet implementation of godot and the library itself and i have some corrections.

Channels are probably not important as they are not utilized by default in the implementation by Godot. Godot reserves two channels, One is dedicated to reliable, and the other unreliable. The rest of the channels will need to be designed by the developer. This is probably also a problem with congestion on one channel concerning the peer timeout mechanism driven by periodic pings from the host.

Two, the ping interval, for determining if a peer has timed-out, is fixed for at a period of 500ms, and is not configurable. This will be driving a lot of the traffic when nothing is happening in your stress test.

1 Like

Let me start by saying @pennyloafers knows his stuff when it comes to networking.

I would recommend you also ask the Godot team directly about this. They don’t often drop in here, so I’m tagging @Calinou who might have some insight.

In our first playable prototype we could only keep around 80 clients connected before everyone was kicked, which is obviously far too low for a massively multiplayer online game.

Can you open an issue on GitHub with the minimal reproduction project attached?

1 Like

@pennyloafers
Thank you for the detailed information and for taking the time to look into this in Godot. I’ll give your tips a try and will report back here with my findings.

@Calinou
I’ve just opened a GitHub issue for this.

2 Likes

I would also want to ask how you performed the test?

Spawning 1300+ processes one one machine would not be indicative of the real world scenario. You probably want to isolate the server on on a dedicated machine and have a remote system, or systems, simulate clients. This will allow you to determine if its a hardware resource issue or a Godot/Enet issue. I only say this since you mentioned a more powerful machine can change the numbers.

2 Likes

We actually isolate the server on its own machine—either a Linux VM or a high-end Windows PC—and nothing else runs on that host. Apart from the brief CPU spikes during new-connection handshakes, the server never hits full utilization.

For our load tests, we use several separate Windows PCs to simulate clients. On each of those we launch roughly 300–600 headless client instances that automatically connect, and on one of them we also run a graphical client for monitoring. None of the client machines ever maxes out its CPU, RAM or network during these tests.

2 Likes

Okay, that seems reasonable.

So if the server, and clients, aren’t fully utilizing the systems, it would probably point to some sort of network congestion/or packets being dropped? you could probably verify this by doing some math and measure with a third party application like wireshark to see if packets are being dropped when more peers are talking.

By default Godot runs on a single thread, so the server will only utilize one CPU core, and because you see a CPU spike only when a peer joins, (i suspect it spikes because the server is acting as a relay and needs to generate packets to all clients in one frame), that could mean the CPU, and by extension Godot, really isn’t being utilized… But, maybe that is a detail to consider when looking at CPU utilization…

There won’t be anything going on in the Godot engine in general for your test… except for the ping interval that Enet is handling for client timeout detection every 500ms. Which is an exposed setting. (I missed the binding when i first dove into it)

I suspect that since the SceneMultiplayer class defaults to relay being true will exacerbate this issue. When relay is true, each time a client joins, the server sends packet to all peers saying that a new client has joined. When false, there should be one packet to the new client from the server peer that the server has joined as a peer.

2 Likes

If we do a little math for network requirements, and just the ping concept of enet, when we ping we need about 28 bytes per packet. 20 bytes for the IP header and 8 bytes for the UDP header. Assuming we dont have any real data in the ping packet.

Enets reliable packet is UDP and requires an ack. (Well i dont recall if ping is one reliable with an ack, or two unreliable, or two reliable… Just multiple by 2 or 4 potentially?)

28bytes * 8 bits = 224 bits

2rx/tx * 224bits * 1300 clients = 582,400 bits full-duplex

This would require at least a 1Gbps link between the server and the clients. Probably more to reach the limits of enet, And this assumes the packet for ping is empty, And it assumes a packet burst interval happen all nearly at the same time.

(My math is a little off, but something to consider)

3 Likes

I was able to run a local test, and got upto 1600 stable clients, but had to rate limit the join rate to 2 clients per second. I think i could go further, but it takes some time to build the clients up.

I did see the disconnect issue, with almost all clients disconnecting with no join rate limiting, but can be reduced with join rate limiting. (I.e. using a timer to creat clients)

I was monitoring the packet peer packet loss statistic and for a single peer, it does build considerably as the client list grows. Although this number could get pretty high it did not correlate to disconnect rates. Max packet loss for first peer joined 17000.

changing ping interval didn’t seem to have as large of an effect as anticipated.

These test were performed on a single i7-8550U in Power mode. I had 3 processes, 1 host and 2 clients, using the branched multiplayer api trick to put multiple clients into one process.

After 1600 clients joined CPU usage for each client process sat at 13% CPU , 750MB mem. The server 2% cpu, 100MB mem.

Conclusion: i suspect that the single working channel in the Godot ENet implementation is congested and dropping packets for joiners. I think a custom implementation to spread clients onto available channels is the worth a try to hasten join rate.

Code
extends Node

const CLIENT_COUNT = 800

# 2 client proc (cp), 0.2 create rate (cr)
# pp.set_timeout(15000, 30000, 60000)
# pp.ping_interval(5000)
# 840 stabalized  10000+ packet loss

# 2 cp, 0.3 cr -> 888 instable, 12000 packet loss, (5000 ping interval) +timouts
# 2 cp, 0.6 cr -> 900 stable, 11000 packet loss,  (5000 ping interval) +timouts
# 2 cp, 1.0 cr -> 1200 stable, 21000 packet loss, (5000 ping interval) +timouts
# 2 cp, 0.6 cr -> 900 stable, 13000 packet loss (500 ping interval) no to
# 3 cp, 0.9 cr -> ~1000 unstable, 17000 packet loss (500 ping interval) no to
# 3 cp, 1.2 cr -> ~1200 unstable, 21000 packet loss (500 ping interval) no to
# 3 cp, 1.5 cr -> unstable
# 2 cp, 1.0 cr -> 1600 stable, 17000 max packet loss (2000 ping interval) timouts, power mode


func _ready() -> void:
	var peer : NetworkPeer
	if OS.has_feature("server"):
		peer = NetworkPeer.new()
		add_child(peer)
	else:
		# stagger client rate
		if OS.has_feature("client1"): await get_tree().create_timer(0.5).timeout
		for i in CLIENT_COUNT:
			peer = NetworkPeer.new()
			add_child(peer)
			await get_tree().create_timer(1.0).timeout

class NetworkPeer extends Node:
	static var client_number : int = 0
	const IP_ADDR := "localhost"
	const PORT := 5555

	var enet := ENetMultiplayerPeer.new()
	func _ready() -> void:
		var api := SceneMultiplayer.new()
		if OS.has_feature("server"):
			enet.create_server(PORT,4000)
			self.name = "server"
			api.peer_connected.connect(_on_peer_connected)
			api.peer_disconnected.connect(_on_peer_disconnected)

			api.server_relay = false    # forgot to add this in the first run
		else:
			client_number += 1
			enet.create_client(IP_ADDR, PORT)
			self.name = "client" + str(client_number) 
			set_process(false)
			
		api.multiplayer_peer = enet
		get_tree().set_multiplayer(api, self.get_path())
	
	var time:float = 0.0
	var peer_id: int
	func _process(delta: float) -> void:
		time += delta
		if time > 1.0:
			time = 0
			var p : ENetPacketPeer = multiplayer.multiplayer_peer.get_peer(peer_id)
			var amount : float = p.get_statistic(ENetPacketPeer.PEER_PACKET_LOSS)
			print(amount)
			#if amount > 0.0:
				#breakpoints
			

	func _exit_tree() -> void:
		multiplayer.multiplayer_peer.close()
		multiplayer.multiplayer_peer = OfflineMultiplayerPeer.new()

	func _on_peer_connected(id:int) -> void:
		if not peer_id:
			peer_id = id
		var pp: ENetPacketPeer = enet.get_peer(id)
		pp.set_timeout(15000, 30000, 60000)
		pp.ping_interval(2000)
		print(name, ": #%d peer connected %d" % [multiplayer.get_peers().size(), id])
		
	func _on_peer_disconnected(id:int) -> void:
		print(name, ": #%d disconnected peer %d" % [multiplayer.get_peers().size(), id])
		


update

I forgot to disable server relay, which helped immensely I was able to go 6 clients per second with almost no packet loss. CPU usage was only at 4%, and 100MB for clients. for (1600 total clients) Which means there is some array vector growing when there are a lot of packets piling up in a queue somewhere?

I do see another issue, which seems like the host crashes but nothing is logged..

3 Likes

So to support a large number of clients, one would need to create a load-balancer?

@pennyloafers, thanks so much for digging into this and all the detailed testing. You’ve given me a lot to think about!

I went ahead and disabled relay and tweaked the ping interval and timeouts as you suggested, and that did indeed make a noticeable improvement. However, I’m still running into mass disconnects whenever a single server CPU core hits 100% usage. When I ran the server on my weak dual core VM, I was able to connect around 1500 clients simultaneously before the server started mass disconnecting.

Do you know if there’s any way to make Godot’s ENet networking multithreaded (or otherwise spread the work across cores)? That feels like it could be the next step to avoid that single-core bottleneck. My goal is to hit those same numbers in a real gameplay prototype, not just in a minimal test.

Thanks again for your effort!

1 Like

I just got to 4000 stable

there is the code

code
extends Node

# I had to limit 1000 clients per process because of a wierd error
const CLIENT_COUNT = 1000

func _ready() -> void:
	var peer : NetworkPeer
	if OS.has_feature("server"):
		peer = NetworkPeer.new()
		add_child(peer)
	else:
		# let server get up
		await get_tree().create_timer(2.0).timeout
		for i in CLIENT_COUNT:
			peer = NetworkPeer.new()
			add_child(peer)
			await get_tree().create_timer(0.02).timeout

class NetworkPeer extends Node:
	static var client_number : int = 0
	const IP_ADDR := "localhost"
	const PORT := 5555

	var enet := ENetMultiplayerPeer.new()
	func _ready() -> void:
		var api := SceneMultiplayer.new()
		if OS.has_feature("server"):
			enet.create_server(PORT,4000)
			self.name = "server"
			api.peer_connected.connect(_on_peer_connected)
			api.peer_disconnected.connect(_on_peer_disconnected)
			api.server_relay = false
			# allow to poll more
			DisplayServer.window_set_vsync_mode(DisplayServer.VSYNC_DISABLED)
		else:
			client_number += 1
			enet.create_client(IP_ADDR, PORT)
			self.name = "client" + str(client_number) 
			set_process(false)
			DisplayServer.window_set_vsync_mode(DisplayServer.VSYNC_ENABLED)
			
		api.multiplayer_peer = enet
		get_tree().set_multiplayer(api, self.get_path())
	
	var time:float = 0.0
	var peer_id: int = 0
	func _process(delta: float) -> void:
		if peer_id:
			#poll more
			multiplayer.poll()
			time += delta
			if time > 1.0:
				time = 0
				var p : ENetPacketPeer = multiplayer.multiplayer_peer.get_peer(peer_id)
				var amount : float = p.get_statistic(ENetPacketPeer.PEER_PACKET_LOSS)
				prints(amount)

	func _exit_tree() -> void:
		multiplayer.multiplayer_peer.close()
		multiplayer.multiplayer_peer = OfflineMultiplayerPeer.new()

	func _on_peer_connected(id:int) -> void:
		if not peer_id:
			peer_id = id
		var pp: ENetPacketPeer = enet.get_peer(id)
#not sure timeout and ping is needed but could be useful with more low level insight.
		pp.set_timeout(15000, 30000, 60000)
		pp.ping_interval(2000)
		print(name, ": %s #%d peer connected %d" % [Time.get_ticks_msec() / 1000.0, multiplayer.get_peers().size(), id])
		
	func _on_peer_disconnected(id:int) -> void:
		print_rich(name, ":%s #%d [color=yellow]disconnected[/color] peer %d" % [Time.get_ticks_msec() / 1000.0, multiplayer.get_peers().size(), id])
		

There was two problems with my setup, one I let the process run window mode with vsync on, this limited to server poll rate to 60fps, disabling that for the server and adding an extra poll in the servers process loop, I could spam ~117 clients per second.

server now sits at 12% CPU with 168MB mem (running at ~230 fps). 4 clients procs, sit at 5%, 120MB mem (60fps).
I had 0 packet loss for the monitored peer. and so far 6 minutes (now 28 minutes) with no disconnection. This supports the theory that the bottle neck is on the server ENet instance.

I did have to oddly limit 1000 peers per process for the clients, it seemed to have a strange socket error at exactly peer 1003 very consistently. Maybe processes have a socket limit?

yea, I was thinking that until this morning, when I thought about poll rates. and started trying to poll more on the server. you might still want to load balance, because once a client joins in a real game they will need to sync the world state. so the server will be very busy with new clients.

I think this is sort of how it works in some games, where you sit in a queue to join the servers. I remember this being a thing, it could be due to capacity limits but also probably rate limiting on the services for new clients?

1 Like

Not practically, the problem is that packets queues and world state will be a shared resource and threading that wouldn’t make a lot of sense. you could maybe dream up a way to segment the world into zones on the server using the branched multiplayer concept in a single process. and maybe do some quadtree to balance players per server branch.

The main thing i see with enet is that it uses one channel and once that channel queue gets filled, it drops packets. I think if you redesigned the Godot enet implementation you could spread packets/peers out evenly on the channels, but that is untested. I guess with my findings it is possible to poll enet faster and avoid disconnects, but this probably could be improved in general.

I would look into using Docker, which should allow better hardware access.

@pennyloafers I tested your GDScript code and it works perfectly. However, my project is based on C#, so I translated your code into C# and tested it.

Here is the translated code:

MainNode.cs
  using Godot;
  using System;

  public partial class MainNode : Node
  {
      private const int CLIENT_COUNT = 1000;
  
      public override async void _Ready()
      {
          if (OS.HasFeature("dedicated_server"))
          {
              GD.Print("This is a server");
              var peer = new NetworkPeer();
              AddChild(peer);
          }
          else
          {
              await ToSignal(GetTree().CreateTimer(2.0f), "timeout");
              for (int i = 0; i < CLIENT_COUNT; i++)
              {
                  var peer = new NetworkPeer();
                  AddChild(peer);
                  await ToSignal(GetTree().CreateTimer(0.02f), "timeout");
              }
          }
      }
  }
NetworkPeer.cs
using Godot;
using System;

public partial class NetworkPeer : Node
{
    private static int _clientNumber = 0;
    private const string IP_ADDR = "localhost";
    private const int PORT = 5555;

    private ENetMultiplayerPeer _enet;
    private float _time = 0f;
    private long _peerId = 0;

    public override void _Ready()
    {
        var api = new SceneMultiplayer();

        _enet = new ENetMultiplayerPeer();

        if (OS.HasFeature("dedicated_server"))
        {
            _enet.CreateServer(PORT, 4000);
            Name = "server";

            api.PeerConnected += OnPeerConnected;
            api.PeerDisconnected += OnPeerDisconnected;
            api.ServerRelay = false;
            DisplayServer.WindowSetVsyncMode(DisplayServer.VSyncMode.Disabled);
        }
        else
        {
            _clientNumber++;
            _enet.CreateClient(IP_ADDR, PORT);
            Name = $"client{_clientNumber}";
            SetProcess(false);
            DisplayServer.WindowSetVsyncMode(DisplayServer.VSyncMode.Enabled);
        }

        api.MultiplayerPeer = _enet;
        GetTree().SetMultiplayer(api);
    }

    public override void _Process(double delta)
    {
        if (_peerId != 0)
        {
            GetTree().GetMultiplayer().Poll();
            _time += (float)delta;
            if (_time > 1.0f)
            {
                _time = 0f;
                var p = _enet.GetPeer((int)_peerId);
                var loss = p.GetStatistic(ENetPacketPeer.PeerStatistic.PacketLoss);
                GD.Print(loss);
            }
        }
    }

    public override void _ExitTree()
    {
        _enet.Close();
        GetTree().GetMultiplayer().MultiplayerPeer = new OfflineMultiplayerPeer();
    }

    private void OnPeerConnected(long id)
    {
        if (_peerId == 0)
            _peerId = id;

        var pp = _enet.GetPeer((int)id);
        pp.SetTimeout(15000, 30000, 60000);
        pp.PingInterval(2000);

        GD.Print($"{Name}: peer connected {id}");
    }

    private void OnPeerDisconnected(long id)
    {
        GD.PrintRich($"{Name}:  [color=yellow]disconnected[/color] peer {id}");
    }
}

I think it should work pretty much the same.

When I start with your GDScript, 1000 clients are created in a few moments. My C# code crashes after only 30 and is not nearly as fast. The packet loss is also much higher. I have the feeling that multiplayer runs much less stable under C#. Any ideas why this might be?

I mean a lot of the lobby stuff back in the day (20+ years ago) was to deal with internet latency and to deal with smaller numbers of players. Fortnite uses lobbies now to make sure a game is full before starting, but it’s like loading all the seats on a rollercoaster at an amusement park.

Back in the day we were testing the exact same capacity things to determine how many servers we needed (and how many load balancers). These days I’d just setup VMs in the clous (and load balancers) and spin up boxes as I needed more. Then I’d monitor the boxes at high utilization and see where the bottlenecks on the machine are.

TBH, this particular issue seems kinda weird. Usually these days database calls are what limit things (at least in webapps). Handshakes are small, they shouldn’t be doing this.

I’d consider using AWS or another cloud resource. They only charge for actual usage and tests like these would be pennies. Then you’re not dealing with hardware issues as an added factor.

Some stuff works better in GDScript than in C#. There are certain areas where C# has improved performance. However Godot was made to be driven by GDScript first, and C# second.

One thing to keep in mind is that Godot Mono (C#) does allow you to use GDScript scripts as well as C# code. So I would recommend trying to use @pennyloafers 's code as a GDScript file with your C# project and see what happens.

1 Like

I experimented a little more. I now have my project in both C# and GDScript. However, in my project, I don’t create 1,000 connections in one application, but only one per application. In terms of performance, the applications that only establish one connection and the programming language made no difference.

What does make a difference, however, is sending RPCs. In my project, I sent a broadcast to all clients per PhysicsProcess. This broadcast contained only a single integer. It seems that the more clients are connected, the more difficult it becomes for the server to send the packet to all clients.

As soon as I remove the RPC, I get 4000 connected clients.
With the broadcast RPC active, only 1500, because then the CPU load becomes too high.

I have vsync disabled on the server side, a ping interval of 2000 and a timeout of (1500, 30000, 60000), and I also run an extra poll in each process.

I know that in practice, you don’t send packets with a tick rate of 60 to all clients. But I think sending a single small broadcast should be possible.

In an MMO, you have to send many more and larger packets.

1 Like

I have now increased the packet size. Instead of broadcasting a single int, I send an array with 75 integers to all clients. The server started disconnecting connections at 430. It left 307 connections connected.
With 150 integers, I only reached about 250 connections. It left 147 connected.

It does not seem to be linear. But with larger packets, the number of possible connections decreases.

Interestingly, in both tests, not a single CPU core was even close to being fully utilized.

Here is a summary of my findings:

  • No broadcast RPC from the server → 4000 connections without problems, CPU utilization < 15%.

  • Broadcast (60 times per second) of a single integer → around 1500 clients until a single CPU core reaches 100% and the server starts mass disconnecting.

  • Broadcast (60 times per second) of an array with 75 integers → around 430 clients until the server starts mass disconnecting. The CPU never reaches full load; the maximum was 12% on a single core.

  • Broadcast (60 times per second) of an array with 150 integers → around 250 clients until the server starts mass disconnecting. The CPU never reaches full load; the maximum was 25% on a single core.

1 Like