id Software pretty much perfected this (client-inputs + authorative server) for Quake (in 3D):
see
tldr: Client sends player inputs (incl position) to server while the PLAYER gets moved immediately by the clients engine (0ms lag, simple form of client prediction). The actual world state happens on the server and sent back to client.
Details from jfedor:
"What happens when the network packet containing the next snapshot is delayed or lost and the client runs out of states to interpolate between? Then it’s forced to do what it normally tries to avoid: extrapolate or guess where the objects will be if they keep moving the same way they’re currently moving. This might not be a big problem for objects that move in a straight line like rockets or that are only affected by gravity. But other players move in an unpredictable manner and because of that the game normally wants to interpolate and not extrapolate, even if it means artificially delaying everything by up to a single server frame in addition to whatever network lag is present.
The above description applies to all game objects except the most important one: the player. This case is special, because the player’s eyes are the first person camera through which we are seeing the game world. Any latency or lack of smoothness here would be much more noticeable than with other game objects. Which is why sending user input to the server, waiting for the server to process it and respond with the current player position and only then rendering the world using that delayed information isn’t a good option. Instead the Quake 3 client works as follows. In each frame it samples user input (how much the mouse was moved in which direction, what keys are pressed), constructs a user command, which is that user input translated into a form independent of the actual input method used. It then either sends that user command to the server immediately or stores it to be sent in one batch with future user commands from subsequent frames. The server will use those commands to perform the player movement and calculate the authoritative world state. But instead of waiting for the server to do that and send back a snapshot containing that information, the client also immediately performs the same player movement locally. In this case there’s no interpolation - the move commands are applied onto the latest snapshot received from the server and the results can be seen on the screen immediately. In a way this means that each player lives in the future on their own machine, when compared to the rest of the world, which is behind because of the network latency and the delay needed for interpolation. Applying user inputs on the client without waiting for a response from the server is called prediction."
Also note that your game is highly interactive (Player1<->Ball<->Player2), this needs a fast+high bandwidth server. id software circumvented the problem by letting players host their own games. Obv. this open the door for cheating.