How to Make Robots Play Table Tennis Matches?
Currently, table tennis is a major highlight of the Paris Olympics, with players demonstrating extremely high levels of physical fitness, high-speed movement ability, precise control of various types of shots, and superhuman agility.
For this reason, researchers have been using table tennis as a benchmark for robots since the 1980s, developing many table tennis robots and making progress in key aspects such as returning the ball to the opponent's half, hitting target positions, smashing, cooperative play, and many other critical aspects of table tennis. However, no robot has yet played a complete table tennis match against an unseen human opponent.
In this research, through techniques such as hierarchical and modular strategy architecture, iterative definition of task distribution, simulation-to-simulation adaptation layer, domain randomization, real-time adaptation to unknown opponents, and hardware deployment, the Google DeepMind team achieved robot performance at the amateur human level in competitive table tennis matches against human players.
1. Hierarchical and Modular Strategy Architecture Based on Skill Library
Low-Level Controllers (LLC): This library contains various table tennis skills, such as forehand attack, backhand positioning, forehand serve, etc. Each LLC is an independent strategy focused on training specific skills. These LLCs are learned through neural networks and trained using the MuJoCo physics engine for simulation.
High-Level Controller (HLC): The HLC is responsible for selecting the most appropriate LLC based on the current match situation and opponent's ability. It consists of the following modules:
Style selection strategy: This strategy chooses between forehand and backhand based on the type of incoming ball (serve or attack).
Spin classifier: This classifier determines whether the incoming ball has topspin or backspin.
LLC skill descriptors: These descriptors record performance metrics for each LLC under different incoming ball conditions, such as hit rate and ball landing position.
Policy selection module: This module generates a list of LLC candidates based on LLC skill descriptors, match statistics, and opponent ability.
LLC preference (H-value): This module uses a gradient bandit algorithm to learn online the preference value for each LLC and selects the final LLC based on these preference values.
2. Techniques for Achieving Zero-Shot Sim-to-Real Transfer
Iterative definition of task distribution: This method collects initial ball state data from human-human match data and trains LLCs and HLCs in a simulated environment. The data generated from simulation training is then added to the real-world dataset, and this process is repeated to gradually refine the training task distribution.
Simulation-to-simulation adaptation layer: To address the problem of parameter differences in topspin and backspin ball models in the simulated environment, the paper proposes two solutions: spin regularization and a simulation-to-simulation adaptation layer. Spin regularization solves the issue by adjusting the LLC training dataset, while the simulation-to-simulation adaptation layer uses FiLM layers to learn the mapping relationship between topspin and backspin balls.
Domain randomization: During training, the paper randomizes parameters such as observation noise, delay, table and racket damping, and friction in the simulated environment to simulate uncertainties in the real world.
3. Real-Time Adaptation to Unknown Opponents
Real-time tracking of match statistics: The HLC tracks match statistics in real-time, such as scores and errors for both the robot and the opponent, and adjusts LLC preference values based on this data to adapt to changes in the opponent.
Online learning of LLC preferences: Through the gradient bandit algorithm, the HLC can learn LLC preference values online and select more appropriate LLCs based on the opponent's weaknesses.
The research team collected a small amount of human-to-human play data to initialize task conditions. Then, they used reinforcement learning (RL) to train agents in simulation and employed various techniques to deploy the policy to real hardware in a zero-shot manner. This agent played against human players to generate more training task conditions, and then the training-deployment cycle was repeated. As the robot improved, the standards for matches became more complex while still based on real-world task conditions. This hybrid simulation-reality cycle created an automated task curriculum that improved the robot's skills over time.
How Well Does It Play?
To evaluate the agent's skill level, the robot played competitive matches against 29 table tennis players of different skill levels - beginner, intermediate, advanced, and advanced+, as determined by professional table tennis coaches.
Against all opponents, the robot won 45% of matches and 46% of individual games. Breaking it down by skill level, the robot won all matches against beginners, lost all matches against advanced and advanced+ players, and won 55% of matches against intermediate players. This strongly suggests that the agent achieved an intermediate human player level in rallies.
Research participants enjoyed playing with the robot, giving it high ratings for being "fun" and "engaging." These ratings were consistent across different skill levels, regardless of whether participants won or lost. They also overwhelmingly answered that they would "definitely" play with the robot again. When given free time to play with the robot, they played for an average of 4 minutes and 6 seconds out of a total of 5 minutes.
Advanced players were able to exploit weaknesses in the robot's strategy, but they still enjoyed playing with it. In post-match interviews, they considered it a more dynamic practice partner than a ball machine.
Limitations and Future Prospects
The research team stated that this robot learning system still has some limitations, such as limited ability to react to fast and low balls, low spin detection accuracy, and lack of multi-ball strategic tactics.
Future research directions include improving the robot's ability to handle various types of balls, learning more complex strategies, and improving motion capture technology.
The research team also stated that the hierarchical strategy architecture and zero-shot sim-to-real transfer methods proposed in this study can be applied to other robot learning tasks. Additionally, real-time adaptation techniques can help robots better adapt to constantly changing environments and tasks. Furthermore, system design principles are crucial for developing high-performance and robust robot learning systems.