Quick search will tell you that MCTS is applicable to large/infinite RL tasks. But it seems that there’s no empirical confirmation that it works as well as on Go. Assume that no rollout is used just as in AlphaZero. Go’s state space is larger than other games, but its length is small (not much larger than 0 timesteps). The state space of many real- problems grows exponentially w.r.t. the timestep in the following sense. For example, this is the case for text generation (not just sentence generation but also paragraph/essay generation), which is a fully observable non-Markovian (e.g. reward depends on words in long past) problem and therefore can be recast into a MDP by regarding the history of all the previous states and actions to be the current state (actions only for text generation) (See LeakGAN for reference). Other examples include: music generation, video games playing and various robotics tasks. This exponential growth of states can be made constant w.r.t. the time by techniques like HRL. For example, you can just consider the history of the last 40 timesteps as the current state and use HRL techniques to make the overall policy to depend on the history of distant past as well. This enables the problem to be dealt with by MCTS. Or you can embed the history of long past to a binary vector of a certain fixed dimension and use MCTS.

Is my argument reasonable? Any suggestion of modification for empirical purpose? Thoughts?

Source link


Please enter your comment!
Please enter your name here