Service, not algorithm Supervised learning is essentially observational: some data has been collected and subsequently algorithms are run on it. (Online supervised learning doesn’t necessarily work this way, but mostly online techniques have been used for computational reasons after data collection.) In contrast, counterfactual learning is very difficult do to observationally. Diverse fields such as economics, political science, and epidemiology all attempt to make counterfactual conclusions using observational data, essentially because this is the only data available (at an affordable cost). When testing a new medicine, however, the standard is to run a controlled experiment, because with control over the data collection more complicated conclusions can be made with higher confidence.
Analogously, reinforcement learning is best done “in the loop”, with the algorithm controlling the collection of data which is used for learning. Because of this, a pure library implementation of a reinforcement learning algorithm is unnatural, because of the requisite state management. For example, rewards occur after actions are taken, and these need to be ultimately associated with each other for learning. (One of my first jobs was at a sponsored search company called Overture, and maintaining the search-click join was the full time job of a dozen engineers: note this was merely an immediate join for a singleton session!)
Ergo, packaging reinforcement learning as a service makes more sense. This facilitates distributed coordination of the model updates, the serving (exploration) distribution, and the data collection. This scenario is a natural win for cloud computing providers. However, in practice there needs to be an offline client mode (e.g., for mobile and IOT applications); furthermore, this capability would be utilized even in a pure datacenter environment because of low latency decision requirements. (More generally, there would probably be a “tiered learning” architecture analogous to the tiered storage architectures utilized in cloud computing platforms. Brendan McMahan has been thinking along these lines under the rubric of federated learning.)
Bootstrapping is everything It is amazing how clarifying it is to try and solve and actual problem. I now appreciate that reinforcement learning has been oversold a bit. In particular, the sample complexity requirements for reinforcement learning are quite high. (That’s fancy talk for saying it takes a lot of data to converge.) When you are working in a simulated environment that’s not such a concern, because you have the equivalent of infinite training data, so we see dramatic results in simulated environments.
When reinforcement learning is done on live traffic with real users, you have less data than you think because you always start with a test fraction of data and you don’t get more until you are better (catch 22). So I actually spend a lot of my time developing initial serving policies, unfortunately somewhat idiosyncratically: imitation learning can be great with the right data assets, but heuristic strategies are also important. I suspect initialization via not-smartly-initialized-RL in a simulated environment is another possibility (in dialog simulators aren’t so good so I haven’t leveraged this strategy yet).
This creates some design questions for RL as a service.
- Assuming there is an initial serving policy, how do I specify it? In the decision service you pass in the action that the initial serving policy would take which is fine for contextual bandits, but for a multi-step epoch this could be cumbersome because the initial serving policy needs to maintain state. It would make sense for the service to make it easier to manage this.
- How does the service help me put together the initial serving policy? Considering my experience so far, here are some possible ways to develop an initial serving policy:
- An arbritrary program (“heuristic”). Sometimes this is the easiest way to cold start, or this might be the current “champion” system.
- Imitation learning. Assumes suitable data assets are available.
- Off-policy learning from historical data. This can be better than imitation learning if the historical policy was suitably randomized (e.g., the exhaust of previous invocations of RL as a service).
- Boostrapping via simulation. In dialog this doesn’t seem viable, but if a good simulator is available (e.g., robotics and game engines?), this could be great. Furthermore this would involve direct reuse of the platform, albeit on generated data.
Language is the UI of programming I think ideas from credit-assignment compilation would not only address the question of how to specify the initial policy, but also provide the most natural interface for utilizing RL a service. I’ll do another post exploring that.