In dialogue systems partial feedback problems abound: anyone who has ever unsuccessfully tried to get a job has considered the counterfactual: “what if I had said something different?” Such questions are difficult to answer using offline data, yet anybody trying to offline assess a dialogue system has to come up with some scheme for doing so, and there are pitfalls.
Online evaluation has different problems. In isolation, it is ideal; but for the scientific community at large it is problematic. For example, Honglak Lee has convinced the registrar of his school to allow him to deploy a live chat system for recommending course registrations. This is a brilliant move on his part, analogous to getting access to a particle accelerator in the 1940s: he’ll be in a position to discover interesting stuff first. But he can’t share this resource broadly, because 1) there are a finite number of chats and 2) the registrar presumably wants to ensure a quality experience. Similar concerns underpin the recent explosion of interest in dialogue systems in the tech sector: companies with access to live dialogues are aware of the competitive moat this creates, and they need to be careful in the treatment of their customers.
That’s fine, and I like getting a paycheck, but: how fast would reinforcement learning be advancing if the Arcade Learning Environment was only available at the University of Alberta?
So here are some ideas.
First, we could have agents talk with each other to solve a task, without any humans involved. Perhaps this would lead to the same rapid progress that has been observed in 2 player games. Arguably, we might learn more about ants than people from such a line of research. However, with the humans out of the loop, we could use simulated environments and democratize assessment. Possibly we could discover something interesting about what it takes to learn to repeatedly communicate information to cooperate with another agent.
Second, we could make a platform that democratizes access to an online oracle. Since online assessment is a scarce resource it would have to cost something, but imagine: suppose we decide task foo is important. We create a standard training program to create skilled crowdsource workers, plus standard HITs which constitute the task, quality control procedures, etc. Then we try as hard as possible to amortize these fixed costs across all researchers, by letting anyone assess any model in the framework, paying only the marginal costs of the oracle. Finally, instead of just doing this for task foo, we try to make it easy for researchers to create new tasks as well. To some degree, the crowdsourcing industry does this already (for paying clients); and certainly researchers have been leveraging crowdsourcing extensively. The question is how we can make it easier to 1) come up with reliable benchmark tasks that leverage online assessment, and then 2) provide online access for every researcher at minimum cost. Merely creating a data set from the crowdsourced task is not sufficient, as it leads to the issues of offline evaluation.
Of course it would be great for the previous paragraph if the task was not crowdsourced, but some natural interactive task that is happening all the time at such large volume that the main issue is democratizing access. One could imagine, e.g., training on all transcripts of car talk and building a dialogue app that tries to diagnose car problems. If it didn’t totally suck, people would not have to be paid to use it, and it could support some level of online assessment for free. Bootstrapping that, however, would itself be a major achievement.