Reinforcement learning continues to ascend, extending the enthusiasm and energy from ICML. The “Imagenet moment” for RL was the Deepmind work in the Arcade Learning Environment. In a talk in the Deep RL workshop, Michael Bowling presented evidence that the big boost in performance could be mostly characterized as 1) decoding the screen better with convnets and 2) using multiple previous frames as input. This was not to detract from the breakthrough, but rather to point out that a hard part of RL (partial feedback over long action sequences) is not addressed by this advance. What’s interesting is currently no system in good at playing Pitfall, which involves long action sequences before reward is encountered. The Bowling quote is that we are good at games where “you wiggle the joystick randomly and you get some reward.”
However, the community is not standing still: with so much enthusiasm and human talent now thinking in this direction, progress will hopefully accelerate. For instance, an idea I saw recurring was: rewards are partially observed (and sparse!), but sensory inputs are continuously observed. Therefore decompose the prediction of future rewards into a combination of 1) predicting future sensory inputs conditioned on action sequences, and 2) predicting reward given sensory input. From a sample complexity standpoint, this makes a lot of sense. As Honglak Lee pointed out in his talk at the Deep RL workshop, the same technology powering Transformer Networks can be learned to predict future sensory input conditioned on action sequences, which can be leveraged for simulated play-out. (If you know about POMDPs, then the decomposition perhaps makes less sense, because you cannot necessarily predict reward from the current sensory state; but we have to crawl before we can walk, and maybe ideas from sequence-to-sequence learning can be composed with this kind of decomposition to enable some modeling of unobservable world state.)
Another popular reinforcement learning topic was need for better exploration strategies. I suspect this is the really important part: how do we explore in a manner which is relevant to regret with respect to our hypothesis class (which can be relatively small, redundant, and full of structural assumptions), rather than the world per se (which is impossibly big)? This is how things play out in contextual bandits: if all good policies want the same action than exploration is less important. At the conference the buzzword was “intrinsic motivation”, roughly meaning “is there a useful progress proxy that can be applied on all those action sequences where no reward is observed?”. Given a decomposition of reward prediction into (action-sequence-conditional sensory input prediction + sensory-reward prediction), then discovering novel sensory states is useful training data, which roughly translates into an exploration strategy of “boldly go where you haven’t gone before” and hope it doesn’t kill you.
Finally, I have some anecdotal evidence that reinforcement learning is on the path towards a mature industrial technology: at ICML when I talked to Deepmind people they would say they were working on some technical aspect of reinforcement learning. This time around I got answers like “I’m doing RL for ads” or “I’m doing RL for recommendations”. That’s a big change.
There were a variety of other interesting topics at the conference around which I’m still collecting my thoughts.
- I really like the best paper Competitive Distribution Estimation: Why is Good-Turing Good, and I suspect it is relevant for extreme classification.
- Brown and Sandholm are doing amazing things with their Heads-up No-Limit Poker Player. This is one of those “we probably aren’t learning about how humans solve the problem, but it’s still really cool technology.” Navel gazing isn’t everything!
- I still like primal approximations to kernels (in extreme classification we have to hug close to the linear predictor), so I liked Spherical Random Features for Polynomial Kernels.
- I want to try Online F-Measure Optimization. F-measure is an important metric in extreme classification but just computing it is a pain in the butt, forget about optimizing it directly. Maybe that’s different now.
- Automated machine learning aka AutoML is heating up as a topic. One near-term goal is to eliminate the need for expertise in typical supervised learning setups. The poster Efficient and Robust Automated Machine Learning is an interesting example. The AutoML challenge at the CIML workshop and ongoing challenges are also worth keeping an eye on. IBM also had a cool AutoML product demo at their party (parenthetically: what is the word for these things? they are clearly recruiting functions but they masquerade as college parties thrown by a trustafarian with nerdy friends).
- Memory systems, exemplified at the conference by the End-to-End Memory Networks paper, and at the workshops by the RAM workshop. I especially like attention as a mechanism for mitigating sample complexity: if you are not attending to something you are invariant to it, which greatly mitigates data requirements, assuming of course that you are ignoring irrelevant stuff. Is it somehow less expensive statistically to figure out what is important rather than how it is important, preserving precious data resources for the latter? I’m not sure, but Learning Wake-Sleep Recurrent Attention Models is on my reading list.
- Highway networks look pretty sweet. The idea of initializing with the identity transformation makes a lot of sense. For instance, all existing deep networks can be considered highway networks with an uncountable number of identity transformation layers elided past a certain depth, i.e., incompletely optimized “infinitely deep” highway networks.
- Extreme classification is still an active area, and the workshop was reasonably well-attended considering we were opposite the RAM workshop (which was standing-room-only fire-code-violating popular). I especially liked Charles Elkan’s talk, which I could summarize as “we just need to compute a large collection of sparse GLMs, I’m working on doing that directly.” My own work with hierarchical spectral approaches does suggest that the GLM would have excellent performance if we could compute it, so I like this line of attack (also, conceivably, I could compose the two techniques). Also interesting: for squared loss, if the feature dimensionality is small, exact loss gradients can be computed in label-sparsity time via Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets. This is great for typical neural networks with low-dimensional bottlenecks before the output layer (unfortunately, it is not usable as-is for large sparse GLMs, but perhaps a modification of the trick could work?).
could be a cool trick for better optimization of deep networks via eliminating one pesky invariant.
- The Self-Normalized Estimator for Counterfactual Learning. If you like reinforcement learning, you should love counterfactual estimation, as the latter provides critical insights for the former. I need to play around with the proposed estimator, but it looks plausibly superior.
- Taming the Wild: A Unified Analysis of Hogwild Style Algorithms. While I had plenty of empirical evidence that hogwild and matrix factorization play well together, this analysis claims that they should play well together. Neat!
- Last but not least, a shoutout to the Machine Learning Systems workshop for which CISL colleague Markus Weimer co-organized. While not quite fire-code-violating, it was standing room only.