This is a paper by MIT. They are introducing a player called PixelPlayer. This player with the help of lots of unlabeled videos learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the from each

