I never understood the link between beta-VAE and interpretable representations. Frankly, I still don’t understand the whole VAE and disentangled (i.e. interpretable) representation literature when applied to static image datasets.
Why should encouraging posterior overlap lead to disentanglement? Why should a factorized posterior imply disentanglement? Why should a factorized prior imply disentanglement?
None of this is clear to me. Does anyone here actually understand it? If so, I’d like someone to explain to me how these models lead to disentanglement. I agree that they work empirically. I just don’t know why.