We use mini- during train, and use population during test (which using some kind of approximation like exponential averages).

In case of small mini-batch, a mini-batch statistics seems to be a poor choice.

I can only wonder why we don’t use a kind of exponential average more during training?

