Commonly cited benefits of the rectified linear unit as an activation function are:
It alleviates the vanishing gradient problem by being the identity function for positive inputs.
It’s simplistic and it’s fast.
My interpretation has to do with the dot product in convolutional neural networks. As everyone knows, when the kernel is slid over its input window, it computes the dot product as with vectors. The process of convolution can be interpreted as looking for a specific pattern throughout an image.
Now, the only thing that affects the dot product’s sign is the angle between the two vectors—in this case, the kernel and its window. Zero degrees means perfect correlation, 90 means no correlation, and 180 implies “opposing” correlation.
Since ReLU kills off the input at the point of zero, you can interpret this as killing off dot products between the kernel and its window wherein each input doesn’t correlate enough in terms of its direction. In other words, for there to be any output at all, the directions of each input vector have to have some correlation.
After direction is considered, the magnitudes of each vector take over and can increase or decrease the strength of the activation.
Within this interpretation, the ReLU seems like an oddly natural fit for an activation function interleaved with dot products.